New startup builds APM for today's distributed microservices applications
- Summary:
- New startup Lightstep builds APM designed to manage performance of today's composite distributed microservices applications
Ever since applications broke from the shackles of mainframe homogeneity, software designs have included an increasing number of modules and dependencies that have significantly complicated the task of gaining a complete picture of overall performance. Application performance monitoring (APM) emerged as a software category designed to tame such complexity. Unfortunately, the dotcom-era rise of Web applications that are at the mercy of different systems, and the many networks and server plumbing that make the Internet work, complicated things even further. Now, the APM measurement problem faces another roadblock with the era of distributed apps composed of multiple microservices and running on shared cloud infrastructure.
In response to an exploding number of performance factors and input variables, APM vendors have layered tool upon tool, resulting in unwieldy and increasingly expensive suites that can create as many questions as they answer. Lightstep, a two-year-old startup just emerging from stealth, thinks it's found a better way based on the founders’ expertise in distributed systems, by emulating the modular, scale-out designs of cloud infrastructure and social applications.
The difficulty of measuring microservices applications
Lightstep's first product, awkwardly called [x]PM (in a nod to the bracket notation used for software variables) is intended to evoke an expansive new category of performance monitoring software that’s built for an era where applications are no longer a monolithic, single-purpose, self-contained blocks of code. They have evolved to become an assembly of microservices, namely small, reusable applets designed for a single task and that are pieced together like Legos.
According to Lightstep cofounder and CEO Ben Sigelman, while microservices allow developers to be more productive and agile in releasing software early and often, they also create more confusion. A disjointed collection of independent code modules doesn't have a consistent monitoring and tracing implementation, they collectively compound the amount of telemetry produced and obscure overall, end-to-end performance of the composite application. The company outlined the problem in a blog post last year about the monitoring and measurement problems created by microservices:
Many organizations thought that moving to a microservice architecture just meant 'splitting up one binary into 10 smaller ones.' What they found when they did was that they had the same old problems, just repeated 10 times over. Over time, they realized that building a robust application wasn’t just a matter of splitting up their monolith into smaller pieces but instead understanding the connections between these pieces.
The performance of a single module is often unimportant when the overall application might wind its way through a dozen services, some with multiple dependencies and dependents. Indeed, as Sigelman illustrates in a slide deck describing Lightstep’s approach, the number of permutations an application might take through a portfolio of microservices is astronomical. The path can often change depending upon the outputs of services earlier in the execution chain. While module-level data from individual services is useful for debugging, it can’t answer important questions such as an application’s service dependencies, the number of inter-service API calls and the network latency between services that control end-to-end application performance.
Lightstep’s APM architecture
Lightstep claims to solve the APM problems of composite applications by creating a distributed data collection system that resembles a message bus. Although the company has been skimpy with technical details, its system topology shows a collector backbone that aggregates data from virtually any source including API-based microservices, monolithic legacy applications, mobile and Web clients, and third-party log aggregation software like Splunk and Sumo Logic. The aggregated data is then analyzed by sending subsets of information, selected based on the type of performance report or parameters needed by a particular query or infographic dashboard, to the Lightstep SaaS system.
Sigelman claims that by using a distributed system with data collectors spread across every element contributing to performance, Lightstep can scale to handle applications of any size and ingest a continuous data stream from each component. From there, its IaaS-based (think AWS) SaaS analytics engine lets the system scale to aggregate and correlate all inputs so that it can calculate a variety of APM measures, while allowing users to drill into transaction- or trace-level details for debugging or performance optimization.
A coding challenge Lightstep has issued on Github illustrates the type of performance and optimization questions its software can answer about composite applications,
- Find the operation that has the highest error count. "We define error count as the number of transactions where that operation printed at least one ERROR level log entry."
- Find the longest transaction. "We define the longest transaction to be the transaction where the duration between the timestamp of the START of the first operation and the timestamp of the END of the last operation is the greatest."
For example, a screenshot shows how Lightstep can drill into the individual operations, spread across several components, that make up a transaction, showing latency and log entries for each.
Data integration using OpenTracing
Instrumental to Lightstep’s ability to collect data from any source, whether natively by its collectors or through so-called adapters, is its use of the OpenTracing data collection standard. Co-developed by Lightstep’s founders, but now a part of the Cloud Native Computing Foundation (CNCF), OpenTracing is a set of vendor-agnostic APIs and helper libraries for collecting transaction- or module-level telemetry from distributed applications.
According to Sigelman, OpenTracing decouples the code for application monitoring from the particular implementations of monitoring software vendors or open source libraries, creating a “virtuous cycle” that removes monitoring vendor lock-in and makes both monitoring vendors and developers more productive via the use of common, interchangeable standards. Despite being a relatively new project, Sigelman highlights the fact that dozens of companies are using OpenTracing in their applications and that it’s supported by numerous commercial software vendors and open source projects.
Building a business in a crowded APM market
Lightstep started building its [x]PM system in 2015, but has quietly been building up VC investments and early customers to bootstrap its public push into the APM market. The company announced a total of $29 million in funding from several firms and five major customers, with Sigelman noting that several others prefer to remain anonymous.
For example, Twilio scrapped plans to build a custom performance monitoring system, using Lightstep to instrument its distributed software environment instead. According to Twilio's VP of Engineering, Lightstep significantly reduced the time needed to troubleshoot and fix performance problems, "from an average of 40 minutes to less than three minutes.” Lyft is using Lightstep to track a complex environment that generates more than 100 billion microservice API calls per day – that’s over a million every second. Its VP of engineering says the system allows Lyft to isolate the root cause of performance bottlenecks wherever they are in the stack, whether on the mobile client or deep in the cloud infrastructure.
My take
Lightstep enters a crowded APM market, but which various estimates peg as growing at low double-digit rates. However, the market is ripe for disruption given the substantial changes in how enterprise applications are being designed and deployed and a trend towards the majority of APM buyers coming from outside traditional IT organizations and thus not beholden to existing products or vendors.
Lightstep is spot on in its diagnosis of the challenges facing developers, IT and DevOps teams posed by distributed cloud infrastructure and composite microservice designs in measuring application performance and remediating problems. However, these issues have been apparent for quite a while, and the company is hardly alone in addressing them via the application of machine learning, traditional statistics, content search and graph theory to correlate events, map dependencies, identify anomalies and build rich visualizations summarizing KPIs and incidents.
Vendors like Splunk, Sumo Logic, Datadog and Solarwinds that have historically focused on infrastructure monitoring are exploiting their ability to collect and aggregate data from any source by adding more sophisticated analysis and correlation capabilities that push them into the low-end of APM functionality. Some are partnering with APM specialists like New Relic to create suites that deliver the same end-to-end application measurements and forensics Lightstep is touting.
Still, Lightstep offers some compelling innovations including a hybrid collection-analysis design, demonstrated scalability – as Lyft’s example shows – and a customer-friendly pricing model based on the level of analysis (what Sigelman calls “units of value”), not data volume or host count. While I don’t expect Lightstep to make immediate inroads into large enterprises, as its reference customers demonstrate, it will prove popular with cloud natives providing online and mobile applications and services.