New Relic Edge chases your digital tail so you don't have to
- Summary:
- New Relic brings out a cloud-based service for intensive distributed tracing to help monitor performance across fast-moving microservices deployments
Businesses today need to know what's going wrong in their digital infrastructure long before customers notice — but that's easier said than done. In a fast-changing world, their business systems and applications are built on ever more complex microservices deployments, which in turn require sophisticated monitoring infrastructure. A new service introduced today by APM vendor New Relic aims to relieve some of that burden, says Andrew Tunall, Senior Director of Product Management:
The toil of figuring out how to deploy the software that will give you insight isn't your problem. It can be our problem.
Called New Relic Edge with Infinite Tracing, this is a cloud-based managed service that is capable of analyzing massive volumes of trace data to surface vital clues that can help find and diagnose issues or latency. The service runs in Amazon Web Services (AWS) — initially US East 1 with others planned to follow — and will take trace data from cloud or on-premise workloads. Tunall says:
Send your firehose of data and we'll do the hard work of working out what data you care about.
The aim is to show site engineers those clues to what's going wrong in context as early as possible, so that they can figure out what's happening and ensure the problem gets fixed before it impacts customer experience. The more information they have, the faster they can act, says Tunall:
The sheer amount of layers and amount of data generated results in a lot of complexity ... If you don't have enough examples of what's going wrong, it's hard for the human brain to understand what's happening.
How distributed tracing helps
The challenge of collecting the data and presenting it in a useful format has been growing rapidly. Breaking down once monolithic applications and digital infrastructure into more nimble microservices provides the agility, responsiveness and elastic scaling required to succeed in the digital economy.
But as microservices proliferate, the sheer quantity of these individual components and the connections between them becomes daunting. At the same, the use of DevOps and continuous integration and delivery (CI/CD) means that those microservices and connections are constantly evolving and changing.
The solution the industry has devised is called distributed tracing. Rather than simply keeping a log of each event and then attempting to match them up afterwards, distributed tracing adds identifiers to each step in a process. Every time a step is executed it records the start and finish time and this 'span' can then be linked to others in the same trace thanks to the unique identifier. So far so good, except that this quickly turns into a huge volume of instrumentation data — especially at those critical moments when you most need to understand what's happening. A New Relic white paper cites an example:
One company has an average span load of 3 million spans per minute, but when a new product launches, it sees spikes of 300 million spans per minute.
Head-based versus tail-based sampling
Because there are so many steps generating so many traces, the traditional approach in high-volume environments has been to track a subsample of traces rather than all of them. This is called head-based sampling, because the decision as to which trace to sample is taken before the trace begins. But randomly sampling what's going on in your infrastructure risks missing errors when they first occur. Assume that 1% of the traffic going through your application hits a specific operation within the code, there's a 1% error rate in that operation and you're sampling at 1% — you have a miniscule chance of sampling that specific error, even assuming a perfectly random sample. In fact, the sampling is targeted by algorithms so may miss an issue altogether because it's not something the system is looking for. Tunall explains:
One component has an error or outlier that isn't immediately visible to the parent. Say it's only selecting 1% where that infrequent error is ocurring. It is very improbable today's algorithms will select data that allows you to see what's going on.
The alternative approach is called tail-based sampling, which starts by collecting every trace and then identifying which ones to keep for further analysis. This means you find out about issues as soon as they start occurring, provided your analysis is accurate enough to identify which samples to keep.
Why deliver from the cloud
The problem with this better solution? It means collecting and rapidly analyzing a huge mass of data — particularly at those times when you need it most, which will often be when activity spikes or connections are proliferating. This is why it's so important to productize this capability as a cloud-based service, says Tunall.
People have been talking about tail-based tracing for six years. What this does is offer it as a fully integrated, fully managed component of our platform.
Organizations that are already using tail-based tracing have had to build out their own infrastructure to support it, and that means diverting valuable resources into running the tracing infrastructure, says Tunall.
You don't want to have entire groups of SREs dedicated to maintaining observability infrastructure ... Our belief is that customers want that to be fully managed for them and fully integrated into their observability platform.
My take
This is a high-end addition to New Relic's offering but one that many organizations will want to take a look at. It's exactly the kind of service that is a good fit for IT teams that are pushing the envelope of their digital capabilities but have limits on the resources they can devote to building and running their own monitoring infrastructure. It's also the kind of burstable capability that needs the on-demand elasticity of the cloud.
As more and more enterprises go down the road of microservices and continuous delivery, the need for this kind of real-time, in-depth monitoring and insight can only increase.