New Relic boosts SRE with automated service level objectives and alerts
- New Relic today rolls out the ability to set Service Level Objectives and Indicators, a core component of Site Reliability Engineering for performance and uptime.
Observability vendor New Relic today rolls out new functionality that helps software engineers set service level indicators and objectives to improve site reliability and uptime. Generally available from today as a core component of the New Relic One platform, the new capability includes templates and guidance to automatically set up basic service levels on metrics such as uptime, performance, or error rates. More advanced users can add their own service level parameters using New Relic Query Language (NRQL).
Setting Service Level Objectives (SLOs) is a core principle in the practice of Site Reliability Engineering (SRE), as originally defined at Google. SLOs are the goals a DevOps team must achieve to meet the explicit or implicit Service Level Agreements (SLAs) that an organization has in place with customers and users. Service Level Indicators (SLIs) are the data points used to measure SLOs.
New Relic says that providing a standardized, easy-to-use mechanism for setting SLIs and SLOs makes it simpler to track the system metrics that are most impactful to business performance. Each team across the software infrastructure can set parameters most relevant to the services they manage, in a format that's easy for engineering leaders and executives to interpret and assess the impact on business KPIs (sometimes called Business Level Objectives). This builds an early warning system that helps teams evaluate where to focus their attention. Alex Kroman, SVP and Product GM at New Relic, explains:
One of the areas that has traditionally been hard to get data on is just reliability and performance. How was the piece of software that you've created really performing against customer expectations?
Service levels are the answer to that problem. You can create SLOs across different teams. And you can use use that as a way to help manage your software teams, and to help manage the trade-off between technical debt and roadmap work.
Defining SLOs builds in a safety margin — in SRE language, an 'error budget' — where performance issues can be detected and addressed before they become disruptive for customers and users. Kroman explains:
When customers create those service levels, they are able to manage to that safety margin, because the service level is going to be triggered before their SLA or their customer issue is going to be triggered. That's going to be the way that you can essentially manage the backlog of the team and make sure that issues are being addressed before you violate any expectations you're setting with your customers.
The New Relic tool aims to make it easy to get started, with a 'one-click' setup process based on template scenarios to create basic sets of SLIs. New Relic's AIOps functionality then makes recommendations based on historical data to automatically establish baseline SLOs for performance and reliability, and the results are reported in a unified dashboard. Kroman says:
One thing that we've heard from customers about service levels is it's just really hard to get started with them. It's hard to find the right data, it's hard to know what the right objective is to set. So we spent a lot of time investing, and thinking about a one-click service level setup feature.
The vast majority of our customers can go into New Relic on April 5. If they want to get started with service levels, they can just click a button, and they will be presented with what we recommend as a good starting point. And that's going to be based on industry best practices, and it's also going to be based on the baseline data that we are seeing in your service.
The next step is to tie these measurements back into customer-facing SLAs to better manage the risk of service disruption. Dashboards in the product report metrics that allow business leaders to see how the organization overall is meeting expectations around reliability and delivery. Kroman adds:
Oftentimes you see an evolution where the main service level dashboard in the beginning is around errors and performance. Maybe a few months later, it's getting into the business indicators that are contributing to customer success.
Developers can customize the service levels being tracked in NRQL, and create compound service levels across multiple telemetry types, including business data. It's also possible to store service level parameters in code and deploy them automatically using Hashicorp's Terraform infrastructure-as-code software tool. Kroman explains:
A customer can just write their service levels into their Git repo, and just deploy it as part of that. They can revise it with code. That's a more advanced feature, but makes things really easy to manage the complexity of getting these created across a lot of teams.
How Achievers uses service level management
One early adopter is Achievers, a fast-growing employee recognition and engagement platform based in Canada with a global customer base, focused on helping businesses build their company culture. A long-term user of New Relic, its development and SRE teams rely heavily on New Relic One for alerting, distributed tracing and infrastructure logging and monitoring, rather than using separate tools. It was natural to add service level management as soon as it became available, as Stefan Kolesnikowicz, the company's Principal Site Reliability Engineer explains:
We didn't want to go somewhere else for the tool, because all the data is already in New Relic. Why would we want to spin up another tool and put metrics somewhere else?
Adding service level management has allowed the engineering team to be more proactive in how they manage operations. Kolesnikowicz says:
We have actually noticed engineers becoming less reactive to when things are broken, and actually more looking at the SLOs, seeing what they're doing, and can plan better for the future and what needs to happen in the roadmap.
These conversations can go back to the product owner, and they can actually discuss, 'Hey, this is high priority, you know, it's affecting reliability or availability for a customer, or a grouping of users, we need to prioritize this and bump it in our sprint.' So it's allowed us to plan things a lot better and get engineers focused on reliability.
Developers too now take more ownership of performance, he adds:
Our product teams are a lot more focused on reliability, too. Back in 2019, when I started at Achievers, we were under this big migration to move from monolith to microservices. And usually, when things broke back then, it was, 'Go to the SRE team, they're going to go and fix it.' Now we're able to offload that work to feature teams, and they handle the raw reliability, they own that service in production now. So this is a great metric to determine if they're doing their job well.
Having everything roll up to an enterprise-wide dashboard has helped Achievers look at reliability across all of its global regions, as a result of which it has set up automated failover between regions to improve reliability. Kolesnikowicz says:
It helps us do a little bit of reverse engineering to figure out how we can come up with a number that makes sense for an individual region. Then since we're global, we can get better reliability by having failover into other regions when things go wrong.
The SLM tool is integrated into a self-service tool developed by Achievers' SRE team, which developers use to configure microservices for automated deployment. This means they can now build in SLOs as part of the infrastructure-as-code specification. Kolesnikowicz explains:
They can define today, a latency or an availability SLO within their microservice. We obviously give them a starting point and some sets of defaults that make sense. Then as their service progresses in production, they can adjust those over time. Again, it's all in their control, and they make the decisions with the product owner.
In our case, it's the product team who owns the features. If something is not critical to being up all the time, or maybe it's a feature that can have a more flexible error budget, that's up to the product manager and the engineers themselves.
An important capability that keeps New Relic up-to-speed with competitors, and is crucial for building and maintaining more reliable software for today's always-on world. It's good to see this implemented in a way that encourages adoption by making it easy for teams to get started.