How does a telemetry data giant handle its own telemetry needs?
- Summary:
- Rapid growth meant New Relic had to find a new telemetry architecture solution - Wendy Shepperd shares the scaled up solution and the lessons learned.
Two years ago, I joined New Relic to lead the company’s telemetry data platform engineering organization. At my previous company, I had been a New Relic customer, relying on the platform to help oversee and operate some of our most important applications and infrastructure. With my background firmly rooted in SaaS software, hosting platforms and cloud migrations, I was excited to get to work on one of the industry’s leading telemetry platforms.
This new role brought about an intriguing challenge – how does a company that delivers exceptional telemetry data tools to others handle its own enormous telemetry needs? The question is a matter of scale. New Relic’s telemetry data platform, also known as NRDB, ingests more than three billion data points per minute, more than 125 petabytes of data per month, and serves over 50 billion web requests each day. What’s more, this scale doubles approximately every nine months!
NRDB’s explosive growth was driven by the superior telemetry data service we provided to other companies, yet this massive scale threatened our ability to manage our own telemetry operations and performance. We were operating our data centers using a container-based architecture and a monolithic Kakfa cluster, and we quickly reached a point where we couldn’t continue to scale with our original architecture.
Here’s how we built the solution for our next phase of telemetry leadership.
Changing architectures
NRDB is a massive distributed system built from the ground up for time series telemetry data. A large time series query using telemetry data will operate simultaneously on thousands of CPU cores, scanning trillions of data points each minute to return answers to large queries in milliseconds. Instead of scaling our existing on-premises data platform cluster to accommodate the growing demands of these queries, we wanted the ability to run multiple copies of the NRDB cluster in multiple regions.
To achieve this distributed goal, we relied on cell architecture. Using fault-isolated cells increases parallelization, enabling us to operate multiple NRDB clusters at the same time. Most importantly, the cell architecture reduces the impact of failures – one cell failure does not impact the other cells. As the number of cells grows, the customer impact of a hardware or software failure in any single cell declines – with two cells, an incident in one cell will impact 50% of customers, but with 10 cells, an incident only impacts 10%. Cells are added incrementally as more capacity is required.
By the end of 2019, we had two cells running in our production environment, and at the end of December, our new architecture was put to the test. During the critical holiday period, we had an unexpected traffic spike that caused one of the cells to become overloaded, resulting in sporadic failures. We quickly shifted traffic from the unhealthy cell to the healthy cell, enabling us to keep serving traffic and scale up nodes in both cells concurrently. Most importantly, the incident had no customer impact, something that would have been impossible with our previous architecture.
Lift and shift migration
The transition to our new architecture was a serious undertaking, given the massive amount of data that needed to be migrated as well as the tight integration of NRDB with all New Relic products. We decided to prioritize speed by taking a lift and shift approach, iterating through each data type and associated customer accounts throughout the data migration. Using this approach, we successfully migrated 95% of our data ingestion in just under one year.
Cellular architecture provides us with a lot of flexibility to meet the different needs of our customers. Our new architecture features dozens of cells, with each cell configured automatically based on different profiles. For example, we provision our largest capacity cells with the beefiest compute and storage resources, and use these cells to host customers with the heaviest workloads. We have cells tailored specifically for customers in highly regulated industries with stringent security and compliance requirements such as FedRamp and HIPAA. We have the ability to pin specific customers to a single cell as needed to isolate them from other customers. We keep hot spare cells available in case we need to shift traffic quickly due to load spikes.
Lessons learned
Not every company is going to have telemetry needs on the scale of New Relic. However, some lessons learned from our cloud migration are applicable to any company, regardless of size. Here are a few key takeaways:
- Observability is vital – It’s important to instrument every part of your stack, especially when migrating to a new architecture and infrastructure. We took baseline measurements of how our systems performed on our previous architecture, which then enabled us to compare a number of metrics during and after migration. Automatically instrumenting each new cell as it is built should be standard operating procedure.
- Don’t plan for the easy route – Even the best-laid plans will experience surprises and discoveries. Instead of going in with a comprehensive plan, start small and iterate. Our iterative approach through each data type allowed us to make changes and improve our automation throughout the migration process.
- Continuously build new cells – Our cells have a limited lifespan of 90 days. We continuously build and decommission cells so we can take advantage of the latest functionality improvements, security patches, and O/S updates that keep our fleet up to date. By accepting that one cell will only be in use for 90 days or less, we can make fundamental changes in new cells and remove old ones without stress. As a result, our platform reliability has improved significantly and we have reclaimed engineering capacity that was previously spent on upgrading infrastructure.
- Invest in communication – Cloud migrations are a company-wide event, and they should be treated as such with company-wide communications. Be sure to create solid plans across engineering, finance, sales and support, and make a point of over-communicating throughout the process. Don’t keep your customers in the dark, either; let them know about what you’re doing and how it will benefit them in the long run.
Beyond just supporting our scalability needs, this cloud migration and cell architecture opens new doors for future growth and innovation. Our ability to add cells enables us to scale continuously, making it easier for us to grow in new clouds and regions. We’re on a course now for limitless growth – not just for us, but also for our customers.