Porsche revs up its use of Splunk to tame Apache Kafka Streamzilla

Profile picture for user ddpreez By Derek du Preez October 21, 2021 Audio mode
Summary:
Porsche has created a data streaming managed service for internal product teams - called Streamzilla - which is being monitored through the use of Splunk.

An image of Godzilla
(Image by Artist and zabiyaka from Pixabay )

Porsche needs little introduction, being one of the most famous luxury car manufacturers in the world. However, much like the rest of the automotive industry, Porsche is having to blend its physical assets with the digital world in a bid to provide wrap-around services for customers, including monitoring of vehicle performance and personalized experiences. 

This requires connecting vehicles with sensors to create digital twins, which stream data back to Porsche HQ. Not only this, but Porsche's production lines are increasingly connected in a bid to improve quality and efficiency. These data streams are used by a variety of Porsche product teams and as they become more integral to the car manufacturer's business, require careful monitoring and support. 

Enter Streamzilla - the centrally managed service run by Sridhar Mamella, Data Streaming Manager at Porsche, which uses open source software (Apache Kafka) to capture all of Porsche's streaming data needs. But as Streamzilla has grown alongside Porsche's digital capabilities, so has the need to monitor the data logs, in a bid to improve reliability and support. 

As a result, Porsche is using Splunk to tame its growing Apache Kafka estate, bringing it under control and providing internal product teams with a service that is predictable and has high availability. 

Speaking at Splunk's annual user conference, .conf, leader of the project, Mamella explained Porsche's thinking behind the organization's Kafka implementation. He said: 

Our aim is to bring the Porsche experience into the digital future. That being said, this requires a robust infrastructure and a solution at scale. At Porsche, teams needed something to decouple their architecture and accelerate the data flow. We looked far and we looked wide, and then we found out the problem could be solved by a technology called Apache Kafka. 

It's a distributed system, originally developed at LinkedIn, that's used for data streaming, log storage and stream processing. What's so cool about that? Kafka's main benefit, apart from enabling the decoupled transportation of data in real time, is its ability to scale. So you can now easily add or remove Kafka brokers from your system automatically and adjust your workload. This capability also makes it highly available. 

Mamella explained that just a couple of years ago, Porsche had several teams using independent Kafka clusters, all with their own use cases, managed on their own. This was working, but for doing something like creating digital twins of all Porsche cars, this requires data being sent from dozens of systems from across the world. This was unsustainable in the current set up and Porsche decided to consolidate. He said: 

An outdated digital twin has no value. So updating the current version had to happen and in a reliable way. We had different product teams using Kafka, but obviously maintaining a data stream solution was not their core business, so that's where we ended up with different clusters, most of them were neglected, which in the long run would not have been ideal for a customer that expects things to be perfect…just like their Porsche. 

A central service

As noted above, the decision was made to build a central data streaming platform, based on Apache Kafka, called Streamzilla, which would act as a shared service provider within Porsche. Alongside the platform, the Streamzilla team also provides consulting services and open documentation for Kafka knowledge. Mamella said: 

This meant both technological and organizational change. We act as the mediator between product teams when they need access to streaming data, a completely new approach with lots of potential and lots of thinking ahead of the time. 

The data strategy is now to enable the usage of data along data domains, allowing various product teams and various departments to use and enrich the data when they need it. A perfect example would be the implementation of digital twins. This enables the organization to introduce new systems with optimized quality, minimized time to market, whilst using a continuously updated and managed solution. 

Mamella and his team built the platform from scratch using open source components, in order to avoid vendor lock-in. The components included: Kafka brokers, ZooKeeper nodes, MirrorMakers, Proxies, Application Layer Gateways. 

Mamella said that managing this was easy enough at the start, as Porsche was only using one of each of these components and had four Docker containers. He said: 

That was pretty manageable in the start. We could identify errors, we could act on them, we could provide a pretty decent service. But as the platform grew, more and more components were added, like Kafka Connect, a REST proxy, the schema registry, and a lot more components from the Lafka ecosystem that would add value to the end customer (product teams)

This number of components was extremely hard for us to keep up with. We had the best team, the best developers and engineers, but still at times it was extremely hard to identify and act on errors in due time. We are talking about production level systems, replicated all across the globe, with real-time data coming in. And now without the right tool, it was starting to get tedious. That's the point where we decided to use Splunk. 

One tool to rule them all

Splunk was implemented in order to monitor the entire open source stack. Porsche already had an in-house Splunk team, so getting them in to set up the service and build dashboards could happen pretty quickly, according to Mamella. He added: 

The requirements for monitoring grew, which was easily managed by Splunk. But after a while, we realised it's not only relevant for us, but also the internal customer too. Reliability and reliable systems are always valuable. So we wanted to provide direct value to them, so we, as a platform provider, set up monitoring for their systems. We set up alerts so that whatever data is coming in for them, monitoring if it is fine or not. 

Most of the time they needed analysis when asking our operations team why they had errors all the time, what tasks needed to be executed while they had these errors - this time waiting for reply could easily be eliminated by providing the necessary information as self service, which is at the core. 

The Streamzilla team set up alerting for ACLs, access control lists within Kafka. So if any errors happened, the corresponding topic owner would be contacted by the Splunk alerting system. Mamella added: 

Security related information was also highly relevant for the customer. As you can imagine for a company like Porsche, we take security very seriously. So the internal customer would receive errors, or error logs, or security information at their fingertips. 

We also provided functionality where they could see the Kafka consumer statistics, we added monitoring for replication errors, so connectivity problems and misconfigurations could be easily identified. Getting all this information helped us identify areas that could need some improvement, but at the same time benefits the customer, which is an immeasurable benefit.

Since implementing Splunk, Porsche has seen many benefits. For instance, it has integrated the Splunk alerting system into Microsoft Teams, which means that engineers get alerts immediately when an error occurs and can take action, normally within seconds. This has resulted in faster resolution times - an 85% reduction in mean time to resolution, in fact. Mamella added: 

But the time until we noticed incidents isn't the only aspect - the logs are provided in an easily understandable format, which frees up important resources and allows us to focus on developing the platform and developing business value. And finally by minimising operational overhead, we could free up the time of the Streamzilla team, which would have been blocked up by having to identify issues and the solutions to that (26% increased cost efficiency). 

A system like Splunk means you have every single piece of information you need at a single glance.