McAfee provides online cybersecurity protection to millions of consumers, and their devices, around the world. Securing, monitoring and providing support to these millions of endpoints requires a sophisticated back-end architecture, which is supported by streams of flowing data.
However, up until recently, McAfee was facing challenges with its infrastructure environment, as it had a highly integrated, on-premise monolith that didn’t provide effective scalability, was prone to breaks between connected domains and services, and limited the speed at which the company could bring new services to market.
As a result, the cybersecurity provider decided to decouple its architecture, introduce microservices in the cloud, standardize on a platform approach, as well as introduce more real-time data and event-driven approaches to how it operates.
To do this, McAfee is making use of Confluent Cloud, the Apache Kafka managed service provider, to underpin its new microservices environment. We got the chance to speak with McAfee VP of Platforms, Mahesh Tyagarajan, about the company’s shift away from a monolith environment and how its new, highly decoupled architecture is saving the company on operational costs and allowing it to innovate at greater speed.
Tyagarajan said that to create its microservices, Confluent Cloud was absolutely necessary, as it stitches the decoupled domains and services together through an event-based architecture.
He explained why microservices and this approach is so crucial:
For a microservices architecture - eventing - decoupling is very much the foundation. Otherwise, you'll have one service calling into another and that'll be impacted by the availability of another. So just keeping it clean and decoupled is key. Within a domain you're okay. But if a service depends on some protection, like a printing service, or a SMS service, then you're reducing the availability of that domain, based on the availability of something else.
That was the foundation. But then, we also wanted to be as near real time as possible, whether we were handing over data to analytics, or if we were getting data streams with telemetry from our endpoints.
McAfee also relies on a lot of partners to sell its products, given that much of McAfee’s software ends up on laptops sold by Dell or HP, for example. Given the company has customers, who are customers of both McAfee and its partners, there is also a lot of data exchange within this ecosystem. This again is why data streaming is critical to the company. Tyagarajan said:
We're completely in the microservices world right now. Half of my system has migrated over, the other half are getting migrated over right now from the monolith. So, user accounts, login, the analytics side of the house - those have been migrated over. Subscriptions, which is when you basically subscribe to the services we provide, are in the process of moving over from the old legacy to the new.
By the end of Q1 next year McAfee expects that most of its traffic will have moved over to the new microservices environment, which is hosted on AWS.
Verbs and nouns
Tyagarajan said that he educated his team on microservices by describing the systems to them in terms of ‘verbs and nouns’. By this, he means: what are the entities and what can that entity do? He explained:
If I'm a McAfee user, what are the five things that a user can do? It shouldn't overreach. It should be complete. And you should provide all of that functionality. It's almost like taking these little, itsy, bitsy pieces of things that you can stitch together. And then you think ‘that’s a noun, what does it provide in terms of verbs that can be done to it?
Or what can it do to other nouns in the system? So that's the domain decomposition, which is the first exercise.
These domains, and the services that they provide, are all stitched together with Confluent’s streaming data platform. There is sharing of data between the various systems, in a way that if an event occurs in one area, it will let the rest of the system know that it happened - but all as part of a decoupled architecture. Tyagarajan said that each team takes about a quarter to understand this ‘noun and verb’ metaphor, getting to grips with the new microservices approach, but added that once they’ve grasped it, it brings lots of opportunities.
A big benefit of this approach is the ability to reuse components. Tyagarajan said:
The load layers have no business logic. They just do functional things. The use case specific business logic floats to the top of the stack. That way if we have to change something, I don't have to change the APIs underneath. It's just the flow, or the orchestration. We've provided velocity by allowing for reuse at the bottom tiers.
For instance, when working with channel partners, such as HP or Dell, they may have their own customer IDs that McAfee doesn’t want floating into its own systems. With the microservices architecture, McAfee has an abstraction layer that means partners can use their own IDs without that data being relevant to McAfee. Tyagarajan said:
The plugin mechanism allows you to do something specific for a partner, or specific for a use case, which can override the default. So all these concepts sort of come together. It's slightly different company to company, but generally speaking, don't do too much customization, reuse, keep the business logic up top, and drive reuse at the bottom.
This decoupled architecture, with reusable components, allows McAfee to move at a much faster pace. It recently announced a product for small businesses, for example, which is designed for one to 100 users and was built on the new architecture. On the timeframes for getting it to market, Tyagarajan said:
We were able to put out the product in three months. A similar product, of a similar size, took us about nine months in prior project undertakings. That’s because of the reusable components, the decoupling, and isolation.
I don't ever test it! Because if it's working, I don't have to test that piece. I only have to test that new orchestration that comes in. Isolating that to be able to reduce your blast radius, we call it the isolation of concerns - that way you don't have to worry about retesting something.
This has resulted in anywhere between a 50% and 60% reduction in coding time, according to Tyagarajan, based on the fact that McAfee is now using a lot of frameworks and is being driven by data. He added:
If you have to set up a new plan, where you get antivirus, you get a VPN - like five or six things that we sell. We used to spend three weeks setting that up. Right now, with the tools, no engineer gets involved, and it takes two hours for a business person to set it up.
And so I saved myself the engineering side of the house. And I've also given a huge velocity boost from the business processes, which can go much faster.
Tyagarajan said for other organizations that are perhaps seeking to do something similar - breaking a monolith up into microservices based in the cloud - the key is to get experienced talent within the organization. He added:
Invest in getting somebody who's seen this before. For example, my architect, he's done Confluent before. He told me: ‘when we have to manage this over time, the schema changes, and managing the schema changes is a very important thing’.
That didn't occur to me until he told me. The manageability, the security aspects of it, are all things I learned from my architect as I was going through this process, which is just amazing.
But the benefits for McAfee have been clear. It can now deploy data pipelines efficiently without one domain impacting the availability of another; it has higher resiliency; it has reduced operational costs and speed to market; and it can take advantage of public cloud SLAs and support. Speaking to the outcomes of the project, Tyagarajan said:
Microservices are the way to go. It allows you to scale, right? It allows you to scale different parts of the system independent of the other. It gives you a lot of flexibility. You can take one small area, isolate it, you can take it down, but my entire system is not down, I'm taking some preventive maintenance, and I can scale it independent of the rest of the system.
There's so many benefits to it, right? It allows you to make use of your cloud infrastructure more efficiently. And that is I think the key. It allows you to bring down the costs.