A look at how ServiceNow is managing 85,000 databases with 25 billion queries per hour
Leading enterprise cloud vendor ServiceNow has an incredibly complex cloud architecture, for which it is using automation and MariaDB.
Yim explained that ServiceNow is using its own platform internally to automate the entire management of the infrastructure and has gone through an evolution to ensure stability and availability for customers.
ServiceNow has eight data centre pairs across the globe, six support centres and three site reliability engineering departments, which are watching the infrastructure 24/7.
Yim explained that the first question he normally gets asked is - is ServiceNow single tenant or multi-tenant? And according to Yim, the answer is neither, as ServiceNow has come up with a new deployment model, which he calls ‘multi-instance’. Yim explained:
Traditionally companies will start off with a multi-tenant database, where all your customers are on one large database. Then you will duplicate that infrastructure, to your HA/DR side. 1,000 customers on this large database. We all know you can have a single bad actor in that single, large database, and then that one customer ends up affecting 999 other customers. So if you fail over, you have to fail over every single customer. So instead of one outage, you just caused 1,000 outages.
Yim said that as companies continue to scale, they then progress to a second generation infrastructure, which takes the same idea of a multi-tenant database, but they reduce the failure domain down to maybe 100 customers per database. He added:
So it’s still co-mingled data and multi-tenant, just many pods. The same problem exists. One bad actor can cause 100 fail overs, instead of just one.
To counter this, ServiceNow has come up with a third generation approach, where every single customer gets their own database, sometimes many databases. Yim said:
Every single customer gets their own front end application tier as well. So on the back-end there is no co-mingling of data at all. What this means is that we have thousands and thousands of databases we have to maintain.
The advantage is that you can failover a single customer at a time, you can scale a single customer up. If they need additional compute power, memory, etc. We really have built in surgical scale on a per customer basis, globally.
How it all works
ServiceNow isn’t using any sort of hypervisor or container to do this, every single customer instance is running directly on bare metal for enhanced performance. Yim said it’s not that containers are “bad”, but rather “running on bare metal directly was the best choice for us”.
The way it works is that the hardware itself is shared across each tier, the app tier and the database tier, but the processes are ‘containerised’ with SU Linux, cgroups and iptables. So no co-mingling of data at any level and everybody is running on bare metal. Every piece of gear is duplicated, every single configuration, every single customer app node, every single database, is replicated as well. Every single night we back up every single database. There’s a lot going on every single hour in the ServiceNow infrastructure.
And the key is automation, according to Yim. He said:
So, how were we able to achieve that? It really comes down to, you have to automate everything. We did this with the ServiceNow platform. We have a single instance running in the ServiceNow cloud that we run everything through. We use our own software. It starts with discovery in the CMDB, so you have all of your assets tracked in the single location, then you can build workflows against them. We have our provisioning system, high availability, our fail over, is all automated on top of the ServiceNow platform.
Once you have all of your information into a CMDB, it really gives you that power and flexibility of automating that information. In the recent releases, we have what we call a flow designer, where you can actually automate without writing any code, with natural language instead. This really enabled non-tech people to start writing automations within their department.
ServiceNow is aiming for 99.996% availability, globally it has 50,000 ServiceNow instances supporting the cloud instances, 150 million active users and 10 billion transactions a month. Yim said that everything is running on MariaDB and every single customer instance runs on MariaDB. He said:
We have hospitals running ServiceNow. We have power stations running ServiceNow. So availability is critical for us. But it’s more than availability, it’s stability with MariaDB. More than just the technology, it’s really the people. The folks at MariaDB are on the right track.
Running as many databases as we have, we have problems. Given that it’s a dynamic platform, where customers can write code, the query patterns change. It’s one of the most difficult platforms to tune. It’s very difficult, just trust me on that one. So we’ve had to engage with support and it’s been a fantastic relationship.
They’re helping us get involved with the roadmap, where the roadmap is going to go. We are going to be giving back to the open source community. It’s less about the technology and more about the people.
And the results? In a single hour on the ServiceNow global infrastructure, it has 730,000 configurations added into a configuration management database, there are 76,000 assets and half a petabyte of backups.
With 85,000 databases, inside of which there are 176 million InnoDB tables, all accessed at a rate of 25 billion queries per hour.
This is impressive stuff.