Google announced its intention to acquire healthcare data and wearables company Fitbit last year for $2.1 billion. Although the deal is under scrutiny from antitrust regulators and consumer advocacy groups, in the meantime Fitbit has been working to move its entire monolithic application to Google Cloud Platform (GCP) - a coup for the IaaS provider.
Sean-Michael Lewis, Principal Software Engineer at Fitbit, was speaking at Google Cloud's Next ‘20 On Air event about the migration, which completed three weeks early and focused on ensuring minimal disruption for paying customers. Lewis provided a technical breakdown of the company's approach to shifting to GCP.
He explained that Fitbit started with a monolithic application, which handled all of its traffic for a long time. The company then started to break that up into microservices - but the monolith still handles up to about 70% of customer traffic. Moving microservices is a much easier task than moving a monolithic application. Lewis said:
So let's talk about our monolith. It's a single Java binary, about 1,000 instances, backed by around 200 MySQL instances. Those MySQL instances are mostly sharded by user data, so if a request comes in I can get all of my data from one shard. We have a sharding library that manages the shards, so it figures out where users go and when new users sign up. It allows you to expand and contract the number of MySQL instances that we have. It also handles caching, for which we had about 400 nodes of memcache. We also have Kafka for which we do asynchronous processing and those messages are largely processed by other instances within the monolith.
We thought about what it would look like for us to move and who we should be thinking about as we are moving. The most important stakeholder in the move is your users. Sometimes doing the best thing for your users isn't the easiest thing technically. But that should be the thing you do first, or the person you think about first.
Keeping users in mind
So, what options were available to Fitbit for moving to GCP, whilst keeping users in mind? Lewis explained that there were two clear possibilities - either to do it progressively, to move individual users or batches of users over time, or to move the entire application at once.
There are pros and cons to both approaches, he said. For instance, with the progressive approach, this would allow Fitbit to identify any challenges or problems with the new environment without impacting the majority of users. It could also allow Fitbit to observe how the new environment handled load, but it would mean relying on network calls to the old data centre which would incur a lot of latency. The progressive approach also would mean operating two full application stacks simultaneously and managing them at the same time, in two data centres.
On the other hand, the ‘move it all at once' approach would affect everyone - if there were a big outage, every user would go down and there's a lot of risk. Testing could be done beforehand, but you don't really know what's going to happen until you flick the switch. In terms of positives, once you're in the new data centre all of your network calls are local, so there are no latency issues. And although you don't have to maintain and operate two data centres at the same time, you do need one on standby in case something goes wrong.
Assessing these points, Fitbit felt more comfortable going with the progressive approach. Lewis said:
We felt that overall, thinking of the users in mind, that the progressive migration was going to be much better for them. On average most users wouldn't experience any downtime or any noticeable downtime. We thought that if we could slowly roll into GCP we would give our users the best experience.
Fitbit's first guiding principle was that it wanted to route users to the environment where their data is resident. As noted above, Fitbit shards its data out by user - so if data is moved out of the old hosting environment to GCP, the routing request should go to GCP. Lewis explained:
That was the goal of this migration, to move the user with the data. Let's say we didn't route correctly and my data is in GCP but I got routed to the old data centre - well, that should still work. We still want that to be an experience, we don't want to be throwing errors when we have a misroute. So we wanted an experience no matter what.
In addition to this, Fitbit wanted flexibility. Lewis added:
We wanted to move users to GCP, but if things aren't going well we need to be able to move them back. We also need to be able to move at variable speed. If things are going great we should be able to move users more quickly, if they're not going so great, we should trickle them in.
Lastly as we were making these decisions about fast, slow, forward, backwards, we don't want to be making a lot of like gut check decisions. We want to be able to look at a chart and say we're meeting our SLOs, we're going to have to move forward now. So we needed to develop more SLOs and get a handle on them before we can do this appropriately.
In order to make sure everything was working as it should be in GCP, Fitbit used its employees as guinea pigs for the new environment. Tests can always be done, but until you're routing production users, it's not real. As such, Fitbit employees were all routed through GCP with their data kept in the old hosting environment. This meant the experience was slow, but allowed Fitbit to fix any bugs to make sure the experience was meaningful. Lewis said:
Once we felt confident in that we decided it was time to start moving paying customers. So we started moving users slowly, in a slow trickle. Then once we got a lot of confidence we started moving users in big batches. We moved them quickly and quicker than we even anticipated.
Initially Fitbit had empty databases in GCP and full databases in its legacy data centre. The sharding scheme used by Fitbit includes buckets, where buckets have a number of users in them. This gives Fitbit two options for moving users. It could either move a bucket at a time to the new database server in GCP, which is slow and gradual. Or it can move a whole shard, which contains a number of buckets, using a replication strategy - where a replica is created in GCP and then when appropriate the replica is made the leader. That's a much faster approach.
Fitbit also used CloudFare Workers to provide logic as to where users traffic should be routed. Lewis explained:
So regardless of where your data is resident, we could say ‘oh this user should always go to GCP no matter what'. The logic is simply, is the user in the whitelist? If so, send them to GCP. If not check where their bucket lives. Is the bucket in GCP? If yes, then route to GCP. If not, or there's some kind of error, or no user information, send them to the legacy data centre. Eventually once we had more than 50% movement into GCP, we made the default GCP.
Fitbit ultimately found success with its approach, completing the migration three weeks early. Lewis said that it went "better than any of us ever expected" and the routing of employees first proved critical. He explained:
The routing of employees before anyone else uncovered many critical bugs. It uncovered around 60 bugs and we closed 58 of them before we moved a paying customer. And I think those other 2 bugs got closed, they didn't end up being that critical. Anything that was really blocking an experience in GCP was closed - we did that live, we tested it in production, without affecting our paying customers. We also finished early. We had a pretty hard deadline to be out at a certain time. Based on our SLOs we were able to see that we were doing well, everything was under control and we could really push this migration faster.