Strava leaves AWS Redshift in the dirt in its race to Snowflake

Profile picture for user jtwentyman By Jessica Twentyman January 24, 2019
Summary:
Last year, the fitness app company migrated a 120TB data warehouse in its bid to bring new product experiences to runners and cyclists

06272017_EmilyMaye_Chicago_Run_1624
If getting fit is one of your goals for 2019, then it’s likely you’ve already got a fitness app on your smartphone.

And if that app is Strava, then you’re in good company: according to the company’s 2018 Year in Sport report, released back in November, the app’s 36 million users logged more than 624 million activities using the app in the preceding 12 months, and covered more than 6.67 billion miles, on foot and on bike.

All the data relating to those runs and bike rides ends up in a 120 TB data warehouse, which Strava recently shifted to Snowflake, after running into concurrency issues with its previous Redshift warehousing service from Amazon Web Services (AWS). According to Cathy Tanimura, Strava’s senior director of analytics and data science:

We were using Redshift, which I happen to think is a great solution for getting up and running, for getting started. And we have a lot of other infrastructure in AWS so it was kind of easy from that perspective, but the concurrency issue was a problem.

With Redshift, you have a certain number of connections to the service - connections for loading the data and connections for querying it. And so when you have a lot of jobs that are writing data, then they use up those connections and there are too few connections left over for consumers of the data to query. You end up with something like a traffic jam: people can’t query data, or someone writes a long-running query, and then the loading of data backs up. It can be quite a management challenge.

The result was query times that were far too long, forcing analysts to schedule queries for their lunch hours or even overnight - but on top of that, Tanimura adds, there was also a scalability issue:

We were just growing, growing, growing the amount of data that we had and the price-point you end up in in the Redshift world in order to support our kinds of needs… well, it became a bit eye-watering.

Concurrency conundrum

The concurrency issues that customers can experience with Redshift is a problem that Diginomica has reported on before, in my story last year of how it prompted a shift from Redshift to Snowflake at food delivery service Deliveroo.

Snowflake tackles this issue by separating compute from storage and spinning up independent compute clusters to host ‘virtual data warehouses’ for specific workloads. These virtual data warehouses can be instantly resized according to need or paused entirely, allowing concurrent workloads to run without impacting each other.

Tanimura, who joined Strava from cloud identity management company Okta in February last year, hadn’t worked with Snowflake before, but says she had been pitched by the company in her previous job and had been intrigued by this separation of compute and storage. As a result, she says, she was excited by the prospect to use it in her new job at Strava.

The migration between the two systems, which took place between March and June last year, went pretty smoothly, she says. The vast majority of historical data in Redshift was loaded in bulk into Snowflake - although there was some effort ahead of time to clean up data and remove tables, for example, that were no longer used:

But of course, there was the issue that we have data loading constantly from our athletes around the world throughout the day and night - it never stops. So what we did was we ended up loading in parallel to the two platforms for a cutover period, where we were comparing the two to check our stats looked the same on each and so on. Plus we had a consultancy company help us to port our Looker front-end tools to the new warehouse so we could point our reports at it. It was a fairly big effort, but it was team effort.

New features, new efforts

Where that effort leaves Strava, she ways, is in a much better position to use data to bring new services to its app for users - or ‘athletes’, as the company prefers to call them. These include a launching a ‘relative effort’ score in April 2018, which enables them to compare and contrast heart rate metrics from their wearable device across different fitness activities. Data also powers features such as matched runs and rides, which enables runners and cyclists to benchmark themselves against past performance on their most common routes and social features that enable Strava users to connect and compete with each other.

As new features are launched, data discovery holds an important clue into what’s working and what’s not, says Tanimura:

If we put something new out and we see usage below what we expected, then we can go away and think about whether we’re making users aware of this feature, do they need more help using it and so on. It’s about really understanding athletes and what they need and building a great product for them, as well as turning the data that they’re uploading into formats that are visible and engaging.

These product experiences will hopefully inspire customers to greater efforts on the fitness front. They might even help some of us stick to those all-too-slippery New Year’s resolutions.