Delivery Hero turns to Google Cloud BigQuery to improve data accessibility and sharing
Online food delivery network Delivery Hero wants to improve data access between its different business units, with the aim of advancing its machine learning capabilities.
Founded in 2011, European-based Delivery Hero is the largest global food network outside of China, offering services in over 700 cities globally and taking up to 5 million orders every day. This sort of operation equates to petabytes of useful data being collected, which Delivery Hero is striving to make more accessible and usable with Google Cloud BigQuery.
Speaking this week at Google Cloud's ‘Born-Digital Summit', Mathias Nitzsche, VP of Engineering for the Global Data team at Delivery Hero, explains how the company was facing challenges in terms of making data discoverable. This meant that when trying to solve a problem, or execute on a new idea, 80% of a person's time was spent trying to find out who owned the data, how they could get access, what approvals were needed, and whether it was safe to use.
The aim of the Global Data team is to make Delivery Hero's data globally and universally accessible and useful. It began doing this through the creation of a Data Streams system, which is essentially a real time messaging platform, which allows all of Delivery Hero's different entities to exchange key information in new-real time. Data Streams is built on AWS and Nitzsche says that it has been critical to connecting all of the company's complex application environments together.
More recently, the company has embarked on creating what it calls Data Hub, which follows the ‘data mesh philosophy', and aims to provide a common infrastructure used by all data owners. These data owners keep responsibility and accountability for their data, which is then shared in this common environment. Nitzsche says:
Data Streams was completely built on AWS and we started that two and a half years ago. But when we came to the point when we were thinking about Data Hub, this data mesh, we didn't see many alternatives to Google Cloud's BigQuery. BigQuery provides a lot of the features that we expect from such an environment, along with the other tools around it that are supported by Google. But we are happy with this multi-cloud decision, it gives us the best of both worlds.
Delivery Hero's Data Hub is essentially a new infrastructure, with tooling and governance, to make all of the company's data accessible - focusing on interoperability, usability, security, scalability, quality and efficiency. Nitzsche explains how Delivery Hero wanted to move away from its previous complex data architecture. He says:
In the past, our environment was a lot of different data warehouses - once we counted them and there were like 15 different data warehouses, and a lot of immediate intermediate solutions between them, importing data from all different places, all sharing a lot of data with each other.
What Data Hub is giving us is one central infrastructure, where every business unit is isolated from each other, and is still fully owning and responsible for the data, but we're able to share and join data across these units,
Simply put, each business unit has its own Google Cloud project, where the data is fully owned by each business unit, and they use Airflow to import all the data they need from different sources, making use of other tools such as Dataproc, to push all the data into the central BigQuery infrastructure.
It's still fairly early days for Delivery Hero, but it already has 11 business units on the Data Hub, which includes 174 data sets, over 5,000 tables, 2.7 petabytes of accessible data, upon which it is running 7 million queries a month. Nietzsche says:
This may seem hard to believe, but the interesting thing is we do not even have half of the business units we want on this solution. And the business units we have are still adding more and more of their tables and exposing more of their data. So we believe it's still the very beginning of this solution and that can easily grow 10 times that size.
Nietzsche says that prior to the Data Hub, Delivery Hero had different BI teams importing whole data warehouses from each other, which created a "spider's web of big data dumps" happening each day. This pretty much stopped once this new common infrastructure was put in place, where the different business units can now access data between each other more easily. He adds:
Also for the machine learning teams, the data scientists, they now have much bigger amounts of data at their fingertips, and they can just play with it and find insights, which would have been very hard to find just a year ago,
The end goal of this, beyond the Data Hub, is to advance Delivery Hero's machine learning capabilities. Nietzsche says:
We are very early with this solution we're still onboarding business units and adding the data is ongoing. We haven't even started with the machine learning platform, which will really leverage the power. So right now we are just solving and making the data accessible. And then, we will be unleashing a lot of solutions for the future.
Key to success for the project has been the way that the data team has worked in the open with Delivery Hero, where Nietzsche says that it's important to adopt this approach in order to gain trust. He adds:
I think it's fair to say that data is power - who has the data, who controls the data can do a lot with it. And that makes the changing of data and handing over control of your infrastructure to somebody else a political topic. That's very hard to overcome without strong decision making.
I think what was most successful for us was not running the Data Hub project as a solution one team creates and provides to the others, but more as a maintainer of an open source project. So everything we do is publicly shared - that starts with the code, where everybody can see all code, but that continues with every meeting, every bigger change.