As business leaders worldwide continue to grapple with make-or-break decisions to protect lives and livelihoods in the face of the COVID-19 pandemic, data-led responses have become a top priority in boardrooms.
But for many executives, questions remain not just around the best sources of reliable third-party data, but also how to bring these together for analysis with internal business data relating to their own workers, their own facilities, their own supply chains.
To help them navigate this challenge, data engineering specialist Starschema has created a massive dataset of curated COVID-19 incidence data, hosted on the Snowflake Data Exchange from cloud data warehouse provider Snowflake.
Data is ingested from multiple sources and made ‘analytics-ready’ along the way, so it can easily be accessed and used by organizations where executives face difficult short-term decisions around closures and layoffs, for example. And, as the pandemic unfolds, Starschema CTO Tamas Foldi says he’s hopeful it will help them better adjust over the longer term to a vastly different economic and operational environment. As he explains:
I believe data analytics experts can play a decisive role in the battle against COVID-19 and we saw a chance here to use our knowledge and expertise to play our part in driving better public health outcomes and business continuity. Our clients were telling us that they had a real need for epidemiological, demographic and other data, but that this wasn’t available in one place and in formats that made it easy to combine it with their own business data.
A free public resource
As of 1 April, more than 800 requests for the Starschema COVID-19 dataset had been made by a wide range of private and public-sector organizations. The dataset is publically available, free of charge, and is based on information from sources that have become quickly recognized as some of the most reliable around. They include Johns Hopkins University, the World Health Organization and the Henry J Kaiser Family Foundation, along with state and federal government organizations, healthcare informatics companies and geographic street mapping specialists.
Before it’s added to the dataset, all data is first cleansed and formatted by Starschema’s data engineers and data scientists, who then maintain, update and add to it on a day-to-day basis, says Foldi. That data preparation job, in itself, can be quite a challenge:
First you have to identify relevant data, and then, you’ll often find it’s only available in an unstructured way on a website or even in PDF documents. And that’s understandable, because a lot of these datasets are built by government organizations where the priority is to give the information to a mass audience in an easy-to-read way. But it’s not necessarily computer-readable when we first get it, so we have to use algorithms we’ve built to extract information and create structured tables from it all, and enhance it where necessary, so it can blend with your company’s existing data sources. It’s a lot of work.
Snowflake Data Exchange makes a lot of sense for a use case like this. Announced at Snowflake’s annual user conference in June 2019, the main thinking behind this technology is that companies that already use Snowflake might use private exchanges to host company datasets and enable their employees to ‘shop’ for the data they need to get work done, much as they would on a consumer-focused private marketplace, but in an access-controlled way.
However, the technology first emerged as a public-exchange version, where third-party information providers like Weather Source could host datasets - as Starschema is now doing with its COVID-19 dataset.
There are a number of factors behind the choice of Snowflake, according to Starschema’s Foldi. For a start, because the Snowflake Data Exchange is cloud-based, data can be loaded from multiple sources to form a single dataset, which can then be accessed from anywhere. At the same time, Snowflake is engineered to be able to ingest a wide variety of data types. Plus, without a complex infrastructure environment to set up and manage, Starschema’s engineers can focus instead on identifying and importing new data. But perhaps most importantly, Foldi says, having a single data engine makes it easy for organizations to download the dataset and mix it with their existing data assets, which he believes is where the greatest potential value lies.
Smart data combinations
Combinations of the Starschema Covid-19 data with internal data could provide major benefits to all kinds of organizations. A healthcare provider, for example, might mix the Starschema data with their own data on hospital-bed capacity, in order to identify hospitals at risk of reaching or exceeding their limits, and enabling patients to be diverted or transferred elsewhere. In similar ways, retailers could identify stores vulnerable to empty-shelf situations; charities could spot where school closures threaten to leave kids dependent on school meals hungry; and financial services providers could pinpoint default risks, and offer more realistic repayment plans to customers.
Over the longer term, the dataset could be a valuable historical resource, providing vital clues as to which strategies were successful and which were not, when companies, governments and non-profit organizations are called upon to tackle future global healthcare emergencies. Says Foldi:
I really believe our dataset will be the basis for a lot of new scientific papers and research for at least the next decade. Over time we’ll be covering more countries and more strategies within this dataset. Once the immediate chaos levels have decreased, we’ll be able to see what policies worked, what decisions proved to be poor choices, what was effective. This data isn’t just for immediate decision-making in the current emergency. It’s also to help us prepare for the next one.