Let's face it - most corporate mission statements are a little bland. But that's not the case for IHME, The Institute for Health Metrics and Evaluation.
Noble mission - easier said than done. At M|17, Ernst, IHME's Infrastructure Team Lead, laid out the not-insignificant data challenges his nine person admin/DBA team faces pulling data and managing volume.
For IHME, an independent, non-profit population health research center at UW Medicine - part of the University of Washington - it's about tackling data issues to ensure that the public consumption is easy and hassle free.
Transforming public health care data into visualizations
At the heart of IHME's public work are free, interactive data visualizations. Individuals and agencies can use these downloadable charts to quantify their cause and bolster their case for policy change, funding, or fresh research. 1200+ researchers and collaborators across 100 countries contribute data to IHME's tools.
IHME's data visualizations chart everything from global health financing to tobacco consumption to a core project called the Global Burden of Disease (GBD), which tracks how the global increase in life expectancy is offset by war, obesity, and substance abuse. Their GBD Compare tool now tracks data from 1990 to 2015.
Using treemaps, arrow diagrams, and other charts, users can compare causes of death and risks within a country, compare countries or regions, and slice and dice by age and gender in search of patterns. You can see which causes of death and disability are having more impact, and which are waning. That led Ernst to joke during his presentation that "I see dead people," a famous line from the movie Sixth Sense. Jokes aside, data on changing risks and life expectancies have huge policy implications.
They also require a serious data architecture, including:
- 10 petabytes of file systems
- 100 database servers, about 10 bare-metal and 90 virtualized
Bringing health care data together, and breaking silos
Readers may be wondering: Isn't there free health data all over the place? What makes IHME different? True, you can pull in chunks of free health data here and there. But as Ernst told me after his presentation, IHME is working to solve a silo problem:
Early on, the World Health Organization had stats. We had stats. Other people had stats. Largely, there wasn't a comprehensive base of all the diseases affecting everyone. That's what I think IHME has done very well, building that comprehensive look to the point where we're working hand in hand now with the World Health Organization, and we're working with ministries of health and governments around the world.
IHME works to solve this scattered data problem in two ways: better communication/partnerships, and better data integration. Their team pulls in vast amounts of data from various sources, including obscure or hard-to-mine info.
A lot of this data comes from journals. You really have to dig to find it. We do a lot of digging, and then we're able to present it back out.
So far, the results are encouraging:
The picture we're providing to the public and to the ministries of health, they're a better picture than they had prior to us being able to disseminate that data. They're able to make better decisions for their countries based on actual metrics... Over time as people saw what we did, they were getting access to information that they may not have had before.
Arming people who can influence policy with better data is a win. But the use cases are as varied as the site's visitors. Ernst told me about a criminology expert who has been using the site to identify crime rate patterns.
Facing scale problems - "The data gets larger every year"
Gathering the right data is only part of the problem. Data has a way of growing on you: "The data gets better every year. The data gets larger every year, and the scope of what we do tends to grow as well."
To manage that volume, Ernst's team uses a "best of breed" collection of open source tools, from Apache to Docker to Ubuntu to Kubernetes. But two years ago, Ernst's team ran into a problem. Their GBD growth in gigabytes was rapidly outpacing InnoDB's ability to manage it (InnoDB is a MySQL storage engine).
Other issues included: lack of offline capability (IHME team members need to be able to manage the pre-release visualizations on their laptops), and, that dreaded performance complaint: "This query is too slow."
So, early in 2015, Ernst, a self-described open source enthusiast, started kicking open source tires on alternatives. He had been intrigued by InfiniDB, but then the company behind InfiniDB went out of business. MariaDB stepped in, eventually morphing InfiniDB into their new open source ColumnStore offering. Talks with IHME heated up:
I already knew about InfiniDB. I knew that the platform could handle a fair amount of ingest... I saw the press releases saying that MariaDB was taking up that code base and starting active development again. That got me excited - I started saying, "Well let's take a look at this," because I think if we can get the best of the latest generation minus QL functions and features combined with that storage engine, it was probably going to be an interesting story that we could come out with on the other end.
Enter MariaDB ColumnStore
That meant taking the plunge as a ColumnStore "alpha" customer. After talking it through, Ernst's team took the plunge. They started on InfiniDB and then, a few weeks later, the first alpha of ColumnStore. As with any alpha program, IHME found plenty of issues and bugs, but that got sorted.
Basically, Ernst knew what he was getting into. They worked through the issues in summer of 2016, and went into production on MariaDB Column Store in October of 2016. As for the results - how about a visual?
Pretty good result so far. Ernst and team are also using MariaDB's MaxScale for load balancing:
The GBD took about six hours of loading in InnoDB using a low data end file. That same data being loaded into ColumnStore took about an hour and two, on much less quality hardware. In the VM environment, it was super fast compared to this dedicated hardware that we were using the old way.
Just to show what the site can do, I spent some time doing my own slicing and dicing. Basically I looked for data anomalies or interesting questions to explore. This 2014 visual shows an odd concentration of neglected tropical diseases in the Midwest:
And this one looks at male global smoking trends:
The female smoking trends across age groups are geographically very different. I'm sure the reasons for these issues are already uncovered, but these samples give you a glimpse into the data investigations you can conduct.
The wrap - how do you turn data into action?
It's obvious that better/accessible data can help change policy, but is that enough? Does the IHME have an obligation to go beyond releasing the data and hoping for the best? I wasn't able to stump Ernst there. The IHME has several initiatives to help push data into action. One such project is called, appropriately enough, Act on Data:
It's really about getting out there and sending information. Our director, Chris Murray, gives plenty of talks. He's done some Ted Talks around what we do. He goes to different countries and presents on interesting findings within our data. We're constantly finding new and interesting things.
IHME also issues a Roux Prize - a $100,000 award for someone who takes IHME's research and takes it further. Ernst must take his work further as well. Ambitious goals are pending: within five years, they are committed to adding geo-spatial capabilities, in five-by-five kilometer map tiles across the globe.
Ernst's team plans to enhance the analytical tech to support that geospatial effort, and, eventually, make all of their data queryable and accessible via APIs. Hurdles remain, but when you have a mission like IHME's to rally around, I like your chances.