Can data.world become the Github of enterprise data sharing?

Profile picture for user Jerry.bowles By Jerry Bowles June 6, 2017
Summary:
The world is awash in data but most of it buried away in hard to find or inaccessible places—researchers’ hard drives, private servers or obscure web sites—or it is in formats that are difficult to work with.  data.world aims to change that one dataset at a time.   

census bureau
The Census Bureau has been counting Americans every decade since 1790 but it only began collecting socio-economic, housing and demographic data, much beloved by government agencies, nonprofits, universities, research groups, and journalists in 2005 when it introduced the annual American Community Survey (ACS).

The ACS is the largest and most up-to-date annual survey performed by the Census Bureau detailing information about the American people and housing units. It affects $400 billion in annual spending and influences local officials, community leaders, and businesses who rely on the data to understand the changes taking place in their communities.

The ACS is known for its accuracy and thoroughness but has historically been very difficult to use even for data scientists, largely because of its tabular structure and lack of metadata. Jonathan Ortiz, now a data scientist at data.world, an Austin-based startup whose stated ambition is to build “the most meaningful, collaborative, and abundant data resource the world”  to leverage data's "societal problem-solving utility,” explains:

What comes to you in the microdata survey file … is essentially just: one piece is the CSV, which has coded values throughout, and you constantly have to refer back and forth to the data dictionary.  And while the data dictionary is a human-readable document, it’s not computer-readable at all.

The Census Bureau teamed up with data.world  more than a year ago to find ways to make its public data more accessible and easier to use. Jonathan Ortiz, still a graduate student and NSF fellow at the time, began the process of linking ACH datasets together using semantic web technology and adding metadata about the concepts within the datasets, making them easier for people and machines to work with and combine with other datasets.

Adding that extra code made the survey bigger and more unwieldy so Amazon Web Services (AWS)  offered to make the data available for analysis in the cloud. Now, anyone can access and analyze the American Community Survey (ACS) Public Use Microdata Sample (PUMS) in the cloud without needing to download and store their own copy.

Building a social network for data nerds

Data.world may be one of Austin’s hottest startup these days having raised nearly $32.7 million in two rounds of funding to build its data sharing and collaboration platform in the expectation that it becomes the go-to network site for the open data movement.  CEO Brett Hurt, a serial investor and entrepreneur, believes that making data more accessible will solve real problems for society.

 We started to imagine what the world would be like in this age where processing and storage costs have come down to a point where we can think ambitiously about what if all the world's most important data was connected and linked together and how that would lead to breakthroughs in areas like solving climate change and cancer and poverty alleviation programs and all of these things that are so important to advancing the world.

To reinforce that commitment, data.world is organized as a “public benefit” or B Corporation, meaning that its governing documents mandate it be accountable for creating value for society, rather than just value for shareholders. Through data philanthropy, Hurt says, data.world hopes to help companies see the benefits of sharing their data.

As the company’s co-founder and chief product officer Jon Loyens puts it:

Data.world seeks to increase data collaboration to accelerate problem solving.  It’s a social platform that helps people who work with data discover, prepare and share datasets. By linking datasets together using semantic web technology, data.world identifies and adds information about the concepts within the datasets, which makes them easier for people and machines to work with them.

My take

data.world hopes to do for the open data movement what the open source movement did for developers or what Github, which now has 15 million users working together, did for sharing source codes that solve a myriad of problems.

One of the company's defining principles is:

Datasets are social objects, and open data platforms should reflect this. Social networking can accelerate and enhance data science by improving decision-making, knowledge transfer, problem-solving, and much more. We’ve built a community where conversation adds context to data and deep collaboration is the norm.

The community users are incentivized to share data through the usual method used by many social network communities--recognition that can come from sharing the data and seeing it used for research published in journals, websites and news publications. Members can keep data private on the site, as well, if they desire.

Data collaboration within and across the enterprise will drive the next era of business,

Pat Ryan Jr., chairman of Chicago Ventures and data.world investor, said in a news release.

By making it easy to use and share data in both public and private configurations, data.world is fueling the next wave of innovation and growth.

Whether Ryan is right or not depends on how much database managers, companies, governments and researchers are willing to share their datasets.

This will be an interesting one to follow. Companies that already believe FANG (Facebook, Amazon, Netflix, Google) represent the competitive business models of the future are actively considering how they change or adapt. We have already heard of instances where companies that would normally be in competition but which are effectively in adjacent markets are recognizing the need to collaborate where that implies the emergence of new services.

Right now, that is mostly concerned with topics like product and service design but then it is a small step to sharing elements of operational data as a way of discovering patterns of commonality that can be leveraged for the future. data.world positions itself as the Switzerland of publicly held datasets that could well serve as the place for those collaborations to flourish.

Taking that one step further, we foresee a time where publicly shared datasets act as a way to inform entirely new businesses that cannot be imagined today. Now that would truly be radical.

Endnote: Europe/Asia - your move.