The term "data sharing" has, until recently, referred to scientific and academic institutions sharing data from scholarly research.
The brokering or selling of information is an established industry and doesn't fit this definition of "sharing," but it is popping up. Scholarly data sharing is mostly free of controversy, but all other forms of so-called sharing present some concerns.
Information Resources (IRI), Nielsen and Catalina Marketing have been in the business of collecting data and selling data and applications for decades, but the explosion of computing power, giant network pipelines, cloud storage and, lately AI, is a fertile ground for the creation of literally thousands of data brokers, mostly unregulated and presently a challenge to privacy and fairness:
Currently, data brokers are required by federal law to maintain the privacy of a person's data if it is used for credit, employment, insurance or housing. Unfortunately, this is clearly not scrupulously enforced, and beyond those four categories, there are no regulations (in the US). And while medical privacy laws prohibit doctors from sharing patient information, medical information that data brokers get elsewhere, such as from the purchase of over-the-counter drugs and other health care items, is fair game.
One might assume that your medical records are private and only used for the purposes of your healthcare, but as Adam Tanner writes in How Data Brokers Make Money Off Your Medical Records:
IMS and other data brokers are not restricted by medical privacy rules in the U.S., because their records are designed to be anonymous-containing only year of birth, gender, partial zip code and doctor's name. The Health Insurance Portability and Accountability Act (HIPAA) of 1996, for instance, governs only the transfer of medical information that is tied directly to an individual's identity.
It is a simple process for skilled data miners to combine anonymized and non-anonymized data sources to re-identify people from what is supposed to be protected medical records:
One small step toward reestablishing trust in the confidentiality of medical information is to give individuals the chance to forbid collection of their information for commercial use-an option the Framingham study now offers its participants, as does the state of Rhode Island in its sharing of anonymized insurance claims. "I personally believe that at the end of the day, individuals own their data," says Pfizer's Berger [Marc Berger oversees the analysis of anonymized patient data at Pfizer]. "If somebody is using [their] data, they should know." And if the collection is "only for commercial purposes, I think patients should have the ability to opt out."
There are also legitimate data markets that gather and curate data responsibly. Most notable lately is Snowflake, which I'll cover below. Others are Datamarket.com, which is now part of QLIK, and InfoChimps.com.
One I can't get my arms around is Acxiom. They are a $1B business that collects all sort of information about people in 144 million households. Apparently their business is creating profiles so advertisers can target you more accurately. That seems innocent enough, but I don't know if that's the whole story. However, about five years ago, Acxiom launched https://aboutthedata.com/portal which allows you see what data they have about you.
Even more remarkable, you can correct mistakes and you can opt out. According to Acxiom, though, if you do opt out, you can expect to get a lot of ads you're not interested in. Keep in mind, though, that this business is still unregulated, so it would take an investigative reporter to validate these claims.
Then there is this: Acxiom, a huge ad data broker, comes out in favor of Apple CEO Tim Cook's quest to bring GDPR-like regulation to the United States:
In the statement, Acxiom said that it is "actively participating in discussions with US lawmakers" on consumer transparency, which it claims to have been voluntarily providing "for years." Still, the company denied that it partakes in the unchecked "shadow economy" which Cook made reference to in his op-ed.
Update: Acxiom now directs their aboutthedata.com portal to a privacy page which explains:
Acxiom and our former subsidiary, LiveRamp, Inc. separated in September 2018. When Acxiom and LiveRamp separated, ownership of AboutTheData.com was transferred to LiveRamp. Acxiom data is no longer available through AboutTheData.com.
They include details on their plans to build a new "customer access portal."
The good - let's start with data.gov
Data.gov is a U.S. government website launched in late May 2009 by the then Federal Chief Information Officer (CIO) of the United States, Vivek Kundra. Data.gov aims to improve public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government. The site is a repository for federal, state, local, and tribal government information, made available to the public. Data.gov has grown from 47 datasets at launch to over 180,000 (actually now over 250,000).
This chart gives a sense of the vastness and variety of free, open and curated data on data.gov:
Don't confuse this with:
The Open Data Initiative (ODI) is a joint effort to securely combine data from Adobe, Microsoft, SAP, and other third-party systems in a customer's data lake. It is based on three guiding principles:
- Every organization owns and maintains complete, direct control of all their data
- Customers can use AI to get insights from unified behavioral and operational data
- Partners can easily leverage an open and extensible data model to extend solutions
ODI is an ambitious effort with admirable goal, but it is not the subject of this article.
There are also legitimate data markets that gather and curate data responsibly. Most notable lately is Snowflake, which I'll cover below. Others are Datamarket.com, which is now part of QLIK, Azure Data Marketplace (Microsoft) and InfoChimps.com.
Epsilon (recently acquired in April, 2019 for $4.4B) refused to give a congressional committee all the information it requested, saying: "We also have to protect our business, and cannot release proprietary competitive information." information onpeople who are believed to have medical conditions such as anxiety, depression, diabetes, high blood pressure, insomnia, and osteoporosis.
Sprint, T-Mobile, and AT&T said they were taking steps to crack down on the "misuse" of customer location data after an investigation this week found how easy it was for third parties to track the locations of customers. (Misuse? They SOLD the data).
Experian sold Social Security numbers to an identity theft service posing as a private investigator.
Optum. The company, owned by the massive UnitedHealth Group, has collected the medical diagnoses, tests, prescriptions, costs and socioeconomic data of 150 million Americans going back to 1993, according to its marketing materials.Since most of this is covered by HIPPA they are very clever in getting around the regulations. But that socioeconomic thing is real red flag.
What it means, at the very minimum, is the use of the "Social Determinants," income and social status, employment, childhood experiences, gender, genetic endowment. That's just the start. You have to ask yourself, why would anyone want to use this information? Life insurance, car insurance, mortgage, education, adoption, personal liability insurance, health insurance, renting, employment…there is no end to it and you will never know what's in there.
The World PrivacyForum found a list of rape victims for sale. At one data broker, the group found brokers also selling lists of AIDs patients, the home addresses of police officers, a mailing list for domestic violence shelters (which are typically kept secret by law) and a list of people with addictive behaviors towards drug and alcohol.
Snowflake's Data Sharing
Snowflake is a cloud-native data warehouse offering. Their secret sauce is the separation of data from logic. So taking Amazon as an example (Snowflake also runs on Google Cloud and Microsoft Azure shortly). Your data will reside in S3, where costs are asymptotically approaching zero, and you basically only pay for processing on EC2. Everything works as a "virtual data warehouse," meaning you create abstractions over the data and nothing moves or is copied. You can have virtually thousands of data warehouses with one copy of the data.
I don't know this sure, but I suspect Snowflake, despite their success, saw the need to create some other technology as data warehouses are a limited market. What they came up with was using their existing technology to provider a mechanism for data providers to locate their data in a Snowflake region, and allow others to "rent" data without copying or downloading it. Beside this obvious productivity and cost-saving, Snowflake added feature for their data sharing product including some level of curation and verification of the data. I get the impression this is still a work in progress.
And, because all access to data is through (virtual) data warehouse views, integration of data sources, reference data and a level of semantic coherence - all qualities of a data warehouse - are there. In contrast to a bucket of bits you can download and wrangle later, this seems like a good idea to me
I asked Justin Langseth, Snowflake's CTO, if he was concerned about criminal, civil or even ethical exposure to Snowflake from the data provided. His e-response was:
Legally no we're just the communication platform, the provider of the data is responsible for their data... but we are looking at some tools that can detect hidden bias in models and data though, so it is an area of interest. Should this be enough of a reason to not have people share data? There's tons of social good that can come from this as well.
The problem with his response is two-fold: First not just the data, but any calculations and modeling a customer will do takes place within Snowflake. Secondly, legal responsibility is an abstract term. You may not be legally responsible, but you may still be charged or sued and have to defend yourself, with uncertain outcome.
Besides all of the issues, I'm wondering how many companies have data someone else would want to buy? If you dig into data lakes, the volume comes from things like log files which would be useless without context and imported data, which may not be resealable anyway. Between data.gov and Google and Facebook et al, is there really a market for this? I'm also thinking about edge data; how would you package that, because the trend is not to bring it back to the cloud (though I still don't understand how you do machine learning at the edge).
Langseth also just posted an article on Medium recently, with The article covers the "hardest" issues data marketplaces will face:
- Faked and Doctored data
- Sales of Stolen, Confidential, and Insider Data
- Piracy by Buyers of Data
- Big Data can be really Big
- Data is Fast
- Data Quality can be Questionable
- Lack of Metadata Standards
And in conclusion, he asks: So how do IotA, SingularityNET, and Datum address these issues?
Mostly they don't, at least so far. Most of the projects working on decentralized data marketplaces have simply not hit these issues yet as they are just in a test mode on a test network. To the extent they have thought about the trust-oriented issues, most of them propose either a reputation system or a centralized validation authority. Reputation systems for data marketplace are highly prone to Sybil attacks (large #'s of fake accounts colluding), and if you need a centralized authority forever you're defeating the purpose of a decentralized crypto system and may as well do everything the old way.
The battle for privacy is already lost. Once data is out, it's gone. Stemming the flow of current data could eventually dilute the value of the data brokers, but that requires regulation which is unlikely in the USA. To reign in data brokers who exist in the shadows, as opposed to a polluting coal-firing power plants, will require digital enforcement, and for-good trolls sniffing out the bad guys. The only question is, who will pay for the development and operation?
Updated, July 23, 2019, with clarifications on a couple of URLs cited in the original piece. Microsoft Azure DataMarket is defunct and was removed from the piece.