The world can't afford cheap data
- Summary:
- Cheap data would be a false economy, argues Salesforce's Peter Coffee.
At this year’s annual conference of the Pacific Telecommunications Council, almost the first words spoken in the opening session concerned the growth of the “DataSphere”: the collective measure of what’s contained in data centers, what’s crossing the boundaries of wired/wireless networks, and what’s living on our point-of-use endpoints (including PCs, smart phones, and every manner of connected device). As defined by IDC, and as reported in work sponsored by Seagate, the DataSphere measure is intentionally open to the chance that a single datum may be counted multiple times – because this is not about data as static knowledge, but rather as active agent of causes and effects. (Remember, as noted here in 2015: “information” was born as an action word.)
Permission is given to be overwhelmed, for a moment or two, by IDC’s projected aggregate DataSphere volume of 175 zettabytes by 2025. Not only does this imply a compound growth rate of 61% per year, starting from the November 2018 estimate of a mere 33 zbytes; it also represents a 9% increase from the forecast for 2025 that IDC issued less than two years ago, so the estimated rate of data growth…is growing.
Now that the moment of marveling at magnitude is over, it seems like a good idea to devote some brain cycles to the pressing matter of data quality – because recent weeks have brought us too many examples of life becoming, not so much data-driven, as drivel-driven, with consequences that mess with people’s lives.
Regard, for example, the unfortunate discovery that Japan’s Ministry of Health, Labor and Welfare apparently overlooked its own shift from a complete tabulation to a sampling approach for key economic data: it appears that having collected numbers from only a third of a particular category of companies, the crucial step of multiplying the total by three to get a nationwide estimated value was…overlooked. Estimated impact? Underpayment of various pensions and other such public benefits by roughly 53 billion Japanese yen (about £365 million, or $485 million) over a period of fourteen years.
Further, there is some evidence to suggest that the error was discovered at least a year before it was revealed, and that published data in the meantime may have been manipulated to conceal the mistake. It’s already a subject of growing concern, as I noted here two years ago, that people’s lives are increasingly affected by algorithms with limited transparency and potentially biased effects. Auditing of algorithms, though, seems like naïve angels-on-pinheads abstraction compared to questions like the validity of the raw data, and the simple arithmetic transformations of the raw numbers that get fed to the machine.
Reality gap
What also needs attention, it appears, is the reality gap between people who collect and publish data for a living, and people who live in the world that publishers’ oversimplified data can seriously mis-describe. For example, it’s useful to make at least some attempt to estimate the physical location that’s associated with an Internet address; it’s possible, and responsible, to annotate such estimates with warnings of their uncertainty. It’s predictable, though, that some users of those location services may be casual about taking full advantage of messy things like error bars, and will simply proceed with the most likely location as if it were a nice, neat pin in a map.
That’s why a farm in the state of Kansas, unfortunately close to the geographic center of the United States, was identified for fourteen years as the physical location of any IP address that could not be pinned down more precisely than “somewhere in the USA.” Complications ensued. Apparently, it was a novel idea that database locations of this kind could usefully be adjusted—you know, by actual application of human judgment—to points like the middle of a lake, or a public square in the center of a city, rather than letting them wind up painting a target on an ordinary (and random) person’s home.
Moreover, this kind of carelessness is not only appearing in profit-sector publishers cutting corners due to lack of knowledge or care; it’s also arising in databases produced by major governments, sometimes with effects on military or intelligence operations. Smart weapons fed by dumb data practices…that’s a sentence that doesn’t need much imagination to complete it.
There’s huge opportunity to improve the performance of institutions of society, whether public- or profit-sector, by feeding carefully curated data into scalable, supervised processes that are capable of continuous learning. Fixing the technologies, end to end, from the point of first data collection to the point of reality-checked recommendation, is a task that we can all envision and drive with vigor.
Perversely, though, it may turn out that cultural change will (as often happens) be more difficult to achieve than technical progress. For example, the process of cleaning up contaminated drinking water in a US city was being optimized by machine-learning methods – but there were complaints from people who saw what seemed to them like inconsistent treatment, with some homes in a neighborhood receiving remedial attention while others did not. An altered mandate for superficial “fairness” of treatment, said the contractor doing the work, has “abandoned” the “core priority” of lead removal – and resulted in higher costs for a slower rate of progress toward that goal.
“The citizens,” said the program manager for the affected city, “are not going to trust a computer model.” That seems like a mis-statement of the problem. What people are getting too many reasons not to trust is not the computers, or the models, but the people who are making the decisions about what the computers are told and how the models are being used to turn data into action.
The “fifth industrial revolution,” as noted here at diginomica in Stuart Lauchlan’s observations of this year’s WEF at Davos, is driven by a “trust crisis…a tipping point in so many ways.” Cheap data, carelessly governed and thoughtlessly applied, must be high on the list of things that any such revolution seeks to change.