I've been thinking about metadata. I know that's a strange thing to think about, primarily when I used to refer to it as "dead data in a drawer, neatly arranged." And it was. Our halting attempts to tag extracted data with a semblance of meaning always fell short - because it was just too hard. And it was even harder to figure out how to get application programs to use it effectively.
It is becoming prevalent for organizations to collect whole datasets, and many of them, without regard to a direct purpose other than the needs of data scientists and AI engineers. The freedom to vastly expand the scale of these loose repositories, first with Hadoop, then data lakes, and now cloud object stores and variants, allows for these large collections.
Understanding what is in the collection becomes more complicated. Data pulled for data warehouses typically came from "operational exhaust," not the digital exhaust of logs and streams and third-party data, including unstructured data not as easily searchable, including formats like audio, video, social media postings, spreadsheets, and email, all without a clear data model.
Existing approaches to metadata are simply not adequate.
Back in the late nineties at Sandia and Los Alamos Labs, we tried to make sense of experimental data relating to both nuclear waste disposal (from the nuclear weapons research program) and the nuclear weapons stockpile efficacy through simulations (we couldn't test them anymore). I came up with using ontologies to drive fully attributed DAG's using semantic web approaches. We ran into two problems:
- There weren't enough people in the world to tag all the data and to resolve all of the relationships.
- There wasn't enough computing power to resolve even the subset of data we did tag.
We didn't know how to solve the first problem. For the second problem, we convinced the DOE to fund us to build the world's fastest supercomputer, ACSI Red. It was the first teraFLOP computer (one trillion double-precision floating-point calculations per second). To put that in perspective, if you did one calculation per second, it would take you 32,000 years to do what it did in one second. It was also the first to deviate from the CRAY model by using the MPP approach and primarily commodity components. It didn't work either, at least not for the metadata problem.
In 2021, three HPE/Cray machines will come online two million times faster, but they'll be too busy with defense, intelligence, climate, and genomic problems to give a few cycles to metadata. And you can't buy one because they cost >$500 million, are the size of two football fields and require up to 40MW of electricity. It is fair to say, though, that the next generation will be quite a bit smaller and consume less energy, and beyond that, the conventional wisdom is quantum computers will replace them, but that's like saying in 1965 we'll have flying cars soon.
We apply metadata, to a certain extent, that is considered "source" data. What is source data? The definition of "source" is beginning, origin, genesis. But it's our perspective of its intended purpose that identifies it as source. It is itself already derivative. It's the result of other things that precede our use of it. Typically, data is processed and transformed through some programmatic logic and stored in a persistent device. Is it a reliable proxy for reality? We have no idea. Source typically means to be used for a particular purpose, and it is the beginning point for assembling data for drawing inferences and conclusions.
If it's the dataset, log, stream, etc., that we start with to re-use (after all, all of the data is at least second-hand from its original purpose, whatever that was), we consider its source. But what do we know about it? I sign up for seminars and webinars and podcasts and even download reports once in a while, and to do so, I have to fill out a form. Inevitably there is a box for my industry and my position. There is never an appropriate choice, so I just click anything. Data is stored, copied, aggregated, and sold, and none of those cases that use the data know why I chose that item. I get invitations to Zoom meetings that consistently mix up standard and daylight time. You could call these data quality problems, but that's just the usage effect from the reader's perspective.
What is the provenance of source data? How did it exist in its primordial form? Data? At a certain level, data is an oxymoron. The context of data - why, how, and when it was recorded, and by what method it was collected and then transformed is always relevant. We can't triangulate data to see if it's consistent with other instances of the same phenomenon or event. As Robert Searle said in Why Data Is Never Raw: "There is, then, no such thing as context-free data, and thus data cannot manifest the kind of perfect objectivity that is sometimes imagined."
This is a real conundrum. In Data Is An Oxymoron, Louise Gittleman comments that:
Data isn't something that's abstract, out there, and value-neutral. Data only exists when it's collected, and collecting data is a human activity. And in turn, the act of collecting and analyzing data changes (one could even say ‘interprets') us.
So what are we supposed to do? We have tools today I didn't have twenty years ago. We have profound learning and NLP models that can sort this out while we're sleeping.
A few years ago, we wondered how we were going to send the massive amount of data from remote sensors to the data center or the cloud. Pushing all of the data onto the public networks would break the piggy bank. The most common answer was not to send it all. "Just send abnormal readings with a time stamp." (Remember Y2K?) Suppose other sensors were reporting abnormal readings when the other one wasn't. We would miss that, and it is probably meaningful. In other words, the capture of data is already changing. Of course, a better answer did come along, to build smarter sensors that can do some of their processing, but that's the same problem all over again. The locally processed data isn't raw anymore.
So it boils down to what we call the provenance of data that begins with what we identify as the source, though it rarely is. What if we could dispatch robots to investigate the logic of programs that create and store the data the first time, a sort of technical introspection? About fifteen years ago, Tom Hite co-founded Metallect, now of VMWare, and patented a system to do just that.
You need an inventory, not only of the applications themselves, but the semantics of the applications - what they mean, what they do, and how they relate to other resources. A relational schema is a representation of a model; an ontology is a model. A relational schema is passive; it is not capable of analyzing itself. An ontology model is a run time model, as it can introspect itself and produce new information that is not given to it directly.
What if all data collected the first time is tagged by its application or device and carries that tag wherever it goes? And is augmented by each handling or transformation? Forget about rows and columns. That won't work. This may be cumbersome, but not if smart tools are handling it.
Semantic representation differs from descriptions or definitions in that they capture the relationships between things. Using a trivial analogy, consider a set of Tinker Toys® without the connectors. A metadata repository is useful as a reference tool. Still, to determine what resources are dependent on other resources or the context in which they interact, a metadata repository is just a database. The sticks are all there, but the connectors exist only outside the metadata - in the staff's knowledge that constructs queries.
On the other hand, a semantic map is a structure that allows computer programs to reason and draw inferences about things, such as a parts catalog that knows to reject an order without needing a business rule, stored procedure, or application program: the semantic map stores the sticks and the connectors. Also, unlike metadata in a relational database, a semantic map is a run-time system. Because of their graph-based structure, they are dynamic and capable of expanding their knowledge as they work by finding implicit expertise within its structure.
Finding a way to capture the original intent of data and using it will only work with this approach.