Should you be more specific? Six points about data

Peter Coffee Profile picture for user Peter Coffee May 30, 2016
How much data - and metadata - should you collect? Salesforce's Peter Coffee outlines six points to consider when thinking about data capture and context

Peter Coffee
Peter Coffee

When I was working in aerospace applications of AI, something like thirty years ago — yes, there were computers and even 36-bit Lisp machines back then — there was an “elephant joke” on the subject of knowledge that came up in the context of devising expert systems.

How is an elephant like a grape?

They’re both purple, except for the elephant.

If you don’t think that’s funny, I’m sorry, but there are at least two points being made.

  • Point One: we think of many things as basically “the same,” or at least as members of a single class of object, despite substantial differences in uncountable fine details. A purple grape, a red grape, and a green grape are all “grapes”; a big grape and a tiny grape, ditto, and most of us would say that any of them is more like the others than any of them is like an elephant (even if I painted the elephant purple).
  • Point Two: actually characterizing elephants and grapes (and pretty much anything more complex than a hydrogen atom) requires a whole lot more than observing their color. To a winemaker, merely saying “grape” or even “red grape” is about as specific as saying “wrench” to a mechanic (who may own five different kinds, each in various sizes) – but how much detail, in any given context, is enough? When we pay for storage by the bit, during collection and transmission and archival and (perhaps especially) protection, that question is more than philosophical.

Representing the essence

Let’s talk about representing the essence of things. We sometimes use the DNA blueprints of living things as the unobtainium standard of precision in an enormous space of options. It’s often remarked that there are more possible variations on a human DNA molecule than there are atoms in the universe. Really, though, that’s not the most illuminating comparison.

For any given human being, the number of different genomes that would produce an apparently identical person (because much of our DNA is so-called “junk”) may possibly outnumber the known universe’s atoms by a factor of 106,141,012 – yes, 1 followed by about six million zeros. I’m reasonably sure that this argument will not get you acquitted if they find your blood at the crime scene, even if a member of the cast of “CSI” appears as an expert witness, but perhaps it occasions a reality check – because these are questions of how much similarity is enough, and how much knowledge is necessary, for us to make decisions and take actions.

There are at least two further points to be made here.

Regulatory data

My own previous reference to “junk DNA” is an example of Point Four, because that 1970s coinage was considerably rethought about four years ago. The phrase “does need to be totally expunged from the lexicon,” said Ewan Birney of the European Bioinformatics Institute in Cambridge, because perhaps as much as 80 per cent of our genes may actually play functional roles (compared to the 1970s assessment that 97 per cent of our DNA was “non-coding”).

“It was always clear that there was regulation. What we didn’t know was just quite how extensive this was,” said Birney, and in a connected world I believe this is more than a biological comment. In every situation, we’re increasingly awash in opportunities to capture information that might matter someday — that might turn out to play, as the biologists put it, a “regulatory” even if not a directly “coding” role — and that, once discarded, can never be re-created.

I’m talking about questions of digital practice, of value versus cost — of diginomics, if you like — such as

  • What do we let go by us entirely?
  • What do we capture only in aggregates of totals and averages?
  • What do we record item by item?
  • What do we capture, not merely in full detail, but in the context of almost unbounded metadata?

Oh, my, that last one is a question.

Engineering the environment

With people using uniquely personal devices, enabling time- and geo-tagged recording of what they do, we can choose to capture the time and the location in which a question was asked; the times and the sequences in which the questioner got what answers, from whom; the actions subsequently taken, when and where and in the company of whom. And I’m just getting started: what about the music that was playing in the location where a choice was made, and the colors that were present, and perhaps even the smells that might have played a part?

What we capture, and analyze, and can control on future occasions, we can engineer instead of leaving to luck or happenstance. Choosing not to do this will soon seem as half-hearted as using default fonts in monochrome layouts in advertising. Engineering the scent of an environment, for example, is today “where music was 15 years ago,” in the words of one player in this space, about a year and a half ago.

New database disciplines

This kind of thing won’t be done with databases as we have known them. We’ll need to do two further things that traditional database disciplines don’t merely neglect, but actually work against:

“I think I’ll stop here,” to borrow an ironically famous quotation from mathematician Andrew Wiles. As it turned out, he had not actually proved Fermat’s last theorem at that point, but he’d given people enough to think about while he filled in the remaining details. By all means, let’s each of us go do the same and see what we can build from the results.

A grey colored placeholder image