I often call Twitter “the log file of my brain.” Instead of ending my day with a baker’s dozen of open browser tabs, each trying to remind me of something that I thought might merit a blog post, I now Tweet a key quotation from something that sparked my interest—along with a link to the item—and then move on, knowing that I will have that reminder waiting for me if I want to go into more depth sometime soon. In the mean time, much of the value (“Hey, this thing here made this interesting point”) has already been realized – but at moments like this, I can review my recent Tweetstream and look for larger-scale patterns.
On this particular day, recent election results (from the Brexit vote through the USA’s just-completed cycle) have moved me to share a number of items on quality versus quantity of data. People are a bit wound up on this subject: as the U.S. election results became apparent, one professional opiner said simply, “Tonight data died.” I absolutely disagree, but the comment captures the controversy arising from so many being so wrong despite so many bits of opportunity to get it right.
Here on diginomica, the always-insightful Denis Pombriant called attention to what he called “The Moneyball Failure”: the use of techniques from a sporting domain, where data is complete and exact, in an unstructured world where data come from people who might misrepresent their intentions (or change their minds). There are understandable, non-neutral forces of distortion in play if a person who is asking a voter’s intentions seems like a member of an “old order” – when that is precisely what one side in an election deeply wishes to overturn.
A related problem is often shared in the story of the man who’s looking for his dropped keys. When someone offers to help, and diligent effort fails to find them, the would-be helper asks, “Are you sure you dropped them here?” The reply is then some variant of, “No, I dropped them in that puddle down the block, but the light is much better here.” In the course of looking for a linkable version of the story, I found for the first time that this phenomenon has a name: “Streetlight Effect.” This term and the related phrase, “Drunkard’s Search,” are apparently more than fifty years old.
In either case, whether by asking questions the wrong way or asking them in the wrong place, overwhelming amounts of data can turn into overwhelming misinformation – especially if there’s a confirmation bias in play, where too many of the people asking questions are all expecting (or even hoping for) the same result.
What to do?
Are there any good ways to address this? It’s not enough to be aware of the problem. As noted by Jack Davis at the CIA, cognitive bias is like an optical illusion: even when you’re told what’s happening and why, you still see something that isn’t actually there.
As we grapple with surging data volume, we should take care to remember that dealing with the real world’s data in a purely technical way is perhaps as flawed an approach as so-called “technical analysis” in investment markets. Yes, that’s an opinion: I know there are passionate and well-paid proponents of the idea that price movement already captures everything that anyone knows about the behavior of a company’s shares, and that any additional information merely tempts people to think that they know more than the entire collectively-wise market. I see a similar notion emerging in the big data community: that the data knows all.
I’m reminded, though, of a statistical technique that I believe is called “stream extension” (but since I can’t find any good reference on this, I may be misremembering). Suppose that I have 100 years of monthly data on a behavior, and that I believe I have a good model of why that behavior happens. Suppose that I have only three years of monthly data on what I think is a fundamentally related behavior, and that statistical correlation between those data sets (month-on-month) is strong. I could then use a forecast of the first behavior, based on long and deep knowledge of seasonal and other patterns, to predict the second behavior – perhaps much better than I could do from the second, smaller data set alone. I use the longer stream of data to extend my understanding of the smaller, hence the name (as I recall it) of this method.
The key point is that this only works if I have gone to considerable effort to understand the causes, and not merely record the behaviors, of what’s going on. A world of data disgorges almost uncountable examples of correlations with astonishing strength but with no plausible underlying mechanism. I don’t for a moment believe we’ll ever find a causal link between the number of letters in the winning word of the Scripps National Spelling Bee and the number of people killed in the same year by venomous spiders.
We need to avoid declaring victory merely by mastery of the magnitude of the data challenges and opportunities that face us. When timely, precise, highly accurate data flow in newly massive quantities from connected devices and from the activities of obsessively connected and socially active people, there will be a growing number of well-lit places that tempt us to collect what’s there – rather than going looking in darker, messier places for what might be the kind of data that can tell us more things better.