Most of the past five-ten years' attention has been focused on "big data," especially to fuel data science and machine learning. All of the energy spent on big data obscured something we used to know: good things come in small packages.
Organizations are replete with valuable data on PCs, small databases enriched by models, and ingested external data sources, perhaps in kilobytes or megabytes, or departmental applications separated from enterprise IT systems. It's a fungible term, small data, with three definitions:
- Big Data is too big for humans to understand without powerful computers and algorithms to reduce big data into accessible, understandable, and actionable information. This can occur in many forms, but most often in charts and visualizations
- Data that is small in the first place, such as observations at a human scale
- An entirely different situation is how to address the problem of fewer data in machine learning models and the core issues dealing with small data sets
To break this down, the first definition isn't very interesting. It just applied the term "small data" to something we have always done: statistical analysis, visualization, business intelligence. The second definition is profoundly interesting. The third definition is only interesting if the nuts and bolts of AI appeal to you. Besides, it is a misapplication of the term "small data," which describes "not enough data." Not the same thing, but we'll cover it first and get it out of the way.
#3 - Small data for AI models
Dealing with models that lack sufficient data (and "sufficient "is a fungible concept), AI engineers identify three problems:
- Lack of Generalization
- Data Imbalance
- Difficulty in Optimization
Lack of Generalization manifests itself in many ways. Still, the most likely remedy is Data Augmentation with ensemble techniques (" select a collection (ensemble) of hypotheses and combine their predictions into a final prediction") such as bagging and boosting. A quick example is automotive manufacturing, where most OEMs and Tier One suppliers strive to have an infinitesimal number of defects, such as 3 or 4 per million parts. The rarity of these defects makes it challenging to have enough defect data to train visual inspection models. Synthetic data generation is applied to synthesize novel images.
Data Imbalance occurs when the number of data points for different classes is uneven. Imbalance in most machine learning models is not a problem, but imbalance is consequential in Small Data. One technique is to change the Loss Function by adjusting weights, another example of how AI models are not perfect. A very readable explanation of imbalance and its remedies can be found here.
Difficulty in Optimization is a fundamental problem since that is what machine learning is meant to do. Optimization starts with defining some kind of loss function/cost function. It ends with minimizing it using one or the other optimization routines, usually Gradient Descent, an iterative algorithm for finding a local minimum of a differentiable function (first-semester calculus, not magic). But if the dataset is weak, the technique may not optimize. The most popular remedy is Transfer Learning. As the name implies, transfer learning is a machine learning method where a model is reused to enhance another model. A simple explanation of transfer learning can be found here.
I wanted to do #3 first, because #2 is the more compelling discussion about small data. Searching, I found that there are no references to "small data" before 2016, the year in which the most influential book on the subject was published: Lindstrøm, Martin, Small Data: The Tiny Clues That Uncover Huge Trends, St. Martins Press, 2016
His premise is that small data is seemingly insignificant observations that disclose people's subconscious behavior. As 85% of our behavior is subconscious, small data provides the clues to the causation and hypotheses behind that behavior. Compared to small data, big data is rational data that creates correlation by connecting the dots. But as analysts mining big data use their hypotheses in their random search in billions of data, the results are imprecise. Moreover, as everyone mines big data, they end up with the same results in the same way. By using small data, which is so specific and detailed, as preliminary research to find the hypotheses behind our subconscious behavior, you can find the imbalances in people's lives that represent a need and ultimately a gap in the market for a new brand.
One example is he believes Amazon will fail if they attempt to open bricks and mortar bookstores because they won't embed themselves in the community. They won't pick up on the clues like independent booksellers do, or as Lindstrøm says:
Where Big Data on the Internet is good at going down the transaction path if you click, pick and run. You could say that the Small Data is fueling the experiential shopping, the feeling of community, the feeling of the senses - all that stuff you can't replicate online.
Lindstrøm's premise is very compelling. I've always been suspicious of Big Data predictive analytics. My classical training in statistics taught me that you couldn't predict the future by manipulating data. Statistical analysis was used to understand past behavior to build better analytical models. Big Data and Data Science perverted this guideline, and the belief is a terabyte is ok, a petabyte is better, and an exabyte is the cat's meow. The more data you have, the more likely it is you do not understand its context. It doesn't speak for itself.
Big Data applications are extremely useful, such as smart electric grids, autonomous vehicles (when they arrive), money laundering, and threat detection. Sadly, a reference search for how AI will help humanity is not nearly as expansive, and every reference seems to say the same three things. I am hopeful these will emerge.