Solving the data wrangling conundrum - can machine learning transform data management?

Neil Raden Profile picture for user Neil Raden April 26, 2022
Do data scientists really spend 80% of their time wrangling data? Last time around, we examined this notion. But when it comes to data management, how can machine learning change data platforms for the better?


In my last piece, I asked: do data scientists really spend 80% of their time wrangling data? Now it's time for the follow-up: can machine learning make a difference in data management? Can it alter that 80/20 data cleansing ratio?

Machine Learning (ML) is a term that can mean just about anything. In evaluating a (proprietary) tool for data management that claims to use machine learning, you should understand what that means. It isn’t necessary to see the math or even the code that implements the algorithm.

It should suffice to understand what the algorithm evaluates, at least a high-level explanation of how it operates and what it produces.

Keep in mind that the fundamental workings of the algorithms are usually proprietary, so the explanations, if given, will be pretty high level. How coherent the explanation is, though, should help you understand what is real. 

Despite its lofty name, machine learning isn’t that mysterious. The most popular algorithms in use today are pretty mature. What makes them “machine learning” instead of just statistical models is the use of massive amounts of data, which was not previously possible. Some machine learning algorithms that are common in use are

  • Linear Regression 
  • Logistic Regression 
  • Linear Discriminant Analysis 
  • Classification and Regression Trees 
  • Naive Bayes 
  • K-Nearest Neighbors 
  • Learning Vector Quantization 
  • Support Vector Machines 
  • Bagging and Random Forest 
  • Boosting and AdaBoost 

Is Machine Learning Artificial Intelligence? 

There is a tendency to conflate machine learning with Artificial Intelligence (AI). There are two general fields of AI. The first, Artificial General Intelligence (AGI), is about machines having human-like cognition and human intelligence, but there is some disagreement about when or if we will reach that threshold. Each new bold advance in what appears to be AGI demonstrates that what was assumed to be intelligence turns out not to be. Facial recognition is a good example. The other is what is in place now: non-sentient machine intelligence, typically focused on a narrow task. This is where machine learning and AI get mixed up. 

For an ML algorithm to learn, it sifts through lots of data using a variety of statistical, non-parametric and other quantitative algorithms to find relationships, patterns and connections in the data. According to Judea Pearl, the Turing Award winner and author of “The Book of Why: The New Science of Cause and Effect,” ML cannot understand cause and effect. ML without causal capabilities, as Pearl derisively claims, “is just curve fitting.” 

Pearl has led the field in the issue of cause and effect, and while there is some truth in his comment, there are many applications for ML that are “just curve fitting.” For example, sifting through billions of records to find what relates to what and how strongly, and then having analysts or data stewards the opportunity to edit those findings. That’s how ML actually “learns.” 

For a data discovery/relationship discovery process to tie to a data catalog, the essential abilities are: 

  • The ability to scale as the data volumes are large; the processing is continuous.
  • The ML algorithms operate in supervised and unsupervised mode.
  • No ML discovery algorithm is perfect. User input is captured and cycled back into the ML process.
  • Continuous relearning and adapting of the models. 

I wrote a few years ago: The real magic in applying machine learning models to a software product is producing the right mix of things that are general enough to work with a wide range of situations and powerful enough to produce non-trivial results repeatedly (useless example, “Most auto injury accidents occur when the driver is at least 16 years old.”)  Supporting data science with Integrated (no code) tools requires creating and maintaining a comprehensive data catalog, but a few steps precede it. 

Relationship discovery

If you think about it, the most crucial part of managing collections of unalike data is finding relationships. Finding relationships between so many forms of data is practically impossible to do by hand. When dealing with tabular/columnar data, figuring out what names are likely to point to similar kinds of data (though not consistently accurate). Instead, the magic investigates the actual data to determine what it is.

To put this in perspective, if you have a few billion instances to compare, this can be a computationally expensive (read, slow) process. Here is the first example of machine learning boosting the process. Using some of the algorithms mentioned above, an unsupervised machine learning model can quickly break down the similarities and converge to a solution. As the process flows through the data collection, it builds a relationship map that drives all of the elements of the system. Some powerful  techniques that data discovery vendors are employing to find these relationships are   

  • Recurrent Convolutional Neural Networks RCNN.
  • Semi-Structured Data Parsing: Hidden Markov Model and Gene Sequencing algorithms.

Recommendations are then provided to help the analyst join data sets, enrich the data, choose columns, add filters, and aggregate the data. The algorithms convert the mapping recommendation problem into a machine translation problem using:

  • Encoder-Decoder architecture for primitive one-to-one mappings.
  • Then using maximal grouping.
  • An Attention Neural Network (ANN) is used to resolve the recommendation.

Data flow 

Machine learning-based discovery of how data flows between databases and data sources and ultimately how data moves through the organization; discovering where data emanates and the affinities in the data itself. 

Sensitive data discovery

There are two types of sensitive data in sources. The first is the obvious personal information such as name, social security number, date of birth, and demographic, sociographic, and psychographic data. The problem is that this data may not be identifiable by merely looking at the column names or other available metadata. Only by examining the data itself can an algorithm decide the data within the " sensitive realm.”

But there is a deeper problem. Personally Identifiable Information (PII) is the term for seemingly non-sensitive information that can be combined with other non-sensitive details to create an “emergent” identity. Additionally, there may be information that is considered sensitive or confidential to an organization that is defined by company policy, which may also be considered within the realm of “sensitive.” 

Considering these types of sensitive data, there are many issues where it is essential to manage the process. First, of course, are regulatory issues, such as the recently enabled General Data Protection Regulation (GDPR). But there are also organizational promises to customers and suppliers to be good stewards of data you collect about them. It is relatively easy to govern these policies when a single internal system generates and manages the data. Still, if the data is scattered across sources and locations, gaps in governance and even the “emergent” problem can occur. 

And finally, the connection between policy and digital processing is wide. The policy is stated in natural language, but how that policy is implemented in software can be pretty tricky. 

Impact analysis

Like a trend analysis, this captures changes in the source data at different points in time. For example, if new sensitive data is introduced into the database, impact analysis can determine when that occurred and quantify the delta.

Redundant data analysis

Redundant data may, and usually does, have different modification cycles, leading to data confusion. Generally, there aren’t redundant data sources of primary enterprise data (though it happens). Still, other data sets can creep into the universe of sources, such as saved analysis outputs, training data sets and even spreadsheets. The relationship map can identify these redundant sources and allow the analysts to choose the appropriate one. 

Organizations can accumulate vast quantities of redundant data. They may be impacted by storage costs and unknowingly leave such data unmanaged and unprotected. Redundant data also requires management so that organizations can decide on the appropriate remediation steps as part of the data management process once identified. 10 

Data catalog

Most important. The automated data catalog is driven by relationship discovery. The whole point of a semantically rich data catalog is to provide analysts, data scientists, business and technology users (anyone who uses data, actually) a means to find the data needed, to understand what it means, how it relates to other data, its flow and to support collaboration and enable good data governance, data management and ultimately business analytics. Unlike proprietary metadata of an application, such as enterprise applications like ERP or CRM or the proprietary metadata of Business Intelligence and visualization tools, the catalog is not tied to a specific schema or model. Its generality is the key to its usefulness. 

The most common repositories of metadata relate to customer and product domains. There is no doubt that these repositories are useful, but they lack perhaps 90 percent of the valuable data for analytics and data science. 

My take

Machine Learning alone cannot break through the 80% problem, but it is the necessary element if applied intelligently. A unified platform, from data discovery to data catalog, can vastly reduce the time it takes to do the analytics required for digital transformation. 

A grey colored placeholder image