In a previous article, The modern stack has a messy data problem, I borrowed a phrase from George Fraser, CEO & Co-Founder, Fivetran – "messy data."
"People need to realize that the sources produce very unclean data. And if you need to send the data to a relational database that supports updates and things like that, the data you will be looking at will be very ugly." That's how Fraser explains it. I think it's a wild understatement.
The problems with data are broader and deeper than just extracting data from one place and putting it somewhere else. cleansing, enriching, parsing, normalizing, transforming, filtering, shaping, integrating it and formatting it prior to its intended use should be entirely automated. But first, you must find it, understand what it means, discover its provenance, and get access to it if it is protected.
Today's enterprises are using more data than ever from various sources. As these files are in disparate many formats of data, in many native file types, limited metadata, crammed with thousands or even millions of other files in a "data lake" and other locations across the universe, it has historically been incredibly difficult to create an all-encompassing and searchable resource of metadata from all these sources.
It is common in the industry today to proclaim that a software product is AI-enabled or AI-assisted. Some of these claims are flimsy. Challenges today raise higher expectations of AI from management to be "data-driven" or a "digital enterprise," but it takes more than will. Our investigation revealed that AI practices are pervasive in a few products and can be enhanced and maintained easily by intelligently putting all the AI methods into one module that can be applied to any other part of the product set.
In addition to the "number crunching, curve fitting" types of models, the best solutions I've seen charged ahead with exotic neural networks, Natural Language Processing (NLP), so you can ask questions in your language and, perhaps most of all, as for the sometimes mysterious operations of AI, a kind of "Cooperative AI," where the decisions of people using the system are fed back to the algorithms.
In the recent article in Battery Ventures, Fire vs. Ice: Databricks and Snowflake Face Off as AI Wave Approaches, by Dharmesh Thakker et al., I found the following paragraph to be a good, concise description of what enterprises are likely to do with AI combined with their data:
The theme of bringing models/compute closer to the troves of proprietary, enterprise data that already exists inside of Databricks and Snowflake. While we have long debated the end state of how enterprises will leverage AI in production—either through sending data directly to off-the-shelf, third-party model providers like OpenAI, Cohere, or Anthropic, or bringing models, both third-party and open-source directly to the data—both Databricks and Snowflake have made it abundantly clear that data has gravity. And, despite the size, sophistication, and abstraction that off-the-shelf, third-party models offer, enterprises want the ability to train, fine-tune, and run models directly on top of their proprietary, first-party data without compromising on performance, cost, and security and governance concerns.
Beyond this paragraph, this is an excellent review of the recent Databricks and Snowflake conferences and announcements, but not the subject of this article.
This begs the question, "How can AI move the needle on this 'mess', and what sorts of surrounding capabilities are needed?" The question is too broad for this article, but I will propose that some bedrock applications were not possible before our current state of AI, such as:
- Knowledge Graphs, Complete Active Data Catalog, and Natural Language
- Similar Dataset Overlap Detection -There may be hundreds of similar datasets, some identical in structure but not in content - subsets, supersets, partial super/subsets.
- Mapping Recommendations/Crowd Source Mappings
- Trending/Data Drift
I included the below terms that will appear in the descriptions of AI techniques to add some specificity to how AI aids in the bedrock applications we need to deal with messy data. Some terms in this article:
- Classification is a technique to categorize data into a desired and distinct number of classes and label each. Some classifiers algorithms are: Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbor, ANN (see below), Support Vector Machine
- Deep Learning - Deep learning models can recognize complex pictures, text, sound, and other data patterns to produce accurate insights and predictions.
- Recurrent Neural Networks - are distinguished by their "memory" as they take information from prior inputs to influence the current input. Recurrent neural networks' output depends on the sequence's previous elements.
- LSTM - Long Short-Term Memory networks are a variety of Recurrent Neural Networks (RNNs) that are capable of learning long-term dependencies, especially in sequence prediction problems
- BEAM - choose the best output given target variables like maximum probability or next output character.
- Gradient Descent - iterative optimization algorithm for finding a local minimum of a differentiable function.
- Attention Neural Network - enhances some parts of the input data while diminishing other components—so the network focuses more on the important aspects of the data, trained by gradient descent.
- Transformer – architecture to solve sequence-to-sequence tasks handling long-range dependencies. It relies entirely on self-attention to compute representations of its input and output
- Encoder/Decoder - The transformer uses an encoder-decoder architecture. The encoder extracts features from an input sentence, and the decoder uses the features to produce an output sentence (translation).
- Parsing Expression Grammar (PEG's) algorithm - describes a formal language in terms of a set of rules for recognizing strings in the language.
- ML Regular Expression Framework - expressions used to extract or replace a specific pattern in the text-corpus.
Using the Knowledge Graph
Consider this scenario: an organization was an early adopter of a data lake and proceeded without proper effort to maintain metadata, lineage, and provenance and, as a result, had a utility that needed to be more secure and administered. The data lake was full of files from within the organization's data centers. Still, files from desktop/laptop applications, email, memoranda, and files were pulled from external sources and scraped from social media.
There were typically structured formats and many others, data from applications, logs and, interestingly, an active Machine Learning group that generated hundreds of new files a day. Because many of the files ingested into the data lake also existed outside of it and spread around the organization, it was impossible to know which ones were current, assess data quality, or apply any rational way to avoid duplication.
Unfortunately, finding the right data for data scientists, data engineers, and AI engineers was done by word of mouth, passing notes back and forth and pure manual examination. The data lake was never considered strategic because projects never progressed beyond the pilot phase, never addressing the big questions. It was an endless repository with little value but substantial cost.
The solution was to build a Knowledge Graph that can identify 60 different data formats, including JSON, AVRO, Parquet, XML, and RDP, including semi-structured data via its Data Parsing of WebLogs / Log File. – crawling through the universe of data and using AI techniques to create an Enterprise Knowledge Graph that encodes relationships between files and depicts all of the relationships in the entire corpus of data. The Knowledge Graph underlies the catalog and powerfully distills complexity into something understandable. Connections between nodes form a directed graph which is the bedrock of the catalog and background processing. AI tools employed for constructing and using the Knowledge Graph are:
- Recurrent Convolutional Neural Networks RCNN
- Semi-Structured Data Parsing: Hidden Markov Model and Gene Sequencing Algorithms
- Matrix factorization is a class of techniques that use linear algebra to map high dimensional data into a low dimensional space. This projection is accomplished by decomposing a matrix into a set of small rectangular matrices. Methods for matrix decomposition include Isomap, Laplacian eigenmaps and Principal Component Analysis (PCA)/ Singular Vector Decomposition (SVD). These methods were designed to be used on many different types of data.
Graphs are essential because it is only possible to navigate through the ocean of unlike data available for modeling and analysis with some tools to illuminate the process. Graphs are about relationships and provide the ability to traverse far-away relations easily and quickly, something for which relational databases are quite limited. The Important part of managing unalike data is finding relationships. Manual methods for finding relationships in unalike data are too limited to be effective. Technical metadata like column names is useful, but the magic is understanding the data to determine what it is. With robust relationship analysis in extensive data collection, errors and biases are inevitable.
Complete Active Data Catalog
AI allows users to resolve this key issue by creating and maintaining a "complete data catalog" (many offerings are far from complete) that automatically catalogs data where it sits. The active catalog is a vital tool for IT, developers, AI engineers and the value-adding part of the architecture. They are freeing business analysts and decision-makers from time-consuming, inaccurate manual processes. Metadata is collected and cataloged, no matter where original data sources currently exist.
In addition to a conversational user interface for interacting with the catalog, and support for APIs to do so programmatically, the catalog should provide Natural Language where one can ask questions in their language about discovered objects and many other kinds of queries supported. For example, NLP for the catalog can answer questions about:
- Metadata & Quality (show me a description of something)
- Statistics for datasets and columns (what's the meaning of something)
- Governance (give me datasets that have PII)
- Permissions (show me users who have access to dataset <customer>)
- BI/Tableau (show me tableau views for <revenue>)
- Glossary (show me glossary with <word> in title/description)
- Feature Depth (gaining equivalence to consumer-grade search & NLP)
- The Future of AI: Machine Learning and Knowledge Graphs. https://neo4j.com/blog/future-ai-machine-learning-knowledge-graphs/
Interestingly, machine learning is enhanced using knowledge graphs because of their innate ability to surface context. Contextual information increases predictive accuracy, makes decisioning systems more flexible, and provides a framework for tracking data lineage.
Similar datasets, similar dataset overlap detection
There may be hundreds of similar datasets, some identical in structure but not in. content: Subset, superset, partial super/subset. There may be hundreds of similar datasets, some exact in structure but not in content.
AI can determine if datasets are related in several ways, particularly when one is a subset of another or a superset and a perfect match or even a partial subset. Suppose an engineer needed twenty attributes for a model. He typically uses the same file, which has 300 attributes. AI can recommend a file with 16 attributes and map the other four from the superset file. Once it begins to understand the subject areas of interest to the developer, it can recommend this approach without being asked.
When you have hundreds, thousands, or even millions of datasets to consider, it is clearly beyond human capability to map files for investigations. Doing it manually generally leads to repetitive use of the same sources and stale investigations.
Another example of AI working is machine learning classifiers to auto-detect PII (Personally Identifiable Information) in your data. When a data scientist is working with data sources that are familiar, this is not much of a problem, but introducing new sources can inadvertently reveal PII or, even more insidious, provide enough non-PII data so that the machine learning algorithm can de-anonymize the data though the latent values in the model, potentially causing ethical risk and even harm. Dataset and attribute classification is the ability to classify both the dataset and all attributes of a dataset, auto mask prohibited data, or just auto tag it and auto authorize. A series of AI tools should service this process:
Business Value: The Data Catalog enables data to automatically be cataloged where it sits, preventing users from engaging in this time-consuming process manually. Metadata is collected and cataloged, no matter where original data sources currently exist. For an organization with a dysfunctional data lake, which is prevalent, the Knowledge Graph powers features like the Data Catalog and Neural Language Processing—powering an army of analysts to develop the applications that drive measurable results, like, embedding inference into a dynamic sourcing system, screening resumes of both candidates and employees for best fit or in financial institutions, and fine-tuning reserves more efficiently. The catalog and knowledge graph can energize a latent data lake by providing data efficiently to those who need it and keeping the catalog fresh behind the scenes.
Mapping recommendations, crowd source mapping
How does AI provide recommendations? It begins to capture mappings that people already use because recommendations are governed and based on previous mappings. The more a customer uses the product, the more the product learns about that customer's usage patterns and, through observation, the more contextual information it gets about how different data objects and entities are related.
• As analysts begin mapping their integrated datasets to targets, each user's data is recorded in the system
• Generate a histogram of mappings for the specific set of datasets that map to target
• zRank the mappings and provide associated jobs that have previously used them.
• Permissions in the surrounding system govern these suggested mappings
Further recommendations are then provided to help the analyst perform such tasks as joining data sets, enriching the data, choosing columns, adding filters, and aggregating the data. Then, the algorithms convert the mapping recommendation problem into a machine translation problem. Some AI approaches used:
- Encoder-Decoder architecture for primitive one-to-one mappings
- The encoder-decoder models in the context of recurrent neural networks (RNNs) are sequence-to-sequence mapping models. An RNN encoder-decoder takes a sequence as input and generates another as output. ... The decoder network then uses the encoded representation to generate an output sequence.
- Then using maximal grouping (a concept from Abstract Algebra, a little complicated)
- An Attention Neural Network (ANN) is used to resolve the recommendation.
I intended to introduce just some AI capabilities applied to data management. I suspect there will be rapid innovation in managing the “messy data” problem. Incumbents in the data management space are (or will_ make progress in 2023, especially with the explosion of offerings in Large Language Models, which are based on the techniques in the article.
Editor's note: the author provided a number of links to relevant definitions of the terms used; we'll include these as a comment on this piece.