Moving data to the cloud can bring immense operational benefits. However, today's enterprise data's sheer volume and complexity can cause downstream headaches for data users.
Semantics, context, and how data is tracked and used mean even more as you stretch to reach post-migration goals. This is why, when data moves, organizations must prioritize data discovery.
In today’s AI/ML-driven world of data analytics, explainability needs a repository, just as those doing the explaining need access to metadata, e.g., information about the data being used. Data discovery is also critical for data governance, which, when ineffective, can hinder organizational growth. And, as organizations progress and grow, “data drift” starts to impact data usage, models, and your business.
This two-part article will explore how data discovery, fragmented data governance, ongoing data drift, and the need for ML explainability can all be overcome with a data catalog for accurate data and metadata record keeping.
The cloud data migration challenge
With the onslaught of AI/ML, data volumes, cadence, and complexity have exploded. Cloud providers like Amazon Web Services, Microsoft Azure, Google, and Alibaba not only provide capacity beyond what the data center can provide, their current and emerging capabilities and services drive the execution of AI/ML away from the data center.
The future lies in the cloud. A cloud-ready data discovery process can ease your transition to cloud computing and streamline processes upon arrival. So how do you take full advantage of the cloud? Migration leaders would be wise to enable all the enhancements a cloud environment offers, including:
- Special requirements for AI/ML
- Data pipeline orchestration
- Collaboration and governance
- Low-code, no-code operation
- Support for languages and SQL
- Moving/integrating data in the cloud/data exploration and quality assessment
Once migration is complete, your data scientists and engineers must have the tools to search, assemble, and manipulate data sources through the following techniques and tools.
Critical analytics tools for cloud environments
- Predictive Transformation: An inference algorithm that informs the analyst with a ranked set of suggestions about the transformation.
- Parametrization: A technique to automate changes in iterative passes.
- Pattern Matching: A valuable feature for exposing patterns in the data.
- Visual Profiling: Supports the ability to interact with the actual data and perform analysis on it.
- Sampling: Automatic sampling to test transformation.
- Scheduling: This provides the facility a time or event for a job to run and offers useful post-run information.
- Target Matching: Similar to a data warehouse schema, this prep tool automates the development of the recipe to match.
- Collaboration: Support for multiple analysts to work together and create the facility to share quality work for reuse
Taken together, these techniques enable all people to trust the data and the insights of their peers. A cloud environment with such features will support collaboration across departments and standard data types, including csv, JSON, XML, AVRO, Parquet, Hyper, TDE, etc.
It’s more important to know what your data means than where it is
The vision of big data freed organizations to capture more data sources at lower levels of detail and in vastly greater volumes. The problem with this collection was that it exposed a far more complex semantic dissonance problem.
For example, data science always consumes "historical" data, and there is no guarantee that the semantics of older datasets are the same, even if their names are unchanged. Pushing data to a data lake and assuming it is ready for use is shortsighted.
Organizations launched initiatives to be "data-driven" (though we at Hired Brains Research prefer the term "data-aware"). They strove to ramp up skills in predictive modeling, machine learning, AI, or even deep learning. And, of course, the existing analytics could not be left behind, so any solution must also satisfy those requirements. Integrating data from your own ERP and CRM systems may be a chore, but for today's data-aware applications, the fabric of data is multi-colored.
The primary issue is that enterprise data no longer exists solely in a data center or even a single cloud (or more than one, or combinations of both).
Edge analytics for IoT, for example, capture, digest, curate, and even pull data from other, different application platforms and live connections to partners (previously a snail-like exercise using obsolete processes like EDI). Edge computing can be decentralized from on-premises, cellular, data centers, or the cloud. These factors risk data originating in far-flung environments, where the data structures and semantics are not well understood or documented.
Problems arise when data sources are semantically incompatible. The challenge of smoothly moving data and logic while everything is in motion is too extreme for manual methods. And valuable analytics are often derived by drawing from multiple sources.
There are four critical components needed for a successful cloud data migration:
- AI/ML models to automate the discovery and semantics of the data
- Cloud governance.
- On-premises business intelligence and databases.
- A data catalog sophisticated enough to support the other components.
Data security throughout the migration process is also essential. A data catalog that tracks labeled data and spotlights the most valuable data can help migration managers ensure the process goes smoothly. A data catalog with a governance framework can also ensure that cloud data governance is in place once data is migrated.
Data governance and data security
Security and governance are often confused because they are tightly bound, but security is only a part of governance. According to Strategies in IT Governance:
[Data] governance is the system by which entities are directed and controlled. It is concerned with structure and processes for decision making, accountability, control and behavior at the top of an entity. Governance influences how an organization’s objectives are set and achieved, how risk is monitored and addressed, and how performance is optimized.
It’s not a simple definition. Governance has to be codified in an open system for applications across the enterprise to apply. Adding to the confusion here is that ethics and compliance are often used interchangeably. Ethics is about the right thing to do; compliance includes the rules, regulations, statutes, and even organizational direction, which try to realize and guide this “correct” course of action.
How can governance help? The role of governance is to define the rules and policies for how individuals and groups access data properties and the kind of access they are allowed. Yet people in an organization rarely operate according to well-defined roles. They perform in multiple roles, often provisionally. On-ramping has to happen immediately; off-ramping has to be a centralized function. One very large organization we dealt with discovered that departing employees still had access to critical data for seven to nine days!
So how can data governance support more intelligent data security? After all, without governance, security would be arbitrary. Many organizations that employ security schemes struggle because such schemes tend to be either too loose or too tight and almost always too rigid (insufficiently dynamic).
In this way, security can hinder the progress of the organization. Yet, given the complexity of data architecture today, it’s become impossible to manage security for individuals without a coherent and dynamic governance policy to drive security allowance or grants for exceptions to those rules. It is impossible to have a coherent security policy that isn’t part of the larger governance framework.
Governance has grown too complicated for manual methods in today's complex data architecture. A data governance application with the ability to connect to data and security is needed. As discussed in the next installment, that data governance app must also connect to a data catalog.
Governance of data drift
Part of the complexity of managing security is the constant change or “drift” in the data, the models, the semantics, the Master Data Management, and all dependencies.
Once data is ingested into an organization’s repository or is connected through a managed pipeline, data sources tend to drift. Data drift changes the data with time, meaning data cannot manifest the perfect objectivity that we tend to endow it with. Data originating from primarily stable operational systems (AKA operational exhaust) and data extracted from static database tables generally exhibit a lesser drift.
However, almost any other data, the so-called “digital exhaust,” includes user-generated files on web-based systems and networks such as cookies, logs, temporary browsing history, and indicators to help website managers. In addition, there are external datasets from data brokers that need scrutiny every time they are accessed. Incorporating this data into your corpus of information without constant surveillance for drift would render your repository unseeable.
Another source of drift is organizational. It arises from mergers and acquisitions, joint ventures, and dynamic supply chains. All these changes render simple, high-level schemes for security unworkable and create considerable liability to an organization.
Legacy data adds to the challenge. Coupled with inadequate security are last-generation solutions to metadata, effectively rows and columns that have to be queried and often joined in a relational format. This approach is generations old and too limited and rigid for the requirements of a digital organization.
Models, too, can drift and transform over time. Analysts and data scientists may or may not register their models, but effective governance of AI/ML models should include those in production, versioning the models, managing update notifications of documentation, monitoring models, their results, and implementing machine learning with existing IT policies.
AI/ML models pose a problem with versioning results and testing using unique testing and algorithms. For instance, new users may find it difficult to understand how an ML model arrived at its conclusion. This is the so-called black-box problem. Procedural methods can trace other models, but ML is quite mysterious in its operations. However, ML yields some of its mystery through a new technique called Explainability, or XAI. These examinations include measuring bias and fairness by understanding which variables affected the conclusions, among other investigations. However, XAI has only started to show some maturity.
By tracking, documenting, monitoring, versioning, and controlling access to all models, organizations can closely control model inputs and begin to understand all the variables that might affect the results. A key benefit of model governance is identifying who owns a model while a company changes over time. For example, if someone worked on a project recently but has left the company, model governance helps keep track of projects, how they run, and where you left off.
The many kinds of data drift are challenging to address manually and may create problems with data security down the line. How can data leaders safeguard security while ensuring data drift is kept in check?
The solution to the problem is a data catalog. A data catalog collects metadata, combines it with data management and search tools, and helps analysts and other data users find the data they need. Data catalogs can be continuously updated with AI/ML routines to provide richer metadata, broader and deeper coverage, and far faster performance. In the next installment, we’ll look at how AI/ML helps data catalogs improve data governance, data discoverability, data usage, and more.