“After me, the deluge,” is attributed to French King Louis XV or his mistress, Madame de Pompadour, in reference to signs of the approaching Revolution.
The virtually unlimited amount of data, processing capacity and tools to leverage it are a modern deluge, without a doubt. The challenge today is not the volume of data, it’s making sense of it, at scale, continuously.
No organization today is immune from the push for some form of digital transformation. The late Peter Drucker famously said, “The computer may have aggravated management's degenerative tendency to focus inward.” That was almost twenty years ago and is almost certainly not true today. However, it illustrates how information systems have changed, and how quickly.
It is no longer solely sufficient to thrash through your internal record-keeping systems for insight, and it is very likely that you already do your analytics in multiple locations, platforms, clusters and with very different kinds of data. In addition, more of your staff are engaged in analytics due to better software tools, and more will continue to be. But you need help.
The sudden surge in computing capacity drove demand for more data and analytics but also led to a temporary structural shortage of professional data scientists and statisticians. Exacerbating this insufficiency were productivity problems associated with the work, known as the 80/20 problem: 80% of the time spent managing data and only 20% doing the quantitative investigation. This claim first appeared at least eight years ago, and it still endures. There is no clear, rigorous evidence that this 80% measure is accurate. Depending on the circumstances, it varies widely by organization, application, and the skill and tools applied. However, it is impossible to deny that sourcing data for analytical and data science uses is a significant effort, regardless of the percentage cited.
Clearly, better tools are needed to improve highly skilled and compensated professionals' productivity (and job satisfaction). Even those performing analytics in organizations in more traditional ways can benefit from an intelligent and integrated product taking them from data discovery and profiling to navigating the data with a dynamic, semantically rich data catalog and applying it in unique and productive ways.
The cadence of technology innovation surpasses most organizations’ ability to implement each new or improved technique before the next one arrives. Data management can never be a pure, complete process. It requires trade-offs, picking the issues that make the most sense, have the most significant centrality to the organization’s strategy (ies), provide the most protection against danger, and ensure the organization can be as effective as possible. Rich metadata-driven catalogs are emerging as the solution.
It would be a mistake to assume that a static catalog, even one replete with valuable information, would be an adequate solution. What if your catalog could make recommendations for you, such as people who searched for this also searched for this? Or, pointing out datasets that are subsets and supersets of each other shows you where redundant data sets exist? The lift from that kind of technology can be measured in “time to value” – when companies set up a data catalog, the effort to populate the data dictionary and map data sets to the catalog can take months. What’s missing is a knowledge graph technology, and the recommendations engine to compress the time it takes to establish a business viable data discovery tool can often be measured in weeks rather than months.
It can’t be done with traditional methods. There is too much data and too much diversity of sources for programmatic solutions. The data scientists (we use the term “data scientist” broadly to mean anyone using data for analytical and quantitative work) need some help. Interestingly, that help comes from the same disciplines they use in their work. The solutions today that work have, at their core, machine learning and AI technology.
The promise of machine learning and AI-imbued applications catapulting us to impressive capabilities is energized by leaps in processing, storage and networking technologies, the ability to process data at an incredible scale and the expanding skill sets of data scientists. This technology bedrock allows for an innovative approach to data management, not possible even a decade ago.
Here are a few techniques that enterprise software vendors and public cloud providers bring to market. First, a definition: Data governance and metadata have a close relationship, but they are not the same. Data governance defines and manages policies. Metadata is the mechanism to inform stakeholders about governance policies.
User/ Use case clustering
Relational databases developed algorithms to supercharge their query optimizers by finding patterns in data usage to detect and enable the potential grouping of users with similar usage patterns. With advanced platform technology, this same can optimize similar content for new use cases and increase the performance of existing ones. Or even spotting anomaly detection.
Resource allocation metrics
Dynamic and autonomous maintenance of software and hardware resources through the detailed collection of metrics in a dynamic metadata system.
Alerts and recommendations
Analytics software of all varieties, Business Intelligence, visualization, and custom statistical models can all be connected to alerts and notifications. Enabling Insight to be instantaneous.
Orchestrate recommendations and responses
Interoperability with data management platforms provides more actionable answers. Harmonizing metadata between data platforms leads to valuable inferences to new information.
New asset inference
Finding unseen relationships in data is the province of the knowledge graph.
Orchestrate recommendations and responses
Compatibility with data management is necessary for deriving richer, more complete answers.
Active metadata management is the key to effective data access and governance compliance
- Different approaches to managing the new data challenges have elevated metadata management. It enables compliance with governance rules and provides immediate access to high-quality data via self-service and energized intelligent services. Previously proven practices are no longer suitable to handle today's requirements.
- The fundamental hindrance of active metadata management will be home-brewed AI, when there is an ever-increasing number of powerful enterprise tools to do the job.
Metadata management is a part of the data aabric architecture
- Addressing the need for non-homogeneous datasets, data lake architecture does not accommodate federated domains, multi- or hybrid-cloud.
- Data fabric and mesh approaches solve all the drawbacks of previous schemes. Though they choose somewhat different methods, both attack the difficulty of delivering data insights.
- Data mesh is built for distributed, domain-centered data stewardship, while data fabric strives for integration sharing in the middle. Data governance compliance in this approach is enforced separately, and the final data governance layer is emergent.
- In data mesh orientation, data governance, of which metadata management is a part, exhibits a bottom-up structure. In data fabric architecture, governance is supposed to be a built-in feature with data management and security as first principles. It is the responsibility of the domains,
- In general, metadata management has to be developed gradually, and demand-driven, defining standards accompanied by blueprints and focused on the most critical and relevant data first.
For those of us who were tasked with delivering the usual analytics, we scoffed at metadata and described it as “dead data in a drawer, neatly arranged.” It provided no value. The emerging role of active metadata, providing knowledge graphs and catalogs, is essential to the implementation of whatever data management platform you choose: data fabric, data mesh, or both.