Distributed data sources are everywhere - can DataOps save us from cloud data complexity?
- Cloud data was supposed to enable AI at scale and democratize data. But how do we cope with the new complexities of distributed data? The emerging discipline of DataOps may help us here - along with concepts like "Data in mind, data in hand."
What we’re all striving for with data is timely and virtually frictionless access to it. It is a critical requirement for the expanding need for data science and machine learning (I leave out the qualifier “AI” because there is a large and expanding set of capabilities in AI that are not ML.)
The precious time of skilled practitioners is often spent managing data instead of building models. Ingenious technology that allows them to think about data, and get it without delays, requests, and errors are here now. Data in mind, data in hand, is a concept that shrinks the effort and latency from conceiving a model and having the data to run it.
The era of Data Science and Machine Learning created a fundamental shift in data provisioning for analytics. Until recently, data was moved and transformed from sources generically. In other words, decisions were made on what data elements and in what form were needed to satisfy a range of analytical requirements. This typically used persistent data warehouses and carefully constructed data transformation and integration.
It was primarily a “managing from scarcity” approach to minimize hardware and storage costs.The source systems were relatively stable, but the major problem was their semantic dissonance. Merging and unifying various attributes with differing semantics took a great deal of time. Adding a new source required reexamining those relationships and a funded effort.Big Data coupled with cloud resources changed all of that by encouraging organizations to discard the scarcity concept and to gather data in data lakes, not by a perceived need of data elements, but for whole data sources in a myriad of formats in bulk.
The problem with this approach was that the data had no unifying element and little of it produced value due to the difficulty of navigating the data lake. Most organizations felt compelled to move some or all of these collections to cloud services, and in a short period, data sources became distributed.The physical location of the data was of lesser concern.
As a result, technologies for access and control had to deal with multiple locations, and accommodate the movement of data with abstractions that no longer require requesters to understand where the data is located at any point in time. This new reality posed a severe problem for those who relied on a steady source of integrated, conformed data. With the number of source systems identified and captured, it was no longer possible for an analyst to identify, much less qualify, a data source for their investigations.
The situation became so complicated that a new approach emerged: DataOps. While DataOps promises to streamline analytics, it comes at a cost.The architecture to materialize this has many components and is complex. For the data scientist, “The Data in Mind, Data in Hand” concept demands that all of this complexity is not hidden but rather exposed in such a way that all of the capabilities of the DataOps architecture are there for them to exploit.
However, that is a lot of complexity: the components of operations, governance, and agile data pipelines. When you consider that every one of these elements represents multiple, if not hundreds of instances and that there is often more than one location in today’s hybrid cloud world, DataOps masks the complexity.
Still, the whole point of DataOps is to provide an” intent-driven design.” The reality is that data movement in a world of unfathomable data volumes is highly complex. However, simplifying the abstraction layer is still valuable, especially in democratizing the data experience. Nevertheless, the burgeoning world of analytics is not shying away from scale, so these are fundamental needs for the data team, engineers, IT, and decision scientists.
Two essential components of the DataOps architecture are connectors and pipelines. A connector is merely a template to describe how to access the data in a particular source. A given connector may be used in a dozen, or even hundreds of different pipelines that are designed for a specific point-to-point transfer. SLAs and just about anything else that maybe needed to stage complex, vast, distributed data.
A pipeline accesses a data source (or more than one). It can move data from place to place and perform transformations and operations on the data, such as profiling, transforming, cleaning, aggregating, and providing operational metrics. Pipelines are not singular operators. They can work across parallel processors and interoperate with each other.
Once the number of active pipelines expands, a central function of DataOps is the overall management and orchestration of the entire environment.There is always tension between complexity and simplicity. Something that appears to be operating seamlessly relies on a great deal of structure, function, and complexity. The old concept of “ease of use” is, essentially use-less. It tended to dumb-down things to make them understandable, resulting in “masked complexity." That is not a useful approach to DataOps.
There is a term that works. “Revealed complexity” means, in the case of a user interface, something designed to expose it in a metaphor that facilitates actions and disburdens the user from the underlying complexity while remaining approachable. If you had to drive your car with a GUI interface, you might not get out of the driveway because all of the functions are hidden behind drop-downs and buttons. Instead, a voice-based or even a “stick” whose subtle movements would invoke cascade of logical functions that are too numerous and too fast for you to control, but allow you access to all the underlying complexity.
Each nuance with the stick controls a series of events you are not directly aware but control a the detail level. Many software products have user (masked complexity) interfaces and much richer, functional interfaces for administrators, for example. However, why limit this to administrators? Getting the job done takes a lot of structure, features, and complexity. However, revealed complexity, though it sounds like an oxymoron, reminds me of the old Dolly Parton quote, “It takes a lot of money to look this cheap.” (Spoiler alert: this is not a misogynist comment, I admire Dolly Parton for concocting this clever phrasing. Dolly Parton is a musical genius, national treasure and a philanthropist). She captured this tension perfectly.
Revealed complexity should be a design goal for today’s software systems. Self-service data access and analytics development stress the data supply chain by expanding and complicating the related systems. Analytic systems are different from operational systems because they are dynamic. Even the data sources that are stable and persistent are subject to data drift, changes in the semantics (the meaning of the data), and upstream and downstream systems.
DataOps deals with this by applying the fundamental concepts of DevOps to the infinitely more complex world of data, providing the capabilities for data practitioners to become more effective. Because all of this data movement is complicated, the glue that holds DataOps together is monitoring and observability (I’ll dig into that latter issue next month). Analytics performance can now go beyond the speed of data delivery or faster than the quality of the data can be assured.
In the final analysis, is the data ready for consumption? The urgency with which real-time data is consumed and the impact of data drift has on data health makes continuous monitoring at every point in a pipeline critical to the performance of the application or process relying on the data.
About fifteen years ago, our data management and integration methods were primitive compared to today. While we can think of those days like an Andy Griffith show. Today this industry is more like “Fear the Walking Dead.” It just keeps coming back for you. I should point out that data pipelines can get pretty complicated, so tools for their orchestration, like Apache Airflow, are gaining traction. But consider this: 80% of in-house machine learning projects fail, and data is still the prevailing problem.
Next time we’ll take a look at that statistic and drill-down to ground truth. Maybe massive amounts aren’t needed. Maybe there will emerge parsimonious solutions. Maybe synthetic data is the answer - clean, labeled and ready to go.