Solving data integration at scale - DataOps, knowledge graphs and permissioned blockchains emerge

Profile picture for user Neil Raden By Neil Raden April 7, 2020
Despite shiny new AI and data science tools, the problem of data integration at scale hasn't gone away. But promising new approaches from vendors like StreamSets and FlureeDB are worth a closer look.


This article has two sections. The first describes the longstanding difficulties in integrating data for analytics and more recently, data science and AI.

The second section describes a promising, and long awaited, solution for true federation of data through the use of semantic technology, graph database, analysis and blockchain.

The longstanding problem of data integration

Single database transaction systems have the most straightforward data management issues, because the application defines the data polices and there typically is no ad hoc data integration to external sources.

Some single source transaction systems may require either on-the-fly access to foreign data or just routine access that is well-defined. These connections have to be built from scratch (and maintained) especially those relying on API's, which are value-added; they are workarounds.

When data architectures are built for reporting, analytics, data science and AI, they are supported by complicated, expensive and not always reliable data warehouses, data lakes, and more recently, object stores in the cloud. Their common characteristic is a physical, persistent data store assembled from many data sources.

Because of limited bandwidth in networks, expensive storage and the application of technologies not designed for the task, for decades, the solution to assembling data was to move it from its source to a single "target." Over time, it became clear that when you move data, it loses its context. Large-scale analytical tools process far more data, in many forms, for an individual to understand and vet without digital assistance. They require advanced, AI-driven tools. It is too complicated for humans to manage

Data scientists and AI developers usually create subsets of data called training sets, to test their models against data with known outcomes. This is a very important step and one that requires great skill and care so as not to introduce bias into the models. There is even a concept called "overtraining" where the models become too sensitized to the training sets and lose their ability to predict well.

Most modelers do not source their data directly from operational systems or from data warehouses; instead, they spend a great deal of time assembling their data from multiple sources, both internal and external, to extract the data they need in a format that lets them work on it. This process actually wastes a great of the analyst's time, up to 80% by some estimates, because the data has to be assembled and moved to a platform where the algorithms can sift through it.

Data is often unusable without adequate curation, governance and integration. With these in place, data can be used reliably by tools with intelligence, such as AI and knowledge graphs (see below). One guiding principle about data that should not be overlooked: data should never be accepted on faith. How data are construed, recorded and collected is the result of human decisions about what to measure, when and where and by what methods. In fact, the context of data-why the data were collected, how they were collected and how they were transformed-is always relevant.

There is no such thing as context-free data; data cannot manifest the kind of perfect objectivity that is sometimes imagined. At a certain level the collection and management of data may be said to presuppose interpretation. "Raw data" is not merely a practical impossibility, owing to the reality of pre-processing; instead, it is a conceptual impossibility, for data collection itself already is a form of processing. As an industry we made stumbling and inadequate progress to apply data to solving problems.

We are finally to the point where none of this is necessary. Data should rest in place and powerful techniques can substitute for the numbing labor of assembly. If you think about cooking competition shows, as soon as the contestants open the basket, they sprint around the kitchen to gather up what they think they need. They always fail and waste precious time (80%?) going back to the pantry or cooler to fetch some more. There is a better way. Try to imagine them peering inside the basket, visualizing what they will make and staying at their station as an invisible hand delivers to them, just-in-time, everything they need so they can maximize their effort to cook.

Data transformation - new options to consider

The first step in transforming enterprise data and applications is having complete understanding of all the data. The second is having a mechanism to put it into play. Graph databases, open query languages that are graph-savvy and most important, what we used to call metadata - now a complex fabric that endows data with meaning, relationships, and its own sense of security self.

Some will refer data as "the new oil," but I'd rather think of it as the invisible hand supplying the chef. Enterprise focus today is on data-centric applications, not process automation (RPA is an exception, but a small one).

One of the more common uses of analytics is tying to the sequence of events to arrive at some cause an effect analysis.

1. Fluree - a graph database and semantic technology platform

FlureeDB combines a graph database, a "permissioned blockchain," and semantic technology to create a "knowledge graph." The combination of these elements is designed to provide a decentralization (the blockchain element) in what they refer to as a "familiar database format" with immutability. Rather than an app-centric platform, Fluree describes their product with "data centricity," which they describe as:

  • Data security embedded with data
  • Data provenance/trust - verifiable data
  • Real-time, embeddable (even in a browser)
  • Share data without APIs

An interesting aspect of this is that the database can be queried with non-proprietary query languages SPARQL and GraphQL, in addition to their internal language FlureeQL. The use of a graph database, with RDF and W3C standards interoperability, provides instantly machine-readable capabilities and the expanded query power of a property graph with nodes, edges and properties to represent and store data.

And of course, the blockchain element distributes control over the database instead of the usual centralized authority and captures data in real-time with time stamps down to the millisecond, that never disappear and are secured with cryptology. Fluree uses a colorful term, "Time Travel," to describe a "query against any point in time to reproduce Fluree's state down to the millisecond."

Industries that are moving to Fluree are, not surprisingly, supply chain management, insurance and real estate.

2. StreamSets - "DataOps for modern data integration"

Self-service data access and analytics development stresses the data supply chain by expanding and complicating the system. Analytics are different from operational systems because they are dynamic. Even those data sources that are stable and persistent are subject to data drift, changes in the structure of the semantics (the meaning of the data), structure, and upstream and downstream systems. DataOps deals with this by applying the fundamental concepts of DevOps to the infinitely more complex world of data, providing the capabilities for data practitioners to become more effective.

While DataOps promises to streamline the analytics of big data, it comes at a cost. The architecture to materialize this has many components, and is complex .

Common issues with legacy and point solutions include an inability to autoscale pipelines, little operational visibility, and few out-of-the-box integrations, all of which lead to a longer time to insights and lower business confidence. StreamSets delivers a DataOps platform that solves these problems and enables organizations to stay agile with their data in motion strategy.

The automation of pipelines and even at the most basic level visibility into pipeline needs is paramount. Automation capabilities may include:

  • Pipeline provisioning
  • The application of pipeline attributes
  • Pre-built connections and transformation stages
  • Pipeline re-use and segment/fragment re-use
  • Integration with solutions like Jenkins, puppet, and chef (infrastructure automation)

My take

Attempts to label mountains of data for AI, or to attach semantic definitions, are shown to be too labor-intensive, slow and error-prone. The application of ML and AI, graph databases and data pipelines at scale are having a positive impact for the delivery and application of data resources, providing provenance, governance and security. Fluree and SteamSets are excellent examples of this development, but you can expect many other good approaches.

Automation coupled with monitoring gives companies the confidence to set-and-forget the scaling of data pipelines.