DataOps challenge - the complicated art of making things simple

Profile picture for user Neil Raden By Neil Raden September 26, 2019
Summary:
Variations on Dolly Parton - making things simple is pretty complicated.

Image of a human brain abstracted with AI

There is always a tension between complexity and simplicity. Something that appears to be operating seamlessly of necessity relies on a great deal of structure, function, and complexity. I hear a term the other day that got me thinking about this: revealed complexity. I don't precisely know what it means, but I have a theory. The old concept of "ease of use" is, essentially, useless. It tended to dumb-down things to make them understandable and masked complexity.

Revealed complexity, to me, means, in the case of a user interface, something designed not to hide the complexity but rather to expose it in a metaphor that facilitates actions and disburdens the user from the underlying complexity while remaining approachable.

Perhaps it could be called an expert interface. Think about a helicopter pilot controlling the flight of the aircraft with a "stick." The subtle movements of the stick invoke a cascade of logical and servo-mechanical functions that are too numerous and too fast for the pilot to control but allow the pilot access to all the underlying complexity.

Many software products have user (masked complexity) interface and much richer, functional interfaces for administrators, for example, but why limit this to administrators?

It takes a lot of structure, features, and complexity to get the job done. However, revealed complexity, though it sounds like an oxymoron, reminds me of the old Dolly Parton quote, “It takes a lot of money to look this cheap.” Revealed complexity should be a design goal for today’s software systems, like DataOps, for example:

DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. ... This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment.

Name change

Watching presentations at the StreamSets DataOps Summit I saw diagrams of massive data rationalization of the Department of Defense, SDK (pharmaceutical), and others. I'm struck that I've seen these diagrams for decades only the names have changed. The dream is to make sense out of chaos in data.

There was the Enterprise Data Model, a terrible idea because it couldn't be done. There was the Enterprise Data Warehouse, which was a scaled-down version of the EDM, but it hit the wall in terms of scale and cost and complexity but spent limited time thinking about value and analytics. A classic "build it and they will come” strategy. Master Data Management, another mega dream that rarely got past customer or product.  Then there was Hadoop, which turned out to be a file system but not a solution. Data Lakes, another dream to get everything in one place, that's where we are now. However, are they persistent, distributed, virtual?

One speaker, Kirk Borne of Booz Allen, admonished the audience to think big but start small. However, another presentation by two Booz Allen people described a mind-boggling effort to build an audit platform for DoD, which clearly did not start small. Mark Ramsey, formally of Chief Data Officer of pharmaceutical company GSK takes the same POV which is to think big and go big. Reflecting on it, I think the important part was the "think big." As the old Chinese proverb from the late, great data architect Laozi, "A journey of a thousand miles starts beneath one's feet." What we don’t know about these programs is how well they work, what still needs to be done, how much did (do) they cost vs. original estimates.

In trying to get a sense of the complexity of DataOps, I’ve used a graphic, courtesy of DataKitchen, to convey how many subprocesses and tools are needed. It gets to at a high level and shows the components of operations, governance, and agile data pipelines. This last one can get pretty complicated. The GSK model I was shown had over 10,000 pipelines, too many to create and manage by hand, so they developed agents to develop and maintain the pipelines. There were software products in the mix like Tamr, Hadoop, StreamSets, Spark, Kafka, and quite a few others to partition and order events. To be clear, GSK was not using DataKitchen.

DataOps

When you consider the fact that every object in this diagram represents multi, if not hundreds of instances and not depicted is that there is often more than one site in today's hybrid cloud world, the green boxes are masked complexity, for the purpose of illustration, but the whole point of DataOps is to provide revealed complexity for those “data customers” in the upper right corner. This is the weakness I see in DataOps today. Just like data warehousing before it, the "users" are always stick-figures on the right side without any direction how they are served.

I’m disappointed that I don’t hear more about how semantics are generated to make it useful. I guess that's magic. It's all technical — pipelines, parsing, tagging, ingesting. Kirk Borne talks about value and metadata and even ontology (or that may have been someone else), but no detail is given how these are created. I know that's it's possible to create a knowledge graph from multiple sources, but it's still limited by what a model can infer from column names and maybe data instances themselves, but I haven't seen how this can be done with object stores and semi-structured data. I’ve seen catalogs that are rich and useful for the technical operation of the system, or to stage datasets for analysis, but not enough information about the meaning of the data. This is a DataOps necessity.

What are we trying to do with DataOps?

  • Deliver value efficiently and effectively
  • Remove fragility and friction
  • See something say something (presupposes you can navigate through the system)
  • Promote cognitive thinking. Find the question in the data
  • Avoid Right Test Wrong Model (Kirk Borne tells a colorful story about how the first mirror for the Hubble telescope passed tests but didn't work)
  • Data drift, which comes from rapid change in apps
  • Data structures change
  • Semantics drift

My take

The bottom line is - to make it simple takes a lot of complication, a false oxymoron just like the inimitable Dolly Parton’s quip.