Main content

The modern stack has a messy data problem

Neil Raden Profile picture for user Neil Raden May 8, 2023
Summary:
Machine learning and AI is a non-starter without the right data. But are data lakes now legacy? What does a modern data stack (MDS) look like? A recent debate sheds light.

Business decisions mess and information data chaos as strategy confusion tiny person concept. Company leader lost focus about solutions and future plans for success and development  © Vectormine - Shutterstock
(© Vectormine - Shutterstock)

At a recent MDSCon panel discussion titled "Evolution or revolution: Tech leaders on the future of the data stack," some of the biggest names in the industry took center stage to discuss the current state of the Modern Data Stack (MDS). Panelists included:

  •  George Fraser, CEO & Co-Founder, Fivaetran
  • Tristan Handy, CEO & Founder, dbt Labs
  • Matei Zaharia, Co-founder & Chief Technologist, Databricks

And the moderator was:

  •  Jennifer Li, Enterprise Investment Partner, a16z

Usually, these panels are dull, especially when the panelists are all founders or executives of software companies. This was different. The panelists were so engaging and convincing that I felt like Johann Nelböck on his way to dispense with Professor Morris Schlick, the Father of Logical Positivism crying, "You've destroyed my belief in everything."

One particularly memorable moment came when George Fraser clarified Fivetran's role in the MDS (Modern Data Stack). As a data replication company, Fivetran focuses on replicating data into the desired destination without getting involved in any of the workflows intended by the user. This approach has helped Fivetran stand out in the crowded data integration space. However, as the MDS evolves, the "messy problem of data integration" will undoubtedly remain a challenge for all players in the industry.

"People need to realize that the sources produce very unclean data. And if you need to send the data to a relational database that supports updates and things like that, the data you will be looking at will be very ugly." That's how Fraser explains it.

How messy is this messy data problem?

I noticed this when I first looked into Fivetran a few years ago with their early (and abandoned) use of the term ELT (Extract, Load, Transform). I asked, "Where is the T?" Transform has different meanings: 1) modify the source for analytics or 2) clean up the mess. But either way, he says they guarantee the accuracy of their replication, but they don't clean up the mess.

Fraser punted on the unsolved problem of messy data in the MDS: someone has to solve this, not Fivetran. That was a stunning admission, and it caught the other two panelists flat-footed for a moment when Fraser suggested they look into it. Matei Zaharia, co-founder of Databricks and inventor of Spark, explained that Databricks was unique in that their product was an all-in-one solution, unifying data, analytics and AI in one platform. Tristan Handy paused to gather his thoughts, and said he'd get right on it as soon as he got the current development of dbt wrapped up (more on this next month, a stunning success story), and it was disarming to see Handy struggling to come up with an answer.

The standard term for what Fraser exposes is "dirty data." Data can be dirty at its source due to timing differences, careless recording, semantic errors in version or platform conversions, or the extraction process when separated from its context. Merging data from separate sources creates inconsistent, duplicated and inaccurate data.

Timely and virtually frictionless access to data is a critical requirement for the expanding need for data science and AI/ML. The precious time of skilled practitioners is often spent managing data instead of building models. Ingenious technology that allows them to think about data and get it without delays, requests, and errors shrinks the effort and latency from conceiving a model and having the data to run it.

The rise of data science and machine learning led to a significant shift in how data is provisioned for analytics. Traditionally, data was moved and transformed from sources generically, with decisions made based on what data elements and in what form were needed to meet analytical requirements. This approach was primarily focused on minimizing hardware and storage costs and involved persistent data warehouses and carefully constructed data transformation and integration processes. While the source systems themselves were relatively stable, the major problem was the semantic dissonance between them. Merging and unifying various attributes with unlike semantics took much time, and adding a new source required a funded effort.

In recent years, most organizations have migrated some or all of these collections to cloud services, resulting in a distributed data landscape. As a result, the physical location of the data is of lesser concern, and access technologies have had to adapt to accommodate the movement of data and deal with multiple locations. Access routines have been abstracted to the point where requesters no longer need to understand where the data is located at any point in time.

This approach, data federation, has made data access easier, faster, and more seamless for analysts, allowing them to focus on building models rather than managing data.

Jennifer Li brought up the topic of data federation, a topic Fraser is well-known to loath. Paraphrasing from my notes: "Newer approaches around query federation suggest that engineers and developers no longer need ETL altogether because you can query data regardless of where it lives. So, you're obliterating this step of ETL. Will ETL continue in light of the federation trend?"

Paraphrasing Fraser again: as the era of big data evolves, so do the challenges that come with it. Query federation, a technique that has been around for decades, has been "a stupid idea" the entire time. While it may make for an impressive demo, more viable solutions exist for production environments. The problem lies in the speed of data sources, which often need to be faster to support realistic queries.

While some optimizations like predicate push-down can help speed things up, many queries are not subject to those optimizations, and moving the data becomes necessary. Fivetran proposes to solve this issue by treating it as a replication problem. In a data warehouse, you see the same schema in all the data sources, but it's implemented using data movement.

George explained that Fivetran said they would move the data but treat this as a replication problem. In a data warehouse, you see precisely the same schema in all the data sources, except it is implemented using data movement because it's impossible to do anything else. In this way, he said, we are creating the same user experience as query federation.

Can data lakes solve this?

The discussion then shifted to the topic of data lakes, a popular solution for storing vast amounts of data. Some experts, including the panel, believe the need for data lakes is shrinking as newer solutions like cloud-native data warehouses that separate computing from storage have emerged. However, navigating through a data lake can be difficult, as the data needs a unifying element.

But for organizations without a cloud strategy or lacking cloud-native data warehouses, the cost calculations for hybrid-cloud or multi-cloud solutions can be complex. Regardless, the consensus among experts is that data lakes are no longer the optimal solution in the modern data stack and are becoming legacy technology.

My take

With all the hoopla about the MDS, it was brave of Fraser to cite its glaring deficiency while, at the same time, praising MDS.

Fraser's comments about data federation could be more accurate. There is an exception: if you're reading data from object storage like S3, that has enough bandwidth to make query federation work. But other than that, it's hopeless.

To put a fine point on it, there are many solutions to messy data, but the MDS still needs to address it. That is Fraser's point.

Transformation versus replication: pure replication copies data files from one place to another. As Fraser uses the term, it is more than replication because it transforms the data into a schema that is already specified but does not change it. The industry accepted the advantage of this approach with enthusiasm, speed/lack of latency and especially endless arguments about the "truth." Transformation in the sense of ETL will deal with errors, duplicate data, merge and translate, and even aggregate data.

Fivetran, Databricks and dbt are all different products. While Fivetran is a replication engine, dbt focuses on analytics by directly transforming data in your data warehouse. It integrates directly with Google BigQuery, Amazon Redshift, Databricks, and Snowflake. Databricks is "a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale." That is a generic definition because its current focus is "combining data warehouses & data lakes into a lakehouse architecture as a unified platform for anaytics, data management and AI including Machine Learning and NLP."

Loading
A grey colored placeholder image