The data lake collects raw data, thousands, perhaps millions of files. This is posited as a benefit. But is it really?
At a certain level, raw data is an oxymoron. We can't triangulate data to see if it's consistent with other instances of the same phenomenon or event. "Raw data" typically implies it is to be used for a particular purpose, and it is the beginning point for drawing inferences and drawing conclusions.
The context of data - why, how, and when it was recorded, and what method it was collected and then transformed is essential. Context-free data simply does not exist. The perfect objectivity we assign to "raw data" is a myth. That's why in data warehousing, we attempted to integrate and rationalize things.
Data lakes hail from Hadoop (and later, other cloud storage options), which was indifferent to the size and type of files that could be processed, as opposed to the rigid and not nearly as scalable nature of relational data warehouses, That hatched the idea of the single place for everything - the data lake. In truth, it was a concept hatched by the Hadoop distributors to sell more licenses. Though it did simplify searching for and locate files, it provided no analytical processing tools at all. The logic of moving a JSON file from Paris, France, to a Paris, Texas cloud location adds no value - except for some economies of scale in storage and processing
Industry analyst Andrew Brust, in Big on Data, quotes George Fraser, CEO of Fivetran:
I think 2021 will reveal the need for data lakes in the modern data stack is shrinking...there are no longer new technical reasons for adopting data lakes because data warehouses that separate compute from storage have emerged.
If that's not categorical enough for you, Fraser sums things up thus: "In the world of the modern data stack, data lakes are not the optimal solution. They are becoming legacy technology."
For organizations that lack cloud-native data warehouses that separate compute from storage or even lack a cloud strategy, that is something of an oversimplification. The calculation of costs of hybrid-cloud, multi-cloud, separation of storage from compute...border on alchemy. And even a good approximation is only as good as when you make it - because things change so quickly. There is one secret, though, that you will do worse without a model, no matter what approach you take.
Another thing to consider is that "organization" is often an oxymoron. While there may be a single "strategy" for data architecture in most organizations, the result of acquisitions, legacies, geography, and just the usual punctuated progress, there may be a collection of them, distributed physically and architecturally. The best advice is:
Pay more attention to what your data means than where you put it.
To patch some of the data lake idea's manifest deficiencies, cloud providers have regularly added processing capabilities that mimic early data warehousing features - comically calling it the "Data Lakehouse" (or the Databricks variant, the Delta Lake)
What is a data lakehouse?
According to Databricks:
A data lakehouse is a new, open data management paradigm that combines the capabilities of data lakes and data warehouses, enabling BI and ML on all data. ... Merging them into a single system means that data teams can move faster as they can use data without accessing multiple systems.
This statement is more aspirational than fact. Data warehouses represent forty years of continuous (though not always smooth) progress and provide all of the services that are needed, such as:
- AI-driven query optimizer
- Complex query formation
- Massively parallel operation based on the model, not just sharding
- Workload Management
- Load balancing
- Scaling to thousands of simultaneous queries
- Full ANSI SQL and beyond
- In-database advanced analytics and support for ML
- Ability to handle native data types such as spatial and time-series
The fact is that some data warehouse platforms do perform all of these functions and more, and are very central to the operations of businesses.
In the early seventies, the world was beset with an energy crisis. Some executives in Detroit decided that the US needed small cars, with which they had little experience, but they came up with a platform anyway. But Americans loved their pickup trucks, which accounted for a substantial share of the automaker's revenue, Ford and Chevy especially. When you have a terrible solution, the worst thing you can do is pile on more terrible decisions - enter the 1973 Ford Courier mini pickup truck, one of the worst, poorly designed, ill-conceived vehicles in history.
If you can query a JSON file in the Data Lakehouse with SQL transparently, you have accomplished something. But not enough. What troubles me the most is that the data lakehouse's excuse is that it's a data lake with some analytical capabilities. What I haven't heard about are understandability and usability. Those capabilities are mostly inherited from the expanding capabilities of cloud services themselves.
What are cloud data warehouses?
There are principally three: AWS Redshift, Snowflake, and Google BigQuery. Many other relational data warehouse technologies have acceptable cloud versions, but the cloud-natives claim the high ground for now. At a certain maturity, they provide all of the functions listed above, rather than being bolt-on capabilities to generic cloud features. However, it does get a little blurry because the CDWs provide more than a traditional data warehouse. One, for example, proves a public data exchange market. I've noticed the word "warehouse" starting to disappear from their content.
Would you rather have a cloud-native data warehouse that can handle the most challenging data warehouse tasks but can also provide most of the functionality of a data lake (or, to put it another way, to eliminate the need for a data lake), or would you prefer a data lake with partial data warehouse capabilities slapped on?
To sum up:
- The concept of a data lake is flawed. In an age of multi-cloud and hybrid-cloud distributed data, not to mention sprawling sensor farms of IoT, there is no advantage to pulling it all together. AI-driven knowledge graphs are a far better alternative to locating and tagging data where it is.
- If you dismiss the data lake, you must of necessity dismiss the lakehouse.
- Pay more attention to what your data means than where you put it
A data lake looks to be static "dumb" data neatly arranged. A data lakehouse, if you must use that term, is fundamentally different from a cloud data warehouse. A data warehouse is a comprehensive set of capabilities that provides a graph-based linked and contextualized information fabric (semantic metadata and linked datasets) where NLP (Natural Language Processing), Sentiment Analysis, Rules Engines, Connectors, Canonical Models for common domains. Add to that cognitive tools that can be plugged in to turn "dumb" data into information assets with speed, agility, reuse, and value. I haven't seen one of those yet.