Data lakes are typically standalone objects, meaning, not designed with any particular requirements or upstream/downstream dependencies. In a way, they represent the rebirth of the original data warehouse concept, getting all the data in one place but without the limitations of schema and scale, but also without any offerings for people to use it. Because in this form, organizations are finding that getting value from the investment is elusive. The data lakes lack most essential capabilities to be useful for those other than data scientists and IT developers.
The reason for this is that there is a yawning gap between a Brobdingnagian (OK, huge) collection of datasets of mixed up formats, semantics, and types and an organized data warehouse. Without some extreme enhancement and translation process that provides both an abstraction layer and pipelines from the data lake to the ops and analytics, reaping value from the effort is elusive.
Given the size and variety of data that data lakes typically contain, it would be preferable to build abstractions, maps, graphs and catalogs so that data can be “discovered” and manipulated with the power and scale of the physical data lake.
What this means is that Hadoop, which has been synonymous with data lakes, is not the ideal platform for a data lake.
Silicon Angle, in The sun sets on the big-data era: HPE to acquire MapR’s assets opines:
MapR was one of three high-profile startups, along with Cloudera Inc. and Hortonworks Inc., that collectively raised more than $1.5 billion in funding in the heady early days of the big data movement. Its CEO publicly announced plans for an initial public offering in 2015, but customers’ faster-than-expected
move to the cloud, combined with the market’s overall ebbing interest in Hadoop, sent the fortunes of all three companies spiraling downward shortly thereafter.
First of all, MapR forked from open source a long time ago, so I don’t agree that their demise was an ebbing interest in Hadoop. I was personally chided a few years ago for referring to MapR and as one of the “distros.”
Quite the contrary, by distancing themselves from Hadoop distributors, they failed to identify themselves as something else.
Platforms for AI are a dime a dozen now and I believe their problem was execution and positioning. Now, as for the ebbing interest in Hadoop, it’s far worse than that. It goes back to the eruption of interest and what Hadoop turned out to be.
Hadoop found its way into the enterprise for one reason: it was cheap. The dream of data warehousing hit the wall in terms of scale and cost (I don’t agree with the claim of difficult to maintain. It wasn’t if you knew what you were doing), and was late to address big data.
But once Hadoop was put to enterprise applications, IT wanted better servers, not cheap commodity servers, and SSD not spinning disks. Performance was not good, it was all batch, so that brought in Spark (in-memory) and suddenly Hadoop wasn’t cheap at all. To make things worse, as organizations became more comfortable with the cloud, it became abundantly clear that Hadoop was not developed for the cloud and failed to exploit many of the advantages cloud platforms offered.
Hadoop was great for analyzing weblogs and indexing search engines. But once its deficits became apparent, the cloud vendors began to offer competing alternatives. So this isn’t a case of “ebbing” interest, it’s a full-scale retreat.
So let’s get back to data lakes. Who came up with the idea? Hadoop vendors, of course. What better way to get enterprise license revenue? Massive scale of storage, a single place for everything, but it was like a bridge to nowhere: now what? Hadoop and a place for the lake lacks so many functions it’s impossible to list them all, but for starters:
- Hadoop performs poorly with small datasets. MapReduce performs at the file level, and is best for batch processes, but limited for interactive queries. Data warehouses always have both large and small files.
- While there were many distributors of Hadoop initially, as of this writing t is down to one as Hortonworks was blended into Cloudera, Microsoft, IBM and Pivotal are no longer in the market and MapR effectively shut down and sold remaining assets to HPE.
- Open source Hadoop is lacking in security functions. Over time the Apache Foundation delivered some modules but the most-used ones were developed by the Hortonworks/Cloudera company, not as open source (though that changed recently) leaving only a single source. Hadoop on its own comes with security settings disabled in a default state, so implementing security is a difficult task for a programmer.
- Because Hadoop is pure Java, it is susceptible to malicious hackers.
- Hadoop is not ideal for real-time analytics. Hadoop still runs batch processing, and response time is not up to standard. Hadoop distributors have layered on Spark and Kafka to improve this, but the complexity of configuring it and keeping it running is extensive. If not for interactive analytics, but surely for data science and ML applications, Kubernetes is coming on strong.
- Most Hadoop implementations were on-premise. Hadoop did not have a cloud native offering and even today it struggles to exploit the cloud advantages.
- Most importantly of all, Hadoop was never designed for interactive real-time analytical processing
Hadoop’s infrastructure requires a great deal of system administration, even in cloud managed systems. Administration tasks include: replication, adding nodes, creating directories and partitions, performance, workload management, data (re-)distribution, etc. Core security tools are minimal, often requiring add-ons. Disaster recovery is another major headache. Although Hadoop is considered a “shared nothing” architecture, all users compete for resources across the cluster.
If Hadoop turns out to be a less than ideal platform for a data lake, what is? Assuming a data lake is needed at all? I believe it is, as long as it fulfills some basic requirements:
- It should be in cloud. Not just one cloud, either. There is no reason for a data lake to be not to be distributed across cloud instances.
- Cloud vendors, such as Amazon’s S3, offer compelling non-Hadoop solutions for data lakes. because of its virtually unlimited scalability
- Cloud vendors need to provide much more than storage advantages. Backup and archive, data catalog, analytics, machine learning and even data warehousing
- Cloud-native products like Snowflake can actually provide standalone data lake capabilities
Hadoop was shoehorned into applications it wasn’t suited for. Its cost advantage dissipated once applications went beyond batch search and indexing. It offers no scalability features greater than cloud services. Successful Hadoop applications going forward are going to be complicated amalgams of modules less coherent than cloud offerings with massive capital to bring to bear.