Do we really need a data lakehouse? Hashing AI, cloud, and customer proof points with Databricks CEO Ali Ghodsi

Profile picture for user jreed By Jon Reed June 17, 2021 Audio mode
Poking fun at vendors for pushing terms like data lakehouses is part of my job. But the real purpose here isn't humor, it's delivering value on data and AI rather than buzzword polishing. Databricks CEO Ali Ghodsi was up for the debate, and he brought customer proof points.

(Ali Ghodsi of Databricks talking shop)

Readers know I'm not a fan of enterprise buzzword bingo. So you can imagine how I feel about the spanking new term "Data Lakehouse," which, to me, sounds like a mashup of bad Sandra Bullock movies.

So why not a virtual 1:1 with the CEO of Databricks, a company that is betting heavily on its "data lakehouse" architecture?

I didn't get here first: Diginomica contributor Neil Raden already skewered data lakehouses (and other cloudy data terms) in his piece Data lakes, data lakehouses and cloud data warehouses - which is real?

The pandemic economy - an AI adoption accelerator?

I'll get to that - but my chat with Databricks CEO Ali Ghodsi started elsewhere. Ghodsi was a speaker at Collision '21, coming off a talk on AI adoption. So what are we learning? As Ghodsi told me:

If you look at the enterprise space today, especially since the pandemic started, everyone wants to quickly figure out how they can get AI into their business and become more data-driven... I think something happened in business leaders' minds in the last year, as they were sitting home and everything was turned upside down. Something clicked, and they realized that 'Maybe we should move into the future faster.' So there's this urgency.

Applying "AI" to massive data sets is nothing new - but the urgency has changed. Ghodsi pointed to a textbook example: the triumph of Google over Yahoo.

If you look at Google, people think of it as a search company, or an app company. But if you think about it, they would not be here today if they didn't have artificial intelligence and data as a strategic asset from the beginning. You'd probably be using Yahoo right now. Same thing with Facebook. It's not like they have your friends list, then you can click on your friends. It's the data and AI they leveraged in a really strategic way. Same with Twitter; same with Airbnb; same with Uber.

It was enough to displace incumbents. Ghodsi's data shows the pandemic adding AI fuel: "We see it in our usage numbers. Everything is accelerated significantly." And yet, AI and analytics is a crowded space. What is the unique pain point Databricks aims to solve? How is their approach different? Ghodsi says Databricks can be understood via three early bets: cloud-first, AI, and open source. Databricks bet on cloud early, back when cloud BI wasn't a thing.

Three Databricks bets - AI, cloud data, and open source

Ghodsi says every year, he had a standing bet with an (unnamed) Gartner analyst. Each year, the analyst would say to him: "This is a huge mistake to bet all your business in the cloud." Each year, Ghosdi would ask him, "You still think it's a big strategic mistake of me staying in the cloud?"

Each year, the answer was the same: "Absolutely, I talk to customers all the time. There's this thing you don't understand: all data has gravity, and the data is on-prem; it's not in the cloud."

Well, that changed. All three bets worked out, though I'd argue open source has issues, e.g. commercialization, and proprietary caretakers whose agendas are hardly neutral. But when we fast forward, what makes Data Bricks unique today? Ghodsi:

What's special about Databricks is we're the only company right now that combines massive data processing with AI. The market is split for historical reasons. There are vendors that do data management and data processing, like Snowflake, and they're great for data processing - but they have no AI or machine capabilities whatsoever.

And then there are startups, on the other hand, that do machine learning and AI. They're great for the machine learning algorithms, but they actually are not in the business of processing massive petabytes of data... So we're the only vendor that combines those two into one product.

Databricks customer proof points - Shell, Starbucks and Regeneron

I'm not sure Snowflake would agree with that "no AI or ML capabilities whatsoever" assertion, but we move on. Here's my next problem: customers don't call vendors and say, 'I need some AI.' No - they have a business problem to solve. How does Databricks fit in? Ghodsi responded:

When you said, 'All vendors say they do AI,' look at the use cases on their web pages. Let me tell you mine. Regeneron is a pharma company; they found the genome responsible for chronic liver disease in Databricks, using AI, and they actually have a drug [to treat that] now.

Okay, that's a good use case. Ghodsi added:

At Regeneron, they built up a phenotype database, and a DNA database with lots of patients. So you have diseases on the one hand, and you have DNA on the other side. And you want to quickly iterate between those, and find the needle in the haystack: the gene markers that are responsible for that disease.

Using Databricks, Regeneron accelerated drug target identification. They reduced the time it takes data scientists and computational biologists to run queries on their entire dataset, from 30 minutes to down 3 seconds - a 600x improvement. Ghodsi's next slide was Starbucks:

They have so many different use cases. Today, using Databricks, down to every SKU of the stuff they sell, to over 30,000 stores in the world, they can predict and forecast sales and revenue. It's all machine learning, and it's based on real-time patterns, how people have been buying stuff, the diurnal patterns, what's happening in their other stores, and then that forecasting is done.

50x to 100x in processing time improvement is another Starbucks proof point. For his third customer example, Ghodsi walked me through Shell:

Shell has over 200 million valves. Each of these valves has a sensor attached to it, and it just spews out lots of data all the time. That data comes into Databricks; the technology that we have is called Lakehouse.

We predict the likelihood of a particular sensor breaking down. We're actually not that great at it; we have roughly 70% accuracy. If we say it's likely to break down in the next month, we're right 70% of the time. Not 99% or anything like that.

As it turns out, 70% accuracy is still a big deal:

It's massively cost efficient to go replace that right away, when we give that prediction. And that saves lives; that's better for the environment. And it also saves a lot of money.

My take - call it a data lakehouse if you want, just give me the customer proof points

Did you notice how Ghodsi snuck in the 'data lakehouse' term there? I'm not going to spend much time sweating this particular buzzword. I realize there is some benefit into drilling in, as Raden did, to understand the semantics - and say, the difference between a cloud data warehouse and a data lakehouse. Databricks has their own answers here, and justifications for this terminology. I don't really care about that. Nor do I care about Ghodsi's assertion that they are the only company that has productized AI and massive data processing. Here's what I care about:

You say you're using AI and "big data" to help customers - let's see the proof points. There, Ghodsi was ready, with multiple blue chip customers to point to, operating at scale. He was able to talk speeds and feeds, but also business results. Ghodsi also wanted to talk about Comcast, and how Databricks uses voice data from Comcast and converts it to digital data inside their Data Lakehouse, and "actuates it in real-time." Meaning: instantly responding to customer voice requests.

My next step would be to drill into this with Databricks customers, which I hope to do in good time. But in the meantime, Databricks is full speed ahead. They conducted their own major virtual event, the Data + AI Summit (replays available through June 28, 2021). Timed with the show, Databricks issued a slew of product news. One news item that caught my eye: Databricks Unveils Delta Sharing. From the press release:

Today, at the Data + AI Summit, Databricks announced the launch of a new open source project called Delta Sharing, the world's first open protocol for securely sharing data across organizations in real-time, completely independent of the platform on which the data resides.

If traction can be achieved, this opens up possibilities, beyond the data access issues inherent in proprietary systems. A so-called "neutral governance model" is certainly something to push for, though industry-wide adoption is always the rub. But unraveling all these news stories is beyond my scope today. I have a hunch I'll be debating such topics with Ghodsi again.