DataStax is an interesting company to watch and one to keep an eye on. Although it is not even four years old yet, it's enterprise offering of the ApacheCassandra database is now powering some of the biggest internet companies out there – including the likes of Netflix and eBay. Companies that are needing to power heavy web applications are beginning to see the appeal of Cassandra's distributed NoSQL architecture, which can scale rapidly and run on cheap commodity hardware.
That's not to say that the market is tied up. Of course it's not. Relational databases provided by the likes of Oracle, Microsoft and SAP still dominate, and it's going to take a while for customers to turn their backs on what is a multi-billion dollar market.
However, DataStax has made good headway in short four years and it's co-founder, Matt Pfeil, told me that it is already in 25% of all Fortune 100 companies (in some form or another). He added that over the next decade, he is certain that this figure can rise to 100%.
In ten years from now we should easily have Cassandra in 100% of the Fortune 100. In fact, it should not take a decade and I don't think that that's going to be hard to achieve.
Bold words. When talking to Pfeil I don't get a sense that he's delusional, or that he's driven by marketing hype. When he's talking about the benefits of Cassandra and DataStax for the enterprise, he simply talks about why it makes better economic sense (cheap hardware, cheaper software) and that the “computer science” behind the system is less constrained for those that want big, powerful online applications. If the likes of Netflix are managing to run 100% on Cassandra, then it's an argument worth considering. Here's what Pfeil said:
We are in the data age and data itself is driving all these changes. Why? The answer to that is pretty easy. The answer is because for the first time in the history of the human race, it's economically feasible to store unlimited data. I can go online and buy a 3 TB hard drive for like $99. It's very desirable to store unlimited data, you don't have to guess anymore, you can look for trends.
Commodity hardware works from an economics perspective, but you also need really smart software that can manage it. The problem is that from an architecture perspective, traditional systems weren't built with that in mind, and even worse, a lot of the providers latched on to high end hardware to make their revenue numbers. But once you've got that double marriage of both computer science [Cassandra] and the business incentive [cheap hardware], the market begins to shift somewhere else and a lot of the traditional players over the long term will begin to suffer.
Pfeil said that although DataStax has customers, such as Netflix, that are running 100% on Cassandra, what tends to happen is that enterprises are looking to NoSQL to run alongside traditional relational databases to power some of the newer features for their web applications. So customers are starting new projectsand are looking to Cassandra as the database of choice, whilst running their old relational databases alongside, with the application tying it all together. He said:
As a user I don't care if there's two different databases behind the scenes, and your relational database is powering features from the first version and the new parts are being powered by something else. As the application evolves we just see less and less go on to the old stuff and honestly, at some point, it just disappears.
But what about the pain points for customers? I'm always interested to hear how the end-user is adapting to a change in technology, and although the vendor is never the best person to ask, Pfeil did say that it takes customers some time to get used to the fact that they are no longer constrained by the structures of a relational system. He said:
It's a little bit like the Matrix. You remember the red pill and the blue pill? It's a slightly different way to think about data, you sort of have to free your mind. When you use relational technology, there's restrictions on what you can do. But all of those rules disappear, so you start thinking if I can store it all, what I can I do with it? Over time it starts to hit you that there aren't rules anymore, because you look at the data differently - it's literally the whole picture. Imagine if we could drive on either side of the road and it was safe to do so, it would take a little bit of time before I got in the car and started driving on the other side of the road.
Integrating with Spark
Given that the likes of SAP are partnering with Databricks to take advantage of Hadoop and the open-source market for large datasets (see Den's excellent piece on the implications for that deal here), it's perhaps unsurprising that DataStax is doing something similar and has made Spark integrations possible for its latest release, DSE 4.5.
Pfeil explained to me that although his mission is to make DataStax the best online operational database that “the planet has ever seen”, the company also recognises that its clients are demanding integration and partnerships from across the open-source community. “Anyone that says that you can be the best at everything and do it all in-house is probably full of shit.” Agreed.
- SAP HANA gets some Spark from Databricks (diginomica.com)
- Alteryx and Databricks team up to simplify and accelerate Hadoop analytics (diginomica.com)
- DataStax strengthens foundations for relational database assault (diginomica.com)
Spark is an evolution of MapReduce that allows companies to improve the performance of their analytical queries. For example, if you are doing similar queries over and over again, Spark will keep recent result sets in memory to ensure a faster retrieval. But Pfeil was honest about when it should and should not be adopted – i.e. it should be used to run analytics, but it shouldn't tread on the toes of DataStax and be used for online operations. He said:
The question I asked when I was learning about Spark was – is it analytical or operational? If the answer you are seeking has already been asked, it's probably fast enough to be operational. The problem is, since you don't always know what a user is going to do, you need operational queries to be uniform in performance. And if you ask a question that is not already cached in memory, the performance can not be what is needed in an online environment.
In an online environment you need guaranteed response times – if you go to Amazon.com you don't want to search for something and 80% of the time it is fast, then the other 20% of the time it takes 20 seconds. You give up. You need that uniformity – Spark isn't built to guarantee that, it's built for optimal use cases, but not every use case. So it's very much an analytical, data warehouse type engine. We integrated it because people do have a desire when they're building on that operational data, to ask analytical questions of it.
Say you are building a music recommendation service, it could be very beneficial if you know every song everyone has ever played and you want to know what was the most popular song over the last six weeks.
The future? DataStax as an all-round data solution
Given Pfeil's optimism and the rate at which DataStax seems to be growing (he mentioned that the company is aiming for an IPO over the next three to five years), I began to question how the enterprise offering of Cassandra will develop. Will DataStax stick to what it's good at or will it begin to start branching into other areas? According to Pfeil, DataStax is a platform play (as is the case with almost everyone else) and that what we are likely to see in the long-term is offerings emerge that are all-round data solutions. Pfeil said:
It's a platform play, so there's no company that shouldn't be using Cassandra – that's not my sales hat, that's my computer science hat. So it'sreally about how do we provide offerings that make all their lives better. The cool thing about being a database is that no one buys a database because they want to, they buy a database and build their own solution because the solution doesn't exist. They have to build their own. The good thing is that we have a really good database, so the more people that use it, we can look for trends and we can work it up over time to the point where we are offering a full business solution. But you never want to stop the database part, because that's how you learn about the use cases.
A decade from now I think there will be a shift to: how do we provide the best data solutions the planet has ever needed? And the database is a component of that.