The NoSQL and Hadoop disruptive open source dividend

Profile picture for user gonzodaddy By Den Howlett June 10, 2015
Summary:
Open source technology has arguably become mainstream, especially in modern databases like NoSQL and Hadoop based systems. They are unlocking huge value.

Hadoop Summit

Everyone's hiring at Hadoop Summit

diginomica is largely a business focused technology property. We are less concerned with the minutiae of the technology and lean more towards what it delivers. But in recent weeks, we've been spending a good amount of time at technical events organized by open source leaders like MongoDB, Couchbase and Hortonworks. I've also spent quality time with people who are working on groundbreaking projects where open source is front and center.

I'm not going to pretend that we've suddenly become experts in NoSQL, document databases, YARN or a myriad of other technologies. What I will say is that I've observed a very clear set of value propositions that make the world of the relational database, and many of the follow on ERP, CRM and supply chain applications look archaic. I say that not with any intent of malice or out of a new found wonder for technology. I say that because customers are solving problems at scale that would previously have been unimaginable.

Why open source matters

Vishal-Sikka---latest---June-2015
Vishal Sikka - Infosys

I'll start with a conversation I recently had with Vishal Sikka, CEO Infosys. That company is rapidly building out a competency center that embraces a slew of open source technologies and which underpins new, value based services it offers to some of the largest companies in the world.

I was interested in his position on this topic because there is a direct contrast with his past life as the engineering head at SAP driving HANA, SAP's in-memory fast database the company expects will replace relational systems from Oracle, IBM and Microsoft for both transaction processing and analytics. HANA is a proprietary system. It is commonly sold as a license with attendant support and maintenance costs in more or less the same way as any other relational database.

Sikka's vision for SAP HANA always steered in the direction of solving some of the world's biggest problems and in that regard, he would cite work done on accelerating genome sequencing at Stanford Medical School. While that is a big problem Sikka would also muse on solving problems of a more nebulous nature. It was never clear to me what he meant but I have since come to learn that in his mind, software should be capable of solving known unknowns.

These are problems that are coming from left field and which, on the face of it have no clear or obvious answer. They don't fit into particular boxes but are representative of industry or large scale business specific problems. Linking this back to open source, he told me something profound that has informed my thinking as I've met with executives, developers and customers on the event circuit:

We have to use open source, be active in those communities and be good actors through code contributions for one very simple reason. In today's world, no one company can know all the answers to its customer's problems. Even when you have domain experts, it's not enough. We need the community to help us find answers to complex problems and you only get that in open source communities. You cannot get that in a vendor's forum because that's not their focus. We have to combine that with techniques you know as 'design thinking,' the stuff going on at the dschool for instance, so that we can learn how to solve problems and then apply the right technology answer. In short, it's about constant learning. It is the only way.

Diverse proof points

That's quite a statement but makes perfect sense when you understand some of the topics with which companies are wrestling. It doesn't go un-noticed that representatives from GE have been showcased at both Couchbase and Hortonworks events, talking about how you solve for predictive pipeline maintenance in hostile environments. Or how you keep complex aircraft engines running.

Then there was John Wilson, chief medical information officer at Optum, talking about the use of Hadoop in a specialized use case:

We can predict when diabetic patients are not taking their meds and can intervene where necessary at scale. That's 7,000 patients under our care.

Peter-Crossley---webtrends
Peter Crossley - Webtrends

Or how about Peter Crossley, director architecture and technology at Webtrends. His company provides digital marketing solutions for highly regulated industries. The scale at which they're operating is staggering: 2,000 plus customers, collecting and analyzing data from 25,000 sources which equates to 13-15 billion events per day and growing at the rate of one petabyte every 6 months. He argues that the traditional world of SQL databases simply cannot cope at scale or at economical price points.

Using Hadoop and commodity hardware we're driving down cost by 20-40%. Huge savings.

On the question of open source, he says:

The point about no one company having all the answers is absolutely right. More to the point, what happens if you mess up? Open source helps us avoid big mistakes. The community is there as a sort of safety net.

Russell-Foltz
Russell Foltz-Smith - TrueCar

Then there is Russell Foltz-Smith, VP data platform at TrueCar Inc. He explained to me the intricacies of the car buying market in the US and how the bewildering number of variables and their changing nature have an impact on the price a person will pay for a vehicle.

It’s an endless search to discover what’s going on in the marketplace. It needs to be an open platform because we have to acquire all the data. We jumped in head first (with Hadoop) because you couldn’t continue to use relational databases.

I'll talk more about TrueCar in a later piece but earlier I discussed the Ryanair case where a move to Couchbase solved a very real ad complex problem around data synchronization in situations where a customer's connectivity is spotty and how the solution has a side effect of reducing network usage resources.

The other week, Jon Reed met with Apervita. They said:

All of you are individuals, and your data is unique at every health care institution you go to. The uniqueness of the data led us to a place where we can’t deal with a schema, we can’t deal with a database  where we have to try to pigeonhole pieces of data into well know areas and well known descriptors...

...We have a big scaling problem: everyone has a ton of data, and we need to evaluate as much of it as possible. MongoDB was a very quick decision for us, and we’re very happy with it. Our entire platform runs on MongoDB, everything from user information to analytics results to patient information. It’s all one database. We want to keep it a simple, scaleable stack.

You see the common threads here? Highly differentiated problems in vertical markets at massive scale but at minimal cost.

Early days

Ron-Bodkin---Think-Big
Ron Bodkin - Think Big

Where does this lead us? I asked that question of Bob Wiederhold, CEO Couchbase, Ron Bodkin, president Big Think and Tim Hall, product marketing at Hortonworks. All agreed it is clear that the business model to support the new classes of application and analytics can no longer operate within the context of proprietary databases and that the role of the commercial open source providers is one geared towards service, support and a simplification of existing methods. For me, Bodkin put it well when he said of the current state of the art:

Nobody’s yet written the book on modeling patterns for big data.

In looking across the commercial varieties of business, I somehow doubt anyone will. Which brings us back to the question of community and how that plays out along with advances being made to reduce complexity in managing the moving parts inside - say a Hadoop system. Hall made the important point that:

Tim-Hortonworks
Tim Hall - Hortonworks

We used to see Hadoop as the file system and Map Reduce. Now it's a platform and with that come fresh needs. (among other things) we're looking to collect information from customer clusters and provide recommendations about how they tune their configurations.

It is clear there is a long way to go but again, what struck me is that companies have an appetite to build because they can't buy and that while it is early days in the NoSQL and Hadoop worlds, the community is winning.

Small is the new big

Even so, there are some perceptual hurdles to overcome. In a recent Gartner blog, Merv Adrian, one of the world's best thinkers on database said:

...a procurement perspective on Hadoop is that it is a tiny subset of the $33B DBMS portion of the information management market. It’s healthy, and growing, and has a enormous amount of upside adoption potential. It may show associated growth in revenue – though this is not yet clear. Commercial open source software revenue may not scale as linearly with deployment as commercial closed source software does. But that’s a topic for another post.

In pure financial terms, Adrian is right. He adds up the top players as earning some $150 million a year in top line revenue. That's likely light given Hortonworks recent Q1 FY2016 alone showed revenue of $23 million. Sounding elsewhere suggest both Cloudera and MapR are doing just as well. On my math, these there are closer to $300 million, but that's still less than one percent of the total market. I think that misses the point.

Alan-Saldich---Cloudera
Alan Saldich - Cloudera

The stats suggest that Hadoop as a technology investment is the fastest growing of any category in history. As Adrian points out, that doesn't necessarily equate to revenue in the short term but the indications are fascinating. Alan Saldich, VP marketing at Cloudera said to me:

Look at Tableau and how long it took them to reach $100 million in revenue. Now look at where we're at. We're well ahead of the curve. But also we've moved on with use cases and customer types. There's plenty of variety if you look around.

To take it one step further. You don't get companies like Intel investing or a total of $1.2 billion in funding unless you're riding a rocket ship.

The trend towards a services based software economy drives out the license cost for which none of the open source companies can charge. They get to charge for other services like security, governance UI development and support.

More to the point, a pure financial measure at this point in the cycle provides no insight into the data quantities, database instances, clusters or nodes in operation. Therein lies a real problem for everyone.

On the one hand, Adrian and Gartner more generally is suggesting that some 46% of large enterprise are active or will be active on Hadoop in the next 12-24 months. On the other hand, Wiederhold freely acknowledged that they really don't know how many Couchbase downloads are on self support or are in test for future support.

The measure I'd like to get a handle on is a count of nodes in a Hadoop environment. All vendors are tight lipped on this but I learned of at least one installation where the node count runs the low tens of thousands. That's a BIG number.

Last words

I am more interested in the impact that these technologies are having on customers' ability to problem solve. That's what matters. The use cases and example are adding up at a good clip. The diversity is fascinating. The early results are staggering. I sense that in only a few years we will be discussing with awe, some of the most extraordinary stories to emerge from these technologies.

Oh - and by the way. I wonder what happens to those relational license sales? After all, Couchbase has just introduced SQL for documents in its N1QL offering.