Spark outpacing Hadoop? Survey says yes


Databricks survey of Spark usage provides insights into an open source platform that supports multiple languages and use platforms. This is very promising for the future of big data and streaming analytics.

databricksThe other day, we published a Hadoop adoption story that indicates the eponymous big data file system is moving from science experiment cum proof of concept to in-production value generator. Today, Databricks published the results of its own Sparks related survey of 1,417 respondents from 842 organizations and guess what? Databricks are not only claiming more or less the same growth and adoption vectors but also that Spark is outpacing Hadoop on the adoption and implementation curve. Both can be right.

Let’s be clear, surveys of this kind are riddled with methodological problems, largely because they address an already invested audience who are biased towards the technology with which they are working. However, in the case of Spark, I am inclined to be less harsh about this point. Here’s why.

Last year, I was asked to look into Spark as a potential solution for specific analytic use cases. At the time I felt that Databricks – the company most closely associated with the product – was still in science experiment mode. I said as much. But then when I saw SAP provide its endorsement and enthusiastic take up by the Scala community I sat back and went ‘hmm.’ The survey certainly provokes reaction but more to the point, Spark is much younger than Hadoop yet has caught the imagination among powerful influencers in a way that has taken years for Hadoop fans.

Here are the talking points from the Databricks survey in infographic form:

Spark survey highlights

Let’s discount the chest thumping element on the left hand side but instead concentrate on the reasons for choosing Spark and the industries that are adopting.

Performance is considered paramount these days and especially when looking at the impact of streaming data so that 91% is not surprising. Hadoop has always been considered something of a development beast, largely because it was never a finished product per se. You can say the same of Spark. The difference is that Spark is positioned as a platform upon which you can ‘bring your own language’ and that has important positive implications for attracting developers.

As always with relatively new technology, it tends to be the technology companies that pick it up first and so the industry adoption pattern is not surprising. Again, adopt among marketing and advertising businesses should not surprise. We’ve recently written about how dysfunctional media shot itself in the face, foot and narrowly missed its heart.

How is Spark used? See below:

Spark use cases

This is a very broad mix that supports the general purpose nature of the Spark platform. Again, the result is more or less aligned with what the Hadoop Maturity Survey discovered.

More important, 69% of respondents are building two or more solutions with Spark while 49% are building three or more products. That’s an important data point because typically, data processing organizations only allow the solving of one data problem at a time.

Other data from the survey suggests that rather than simply being a technical platform for data engineers, other professionals like data scientists and data architects are using Spark in a collaborative fashion. So are business managers albeit they are a small minority at this stage.

The survey doesn’t go into the reasons for these remarkable adoption curve numbers. I am willing to bet a lot of money this has a lot to do with the inclusive programming and platform environment Spark established from the get go. Yes, the fashionistas in the Scala camp are up there as top users but Python is very popular too. Windows platform has grown 283% in the last year to 23% of all Spark projects noted in the survey. You don’t get that when languages and platforms are restricted or limited.

My take

I remember having discussions about language support with Vishal Sikka, (now CEO Infosys but then in charge of HANA development) when SAP HANA was groping for a purpose beyond feeds and speeds. There was a vocal minority of extremely well seasoned and influential SAP developers who wanted HANA to support beyond ABAP, SAP’s proprietary language and Java. Surveys showed the most common demand was for PHP. SAP dropped the ball on that one and is now a data source for Spark.

Elsewhere, we hear Oracle tout Java as the most popular language for application building, yet the Spark survey is telling a different story about solving tomorrow’s problems with that magic combination of ease of use and blistering performance.

In short, the mega vendors need to wake up and learn from what upstarts like Databricks are doing or rather, the outcomes from those who are working on this platform which, by the way, is used in the cloud among 51% of respondents.

Disclosure: SAP and Oracle are premier partners at time of writing.