If you're paying attention to the 'big data' topic then you'll know that Hadoop is widely regarded as the database of choice for handling large data sets. It does however suffer from several important limitations.
- Hadoop typically runs in batch mode which means that it cannot be considered for real time analytics on its own. Whether this is desirable and what 'real time' means is another topic for conversation but current thinking is that 'right time' which may well mean minutes and hours is a more appropriate way of describing the need for iterative analysis.
- Each time you want to get something from Hadoop, a custom MapReduce program has to be written. This is a drain on resources that harks back to the days of queuing up IT report requests and effectively adds to latency in decisionmaking.
Alteryx and Databricks believe they have the answer to these problems in the shape of Spark, an Apache open source engine that overcomes Hadoop's limitations.
These companies (Alteryx and Databricks) will become the primary committers to SparkR, a subset of the overall Spark framework. In addition, Alteryx and Databricks are announcing a technology and go-to-market partnership to accelerate the adoption of SparkR and SparkSQL, in order to help analysts get greater value from Spark as the leading open-source in-memory engine...
...The collaboration between Alteryx and Databricks will foster faster delivery of a market leading in-memory engine for R-based analytics within Hadoop that is available for the Spark community.
Sound interesting? It should do because as we noted in the past:
Alteryx has largely felt that the so-called data science capability is in the hands of too few people...
...Alteryx is building the ETL platform it believes is needed for the 21st century where self-service and end user prepared modeling is the topic du jour. That means aiming squarely at SAS as a competitor and, like Numerify, tackling the problem of blending multiple data sources.
- This is an important development. Improving the tools that work with Hadoop should allow for broader adoption beyond those businesses that can afford in-house data science skills. The idea of democratizing access to large data sets is a net good and should allow both Alteryx and Databricks to grow rapidly.
- But the message will require a lot of amplification plus a requirement to get the new tools to a version 1.0 before companies will be prepared to commit resource.
- The good news is that since this is another commercial open source initiative, it should drive interest from SQL developers working in the analytics field who so far have not got to grips with Hadoop. At least that is the expectation.
- The even better news is that the Databricks founders are people who created Spark. That should mean technical development will proceed apace.