For those unfamiliar, Databricks is a company that is rapidly emerging from the status of 'academic science project' to Hadoop platform provider with a cloud twist. In short, it is betting that a significant number of big data workloads will go straight to the cloud. For that to happen, building applications needs to be easy and so Databricks is working on making development and deployment as easy as possible in what is otherwise a highly complex set of technologies.
The certification program is one step in that direction. So what's happening? From the blurbs:
Databricks’ Spark experts and O'Reilly's editorial team are creating a program – consisting of a formal exam and subsequent certification – that establishes the industry standard for measuring and validating Spark technical expertise. As enterprises increasingly focus on turning data into value, Certified Spark developers can help them take advantage of Spark’s combination of sophisticated analytics and blazing speed to deliver deeper insights, faster.
Of certification and differentiation
On the call, I made no secret of the fact that I find certification exams are of limited value. To me, they tend to demonstrate a person's ability to pass an exam rather than provide a clear indication of practical competence. Databricks agreed that can be the case but Arsalan Tavakoli-Shiraji, head of customer engagement said:
We’re in the early days so it’s a little easier for us to figure how people might game the exam. This is an evolving thing and not limited to individual certification for which we are seeing a demand. We're looking at certification of SI companies based upon their real world deployments. We see this as an additional level of comfort for customers as Spark usage grows among customers.
Databricks remains clear that its business model is fundamentally wedded to the idea of a community built around open source Spark. When asked to discuss the ways in which it differentiates from say Platfora or ClearStory, even though it doesn't own its own Spark distribution, Tavakoli-Shiraji said:
If you go back to Databricks Cloud then you see we’re a platform company and not a SaaS company. We’re much more like Apple in that regard. Right now we’re getting a ton of apps certified through the Certified on Spark initiative launched earlier in the year. We feel that not only acts as a validation but it makes the platform more sticky.
Update support for Spark?
I then switched gears to ask more technical questions. For example - what are the company's plans to support updates/writes in Spark and its ecosystem? The answer may seem less than perfect but then despite the company's fast growth, it knows these are very early days. What's more, the variety of Hadoop based projects is truly staggering and that poses challenges in terms of setting direction.
At the core Spark is not designed to be a transactional engine. One area we see that changing is streaming. Elsewhere, we’re starting to see a combinations of Spark and Cassandra. Can we extend Spark? It's always a possibility but that's not really the direction we want to take right now. Spark is agnostic to the underlying data store and we prefer to start with one platform for everything then see what happens. As a cloud offering, we get far more data about what customers are doing so we have a way to understand what the use cases look like and take direction from there.
This was an interesting response because it is the second time in as many days that I'd heard about Spark and Cassandra in the same breath. For those unfamiliar Cassandra is an open source database in use where companies need high availability, scalabilty and performance for analytics.
It is in use at many well known brands such as CERN, eBay, GoDaddy, Hulu, Intuit and Netflix. The largest reported deployment is at Apple which stores 10 Petabytes of data across 75,000 nodes.
Speed and feeds?
Finally, I wanted to know about typical compression rates achieved in memory. This is not an esoteric question. In-memory databases can run queries very quickly, something that is exercising the minds of engineers trying to understand what 'real-time' actually means. With speed comes a cost tradeoff since memory is expensive compared to spinning disks.
Here Databricks was very unclear and I got the impression that it is one of those situations where the use case determines the benchmark outcome and even then, benchmarks will depend upon the technology combination in play.
There isn’t a single measure. In the Hadoop world, I consider tens of seconds equivalent to real-time which we can regularly achieve.
Clearly some messaging work to be done here but I do understand the reticence.
In the past, we've been inundated with messaging that tries to push real-time as a 'speeds and feeds.' After a while it gets tiring because without context, the expression 'real-time' doesn't mean much. If I'm a quant trader on Wall Street then real-time means nano-seconds. But if I have to spit out an action item report for inventory replenishment then I am driven by the slowest part of the process inside which that report sits. Most typically, that will be the logistics around getting inventory to the right location, not the report upon which I need to make judgments.
Far better in my view to think about 'right time,' an expression I've consistently used to better understand individual use cases on this topic since around 1999.