According to CEO Ali Ghodsi:
They're not getting the value of it, so the idea was to put them together like bricks so you can build much greater things with it than just bricks.
Ghodsi is a man on a mission – explaining how most are doing ML and AI wrong.
Yes, some might argue it's a self-serving mission, but at the beginning of the real application of some very important technologies, ML and AI are destined to have impacts of which we have yet no understanding, let alone much idea of how we might get to any worthwhile results.
The problem is that already many companies are out there planning grand 'buildings' with little or no idea of how the 'walls’ or `floor’ are actually made.
Take driverless cars as an example. I have no idea of the number of accidents there have been set against the cumulative number of miles driven in a 'driverless’ mode, or how it compares to the number accidents per hand-driven miles, but Ghodsi holds the view that this is a good example of a classic top-down, grand global scheme that will no doubt be good and very useful eventually.
But at present no one can be sure that the AI has all the right information for the job because no one can be really sure what the ML systems need to be trained to look for. And while we all think that is obvious, it is but only up to a point. Our near term future, at the very least, is made up such developments, but we are still in the early-learning phase of how to build the 'bricks’ properly, let alone how to put them together to make 'walls’.
Some background to Ghodsi may give a bit of credence to his thoughts here. He was one of six researchers at UC Berkeley who got involved in working on projects for the likes of Google at a time it was leveraging data to do machine learning. Most other businesses were doing the opposite - locking their data into data warehouses, where the most they could do was look at the past.
At the same time Google had built this massive infrastructure for data and they were doing predictions, AI and ML They set out to democratise this and get it out to the rest of the world and opted for the open source route. But there wasn't much uptake because they were not yet a business.
Around 2013 however they, as Ghodsi tells it, 'got lucky' and received $14 million in funding from Ben Horowitz. They fed their project to the Apache Foundation at the same time, where it started life as Apache Spark.
The three parts of unified analytics
Ghodsi doesn’t think that company leaderships are naïve when it comes to AI and ML, more a case of it just being difficult to get there, not least because at the moment, shortage of staff with the right skills is widespread:
People who can actually get the data and automate it are very rare and far apart; they have to be good at three things. They need to be good at the mathematical stuff, that's data science. They also have to be good at dealing with the data, that's the data engineering part. Third, they have to have the domain knowledge of that company.
They also have to understand what problems their company’s customers are trying to solve, and there are currently very few people that can do all three as part of the same single task. Databricks has set itself the target of covering off the first two – data science and data engineering – in a single product, creating something that domain expertise can then be mapped on to.
Ghodsi has named this combination `unified analytics’, and it maps well onto the trend for viable AI applications being really quite small, and focussed on one or two tasks. To Databricks these are the equivalent of walls or floors.
One customer is using Databricks to look at blocking credit card fraud as the card is swiped, comparing the location, time of day, continent, user background, the frequency and types of purchases, with a growing dataset of learned fraudulent behaviours and operations, statistically identifying – and blocking – fraudulent actions against the card. Ghodsi says:
Now, two years later, that one company has over 50 such use cases. And similar numbers of use cases are emerging in most of our 700 customers. It is the boring type of AI that nobody talks about. Everybody is talking about self-driving cars and 'are humans going to be replaced?', that kind of stuff.
One of the main reasons AI projects are failing is they have separate teams of data scientists and data (software) engineers that sit in two different groups and report to two different departments, he adds:
What we're seeing is you get stuck in politics between these two groups. So, the data engineer says 'I have the data, and you can't do anything without the data, and no you can't have the data because I'm not sure you would be able to keep it secure'. The other group is saying 'you need to give me access to it so I can actually build models that can bring AI to this company. This is the future, you're the past'.
In Ghodsi's view, they now have to be on the same team, speaking the same language – which he suggests is the role of Spark. Making that happen however is, he acknowledges, involving the company in some customer education at quite an abstract level.
But he sees that as the reality of there being a chasm between the 99% of companies that are not so successful with AI and the one percent that are currently making the headlines. He feels every Fortune 2000 company has hundreds if not thousands of use cases where they can apply pretty basic machine learning.
Ironically, Databricks is operating in the driverless car arena, but only where it involves the sequencing and analysis of vast amounts of data from the many sensors cars now have detecting their behaviour and surroundings. But in Ghodsi’s view, detecting problems or the build-up of potential accident scenarios is, essentially, just the same technology as used to detect credit card fraud.
“If you talk to any of those companies and ask them what's the hard part, they'll tell you that 95% of their effort went into the stuff that I'm talking about and only 5% into the front-end management system.”
This point is made, and discussed in some depth, in a paper published by Google, the "Hidden Technical Debt in Machine Learning Systems" that focussed on what time was spent running projects inside Google. This outlined the concept of technical debt in developing ML systems, and found that it is common to incur massive ongoing maintenance costs in real-world ML systems. The authors unearthed a number of risk factors to account for this in system design, including `boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns'.
Configuring the resources for thousands or maybe hundreds of thousands of machines, analysis tools, process management tools, extracting the data relevant for the machine learning algorithm, managing the infrastructure around this and then, most importantly, monitoring everything is the issue. If you just feed garbage in, you're going to get garbage out, it doesn't matter how good your algorithm is. That's the secret sauce of Spark. It's one language to do both.
This does highlight one of the dichotomies of IT in the modern era – users are increasingly starting their decision making at the final solution end of the telescope, yet still sometimes have to start small and build towards it. So far most of the top-down `large project’ AI and ML developments have stumbled and not yet achieved worthwhile results – driverless cars being a good example…..good so far as it goes but still with some potentially dangerous holes to fill in the walls and floors of what is being built. Anything that helps create the necessary bricks to fill those holes is probably going to be of value in bringing AI to a first level of maturity.