Breeding a Babel Fish for Internet of Things analytics


Collaboration between big systems environments is an essential capability now if complex systems of systems are to work at all, let alone properly, so Pentaho is pitching to be the tech version of Douglas Adams’ Babel Fish.

babel fish
Babel Fish

Once popped into someone’s ear, the Babel Fish, the fabled universal translating animal from the Douglas Adams’  famous trilogy, The Hitchhiker’s Guide To The Galaxy, could translate the language of any sentient being in the galaxy into the language of that individual. Anyone could then communicate with anyone.

Quentin Gallivan, the CEO of Pentaho – now a division of Hitachi Group –  knew nothing of said fabled fish, but he likes the analogy as for he now sees Pentaho as the Babel Fish of the Internet of Things (IoT) – and a lot more for that matter:

When it comes to machine data, yeah, I like Babel Fish, we’ll have to use that. It’s really about that adaptive layer that takes all the different formats and devices. What’s unique from a data orchestration standpoint in IoT is not the size of the data, it’s the volume and the short burst frequency and the real time nature of it.

As even the traditional buyers of capital equipment look to move away from capital expenditure and towards obtaining what they require, even big products, as-a-service, this need to work with ever-faster and larger bursts of data, from a growing number of sources, will rapidly expand.

That also changes the rules of what the vendors of such capital equipment provide. They are no longer just selling machines; they also have to manage the risk for the customer, as businesses are now paying by the hour. To that end, predictive maintenance is becoming an important way to manage a business model now geared to delivering a service. It even involves using predictive maintenance actually on the IoT system itself.

To that end, Hitachi is targeting the rapid and simple deployment of hyperconverged big data analytics with a Hyper Scale-out Platform, the HPS400. This is the first of an expected range of systems that follow the `appliance’ route by combining compute, storage and virtualisation with Pentaho’s data integration and management environment together with Hadoop in its various forms, a range of database tools and data warehouses.

For businesses with the right compute resources available, it can also be made available as a cloud service, which opens up the possibility to be made available by Cloud Service Providers.

Target market sectors are across the board in data analytics, but according to Gallivan the company’s eye is particularly looking at security, smart/safe city applications, and of course the Industrial Internet.

Everyone paddle in the data lakes

Though Hitachi makes much of its use of Software Defined Data Center architectures this does in fact take that approach a way bit further, conceptually. This appears to be what one might call an `Abstracted Function Appliance Architecture’, a hardware and software combination moving the user’s point of interaction with the resources and deliverables further up the level of abstraction away from the technology.

The idea is that users don’t need to know how it works, they just need to understand what it can achieve for them. And in the highly complex and interactive world of an industrial internet environment, that will be all many users will need to understand.

So this is being pitched at the new trend in IoT – companies that are now trying to figure out the new lines of business that face them and what they can do as a consequence to make those new lines reality. Their IT departments then have the problem of trying to build systems that achieve those results, and it here that Gallivan sees HPS400 and what follows having a direct effect: 

You hear words like ‘Data Lake’. How do you incorporate an architecture for big data if you’re a large corporate? This concept is happening where corporate IT is building a centralised infrastructure which is a parallel to the ERP kind of world. So they’re setting up this centralised data lake. We aim to help companies operationalise that lake by creating an abstraction layer around it so it’s easy for that line of business to actually dip in and grab information they need, and blend it with information they may have within the line of business, such as relational data about machine specifications.

This abstraction layer is designed to help corporate IT operationalise the data lake, at a time when line of business managers are putting pressure on IT to make it easier to exploit the data sitting in the data lake. At the same time, corporate IT also has to make sure that the data is verifiable, auditable, that there is data lineage, and that it is safe and secure. 

According to Gallivan, the advantage of the HPS400 appliance is that having it in a box takes the complexity out of setting up the Data Lake. It’s tactical, with Pentaho the key to the abstraction layer. The relationship between the two companies then makes bundling with the Hitachi hardware simpler and easier to configure.

What they’ve put in HPS400 is self-deployment, self-configuration, just taking the complexity out of setting up the Data Lake.

This then begs an obvious question: is there scope for further levels of abstraction? For example, there must be scope for deeper organisation or orchestration of the data. When enterprises are dealing with petabytes of data, the amounts are such that they need to understand, at a data science level, what data is actually relevant to what question.

It might then make sense for the company to build a partnership with, for example, IBM Watson on the IoT side, using its inference search engine environment to infer what events and actions are likely/possible and where to look for relevant data. Is that something that Hitachi or Pentaho are looking at? Gallivan says:

One thing is this year we’re putting in more machine learning into our data orchestration so that companies and their data scientists can easier research, and we start looking at that Petabyte scale of data. We also have a machine learning capability, open source at Pentaho, and then Hitachi have several initiatives around machine learning and artificial intelligence. They could possibly be part of this IOT capability.

As an example of the general direction here, he points to FINRA, the largest financial service regulator in the US. This captures every trade that’s done by every broker dealer in North America, some 7 billion trades a day, with the objective of looking for insider trading fraud and non-compliance.

Everything is run in the cloud, on Amazon Web Services, with Pentaho used to grab all the trades and transform them for processing in Hadoop, which also runs in AWS. It then uses the results to develop new machine learning algorithms, says Gallivan: 

FINRA uses a great line which is,  ‘Not only are we looking for a needle in a haystack, we’re looking for the bent needle in the haystack’. They’re looking for anomalies in terms of fraud detection. So they’re using Pentaho as the infrastructure for that anomaly detection or that machine learning.

Optimise, abstract

The goal now is to engage in the post-analytic management of specific events. With more prescriptive analytics providing the ability to predict that something will happen, the obvious next step is to manage what is done about it from a system control standpoint. Closing the loop is the way that Gallivan describes it.

don't panic HDS is also looking closely at the scope for optimised implementations of the HPS400, not just of Pentaho but also Hadoop. Coupling this with the inevitable effects of Moore’s Law – where appliances like this will become smaller, more integrated, and cheaper – it is possible for foresee a wide range of devices optimised for specific functions or applications areas with increasing levels of autonomous automation for managing the specific operations of and industrial internet environment.

According to Gallivan, the HPS400 is the first step down a track towards a product line where appliances are simply plugged in with the minimum of final optimisation required.

The biggest advantage is setting up that Data Lake. Where you’re taking complexity and time out of the process, because it is a box. You can optimise it, configure it, put Pentaho in and optimise its performance. It is more dial turning than coding and system testing and is just part of that infrastructure where the benefit is that it is easier to deploy. It is a building block to a broader thing that we’re alluding to.

One interesting part here, and a core part of that Babel Fish analogy, is that despite now being an integral part of Hitachi Data Systems, Pentaho is still free to sell to rivals of HDS, especially in the IoT sector:

We are the data orchestration and the analytics, then other companies provide their own end-to-end. We’re doing that today and we will continue to do that. GE is a customer of ours and some other Hitachi rivals are customers of Pentaho. Hitachi is OK about this and, more importantly, the rival who is our customer is ok with it. That’s more important from our standpoint. Big global environment companies can be competitors in the morning, customers in the afternoon and partners in the evening. That’s the nature of the game.

And when it comes to IoT, partnerships are going to be inevitable as different businesses use different IoT equipment and service providers. They will simply have to work together. So the cloud is the natural place, whether is a private or public cloud, the natural place to facilitate autonomous end-to-end IOT environments. In Gallivan’s view any IoT environment is, by its very nature, a large and complex ecosystem, with many vendors playing their part.

My take

The development of IoT, and Big Data analytics in general, is only going to really take off once the need for users to dabble with the technology is removed from the equation as much as possible. That way, users can dedicate their time to understanding issues, formulating questions and understanding the answers. This could represent a significant move in that direction not least because competing vendors seem happy to use Pentaho as that fabled Babel Fish. Just Don’t Panic! 

Image credit - BBC