Data in motion was a big theme among Hortonworks executives at last week’s Hadoop Summit in Dublin.
It’s all very well for companies to have a huge Hadoop data lake based on, say, Hortonworks Data Platform (HDP), they argued - but how should companies go about analyzing data before it comes to rest there? How, in other words, can they get real-time business insights from data arriving at a fast rate, while it’s still in motion?
The reason behind the ‘data-at-rest versus data-in-motion’ messaging is clear: Hortonworks is looking to capitalize on its late 2015 acquisition of Onyara, the early-stage start-up behind an Apache project called Nifi, which originally started life at the US National Security Agency (NSA).
Nifi (pronounced ‘nigh-figh’) is now an open source product, sold by Hortonworks under the name Hortonworks DataFlow (HDF), and as a separate product to HDP. It can be used to collect data from many sources, with a strong emphasis on the kinds of sensors and meters that increasingly comprise the Internet of Things (IoT), but also from web clickstreams and social networks, for example.
The thinking behind all this is that Hortonworks customers will rely on HDP to provide an invaluable repository of data at rest for analysis, but will turn to HDF to assess and react in real time to fast-changing data in motion from ‘edge’ devices and sources, as Hortonworks president Herb Cunitz explained:
A large retail customer with thousands of stores collects data on all of the transactions that take place in a single business day. Once that data lands in HDP, the data-at-rest platform, they can go analyze it. But one of the challenges they ran into is that HDP is only updates periodically, maybe once an hour. What the retailer also needs is real-time insight into data in motion, which is where HDF, with its streaming analytics capabilities, comes in as our data-in-motion solution.
In the mail
But it wasn’t just Hortonworks executives extolling the virtues of HDF. Thomas Lee-Warren, director of the technology data group at Royal Mail, said that while the organisation’s use of Hadoop is still at relatively early stages, HDF is proving helpful in enabling him and his 15-strong data insight group to move faster when it comes to data-science experiments:
For us, HDP is about where we land data. HDF, by contrast, is allowing data scientists to quickly look at data, play with ideas, visualise those ideas back to business leaders so that they can say which have merit, which don’t and which they’d like to explore further. As we move outwards, towards the edges of where data is captured, we have to consider what’s the best toolset for experimentation.
Royal Mail may not have the biggest cluster, but we do a lot of experimentation. And we’ve got a lot to prove, because [CEO] Moya Greene and the rest of the executive board are very excited about what we’re doing and are directing our efforts.
One project he and his team have worked on, for example, focused on churn modelling, with the goal of cutting customer attrition rates by identifying why and when they’re most likely to move to another provider. Another looks at identifying accurate delivery times for business customers:
Before, we’d have had to have a whole debate around ETL [extraction, transformation and loading] in order to create these analyses. But HDF helps us to streamline the flow of data and build models and visualisations quickly, so that my team can work iteratively with business colleagues on building solutions that work for the business.
It’s a real step forwards from the situation that Lee-Warren found at Royal Mail when he joined the organization around two years ago.
Then, he observed, the data analysis team spent around 90% of its time ferrying data back and forth between operational systems and analytics platforms and 10% on actual analysis. His goal was to reverse those figures, so that the vast majority of time is spent exploring data:
As an organization, we already have a lot of data, but what we needed to do when I joined was to start to change the whole dialogue around data from just something that appears in reports to something that drives the fortunes of the business and lies at heart of our decision-making.
That’s the thing with Big Data today - you need a platform where you can store everything, you need to experiment as much and as fast as you can but, soon enough, data starts to speak to you. More work is needed at Royal Mail, but I think we’re getting there.