So when Hortornworks invited me to the opening of their new office in London this week, where a number of high profile customers were speaking, I thought it would be a good opportunity to get some insight into the real-life examples of organisations that are starting to find value in analysing unstructured data.
The organisations in question are all Hortonworks customers and have a high profile in their respective fields – Jaguar Land Rover, Zurich Insurance and the UK's Home Office.
Below are their stories.
Jaguar Land Rover
Adam Grzywaczewski, Research Strategy Engineer of Self Learning Cars at Jaguar Land Rover, took to the stage at Hortonworks' offices and explained how his team is using Hadoop to better understand how the company's fleet of cars are used across a number of regions. The aim is to ultimately feed this information into the engineering team so that Jaguar Land Rover can reduce its manufacturing costs and place emphasis on better and cheaper design. He said:
Modern cars are not purely mechanical anymore. It comes as a surprise to many that we currently have 60 computers on board. And a good couple of thousand different sensors. An average car on a good day will generate around 1.5 gigabytes of information. That can be valuable, it tells us a lot about how our customers use our cars. That really tells us a lot about how our cars operate in real-life environments and places we don't fully understand, like Dubai, South Africa or China, where people are different than us. We quickly realised that the data is valuable.
Its fairly obvious that in order to convince our board that the products that we are creating have business value and actually work, we had to validate them. Especially in a sector that is so safety and robustness focused. It was crucial to collect representative data from day one. And the volumes just didn't allow us to go in any other direction [than Hadoop].
Really, 1.5 gigabytes of data from a single car a day is substantial. We had to validate our technology across multiple markets and have a reasonable sample size on key markets as well. So we knew from day one that [Hadoop] would be the technology of choice. We have inherited a lot of different brands and vehicle architecture so we have a lot of data and a lot of data barriers across different models, so even though everything is very well documented, there isn't a very straightforward schema. Every car will have 20,000 different time series around the network at any given point in time. We couldn't just go for a relational database and do relational analytics, we had to think outside of the box.It was fairly obvious that it was essential to validate our product internationally, because people do differ. They have drastically different habits and needs and desires across the globe. One of the things that caught us by surprise when we were looking at the automation of some of our climate systems was the way that a lot of people in Dubai and Saudi Arabia use their cars. One day we looked at just climate statistics and we noticed that a lot of cars are just left one for 12 hours – people will just park in front of their office, they will leave their car on with the air-con on, go to work and collect it later on. The definition of a weekend also varies a lot from country to country. So we have seen a huge variety of behaviours.
The mission of my team, which is self-learning cars, is to build the first truly intelligent car, which will cater for user needs. We are collecting data from a global fleet of cars to build our machine learning logic and to validate it in real-life settings.
To be honest, if British manufacturing is to survive it needs to be competitive. And it cannot be competitive without data. Currently cars are designed on worst case scenarios, engineering judgement and very well defined tests. But that needs to be more informed by how people use our vehicles and products. We frequently over-engineer our cars because it's very difficult for us to tell what is the normal usage of certain vehicle components. What is the normal usage of disc breaks? We don't really know. Hopefully we can use this to drive down the unit cost and increase quality.
Rajdeep Mukherjee, Head of Big Data Architecture at Zurich Insurance, one of the largest companies in the world, explained how data has long been used as part of Zurich's decision making process. However, Mukherjee is now trying to take this further as he moves the company from the relational world to the non-relational with Hadoop. He said:
Zurich Insurance is one of the world's largest insurance groups. My role in the organisation is to lead a group of technical architects to design and implement a big data platform, which is going to serve our business and our business intelligence, analytical divisions. [We want to] really bring data to life. Insurance is really a data business [and] not all the data is being used at the moment.The aim is to use Hadoop, a mix of internal and external data, to take Zurich Insurance to the next level of maturity in terms of using data to drive business decisions.Just to give a bit of background, data driven decision making has been a part of Zurich's culture for a while now. When we did a strategy exercise last year we found that there were challenges. Some of those were getting access to data in a timely fashion. And to do that in a relational world in data warehousing environment was cost prohibitive, [even more so to do] it for the entire organisation. That was one thing. Then it was around having a similar platform to run predictive use cases. We were running predictive models on laptops and desktops and to turn around the models, prove the hypothesis was taking 18 to 24 months.
How can we change that and have a new modern data architecture? That's when we started thinking about Hadoop as a distribution. We did an initial proof of concept around getting our corporate business, which spans across all geography, and because that data is in different places and not centralised in different shapes and forms – to get all the data and do a price optimisation use case, the way you could do it using Hadoop and non-relational technologies was an eye opener. Having a conformed metadata layer throughout, that is important.
However, Mukherjee also noted that there are things he still needs to see develop within the Hadoop open source community. He said:
Quality testing within Hadoop, I haven't seen that maturity today, so in terms of a testing framework around Hadoop all the way up to application testing, that would be really nice.
The Home Office
Finally, the Home Office's Wayne Horkan, Senior Enterprise Architect, explained how Hadoop and the open source community will help government departments to move away from their legacy systems and avoid future vendor lock-in (well, that's the theory anyway). He said:
The Home Office obviously is involved in immigration control, border control and coordinating the police and in general protecting the country. The last couple of years I have been looking at border control and basically border checking. That's the area of interest. Obviously we have a large number of existing databases and sets of data and we would like to bring those together and get as much intelligence from them as possible.
The metadata layer and taxonomy and management thereof is absolutely key. The ability to re-schema, lake binding and the ability to bring lots of different, disparate data together for us to be able to relate it in real time and make decisions on it, that's the key enabler. It's that breaking away from legacy relational thinking, where structure and data are so tightly coupled, because that's killing us really.
That's the piece that's exciting, to repurpose the data, to bring it all together, toreuse it very quickly to make decisions and [is becoming] increasingly transactional. And questions of how do we make real-time systems that are born in real-time from very large datasets? That's some of the exciting things that we look to the technology for.
What I've seen from Hortonworks and enjoyed is the alignment to open source, you are very closely aligned to the open source community. There are is a lot of feedback. That's really good for us because it protects us from vendor change and lock-in, which we are not too keen on at the moment. The other piece is that you roll up everything together, get consistent build and delivery, that's really useful to us. There is also a maturity in the ecosystem. Previously where this was a technologists technology, it's not now.