Northern Trust drains the data lake with Hadoop to build a reservoir

Profile picture for user jtwentyman By Jessica Twentyman August 5, 2015
Summary:
Northern Trust Bank builds a data reservoir with Cloudera Hadoop; no more drowning in the data lake.

Len Hardy, Northern Trust
Len Hardy

Behind the scenes at Chicago-based financial services company Northern Trust, a massive construction project is underway.

Len Hardy, the bank’s chief architect, is overseeing the development of a huge information repository, capable of holding  vast quantities and varieties of data from a wide variety of systems and keeping it in native format until it’s needed for analysis.

In other words, Northern Trust is building what many in the information management business have taken to calling a ‘data lake’ - but that’s not a term that Hardy likes to use.

A lake, he points, out, occurs naturally. What he and his team are hard at work on, he explains, is a ‘data reservoir’:

because a reservoir is man-made. It takes a lot of concrete, a lot of work and a lot of engineering skill if it’s going to be a success.

While the terminology may differ from company to company, the vision for this repository is one that the team at Northern Trust shares with IT professionals at other organisations: by creating a vast data pool on the open source Hadoop framework, running across a large cluster of low-cost commodity servers, the bank will have a cost-efficient and scalable way to capture and store all the information it could possibly need for future analysis, without the IT team having to worry upfront about complex ETL [extraction, transformation and loading] processes or schema development.

To put it bluntly, they can just dump data in Hadoop - and that’s a great way of eliminating the problem of data silos.

Ambitious

In Northern Trust’s case, the enterprise Hadoop distribution chosen for this project is from Cloudera. Hardy says he likes Cloudera’s philosophy of supplementing the core, open source Apache Hadoop technology with its own, proprietary products. For example, while Northern Trust is using pure, open source Apache Software Foundation tools such as Hive (for data warehousing) and Flume (for large-scale log aggregation) on top of its Hadoop implementation, it also relies on a number of Cloudera's homegrown add-ons, says Hardy. Cloudera Impala for running SQL queries is also an important part of the set-up.

This is a hugely ambitious project. Hadoop’s an important element in this enterprise data platform (EDP), but other information management technologies (such as traditional data warehousing technology) will likely play a role, too. At this stage, a first production release of the EDP isn’t expected until some time in 2016 - but two earlier proof of concepts with Hadoop are really paying dividends and smoothing the path of implementation, he says:

The reason we introduced Hadoop at Northern Trust was to create an enterprise data platform - that was always our strategy and what we were ultimately working towards. But, like I said, creating a data reservoir is very hard work, so first, we worked on two other use cases for Hadoop that represented ‘quick wins’ with great value for us.

The first of these Hadoop use cases is Northern Trust’s new Infrastructure Analytics application, which collects activity data and transaction logs from over 10,000 PCs and laptops across the bank’s global operations and funnels it into Hadoop. This enables the IT team to perform preemptive analysis of desktop performance issues, he explains:

For instance, we get a dashboard in the morning that shows us the ten computers across the globe that had the poorest performance during the previous day. We can call the user, ask them what software they’re running, for example, and arrange to upgrade their machine if necessary.

The second Hadoop use case is Northern Trust’s new Financial Transaction Tracker. As the name suggests, this monitors the progress of complex, multi-step transactions, such as wire transfers, from start to finish, by pulling in data from the various workflows and applications through which they pass. Hardy explains:

We can then take that information from Hadoop and present it to both operations and technology people in the form of dashboards that can tell you the state of any transaction - where it is in the process, how long it’s been at that stage, what’s the next step, when it’s likely to complete. We liken it to tracking a parcel that a courier’s bringing to your home or office.

We can visualise this data as a timeline on the dashboard, with all the stops and stages along that timeline, and we can flag the timeline as red if a transaction’s taking longer than usual. From a client servicing perspective, that’s very helpful if a client calls up to enquire about a wire transfer, for example. From an application support perspective, it can help technology people to identify slowdowns and bottlenecks that perhaps need attention.

Learning

These applications, says Hardy, have provided Northern Trust with “great learning experiences” in Hadoop, an area of technology that’s constantly evolving, he points out. For a risk-averse company in a highly regulated industry, that can make this a challenging area in which to place bets. He says:

We’ve done our own evaluation of the Hadoop ecosystem and chosen tools that we felt to be the most mature, the most stable and we make sure we have the best support available. But that’s something we constantly have to re-evaluate, because there are things we’re looking at today that weren’t available six months ago.

image
But with those two initial use cases now up and running, the Northern Trust IT team is forging ahead on its data reservoir vision. Says Hardy:

Hadoop will basically be the ‘first stop’ for every kind of data in Northern Trust - every transaction processing system and every app that creates data will feed this reservoir. We’ll be keeping all history, every transaction and any data that’ll end up in a client report or on an employee desktop - it will all be stored within that Hadoop layer.

But then, we can use the power of Hadoop and its parallel processing technologies to transform that data, as it’s needed, into a new, common format that we’re able to export out of Hadoop and into traditional relational databases, so that we can do reporting and create data marts.

Today, much of the design and early implementation work is complete. Hardy and his team are starting to move data into the EDP, transform it and move it into target relational databases:

The data reservoir concept is very appealing because getting the data in, even in massive volumes, is actually quite easy. And the price-point of the hardware that Hadoop runs on is very attractive: you can keep data there as long as you need it, and beyond.

But where the real value comes into play is in the analysis you can perform once you’ve got all your data into one place - and not just your internal data, but data from external sources, too. That’s going to be pretty powerful.