A large bank is, perhaps, the classic use case of a business where understanding its data inventory is a key governance issue. It's essential to ensure that it is never overwhelmed by trying to see what data is coming from where, where it is going, and who is using it. At Huntington Bank, a full-service bank based in the US Midwest, this is the mission of Shaun Rankin, Senior VP of Information Governance.
His key roles are in supporting the bank's enterprise data strategy and expanding appropriate access to that data across the bank's staff. The hot topic with the latter goal is to create self-service access. The aim is to have staff understand where they go to get information. He explains:
We're focused on helping data to move like water as opposed to moving like oil through our systems, so people know what to look for, where to get it, and that they can trust it.
A key part of that data strategy is the use of metadata components, which are the responsibility of what Rankin calls a small but mighty team put together last year with a purpose — to centralize knowledge and decentralize and distribute understanding. He says:
It's a nice simple North Star. And we realize it isn't going to be a magic bullet, or just a simple tool that answers it, we've really got to get engaged with people and establish scalable processes, and then use best-in-class tools to help decentralize that understanding.
How his team has set out to achieve this goal was the subject of a presentation at last week's EVOLVE21 virtual conference, hosted by Rocket Software and its recently acquired subsidiary ASG Technologies, whose Data Intelligence tool, ASG DI, has played a significant part in their work.
It's all in the metadata
The data ecosystem Rankin's team works with is focused on customer accounts, transactions and interactions, from some 40 to 45 sources that feed into the ecosystem. There is a warehouse and Hadoop layer for structuring that data, together with a Snowflake layer for serving up the finance business capabilities users want. Analytics is supported by a data dictionary curated for data quality, plus other tools. This layer is also the home of the metadata capabilities.
Using ASG DI to analyze its use of metadata, the bank decided to start with customer data and how it flows into the data warehouse, seeing whether it can be located on a data mart, and where it is being used in the reporting layer. Rankin describes the scale of the task:
We have over 20,000 tables and views between our customer master warehouse data mart. We have over 14,000 transformations that are data stage layer and over 13,000 reports. So this is where it gets overwhelming. There's a lot of data. The Data Intelligence tool actually started to help us visualize and see some of the lineage. And as we were looking at the lineage, we realized there are gaps — we can't see the data mart, and we can't see the reports.
To plug these gaps, the team has added SQL Server Integration Services (SSIS) and is now in the middle of scanning that. "We expect to get the full lineage within our data ecosystem from there," he says. This is helping expose deeper levels of information about the data, such as the embedded SQL that's used and the transformations for moving that data and mapping it from source to target.
An intuitive data vernacular
This leads on to the other side of the bank's objective — how to de-centralize and raise the level of understanding about the data across the bank. Rankin rejected the option of training everyone in all the technical jargon around metadata curation, choosing instead to move in the opposite direction.
His goal was to expose data assets in intuitive vernacular that people can immediately understand. This approach is not without its own difficulties, not least being that different parts of the business have their own vernaculars and taxonomies. His team had to figure out how they were going to tag metadata assets to make best use of them. It was a daunting prospect, as he explains:
There's a field within Data Intelligence called `Tag'. Where do I put applications and processes and data domains and business segments? Do I just tag everything with everything? Or is there a more intuitive way to actually help us manage that? So as we're thinking about this problem, we actually looked to a couple of industry examples.
One inspiration was search — instead of asking staff what their system records, or how they defined a subject such as `customer', they analyzed what staff searched for. The second inspiration was the collaborative editing of Wikipedia. The aim was to create a similar federated, organic process that would unlock that business acumen. Rather than try to get to a consensus on the definition of `a customer', which Rankin sees as missing the real point, the goal became to develop some principles around which to get started.
Key principles for defining metadata
The overriding principle was that key goal of centralizing knowledge and decentralizing understanding, coupled with targeting and focus to make the best use of the team's limited resources. Given these constraints, the next principle is a bias towards action — encouraging people to jump in with both feet rather than sit around trying to design the best ontology or the best organization. Then comes the need to try something in order to build understanding of data assets. This begins to inform understanding of how it should be organized. He sums up:
It's actually metadata-driven or data-driven. It's looking at that empirical evidence within the metadata to inform how you want to organize the assets.
Another principle is the use of agile software development, with the target of getting to the minimum viable product based on what is the most important metadata to work on next. He says:
That actually gets you through test and learn. Our first attempts at organizing work were not the best, but you could see the wheels turning. And now that we've gone through a handful of these sprints, they're starting to get sharp, and they're starting to uncover some complexities of, ‘How do I get marketing and finance to have a definition of customer that's related, and still allows for their own distinct positions?'
This approach is then backed up by regular review processes. There are sprint reviews every two weeks with data stewards — or curators as Huntington calls them — and monthly reviews with senior management. This allows the team to see early whether they are on the right track or not.
The final principle is to build the ontology in a way that recognizes the different contexts in which data is used. The objective is to organize data assets as both data domains and business segments, and not as an either/or choice. Rankin explains:
Within data intelligence, we're putting business terms on top of that. Those business terms can be organized by both a business unit, a business segment, and data domain. And this visualization lets you see glossaries with contexts. I can have an enterprise glossary, and I can have a business unit glossary specific for them.