There’s a Washington, DC public library branch near me, and I’m relearning how great libraries are. Aside from all the other things they offer communities – and there are a lot – libraries still have books. Nowadays, though, the books available to me are much more extensive than just what’s in the one branch. I can search for a book online, and if it’s anywhere in the DC Library system, I can place a hold. Within a few days, the book is delivered to my local branch for pickup.
Libraries put information at your fingertips, on request. Why isn’t enterprise data like this?
Sure, for some data that we use individually, it is. We can search email, or Slack, or Google docs, or the content management system. If I’m lucky, I can probably find that one thing that I think I saw one time. But what about when you need to access company data? Say you’re:
- Building a new mobile app for your customers to use, which needs access to their existing customer records
- Trying to make the right business decisions based on an analysis of your aggregate sales data
- Or (good luck) trying to get a single view of a customer and all the information you have about them.
How many hoops do you have to jump through and how long does it take to get access to one existing data source, let alone multiple? Or, from the point of view of an end-user, how many systems might your customer service reps have to access to answer a customer’s questions on the phone? For most organisations, the state of data is not good.
What’s the problem?
How did it get like this? Quite naturally, in fact. As your company grew, so did your data silos. You built new applications, and often they had their own backends. You bought new off-the-shelf software, and those had their own backing databases too. You built more applications to extend or fix issues with the existing ones and had to duplicate data. More duplication happened when you loaded it into a data warehouse, and then made different cuts of that cold data for different purposes. You acquired or merged with a company, but never quite merged the data or deduplicated the systems. These different data silos are owned by different teams; quite reasonably, they have their own policies and security restrictions on granting access.
The final picture of all these reasonable steps, however, is of a dystopian data landscape where getting the data you need is near impossible.
Ideally, getting access to data, when you need it, should be as simple as it is to spin up an instance in your cloud provider of choice. We have Infrastructure as a Service – why not Data-as-a-Service?
Providing Data-as-a-Service means that the people who need access to data can get it on demand. Developers can build new applications and services that query company data. Analysts can run numbers, produce insights, and create visualisations. External parties could have access too: partners get a limited view, or you can securely grant access to clients.
There are two basic approaches for Data-as-a-Service:
- Leave the data where it is, and give access to it through some sort of broker layer that connects to the source systems.
- Combine the critical data needed for your key use cases in one place, a pattern with a few names: Operational Data Layer, Data Hub, Data Fabric, and other variations on the theme.
If you don’t mind stretching the library metaphor a bit, the first approach resembles the DC public library or even the broader interlibrary loan system. Books live in many different source libraries, then are supplied to the requester on demand.
The second is more like the Library of Congress. It’s the largest library in the world, with more than 168 million items, including more than 39 million books. Anyone can get a Reader card and then access the library’s collection. Like the second approach, the Library of Congress has all the books in one place. (In fact, it has a few different physical repositories for space reasons, but let’s not let reality get in the way of a good metaphor.)
Combining needed data into one place certainly takes more effort upfront. It does, however, avoid several possible pitfalls in areas such as :
- Latency and user experience – Accessing data directly on many source systems can have unpredictable latency, as each source system is doing its own thing; this may be OK for analytical workloads, but is unlikely to suffice for real-time applications. Combining data into an Operational Data Layer (ODL) provides low and predictable latency.
- Performance and scalability – New demands will put additional load on source systems, which can have negative implications both for the performance for existing workloads and the cost and scaling requirements of those systems. An ODL can handle new workloads while source systems remain unburdened; over time, existing consumers can even transition to the ODL.
- Single source of truth – If you have data duplication or multiple sources of truth, this is an opportunity to deduplicate and combine, producing a useful single view. Similarly, if you need to aggregate data that originally lived in multiple source systems, doing it in a single Operational Data Layer is likely to be easier than pulling it up from all the source systems and doing the aggregation in the broker layer or in the application.
A best practice approach
The data itself is only the start. As important, if not more so, is the process side. When you’re doing the upfront work, you need to carefully identify both data producers and data consumers to make sure they can all hook successfully into your system, and develop a plan for merging and reconciling data as needed.
To facilitate this, it’s a good idea to identify data “stewards” – the person or people who know a given data source and can ensure that it’s made available in an accurate and useful way. Then there’s the process of exposing that data, implementing new ways of working that rely on consuming Data-as-a-Service instead of the customary methods, and of course applying appropriate security and permissions such that only the right people and systems can access the right data.
Another best practice: start small. Given the probable tangle of data sources and systems, it’s tempting to try to solve all problems in one fell swoop: just audit all your organisation’s data and proceed in round 1 to make it all available. Sometimes that’s possible, but such projects usually don’t get out of the planning stage. It’s generally better to start with just a few data producers and consumers and improve the way they work together. Build some credibility for the project with early wins – delivering tangible business value – then incrementally add more producers and consumers to the system to deliver additional use cases.
Successfully delivering Data-as-a-Service isn’t easy. The worse your data estate looks, the harder it will be – but also, the bigger the payoff. The Library of Congress was founded in 1800 and continues to add millions of new data sources every year: not just books, but all sorts of different formats and structures. Your time horizon probably isn’t two centuries, but an iterative, always-improving approach can make working with enterprise data easier and easier. There’s a lot to learn from libraries, whether you look to the largest one in the world or your local branch.