We need a real semantic layer - but something is missing
- Semantic layers have been around for a long time - but with underwhelming results. Is the concept outdated? No, but something it missing. In the big data era, we need to raise our data abstraction game.
There is an old urban myth - and some urban myths are true - that a strategy session is planned in an organization to decide on the next steps for some business issue. The various participants file into a conference room with their laptops and handouts.
One after another gives their presentation, all very data-heavy and all in conflict with the others. The executive in charge is baffled that no one can come up with the same data. Is the problem data quality? Maybe. Is it the analysis methods? Perhaps. Is it a lack of common semantics? No question about it.
What’s semantics got to do with it?
Today, there is a remake of the term “semantic layer.” Some thirty years ago, BI tools used the term to describe a mapping of operational data to a “OLAP cube,” a multidimensional structure compatible with the product's graphical query capability. The semantic mapping was proprietary. It only worked for the software product. It came in two forms: an actual population of a cube, and a virtual cube that did not move the data, but employed generated SQL.
The semantic layer also had a broader definition, derived from the Semantic Web, that attempted to map everything on the web using ontologies coded in RDF. It was a bold and expansive idea, but the effort to populate the ontologies was beyond capabilities. Nevertheless, it had two interesting capabilities: 1) It raised the idea of a semantic layer that stretched beyond a propitiatory software, and 2) It introduced the idea of a graph, capable of assigning fully attributed qualities to data, two giant steps beyond existing definitions of metadata, which provided little helpful information for analysts.
Picking up on the broader definition of the semantic layer, some vendors have developed the capability to map multiple data sources to an abstraction of “business terms” that any number of software tools can access. In other words, arcane attribute names in the source systems are translated to “friendly” terms and analytics; visualizations and BI tools can access the abstraction as if it is the actual data. This is a giant step forward. But something is missing. Unlike ETL, no data is moved.
A real semantic layer - what's missing?
Mapping X_ZF_21_SU to Customer may be useful, but what is a customer? This is where the semantics layer technologies are weak. What is a customer? Sales, Finance, Marketing, and Product Marketing all have different meanings. There are past customers, upsell customers, churning customers and a slew of other types of customers. A semantic layer should have deep semantic descriptions, SMEs, etc. The semantic layers I see today are still technical and data-centric.
Semantics is an attempt to acknowledge reality in all of its complexity. The word “reality” has a nice, crisp sound to it. “Let’s deal with reality.” “The reality of the situation is...,” and “Take a reality check.” Unfortunately, the reality is not as solid as we would like it to be, even if it is persistent. People frequently see truth differently. Things change over time, becoming different versions of themselves or different things entirely. A supplier can become a customer, creating conflict, opportunity, or both. Complex relationships between people and the contexts in which they operate can result. Creating a crisp representation of such a tangled reality is devilishly tricky.
This was a weakness in data warehousing and BI approaches. The methodologies for designing these structures were like pouring concrete, wonderfully fluid while pouring but needing a jackhammer to modify them. Because the designs were based on managing from scarcity (never enough resources for an optimal solution), persistence and careful design were needed not to overwhelm the resources.
But things can adopt new, even opposite attributes, yet retain their original character, sometimes bordering on the contradictory or the absurd. Is a Toyota Prius a gas-powered vehicle or an electric vehicle? Is an SUV a light truck or a passenger car? Is Buffy the Vampire Slayer sci-fi or comedy? Does the world make sense, or do people make sense of the world? Sometimes, when trying to make sense of reality, we err on the side of precision.
Consider the Mars Climate Orbiter, a $125 million project that crashed and burned in the Martian atmosphere because the engineering team used metric units, while another used English units for a key spacecraft operation. This was a relatively simple but catastrophic mistake. When you consider the hundreds or thousands of different attributes in a data lake, the risk of semantic confusion is extreme. It is widely reported (though not precisely confirmed) that data scientists spend 80% of their time preparing data for their models. A large part of the effort, regardless of its relative magnitude, is figuring out what all the data means.
Making sense of used data - what's the solution?
Remember that analytics is essentially commenced with “used” data, that is, data captured and stored for some other purpose than the business at hand - sales data, human resource data, maintenance data, etc. Historically, this “used” data was mostly hand-me-downs within the organization, something we now call “operational exhaust.” But a painful lesson was learned by early data warehouse practitioners that even internal, structured data can be challenging to understand and work with based on technical metadata (table name, column name, datatype, etc.). Now big data and data science bring all of the other external data into the mix, often referred to as "digital exhaust."
“The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.” - Edgar Dykstra
The solution is some form of abstraction, a way for consumers of data to understand the data resources available to them but insulated from the physical complexity of it all. Abstraction is applied routinely to systems that are to some degree complex and especially when they are subject to frequent change. A 2022 model car contains more MIPS of computer processing than most computers only a decade ago. Even under extreme conditions, driving the vehicle is a perfect example of abstraction. Stepping on the gas doesn’t pump gas to the engine. It alerts the engine management system to increase speed by sampling and alerting dozens of circuits, relays and devices to achieve the desired effect subject to many constraints, such as limiting engine speed and watching the fuel-air mixture for maximum economy or lowest emissions.
If the driver needed to attend to all of these things directly, he would not get out of the driveway.
For example, a 1971 Audi had virtually no electronics at all. A 2022 Audi S8 practically drives (and stops) itself. It’s claimed that the S8 has 2 TeraFLOPs of computing power. That’s two million- million (1012) floating-point operations per second. To put that in perspective, Los Alamos Labs was just developing a teraflop supercomputer in 1996 to simulate the effect of nuclear weapon explosions.
Today, working with big data is still a lot like driving a 1971 Audi. You have to do everything yourself. But it will quickly (much faster than 40 years!) resemble driving a 2022 Audi S8. How quickly? 2-3 years. But today, big data relies on at least some business users understanding the location and naming conventions (in the best cases) and semantics of the data, if not the intricacies of crafting queries. This is a considerable barrier to progress.
Business people need to define their work on their terms. A business modeling environment is required for designing and maintaining structures. It is essential to have business modeling for the inevitable changes in those structures. It is likewise essential to leverage those structures' latent value through analytical work that is enhanced by understandable models that are relevant and useful to business people.
Semantic layers over data give you the understanding that truth is relative and fleeting and that well-formulated contexts can be powerful without being perfectly clear. Obviously, for regulatory reporting, launching a Mars probe or making a soufflé, precision is required. But rapid decision-making with incomplete and imperfect information is the hallmark of intellect. Any fool can make decisions with all the information in front of him — and many do.
Semantic abstraction benefits analytics by moving information integration to a new level, where intelligence can more easily and swiftly proliferate throughout an organization. Productivity will increase because machines will make everyone less reliant on a small number of go-to super-users and help us get beyond today's rigid, often brittle schemas and thin metadata.
Physical implementation decisions always belong to the technologists. However, those most familiar with business objectives, models and processes will ultimately control information resources. Analytics professionals can't afford to stay back. Semantics are the future.