The eight layers that make up the complete enterprise AI stack

Phil Wainewright Profile picture for user pwainewright March 25, 2024
Enterprise use cases for generative AI are becoming far more sophisticated than at first thought. The engineering required spans these eight separate layers.

AI engineering concept drawing with people arranging robot in mobile phone, brain in lighbulb
(© denyunecv via

Much of the early discussion about generative AI in the enterprise was focused on the choice of foundational Large Language Models (LLMs). But after a year of seeing how enterprise vendors have harnessed this new form of AI, it’s become obvious that there’s a lot more to it than the underlying models. LLMs form a part of just one layer in the entire stack that’s required to apply generative AI effectively and safely to enterprise use cases.

These use cases not only need far greater accuracy and governance than the purposes to which individual consumers put ChatGPT — they are becoming far more sophisticated, too. Yes, with the right guardrails in place, generative AI has enormous value as an alternative to traditional enterprise search, for summarizing information, and for drafting and editing text and images. But enterprise application vendors are able to take it to another level, because the technology can also learn how the application itself works, and therefore it can be configured to execute actions on behalf of users. This is opening up new use cases for low-code and no-code automation at scale, provided the right infrastructure is put in place. At my estimate, there are eight layers to take into account, and each of them require careful engineering. Here’s a rundown.

Conversation layer

Let’s start at the point where most users will encounter AI — although if it’s working well, the user may not even be aware that they’re dealing with AI; they’re just getting something done. At its simplest, the conversation layer is a chatbot, or if it’s a bit more sophisticated, an intelligent agent, often called an AI assistant. The user requests some information or an action, and the agent may respond with questions to clarify the request before then delivering the result. This back-and-forth conversation replaces the old ways of interacting with enterprise applications through buttons, menus and forms. Instead of the user having to know all the various idiosyncracies of how each application works, the AI takes on this burden and the user simply has to describe the outcome they want. Obviously this makes it a lot easier for users, who need far less training to be able to get a result out of the system — although they still need relevant expertise to be able to evaluate the information they're receiving and decide upon the best outcome. This conversational interface also means they rarely need to use the keyboard and mouse, so they can get things done while on the move through a mobile phone or voice interface.

All of this of course presumes that the AI can be relied upon to properly understand the user's intent and then pull all the right levers in the underlying system. This is where all the remaining layers come into play, which we'll come to in a moment. But while we're still thinking about the conversation layer, there's another important consideration that should weigh on the minds of IT decision makers.

In the past, the buttons, menus and forms through which people interacted with enterprise applications were specific to each separate application. But in recent years, we've seen the emergence of new ways to access the underlying information and functions through messaging and automation apps such as Slack and Teams, bringing these into the user's workflow rather than requiring them to switch from one application to another. I've long argued that enterprise IT leaders need to take a strategic view of these apps as the backbone of a Collaborative Canvas for enterprise teamwork. This now becomes an even more pressing issue.

At the moment, each enterprise application vendor is developing their own conversational AI interface, but it hardly makes sense for users to have to switch from one AI agent to another to get their work done. It will be much simpler for them to work with a single AI assistant, which may vary according to their role — and in most cases, it's going to be up to the enterprise IT team to take a view on which ones to recommend. They need to choose carefully, because the vendors that end up owning the dominant AI assistants will have a very powerful market position.

Prompt layer

There's been a lot of discussion of the concept of prompt engineering in relation to generative AI, with users of ChatGPT and other consumer services needing to write fairly sophisticated instructions to get the results they want out of these all-purpose LLMs. But enterprise vendors have realized that, in a business environment, it's much better to build the prompt engineering into their AI assistants, rather than depend on users first having to learn how to become skilled prompt engineers themselves. More importantly, it ensures that the prompts, or instructions, that go the LLM are precisely grounded in the relevant business context, helping to produce an accurate and verifiable result.

This grounding process uses a technique known as Retrieval Augmented Generation (RAG). Instead of simply sending a question or an instruction to the LLM, a RAG system first retrieves information that's relevant to the user's question and then sends that along to the LLM as part of the prompt. The LLM creates its response drawing on that information and can include citations of the source material as part of its answer.

There's a lot going on in the prompt layers that vendors are building. For example, they have to be able to drive the back-and-forth conversation with the user where their intention is not initially clear. In this case, the system has to be able to recognize whether it has enough grounding to provide a meaningful response, and if not it then has to figure out what questions it needs to ask to get the instructions it needs. The user may then go on to ask further questions, and the system needs to be able to recall its previous answers so that it can continue the conversation in context. Often, a system may draw on more than one different language model and must therefore decide which models to deploy, depending on what the user is asking for.

Graph layer

Many vendors when talking about their prompt engineering have spoken about their use of an existing graph database to help provide the contextual grounding for their prompts. The graph database maps the objects in a set of business processes, such as users, customers, tasks, documents, goals and so on, and the relationships between them. Collaboration vendors, for example, have each had their own graph database for several years and have noted over the past year how useful the mappings within them have been when building instructions for generative AI. Now vendors in other fields are creating their own graph databases to help create a framework that helps LLMs make sense of business data. For example, SAP is building a Knowledge Graph, which as Philipp Herzig, its Chief AI Officer, explains:

We will take the entire master data — so basically, the Data Dictionary of S/4, and put it into a comprehensive Knowledge Graph, and that gives us even more capabilities to ask questions and to reason about the business data, so to speak. Because now you understand, not only the data, but you have relationships between all the entities.

Meanwhile Salesforce is building on its longstanding metadata model by adding the ability to create data graphs to map relationships between data points as part of its AI automation stack. It seems that, one way or another, a graph layer is becoming an essential ingredient in helping AI make sense of enterprise data.

Data layer

Next comes the data itself, which because of the legacy of separate enterprise applications, is scattered across multiple incompatible data stores, while much more is locked away in unstructured document repositories, message threads, video and audio recordings, as well as web analytics, log files and IoT data streams. The pressure to break data out of these function-specific silos has been building for some time, and I've written about the trend towards a Tierless Architecture in which data becomes readily accessible to any function. Now there's a new imperative to bring all of these different sources and formats together so that they can be made available to AI — not necessarily as training data, but increasingly as source material for grounding prompts.

This is the thinking behind initiatives such as Salesforce Data Cloud, which aims to provide a unified data platform together with tooling that can power AI prompts and insights for users based on the full range of data throughout the enterprise. Breaking data out of the old silos and making all of it more easily accessible on demand is another pre-requisite for a successful enterprise AI stack.

Trust layer

Every enterprise vendor will tell you that they have a market-leading proposition on AI trust, ethics, security and privacy. They know that adoption will stall if they don't get these things right, and therefore they want you to know how how importantly they take this. The level of commitment to this essential component is reassuring, but in truth it shouldn't take much to get the basics right. One of the most important elements is to ensure that each individual user's permissions are carried through into the AI stack, so that no one is able to see data they're not authorized to access. That permissions infrastructure should already be robust, so it's just a matter of carrying it through correctly.

Care needs to be taken when training LLMs to ensure that confidential data isn't shared, but most LLMs are trained on public data anyway, and should have been carefully vetted from an ethical standpoint prior to adoption. By the time they are being prompted with business data, the training is long past and they are simply generating answers without retaining anything. Other necessary precautions include masking of Personally Identifiable Information (PII), and checking responses to remove potential toxicity such as inappropriate language, content or profiling.

Model layer

Finally we come to the underlying LLMs themselves. It's becoming clear from speaking to vendors that this layer is typically going to contain several different models of varying types, and extends beyond LLMs to include other models such as those used for predictive analytics. Salesforce, for example, not only allows customers to link to external LLMs such as OpenAI and Anthropic in addition to its own internally developed models. Enterprises can also create their own predictive models on external platforms such as Amazon SageMaker, Google Vertex AI and Databricks, and train these models on their own data held in Salesforce.

LLMs themselves vary enormously in size. The largest general-purpose LLMs contain hundreds of billions of parameters that help them process information. GPT-4, the model that powers ChatGPT, is reportedly based on eight different LLMs, each with around 220 billion parameters, for a total of nearly 1.8 trillion parameters. Enterprise LLMs can be smaller, because they can target the specific knowledge domains of the enterprise application, rather than having to be ready to answer any question on earth. This has the further advantage of making them much less costly to build, train and operate.

For example, when we spoke to Ramprakash Ramamoorthy, AI Research Lead at Zoho, he told us that a 50-billion parameter model would be sufficient to serve a finance application, saying that finance users "don't need to ask the model how to bake a cake." This brings us into the realms of Medium Language Models and Small Language Models, with more function-specific models for tasks such as machine translation being smaller again. Ramamoorthy says:

We have built foundational models that are, let's say, a 3 billion [parameter] model, a 5 billion model, a 7 billion model and a 20 billion parameter model. Then what we do is take this foundational model, and fine tune it to either a domain or a task.

So for example, we fine tune it to the finance domain, or we fine tune it to a legal domain, or we fine tune it to something like question answering, or we fine tune it to something like document similarity prediction. So it's either fine-tuned on a domain or a task.

There's clearly a lot of expertise required to set up and operate the model layer, with particular skills required to decide on the right mix of models and then link them together in the most effective way.

API and runtime layer

So you've created a prompt that's grounded in your enterprise data, and the model has returned a suggested action which the user approves. To be able to fetch all that data and then execute the desired action, the AI needs to be connected into the underlying system resources. This is where the API and runtime layer comes in. It may well exist already, because the drive to create an API-first infrastructure that can connect to internal and external resources, call functions and execute actions has been ongoing for several years, as part of the move towards no-code application building. It's a core component of the Tierless Architecture of composable IT that I mentioned earlier. But in many enterprises this layer of infrastructure is still a work-in-progress. It's now become even more of a priority to complete the job, to enable the advanced automation that these AI innovations are making possible.

Audit layer

One consequence of the reluctance to train LLMs on customer data is that there needs to be an alternative way of feeding back issues where fine tuning may be needed. During the many beta processes that vendors are going through at the moment, we're seeing their AI teams keep a close eye on how their AI assistants are performing and working closely with LLM providers to improve results. In doing so, they're building up experience in how best to monitor and audit AI systems.

This final layer is crucial to make sure things don't run out of control. Effective prompt engineering hugely reduces the likelihood of the AI producing incorrect answers or making up them up, but there's always the risk of outdated information getting picked up in error or that the model fills in some missing detail with a hallucination. The trust layer may need fine-tuning to maintain the effectiveness of data masking or toxicity filters. Increasing automation will reach a point where it's simply impractical for a human to individually approve every AI action, and humans make mistakes anyway — it's very easy to miss an error when most of the time the system gets it right.

While vendors pay a lot of lip service to the concept of keeping a 'human in the loop', the direction of travel is towards automation on a scale where human operators really can't be left carrying the can for system failures. Human supervision needs to come in elsewhere, for example building in automated alerts to detect negative feedback or pushback from users as a signal that errors have started creeping in. Vendors aren't talking a lot about this layer at the moment — perhaps because they don't want to draw attention to the possibility that their tech isn't perfect yet — but building in heightened awareness and sensitivity for errors and the ability to act on them quickly will be another essential ingredient in winning the trust of customers.

A grey colored placeholder image