Pure Storage and NVIDIA collaborate on RAG to customize generative AI and to help eliminate hallucinations

Derek du Preez Profile picture for user ddpreez March 27, 2024
Pure Storage talks through how enterprises are thinking about training generative AI models in the future, with various different approaches - but particularly RAG.

electronic brain with connection line © Kittipong Jirasukhanont - Canva.com
( © Kittipong Jirasukhanont - Canva.com)

For much of 2023, the rise of generative AI product releases - in particular those that use Large Language Models (LLMs) - were coupled with a small disclaimer from vendors that there would, in fact, be more room for improvement. There was a tacit implication that LLMs were often too vague (in some instances) for enterprises to find them genuinely useful at scale - and, of course, there is the ongoing problem of hallucinations. Vendors proposed that in the future, as the technology improved, organizations would be able to reduce the impact of false information being presented, whilst also being able to customize the models, using their own data. 

And so it was with interest that an announcement was made last week, at NVIDIA’s GTC conference, that the chip-maker, which has experienced huge demand in the wake of the AI market uptake, is working with Pure Storage on the development of Retrieval-Augmented Generation (RAG) - one of the approaches used to adopt this customization. NVIDIA is working with a huge variety of partners and there are varying solutions being developed, but I sat down with Pure Storage’s Global Practice Leader for Analytics and AI, Miroslav Klivansky, to understand the significance of RAG and how it fits more broadly into the LLM customization story for enterprise buyers. 

In terms of what was announced, specifically, NVIDIA and Pure Storage said that they are working on the following: 

  • Retrieval-Augmented Generation (RAG) Pipeline for AI Inference - to help improve the accuracy, currency, and relevance of inference capabilities for large language models (LLMs), Pure Storage has created a RAG pipeline leveraging NVIDIA NeMo Retriever microservices and NVIDIA GPUs and Pure Storage for all-flash enterprise storage. 

  • Vertical RAG Development - Pure Storage is also creating vertical-specific RAGs in collaboration with NVIDIA. First, Pure Storage has created a financial services RAG solution to summarize and query datasets with higher accuracy than off-the-shelf LLMs. Additional RAGs for healthcare and public sector to be released.

Getting more specific

As noted above, the LLMs that we have seen today, the likes of those provided by OpenAI or Meta, are often trained on public data (typically scraped from the Internet) and are foundational models that embody some sort of base knowledge that ‘reasoning’ can be applied to. The models improve as people interact with them and fine tuning is introduced, but for anyone that has used ChatGPT and its ilk, you’ll soon recognize that more often than not they aren’t there for deep domain knowledge. For instance, whilst ChatGPT could do a fairly decent job of summarizing a conversation based off a transcription I plug into it, perhaps even providing some broad technological context, it wouldn’t be able to write an analysis on the conversation, based on previous topics covered on this specific site, in the style of diginomica, in the same way that one of our authors with 20 years of experience could (as of yet, anyway…). 

As Klivansky notes, these foundational models are the beginning, and we can build on them from there: 

That’s the starting point. The way that I visualize a model is essentially as this matrix with different nodes and weights, in the same way that our brain has neurons and axons and so on. The weights are what defines the instantiation of this model. There are three different ways that people have typically tried to customize it, by domain specific information. 

As noted, one of the ways to customize is through the use of RAG, which is the area where Pure is prioritizing its work with NVIDIA. Essentially, RAG allows enterprises to improve LLMs with the addition of more proprietary data sources, making them more relevant to users in specific scenarios. As Klivansky explains:

With RAG, you're essentially building up a library or a database of this knowledge. That database is indexed in such a way that if you give it a concept, it can pull other concepts that are related to it out of that database. So instead of just searching for keywords, it could actually do what people call semantic search, or contextual search. 

The way that RAG works is it'll pull these contextually meaningful things, it'll take your question or your request, and try to understand what might be the core concepts that you're getting at or looking for. It will then go into that RAG database and pull out entries that are related to your request. 

And then it will stick those entries into the model context for your question. Part of the problem with these foundation models is they're not very precise, and they don't necessarily know what pieces of their embedded knowledge are most meaningful. So by pulling this stuff out of the RAG database, and sticking it into the immediate context, it's essentially saying: ‘Hey, serve this request, but keep all this other stuff in mind because it's probably relevant to what this person is asking.’

Klivansky compares the use of LLMs vs LLMs with RAG to taking a test in either a closed book or open book scenario. The use of RAGs to support LLMs is the equivalent to an open book test, where there is a broad range of information in your head and you have the specific information available to hand to reference when needs be. He adds: 

It doesn't necessarily highlight everything that I retrieve from the database, right? It doesn’t say, ‘I'm going to use these 100 pages of context to answer your request’, it just has it behind the scenes’. 

More customization scenarios

But RAGs aren’t the only approach being considered by buyers and vendors to further customize LLMs. The most commonly known way to do this is through users in an enterprise, if they have access to a model and its weights, fine tuning it themselves - in a similar way that OpenAI will be fine tuning its foundational models internally. However, this approach does come with challenges. Klivansky says: 

You essentially push all of this additional domain specific information into the model. And that can be useful. But you have to do it right. First of all, it's relatively complex and expensive to fine tune a model. 

And then, let's say you've trained this model on the cutting edge models that have, for instance, 1.8 trillion tokens in them…if you're taking your domain specific dataset, and training the entire model with it, your domain specific data set maybe has 200 million tokens…if that. So it's only a drop in the bucket of knowledge that this LLM has to work with. 

And if you don't do it carefully and work with some of the layers in the front or the back of the model that are most likely to provide the context, it just gets diluted. So there's a level of AI sophistication and computer science or data science that is required to get the most out of fine tuning a model. It's also the most expensive and time consuming way to add domain specific knowledge to a model. 

The third approach is adopting LORAs in conjunction with LLMs. Klivansky describes LORAs as similar to “sticking a bunch of filters before and after a model” and then tuning those layers. He adds; 

For example, if you've used something like Stable Diffusion, or DALL-E, to generate graphics and art, a lot of times you can pick different styles that the art should come out in. Those styles are actually LORAs. In the same way that there are these LORAs for graphic Generative AI, there are also LORAs that you can create for Large Language Models. 

What the future holds (probably)

So at this moment in time, the Generative AI industry - and buyers - can think about these three approaches as the key routes to LLM customization, according to Klivansky. He says that LORAs are probably the easiest way to fine tune a model, but RAGs give more domain specific context, whilst getting users to fine tune a model is important and helpful, but expensive. What will likely happen is that a combination of the above becomes a priority. He says: 

There’s recent academic papers that have come out that say when you combine these things, it's even better. 

However, what’s most promising about some of these approaches - in particular RAG, which is why Pure and NVIDIA have prioritized it for development - is the potential to reduce hallucinations within Generative AI models, which could instil more trust for enterprise buyers. Klivansky adds: 

RAG tends to be very effective in eliminating hallucinations because you can use that vector database to provide context for your answer. It makes it much more likely that the answer you get is going to be based on factual data. That is one of the more helpful things. 

If you implement it carefully, you can also use that vector database as part of the guardrails and fact checking. Before you actually present your result to the user, you can do some fact checking, you can go through guardrails to make sure that it's a safe and good answer. 

My take

Last year, it was pretty impressive to see vendors introduce Generative AI features so quickly into their products, providing some interesting use cases. However, it was clear that there was a lot more work to do - and at the time I couldn’t get a real sense of how large enterprises may further adapt these models to ‘graduate’ to serious enterprise use cases. The discussion here with Klivansky is of interest as it provides a good indication of where things are headed. And of course Pure Storage isn’t the only vendor working on RAG development - we’ve seen a number of others indicate similar approaches - but for buyers, it will be useful context to think about their options and direction of travel. 

A grey colored placeholder image