Galileo takes on detecting AI hallucinations, but new metrics are needed

George Lawton Profile picture for user George Lawton November 22, 2023
Galileo Lab’s new metrics for detecting hallucinations promise to help improve generative AI accuracy.


In a week that may surely inspire the creation of AI safety awareness week, it’s worth considering the rise of new tools to quantify the various limitations of AI. Hallucinations are emerging as one of the biggest issues as new AI tools grow better at spewing out authoritative-sounding BS. Indeed, the Cambridge Dictionary declared ‘hallucinate’ as the word of the year in 2023.

Researchers and vendors are developing a bevy of new algorithms to detect and mitigate the kinds of hallucinations that crop up in the Large Language Models (LLMs) that power ChatGPT and, increasingly, enterprise apps. One such new tool is Galileo Labs’ new Hallucination Index, which ranks popular LLMs based on their propensity to hallucinate. 

Of particular note, OpenAI GPT-4, one of the best performers, is likely to hallucinate about 23% of the time for basic question and answer (Q&A) tasks. Some others fared far worse, struggling 60% of the time. Under the hood, things are a bit more nuanced and take advantage of newly developed metrics such as correctness and context adherence. The company has also developed tooling and workflows to help enterprises test for and mitigate these aspects of hallucination within their own AI implementations.  

Galileo Labs co-founder and CEO Vikram Chatterji says the company defines hallucination as the generation of information or data that is either factually incorrect, irrelevant, or not grounded in the input provided. The nature of a hallucination and how it’s measured depends on the task type, which is why they structured the Hallucination Index by task. 

For example, in a Q&A scenario where context is required, an LLM must retrieve the right context and provide a response that’s grounded based on the retrieved context. Techniques like retrieval augmented generation can prompt an LLM with a relevant selection of things to summarize to generally improve results. However, GPT-4 actually gets slightly worse with RAG. 

In other cases, such as long-form text generation, it’s important to be able to test the factuality of the response provided by the LLM. Here, the new correctness metric identifies factual errors that don’t relate to any specific document or context. 

Chatterji says they have identified a handful of dimensions that influence an LLM’s propensity to hallucinate. Some of these include:

  • Task Type: Is the LLM being asked to complete a domain-specific or general-purpose task? In cases where the LLM is being asked to answer domain-specific questions (e.g., reference a company’s documents and answer a question), is the LLM effectively referencing and retrieving the necessary context? 
  • LLM Size: How many parameters has the LLM been trained on? Bigger does not always mean better. 
  • Context Window: In domain-specific scenarios where RAG is needed, what is the LLM’s context window and limit? For example, a recent paper by UC Berkeley, Stanford, and Samaya AI researchers highlights how LLMs are unable to effectively retrieve information that’s found in the middle of the provided text.

New metrics required

Chatterji acknowledges many more factors to consider since hallucinations are multi-faceted and require a nuanced approach. To simplify the process of detecting hallucinations, Galileo Labs researchers developed ChainPoll, a new hallucination detection methodology. Their recent paper dives into how it works in more detail. 

But at a high level, they claim that it’s about 20 times more cost-efficient than previous hallucination detection techniques. It takes advantage of a cost-of-thought prompt engineering approach that can help elicit specific and systematic explanations from the model for users. This helps teams better understand why hallucinations are happening and is an important step towards more explainable AI.

The new tools helped Galileo Labs researchers develop the two hallucination evaluation metrics used in the Hallucination Index. Chatterji argues these new metrics seem to do a better job than competitive approaches for quantifying LLM output quality in a manner that scales across common task types such as chat, summarization, and generation, work with and without RAG and is also cost-effective, efficient, and quick to process. They also appear to correlate well with human feedback. 

It’s important to note that the measures reflect the probability of hallucination rather than an absolute measure of hallucination. For example, a correctness score of 0.70 suggests a 30% chance of a hallucination in the response. Here is a bit more detail about the nuance of the new metrics:

  • Correctness: Measures whether a given model response is factual or not. Correctness uncovers so-called open-domain hallucinations, which are factual errors that do not relate to any specific documents or context. The higher the correctness score, the higher the probability that the response is accurate. This is useful for evaluating tasks like long-form test generation and Q&A without RAG. 
  • Context Adherence: Context Adherence evaluates the degree to which a model's response aligns strictly with the given context, serving as a metric to gauge closed-domain hallucinations, wherein the model generates content that deviates from the provided context. A lower score indicates the model response is not included in the context provided to the model. This is useful for evaluating Q&A with RAG. 

Since different metrics are used across various tasks, it is not a true apples-to-apples comparison. For example, GPT-4 gets a correctness score of .77 on Q&A without RAG but a slightly lower context adherence score of .76 with RAG. Most of the other models improved the relevant metrics with RAG. 

These metrics enable a continuous feedback loop for teams building LLM applications and, thus, significantly reduce the development time needed to launch safe and trustworthy LLM apps. Chatterji explains:

These metrics allow teams to iterate and test on prompts, context, model choices, and more during development to find the combination that works. And these same metrics allow the team to evaluate LLM outputs in production. Armed with these metrics, teams can quickly identify inputs and outputs that require additional attention, and the underlying data, context, and prompts driving this sub-par behavior.

Enterprise teams are already using these hallucination detection metrics in development workflows. They also help with production monitoring and can trigger proactive alerts and notifications when output starts to degrade. 

However, it’s important to note that the new metrics are still a work in progress. For example, they have only achieved 85% correlation with human feedback. More work will also be required for multi-modal LLMs that work across different types of data such as text, code, images, sounds and video. Also, they plan to expand the list as popular new LLMs emerge. Chatterji says: 

The area of hallucination research is nascent, exciting and has a lot of avenues for experimentation.

My take

One surprising discussion in the recent Open AI drama was CEO Sam Altman’s recent suggestion that they may be hitting a wall in getting LLMs to hallucinate less with bigger models. New approaches for discerning the deeper laws of physics will be required. 

At a public discussion at Cambridge, Altman said:

We need another breakthrough. We can push on Large Language Models quite a lot, and we will do that. We can take the hill that we're on and keep climbing it, and the peak of that is still pretty far away. But, within reason, I don't think that doing that I view as critical to an AGI… If super-intelligence can't discover novel physics, I don't think it's a super-intelligence. And teaching it to clone the behavior of humans and human text - I don't think that's going to get there. And so, there's this question which has been debated in the field for a long time of what do we have to do in addition to a language model to make a system that can go discover new physics, and that will be our next quest.

It has taken nearly six years to get from the seminal discovery that drove progress in LLMs to the point today where they hallucinate a little less. With all the new AI-specific hardware hitting the market and general enthusiasm, it may take a little less time for any subsequent approach to achieve the same level of acceptance and tooling. 

In the meantime, tools like Galileo Labs for detecting and reducing hallucinations will help enterprises take advantage of LLMs a bit more safely. 

A grey colored placeholder image