Hallucinations are shaping up to be one of the most important discussions for enterprises trying to figure out how to improve the use of AI Large Language Models underpinning services like ChatGPT. The problem is there are a lot of ways of framing the problem. The obvious one is to ask humans, but this does not scale well and is expensive.
Over the last six months, several vendors have developed various metrics and processes to help automate this process. Here, we take a deep dive into TruEra, a leader in machine learning monitoring, testing, and quality assurance. They included the new hallucination detection and mitigation workflows as part of a broader open-source framework for LLMs in March. This preceded some of the more recent hallucination metrics diginomica previously covered from Galileo and Vectara.
TruEra has a slightly different take on the definition of hallucination than some of these other vendors. Their approach combines components for evaluations, deep tracing and logging, and the ability to scale to a larger data set on an ongoing basis. TruEra cofounder and CTO Shayak Sen, argues some of the competing hallucination management approaches focus on individual parts of the problem. TruEra’s approach also frames the way they think about hallucinations. Sen explains:
Generally, the prevailing definition of hallucination is a language model producing outputs that are factually incorrect. Without a source of truth, this definition is unenforceable. We’ve been promoting a stricter definition of hallucination: the output from a language system is hallucinatory if it responds to a prompt in a way that does not accurately represent a source of truth in a verifiable way.
One consequence of this definition is that using ChatGPT (or any other LLM) as a question answering system is considered hallucinatory by default. It isn’t even attempting to represent a source of factual truths. It is generating plausible text that may or may not be factually correct. Given that it is often factually correct is a coincidence. This may seem like a bit of a radical viewpoint, but in reality, the fact that generative models hallucinate should be viewed as a feature and not a bug.”
A process approach
Sen argues that one way to build systems that actually do represent a source of truth lies in improving the way retrieval augmented generation can tune interactions with an LLM. In a RAG architecture, an LLM’s task is not to produce facts but to summarize sources of information that are retrieved using a database or APIs. In this context, hallucination can be checked by answering three questions:
- QA relevance: Is the answer relevant to the response?
- Context relevance: Is the retrieved context relevant to the query?
- Groundedness: Is the response supported by the context?
If the answer to any of these questions is ‘No,’ then the system's output could be misleading or irrelevant. In the TruLens approach, the hallucination metrics capture the different failure modes of LLM-based systems.
Identifying which metric is doing the most poorly can help teams focus on what part of the system to fix. For example, if an implementation is hallucinating because it frequently uses irrelevant context, developers can prioritize improving retrieval for the use case.
It’s also important to have a robust system of record to track how performance changes over time as teams experiment with different configurations of the system. Evaluation and monitoring are important to do throughout the lifecycle of an application. This reduces the risk of over-focusing on fixing single examples while ignoring the broader quality of the system.
Enterprises can use these algorithms throughout the lifecycle of a system from development to production to:
- Build confidence that they’ve covered basic edge cases before deployment.
- Use evaluations to guide improvements to their systems by prioritizing the root causes of hallucinations.
- Monitor performance over time to quickly detect and address regressions.
Understanding the root causes of the issues helps create a feedback loop that determines what kind of fix you need to make. For example, if the model is doing poorly on retrieving the relevant content, then addressing how to do retrieval better makes the most sense. On the other hand, if groundedness is the key issue, then fine-tuning and prompt-engineering to do well on domain specific data is likely going to have the most impact. In either case, it’s important to systematically test your system and track improvements.
One essential limitation in hallucination research is scale. Since most of the evaluations are based on the language models themselves, these can be hard to scale at production. Sen said future research and development will focus on ways to algorithmically scale up the evaluation of LLM hallucination so they can run more cost-effectively for more use cases.
In addition, more work will be required to mitigate hallucinations that combine text, code, audio, video, and other types of data. Sen explains:
Also, as models grow and become more diverse, a lot of the focus is shifting towards multi-modal models which means that the evaluations need to shift towards multi-modal use cases as well and we need a new set of tools for what hallucinations mean in a multi-modal setting.
At the moment, researchers and vendors are all struggling to find the most efficient way to measure and reduce hallucinations in AI. This will be critical for new generative AI to scale in the enterprise. More importantly, some of the best hallucination metrics can be a one-off technique.
TruEra is approaching the problem as part of a broader solution to streamlining the AI development lifecycle. It's reasonable to assume that AI development competitors will likely roll out similar capabilities in the near future. These may be surfaced directly in the tool or via plug-ins and third-party marketplaces, like the way new quality assurance and testing capabilities are added to integrated development environments today.
Additionally, many of the current hallucination metrics focus on the kind of conversations a human might have with a chatbot. Different approaches might be required to improve the quality of code suggestions and other types of recommendations in the rapidly evolving world of AI-powered assistants and copilots.