Hallucinations are emerging as one of the biggest challenges in working with new generative AI techniques. As part of ongoing research, I recently reached out to the late Simon Mark Hughes, ML Engineer & AI Researcher at Vectara, about its approach to the problem.
Sadly, Hughes passed away peacefully in his sleep last weekend. Vectara has renamed the model he was working on the Hughes Hallucination Evaluation Model (HHEM) in his memory. Now, on to one of his last great achievements.
Vectara recently posted a hallucination leaderboard and link to an open-source implementation based on Hughes's work. Many of the top Large Language Models (LLMs) measured by these new metrics demonstrate much higher accuracy than other hallucination evaluation metrics. For example, GPT-4 weighs in at 97% and Llama2 -70B at 94.9%, compared to 77% for GPT-4 and 68% for Llama on the Galileo metrics diginomica recently covered.
One factor accounting for this difference is Vectara’s narrower focus on reducing hallucinations involved in searching and summarizing documents rather than a broader set of tasks to which large language models (LLMs) can be applied. Hughes said that in this context, they define a hallucination as any fact added or altered to the provided summary from the source document. He explained:
One thing that is important to stress is that we are focused primarily in evaluating hallucinations created whilst summarizing a short document, which is a very specific type of hallucination. We are not looking to fully address factual errors—for instance if you asked the model ‘who is the prime minister of the UK’ and it answered ‘Boris Johnson’ due to out of date information—these are already measured extensively in the open LLM leaderboards under the question answering tasks. The big challenge there is that you don’t know what data the model was trained on (as it's not publicized) so you can’t tell if the model is hallucinating or recalling incorrect information from its training corpus.
To solve this problem completely requires having a model that knows the entirety of human knowledge and would thus be impractical and almost akin to solving the problem of eliminating hallucinations when building the detector. Instead, we simulate a task that many of these models are being asked to do when used in search engines - here’s some information (e.g., search results), and summarize it faithfully and accurately. This allows us to fact check that data against the source, something we couldn’t do with the more general problem (as LLM providers don’t publish all of their training data).
At the same time, Hughes also believed that this is also a good metric for other LLM tasks, such as generating emails as they require the LLM to use data from the user before generating the text of the mail. This framing suggests a way of thinking about the lower bound on the true hallucination rate of the model. If the model adds information to a short piece of text when explicitly told not to do so when summarizing it, it’s even more likely to do so when performing other tasks.
Hughes believed this approach is the only meaningful way hallucinations can be measured. Importantly, this is not a new protocol invented by Vectara. It builds on recent hallucination detection and mitigation research explored in these recent papers:
- Benchmarking Large Language Models in Retrieval-Augmented Generation,
- Mitigating the Hallucinations of Large Language Models with Retrieval Augmentation,
- Retrieval Augmentation Reduces Hallucination in Conversation
Different dimensions of hallucination
Hughes said that hallucinations can come in many forms. Some important examples include:
- Instead of answering an end-user question, the generative model goes completely off the rails and gives a nonsensical response.
- When a model takes creative liberties with its response and includes facts that aren’t based in reality. The generative system draws on its body of knowledge and produces copyrighted works in its output.
- The introduction of specific biases due to the training data.
However, in most cases, hallucinations are not as blatant. Usually, the model makes some reasonable extrapolations from the source text that nonetheless deviates from it. Here’s an example:
- Original Passage - The plants were found during the search of a warehouse near Ashbourne on Saturday morning. Police said they were in "an elaborate grow house." A man in his late 40s was arrested at the scene.
- Summary from PaLM - Police have arrested a man in his late 40s after cannabis plants worth an estimated £100,000 were found in a warehouse near Ashbourne.
In this case, it is reasonable to assume that the plants were cannabis, but this could be incorrect. Even worse, the street value was completely fabricated. However, this is information you would usually expect in such an article, so the model added information that would usually be found in such an article. The technical term for this process is called slot filling.
Another example is inverting relationships. For instance, a source document might state, ‘Manny lists Mark Wahlberg as a fan,’ which the model modifies to ‘Manny was a fan of Mark Wahlberg,’ which is backward.
Improving LLM workflows
Hughes thought of HHELM as one of a battery of automated tests that researchers could use to ensure their models are aligned with their values and following their instructions. This is also the advantage of providing a model to do this in an automated fashion.
A large-scale human evaluation would not allow others to replicate results efficiently with new models or new types of content. It is also important to note that Hughes did not believe that detecting hallucinations is like other AI benchmarks for measuring the accuracy of single work answer question answering or sentiment analysis detection.
Hughes also hoped that the new metric could be used in an RLHF-like feedback mechanism to replace the human feedback with the model’s feedback to help guide LLMs to be more truthful. Enterprises could use this in an automated LLM evaluation suite as one of many different metrics. It can also be used to reject LLM responses that are not factually consistent with a user’s request and to provide a training mechanism to help train models to be more truthful.
The new measure could also be used to help understand specific issues and select the most appropriate model and tuning process, such as:
- It might act as a filtering mechanism to filter out untruthful responses and as a feedback mechanism for training or optimization.
- For prompt engineering, it could be used to select prompts that lower the hallucination rate of a particular task.
- In a RAG context, teams could use HHELM to generate multiple summaries of the provided search results and pick the most accurate or to help train a better summarizer or filter out bad summaries.
- It could also be used in an RLHF manner while a model is undergoing instruction tuning but using the model as an additional feedback mechanism in addition to humans.
- It can also be used to select the best foundation model to start with for some related task based on that model’s level of truthfulness.
Vendors, including TruEra, Gleen, and Galileo, are developing a wide variety of AI hallucination metrics. Hughes said Vectara has focused on transparency in its approaches, has open-sourced its model, and tested its benchmarks against the current academic benchmarks in the area, such as SummaC and True. He said:
Our Hallucination Evaluation Model is a first-of-its-kind initiative to proffer a commercially available and open-source model that addresses the accuracy and level of hallucination in LLMs, paired with a publicly available and regularly updated leaderboard while inviting other model builders like OpenAI, Cohere, Google, and Anthropic to participate in defining an open and free industry-standard in support of self-governance and responsible AI.
At the same time, Hughes cautioned that HHELM:
...is not a panacea. It’s one approach that can help measure truthfulness.
More work required
At the moment, Vectara is looking at hallucinations in summaries of documents. The company is a provider of search engines and plans to extend hallucination detection and mitigation to more accurately simulate how LLMs are used within these systems in the future.
For instance, GPT sometimes generates fake or inaccurate citations when compiling a summary of search results, so Vectara is actively working on citation accuracy. This is also a multi-document summary task, not a single-document summary, so this requires looking into measuring the accuracy of that task. Another area of research is in self-inconsistent responses, which occur when a model says one thing and then contradicts itself later in the response.
It's also important to note that hallucination research is still a work in progress. LLMs perform many tasks such as writing song lyrics, summarization and question answering tasks, synthesizing numerical and financial information, doing logical or mathematical reasoning, generating images, and describing images. All of these are prone to some forms of hallucination. Having a suite of tests that can look for factual accuracy across all of these tasks is important.
Hallucination detection is one small step towards that of AI alignment research. We are really detecting how well these models are following instructions. When they hallucinate while summarizing, they are doing something we explicitly told them not to do in the prompt. The LLM provider is also attempting to train these models to not hallucinate via reinforcement learning from human feedback (RLHF) and other techniques. What we really have is one method for detecting how well the models follow human instructions, which is what AI alignment research is all about. This is a very small step towards measuring alignment, and much more work is needed in this area for us to be able to know if AI is truly safe (not to mention actually solving the issue)
Vectara’s LLM leaderboard assesses a subset of these tasks but focuses less on hallucination and more on factual accuracy. For long-form answers, it’s possible to both answer a question correctly and introduce false data. Hughes said:
We are trying to mitigate the issue by focusing on a particular task, summarization. There's a lot more work to be done, and this should be done in conjunction with AI alignment and AI safety research. Models that provide false information can have safety implications in some fields, and the manner in which we are testing the models is also looking at alignment by checking how well the model follows instructions.
I was saddened to learn that Hughes passed on November 25th, 2023, which happened to be my 54th birthday. He was only 44. My connection to diginomica began after reading some in-depth analysis that Kurt Marko had done on autonomous shopping when I stumbled on the tribute Jon Reed had written to Marko. This really touched me because I had seen other colleagues who had spent years explaining technology's intricacies silently disappear with barely a whisper.
It's important to keep in mind that all of our great advancements and conversations that expand their impact start with real people. It is always sad when they pass away. I hope that Simon’s contribution goes on to make the world a better place. I felt fortunate to have one of the last conversations about his important work to make AI a little more honest.
God bless Simon Mark Hughes. May he rest in peace.
Here is his memorial page: https://www.ivinsfuneralhome.com/obituaries/Simon-Mark-Hughes?obId=30000023#/celebrationWall