Watchful bridges generative AI transparency gap

George Lawton Profile picture for user George Lawton November 9, 2023
Modern Large Language Models have a transparency problem, confounding efforts to reduce hallucinations and other issues. Watchful is introducing a new set of open source tools and metrics to shine a light on and reduce these.


Generative AI services like ChatGPT have demonstrated an impressive ability to generate text with human-level realism. But they also have a propensity to hallucinate and gloss over uncertainty in their responses. The big challenge is that the newest wave of AI capabilities built on large language models are far more opaque than traditional AI, Machine Learning and analytics techniques, making it much harder to identify and mitigate the root cause of problems.

Watchful, a leader in AI data labeling and observability tools, has introduced a new set of open-source tools that it pitches as helping to bridge this gap. At their core, these new tools support new metrics to help enterprises assess the occurrence of hallucinations, bias, and toxicity in services built on Large Language Models (LLM) and develop mitigation strategies. 

The fundamental problem is that modern LLMs come with billions of knobs for adjusting how they process text, called features. And these features bear little direct relationship with how a human might interpret the meaning of things. To make matters worse, many of the latest models are offered via APIs into proprietary services that limit visibility into their inner workings. 

Watchful CEO Shayan Mohanty explains:

In the case of open-source models, there aren’t well-defined ways to evaluate how a model performs on a given task. Even when you have access to the model itself, the best approaches that have been developed to evaluate models are largely benchmarking on common/generic language tasks. There haven’t been a concrete set of metrics that have been developed to help users of these models evaluate how they perform on a given task.

This problem is compounded with closed-source models, where users don’t have access to the model internals at all, as is the case with proprietary LLM services. As a result, developers and quality assurance (QA) teams have no means to interpret how a model came to its conclusion. 

New metrics for generative AI

Many humans (except maybe politicians and social media executives) have the impressive ability to identify flaws in their thinking that may have led to inaccuracies. Over our years of learning about how the world works, we have developed a conceptual framework for making sense of our thinking processes using symbols we connect to observations and facts. 

In contrast, LLMs use statistical correlation techniques to map the relationship between words, code, visual elements, and other types of data. However, they lack the internal scaffolding to recognize the root cause of a faulty assertion or hallucination. So, a different approach is required to help pinpoint these flawed processes. This is where the new metrics come in:

  • Token importance estimation measures the relative importance of individual tokens (words) in a prompt to an LLM. This helps the user understand which tokens the model focused on when, and so can help guide the user to improved prompt performance and LLM interpretability. The process works by making subtle adjustments to a prompt, such as dropping words, adding new ones, and measuring how the model’s output changes and to what degree. 
  • Model uncertainty scoring measures the reliability and variability of model outputs in terms of conceptual uncertainty (how sure the model was about what it was trying to say) and structural uncertainty (how sure the model was about how to say it). It tracks changes in how the model represents the world, called an embedding space, to estimate structural uncertainty and track how many distinct logical branches the models considered before landing on its final answer to estimate conceptual uncertainty. This metric can help assess how reliably an AI service will produce answers. Conversely, it can also help tune a model away from biased or toxic answers to prune them from the logical branches of a response. 

How the metrics work

Watchful has been a leader in tools for AI data labeling and observability, which have been the state-of-the-art approach for improving the trust, risk management, and security of traditional AI models. Data labeling required a rigorous process to manually tag data with descriptions about what was in a certain image or video. 

However, Google’s discovery of the transformer algorithms in 2017 that inspired the generative AI boom pioneered ways to discover underlying patterns with far less manual effort. That’s why companies like Open AI, Google, and others have been able to train foundation models on billions of pages scraped from the internet. As a result, the generative AI industry is shifting away from data labeling towards newer approaches for fine-tuning existing LLMs or fine-tuning the way prompts are fed to them, called prompt engineering. Mohanty argues:

The term ‘data labeling’ has fallen out of vogue in the gen AI world. Folks have generally taken to calling its application ‘fine-tuning.’ Fine-tuning is when a user of a gen AI model can take a small, labeled dataset and train the model on it with an extremely low learning rate. Currently, there are two distinct camps: those that are proponents of prompt engineering as a mechanism for getting high-quality outputs and those that are proponents of fine-tuning. In our opinion, they are two sides to the same coin.

Prompt engineering is currently the only viable way to introduce net-new knowledge to a GenAI model without controlling its pre-training process. However, it’s somewhat opaque and more of an art than a science.  The new metrics bring some rigor and feedback to this process. 

Fine-tuning, on the other hand, is useful in focusing the model’s output on a smaller domain that might be pulled from enterprise systems, legal documents, or product manuals. Mohanty says fine-tuning is useful when you want to influence a response’s style or structure but not necessarily its substance.

Both techniques can be combined to improve different aspects of the results. For example, token importance metrics can help figure out what parts of a prompt to iterate on to yield the greatest outcomes. Uncertainty measures can assess the progress of prompt engineering efforts. Eventually, prompt engineering efforts hit a plateau of improvement. Then, fine-tuning can help continue to decrease uncertainty. 

A work in progress

These new metrics are an important first step, but more work will be required. For example, the current metrics are limited by how much information they can extract from the embeddings layer used for efficiently representing raw data. But it was important to get the ball rolling, according to Mohanty:

We’re planning on eventually diving deep and addressing this with future research, but we’re first taking a breadth-first approach in trying to address several complementary transparency issues with gen AI. We started with estimating token importance in prompt inputs to these models, then estimated uncertainty in the model’s responses.

We plan on continuing this research by investigating metrics that can describe how well a model is actually able to address a given task (e.g., the closeness of a generated output to a space of acceptable ones). There’s more to be done here, and we’re really only scratching the tip of the iceberg with the research we’ve done so far, but we’re excited about continuing down this train of thought as we’ve discovered that these approaches are already completely model agnostic as-is.

Down the road, Mohanty hopes that this contribution will inspire others to develop better metrics for generative AI. He explains:

We haven’t yet seen the rigor in techniques and metrics that we’ve come to expect coming from a world of conventional AI, but we’re optimistic that this is just a function of the newness of gen AI. So far, a lot of the evaluation techniques we’ve seen have revolved around models self-scoring outputs (e.g., “How good do you think your response was to the given prompt? “) and comparing those to human scores. 

This type of approach is rife with cyclical issues such as bias and hallucination and is largely what prompted us to explore alternative approaches through research. In addition, model transparency is a data problem at its core. Being able to reason about how outputs are generated and what inputs (prompt, training data, etc.) were at play when it was generated is critical. A lot of the industry focus has been on applications of outputs of LLMs but not as much into introspection of the inputs. We’re hopeful, though, that this is just a function of where we are in the overall hype cycle of gen AI - the market seems to still be in experimentation mode.

My take

Regulators and enterprises are starting to call for greater transparency into the foundation models that power generative AI services. But the technical solutions for peering inside their inner workings are in their infancy. Efforts to rethink quality metrics for identifying and fixing hallucinations, bias, and toxicity are critical for building responsible, ethical, and transparent AI. 

Others are exploring the problem from different perspectives. For example, Anthropic recently developed a technique for decomposing LLM into understandable components. This is an important milestone for LLM developers, but more work is required for enterprises that want to safely use these for their own applications.

Watchful’s new metrics suggest a promising alternative that is likely to inspire researchers, competitors, and regulators to build trust in generative AI.

A grey colored placeholder image