Can enterprise LLMs achieve results without hallucinating? How LOOP Insurance is changing customer service with a gen AI bot

Jon Reed Profile picture for user jreed December 4, 2023
Summary:
If enterprise LLM vendors want to deliver business results, they must earn the trust of customers and employees alike. Can this be done without hallucinations? It's a provocative claim - one that led me to document my first live gen AI story with LOOP Insurance, which utilizes Quiq's AI bot.

Weslee Berke, Head of Customer Care, Loop Insurance by Jon Reed
(Weslee Berke, Head of Customer Care, LOOP)

I've been critical of the flaws in generative AI - Large Language Models in particular. I've also blown a gasket or two on why "responsible AI" posturing by OpenAI and other "big AI" vendors is farcical.

What I'm really after is precision: enterprise customers should know what AI's limits and potentials are, without the braggadocio and exaggerations. These tools are powerful enough.

Imagine my surprise, then, when yet another PR campaign about "our gen AI doesn't hallucinate" turned into something different: my first published use case on a live gen AI project. Another curve ball: this isn't an internally-facing bot or digital assistant. This is a customer-facing chat bot, operating in a regulated industry (insurance).

A live generative AI use case - "without hallucinations"?

The AI vendor in question? Quiq, a customer service AI. Their customer? LOOP - a very different kind of insurance company. Upon request, I ordinarily allow a vendor to sit in when I interview one of their customers. But in this case, talking to the customer on my own was a deal breaker. After all, Quiq's PR verbiage asserted "without hallucinations." I needed to ask the customer: is this true?

To be fair, I'm not sure "hallucinations" are the best word for my enterprise AI concerns. The degree of LLM accuracy needed is very use-case-specific. In the case of a customer-facing bot for an insurance company, I'd argue that an outright wacky bot answer would be much less damaging than a slightly inaccurate one, which the policy holder might take for true - and run with.

But then again, we don't always get completely accurate information from human agents either - we've all been in that type of call center purgatory at one time or another. For this particular scenario, I see this as an issue of regulatory compliance on the one hand, and accuracy on the other. I figure that a service bot's accuracy, for basic "level one" support inquiries, needs to be on the level of a competent - though perhaps not exceptional - human agent.

More on this in the "My take" conclusion. What matters for now is that Quiq CEO Mike Myer agreed to my interview terms, with the addition of a follow-on conversation with Myer to discuss the technology in detail. So a few weeks ago, I found myself on a video call with Weslee Berke, Head of Customer Care, LOOP.

One look at LOOP's home page indicates we are dealing with a very different kind of insurance company. When was the last time you saw a phrase like "fair and equitable insurance that reinvests in your community" on an insurance home page?  Or: 

Get an honest car insurance rate based on how and where you drive. See how much you save when we take systemic bias out of the equation.

Why LOOP chose Quiq's AI bot - can customer service scale?

Currently, LOOP's insurance is auto-only and Texas-based, but with such a refreshing stance, perhaps that too will change before long. Prior to my interview with Berke, I tested LOOP's bot for myself. Not being a customer, I wasn't able to put the bot through its full paces. But my bot interactions confirmed: this is not that dreaded legacy "lead gen" bot that forces you to pick from a handful of dumb-bot static answers. I asked Berke: how did all this come about? As Berke told me:

 I've been at LOOP for almost a year and a half. My team really answers any question at this point, from new prospective customers to members who want to make any updates to their policies, or who have questions about their billing, or anything insurance-related - even questions about renewals and all of that stuff.

Those interactions take place across channels. But now, via Quiq's gen AI service chatbot, a "customer-facing digital assistant" is part of the mix. Berke explains:

We talk to customers through phones; we have our bot. And if you don't get the answer you are looking for from the bot, you can talk to an agent through it. The bot is through chat, but also through text message, and we can talk about that, because I actually think that's pretty cool.

For a growing startup like LOOP, staffing customer service is a constant challenge. That's where LOOP's bot comes in:

We started working with Quiq because, like many other young startups who grow quickly, when your customer base becomes a lot bigger, and you're trying to find a solution to help your customer service team answer all those customers. So Quiq seems like a great option to open up different channels of communication.

Quiq's LLM service bot in action - on results, accuracy, and preventing hallucinations

One big appeal of the Quiq bot? LOOP can gain a quick benefit. Then, further automations can be added down the road, such as integrations with payment systems like Stripe. But LOOP got a quick start by training the bot with their own help center documentation:

As we started exploring what we could do together, it felt like we could figure out a way to use our own Help Center [documentation] to answer customers through the bot. And so to me, getting those simple questions answered was a huge win. We do not staff people 24/7, but the bot is 24/7. So if you are just looking for an answer, it's a great resource for our customers.

Those results are backed up in Quiq's published LOOP gen AI case study (PDF):

• Customer self-service rate increased by 3X to more than 50% automated resolution
• 75% positive customer satisfaction rating for the AI Assistant
• 55% decrease in email tickets

A substantial reduction in email tickets is always a happy thing. But how were those results achieved? I have a hard time picturing a general purpose ChatGPT bot getting this done. Berke says the results trace back to training Quiq's bot on their own customer FAQs. Fortunately for LOOP, the caliber of that help data was strong from the get-go:

Our bot only pulls from our help center. We are also in a regulated industry, so everything that the bot is saying is programmed in - it's not pulling from other sources. There are things that Quiq has helped us program such as links, as an example. And so we spent a lot of time going back and forth with Quiq to determine what we wanted included.

And you're right, it does show you the help center articles. Because like you said, it gives you more more in-depth information [by linking to] those help center articles. Because how much do you want the bot to say? It's kind of taking bits and pieces. But sometimes it's a very specific question that someone has, and you need just a little more information that an article from the help desk is there to give you.

And now for the no-hallucinations part. Due to the importance of this issue, I'm reprinting this part verbatim:

Reed: Quiq talked about no hallucinations. Based on how you limited the bot's output content, I can see how you could solve that. But you haven't run into a situation where the bot has provided anything inaccurate, because it's only pulling from your own content, so it can't really be inaccurate. Is that what it comes down to?

Berke: Yeah, that's what I'm seeing. I read a lot of the transcripts, and I don't see it giving incorrect information. We definitely programmed it for a couple of things - for when people are just playing around and testing it out. They won't answer people who are being silly. If it doesn't know, it will say things like, "I'm not trained to answer that." I see that a good amount in all honesty, because some people ask very specific questions. And also the way you and I might ask a question could be different. And it might just understand the way that you're asking it. And it might not understand the way that I'm asking it. I think that's just kind of the nature of it.

Reed: But as far as you can tell, it hasn't given out, for example, an inaccurate quote or something like that.

Berke: No. I don't see it giving wrong information; it's got what it's got.

I'm out of space for a deep dive into Quiq's bot technology, but the LOOP case study has plenty of detail, e.g. a section on semantic search with LLMs:

'Semantic similarity' is a special type of search that compares not just the words that a customer used in their question, but instead the actual meaning of the question. Quiq uses semantic similarity for LOOP to compare what customers ask to content already in the LOOP knowledge base...

The semantic search identifies potentially several articles that are relevant and uses the language generation capabilities of the LLM to summarize the articles into a highly relevant and personalized response. 

The older generation of service bots had major limitations. They would generally surface one article at a time; the answer the customer sought was buried somewhere in that article. As Quiq says, this type of LLM bot is different:

With Quiq AI Assistants, the one article, one answer constraint is a thing of the past, allowing all the information relevant to a question to be combined into an answer precisely matching the customer's inquiry.

Quiq acknowledges the hallucination issue, and how it is mitigated:

While this sounds easy, it isn’t. In addition to language understanding and generation skills, LLMs also can come up with answers on their own. This is known as hallucinating.

Clearly, LOOP doesn’t want its customers receiving any information that they haven’t pre-approved, so Quiq harnesses the power of the LLM while instituting safeguards and fact-checking to ensure the only answers provided to customers come from LOOP’s knowledge base.

To illustrate, Quiq shared this graphic with diginomica:

Quiq LLM chatbot service assistant - for diginomica
(Quiq's AI assistant - process flow) (Quiq's AI assistant - process flow)

I asked Myer: why multiple LLMs? Myer explained that running in parallel has advantages:

The benefit of doing it in parallel is we can ask multiple questions like: Is it a current customer? Or a prospect? Is it a customer in Texas or somewhere else? We can ask those questions [from multiple LLMs] at the same time. 

Different LLMs have different strengths, on a per-query basis. Myer:

As it turns out, some questions are better handled by different LLMs. GPT-4 is really slow, and so we try not to use GPT-4 very often, but there are some types of questions that are actually handled much better by GPT-4. And so in the process of training the assistant, if we're not satisfied with the performance that we're getting out of GPT-3.5, we might, in a particular instance, be required to use GPT-4, and have a little bit slower response time.

Claude and LLaMA are coming up. And in some situations, the speed - especially when you're using the phone - the speed is really important. And even GTP 3.5 might not be fast enough; other language models might be fast enough. So we've taken this approach: language models are like utilities, and we'll use the right utility for that particular case.

My take - enterprise LLM success depends on accuracy and data quality

2024 will be a revealing year for enterprise LLMs; LOOP's story demonstrates exactly why. Some may obsess over expanding LLM parameters; I'm more interested in how the accuracy of LLM output will change in an enterprise context, when honed with industry and customer-specific output.

We've heard bold proclamations about customer service jobs getting demolished by generative AI. But what we see with LOOP is more what I envisioned: harnessing a potent-but-imperfect technology to achieve notable results, but not eliminating the need for higher-level human reps for service escalations. Of course, LOOP also uses those human reps for sales; the blurring of the lines between service, support and sales/upselling is one of the most compelling CX business stories. AI has a valuable role to play here, freeing up humans for the interactions that matter most.

When I used LOOP's bot, I couldn't compare the results to my own customer record, since I'm not a customer. But I was unable to make Quiq's bot do anything close to a hallucination. Of course, Quiq's bot experience is very different than the freewheeling appeal of a ChatGPT interface with limited guardrails. Par of the appeal (and problem) with the ChatGPT prompt is the ability to engage the bot on just about anything. That might be a lot more fun, but it won't translate to most enterprise situations (see: accuracy issues, IP risk, regulatory risk, copyright risk, hallucination risk, etc).

But it's the constraints of Quiq's bot that make it well-suited for a service bot use case, particularly in a regulated industry. Even with those constraints, it's far more engaging to interact with LOOP's bot than the frustrating service bots that are still the norm on most home pages. However,  I was able to occasionally get LOOP's bot to offer a boilerplate/static answer, when I felt the bot could have positioned LOOP's differentiated insurance services better. But LOOP will surely add to their help documentation to improve those moments. Once the documentation covers those scenarios, the bot can draw on them also.

LOOP's clean/quality data is a non-negotiable key to this bot's success. But that shouldn't scare companies about their data silo and governance problems. To be effective, LOOP just needed a focused set of quality data. This opens up the possibility of launching assistants in narrower areas where data is cleaner, without having to overhaul the master data across the entire enterprise to achieve an AI result. Better to notch an early win, earn some AI user trust. Clean/build/expand from there, as LOOP is doing.

For AI-changes-everything enthusiasts, this type of gen AI use case probably just isn't sexy enough. But live projects with results are not to be taken lightly (in the ten years since I first wrote about blockchain, not one vendor has stepped up with a live production customer for me to write about).

This is shaping up as an evolution of customer service (notice I did not say revolution), but success will not be pre-ordained because of generative AI. Each project will have to earn LOOP's type of success, via careful attention to bot design and data input. It's a disciplined new option for a business result, not magical technology powder to sprinkle on flawed data.

Opening up new automated service channels, e.g. text messaging, is another clear win. In 2023, serving customers via LLMs without going off the rails is a pretty strong LLM achievement (if you disagree, I'd point you to one of countless articles such as Amazon’s Q has ‘severe hallucinations’ and leaks confidential data in public preview, employees warn, and Simple Hacking Technique Can Extract ChatGPT Training Data). At any rate, this LOOP use case is the kind of enterprise gen AI story we intend to document further in 2024.

Loading
A grey colored placeholder image