Enterprises will test the limits of LLMs - and ChatGPT is just the beginning
- Summary:
- The fall event season is upon us, and customers have plenty of generative AI questions - but are vendors ready? Here's the questions I'll be pressing, starting with: the impact of industry-specific LLMs, and the debate over AI pricing. Let's see who responded to my PR challenge.
Generative AI has taken over my inbox. Every PR pitch extols generative AI's enterprise revolution. Ready for the true shocker? I really don't mind.
I believe the enterprise is going to test LLMs to the edge of their capabilities. What we find out will inform the practical future of AI.
Enterprise AI vendors are determined to raise the bar on the flawed performance of ChatGPT-type output, where models are trained on the entire Internet - and imperfect guardrails then imposed. But will these vendors succeed? If so, which ones?
The generative AI PR challenge - how are your LLMs trained?
I'm not that much of a PR pitch mark sweetheart. Right now, the customer use cases are mostly in the early stages. Ergo, there isn't much ROI to talk about yet; queue the "revolutionary" hot air.
So I laid down a PR challenge instead: instead of telling me how revolutionary your AI is, can your firm address the specifics on how your LLMs (Large Language Models) are trained, and how customer data is used (and protected)? As I wrote in my PR GenAI Challenge:
Some of the issues I am looking at include: managing IP risk, problems of customer data/pricing, black box/explainability, technical limitations of LLMs, difficulties with using third party LLMs and customizing them with customer training data - plus implications for customer pricing, use case pros/cons etc.
PR firms just looove these questions! About ten companies decided my AI challenge was better left unanswered. So far, two firms have really stepped it up: Pantheon and Genpact (with honorable mention to Alteryx).
The million dollar enterprise AI question - will industry LLMs improve results?
I'll be pressing the issue at the fall events I am crashing attending soon. It's time to move off "AI will transform customer service!" type proclamations, and share specifics:
- How are enterprise LLMs being trained?
- How are beta customers faring? What use cases have early traction?
- Will early customers will pay a premium for AI features? How will that premium go over, if those same customers are investing their own data - and co-innovation time - training enterprise models?
- How comprehensive is the approach to reducing model/data bias?
I'm sure vendors are looking forward to seeing me soon! As I see it, here's the question at the heart of enterprise AI:
Will industry-specific LLMs, further refined by individual customer data, provide a superior level of output to the ChatGPT type experience? Will enterprise LLMs perform better than the larger-but-messier data sets GPT is trained on? Just how superior will that output be? Enough to potentially take (human) adult supervision out of the loop, at least in some cases?
This is a million dollar question, a billion dollar question - maybe a lot more. Now, I don't believe that even a specialized industry LLM will deliver output that is accurate enough to remove humans from the loop (surprise - some vendors disagree). Some data scientists believe they can solve this via deep learning refinements. My view is that deep learning can't solve itself, and that "hybrid" AI approaches will be needed - but that's a complex debate for another time.
Will the "instant mediocrity" of ChatGPT be good enough for the enterprise?
Analyst Hyoun Park has a different take. Park agrees that the output of ChatGPT is mediocre, but he makes the point that automating mediocrity can still lead to new efficiencies. He calls this "instant mediocrity," and he doesn't see this as an insult, but a productivity enhancer. As Park wrote in Instant Mediocrity: a Business Guide to ChatGPT in the Enterprise:
The truth is that instant mediocrity is often a useful level of skill. If one is trying to answer a question that has one of three or four answers, a technology that is mediocre at that skill will probably give you the right answer. If you want to provide a standard answer for structuring a project or setting up a spreadsheet to support a process, a mediocre response is good enough. If you want to remember all of the standard marketing tools used in a business, a mediocre answer is just fine. As long as you don't need inspired answers, mediocrity can provide a lot of value.
As Park implies, getting to mediocre much faster than a team of humans can has real value. Why? Because instant mediocrity can scale. And: if you put generative AI into a specialized digital assistant role, you can help humans raise their game.
In perhaps the most comprehensive study on generative AI in the enterprise to date, Generative AI at Work (PDF link), researchers identified a nifty win: junior-level customer service employees, armed with an AI digital assistant that baked in the know-how of senior level employees, delivered better results. Service reps liked their jobs better too - therefore, better retention. (You can check a survey summary at MIT Sloan, Workers with less experience gain the most from generative AI.)
With the help of enterprise LLMs and customer-specific refinements, I believe we can move beyond "instant mediocrity" use cases. But human-in-the-hoop design can't take the rough edges off of every use case. Some generative AI use scenarios truly have an outlier problem - where the inevitable outliers have too much downside to pursue. Here's a goofy/extreme example: generative AI for mushroom classification:
'Life or Death:' AI-Generated Mushroom Foraging Books Are All Over Amazon https://t.co/Sq5azvhFar
-> I also heard from one goofy startup intending to focus on mushrooms for their GPT endeavors. Talk about an outlier problem!
— Jon Reed (@jonerp) September 3, 2023
Enterprise use cases involving hazardous industrial materials would surely qualify. At a less intense extreme are: interactions with customer governed by regulations, which is why some industries are putting customer-facing generative AI projects on the back burner, and starting with internal-facing systems (banking is a good example, as generative AI can struggle a bit with numbers).
Can GPT work for enterprises out of the box? Mixed reviews
If you want to make an argument for out-of-the-box enterprise value, that service example is the type of use case you'd point to: imperfect but powerful tools, narrowed via customer-specific data and role-based functions. The outlier problem - giving out slightly inaccurate information - has a minimal downside, and is probably happening regardless.
Alteryx surveyed 300 companies worldwide on the perceived risks of generative AI. Early results are promising.
While 89% of the respondents currently using generative AI in their organization (36%) have already realized modest or substantial benefits of generative AI, 70% of generative AI users reported that they trusted AI to "deliver initial, rapid results that I can review and modify to completion.
However, many companies are sitting on the sidelines. Some have banned ChatGPT. Others are surely awaiting tools with satisfactory IP, data privacy and governance guardrails. As per Alteryx:
47% of respondents not currently using generative AI in their organization cited data privacy concerns as the main reason they haven't implemented the technology.
No doubt those risk numbers are higher in some industries/projects, and lower in others (e.g. financial services must check off numerous regulatory boxes, though AI can also help with that - my podcast on this topic will be live soon).
Judging from my PR challenge, we are nowhere near consensus on whether a GPT-type, out-of-the-box solution will suit enterprise needs. Sreekanth Menon, Genpact's VP and Global AI/ML Services Leader, reports good early results with GPT-based approaches:
Currently, we are collaborating with clients on a variety of use cases and have found that GPT performs satisfactorily in the majority of scenarios and excels in several others.
However, Menon noted that to make this work, his team utilizes a number of ChatGPT "mitigations":
- Fine-Tuning: Models can be fine-tuned on domain-specific corpora to make them more relevant and accurate for enterprise use cases.
- Human-in-the-Loop: Combining AI with human expertise can allow for more accurate and reliable systems.
- Secure Deployment: By using on-premises installations or specialized cloud services, data security can be maintained.
- Custom Prompting: Careful design of prompts can guide the model towards generating more relevant and accurate answers.
- Standardization: Enterprises can set guidelines for how the model should be used, to ensure consistency across the organization.
As for the data privacy/risk concerns that keeps enterprises in the caution flag lane, Menon is also optimistic:
Moving forward, OpenAI has announced the launch of ChatGPT Enterprise. OpenAI states that the upcoming features on ChatGPT Enterprise can securely extend ChatGPT's knowledge with company data by connecting the applications already in-use. With this in place, such models may be able to respond within the context boundary of the enterprise. However, no concrete research results are available yet to substantiate the claim.
Josh Koenig, CSO and co-founder at Pantheon, had a different take on GPT's enterprise readiness:
I think GPT is a long way away from fully automating much of anything, and the specific user-experience framework of "Chat" has limited utility. However, with the ability to bring in proprietary training data to the table, as well as setting up deeper integrations via API, I do think there are real enterprise applications. But it's going to feel more like staff augmentation, 'what could you do with a thousand interns that never sleep' vs 'the AI can handle this task or give these answers authoritatively.'
My exchange with David Ottenheimer, VP of Trust and Digital Ethics at Inrupt, was even spicier. As per my assertion that ChatGPT won't adapt well to the enterprise, Ottenheimer responded:
I agree with this line of thinking. Your belief is well-founded in what probably should be described as post-enlightenment thinking. The post-Newton – or perhaps post-Hume – scientific rigor of discarding things that are probabilistically false helps Enterprises navigate towards a progressive upside of profit without causing harm. The very nature of an Enterprise business model is that it operates using efficiencies, better known as regulations, to avoid costly known bad behaviors. Academic approaches to data lean towards overly "wide spectrum" or "both sides" doctrines, dragging everything and anything into view yet eschewing accountability for errors. ChatGPT has had some very problematic missteps and hasn't yet proven it won't blindly drive an Enterprise into some wasteful, entirely avoidable accidents and real harm.
My take
My current position: enterprise AI will not be an out-of-the-box winner, especially those models trained on massive Internet data sets. Yes, there may be some quick-value scenarios, but I believe we'll hit the limits of output relevance pretty quickly. For those who disagree, I would counter: who can argue against testing industry-tailored LLMs, in combination with customer data?
Either way, we're about to find out. C3.ai, who I hope to feature in a future installment in this series, has already released 28 domain-specific LLMs. SAP's "Business AI" plans are also intriguing. Their approach? Combine third party LLMs from trusted partners, along with SAP's own (opt-in) foundational model, enhanced with an infusion of real-time customer data via a vector database (for more links and info on the industry LLM surge, check Larry Dignan's Get ready for a parade of domain specific LLMs).
Which leads me smack into the enterprise AI pricing debate. Some vendors have been vocal on their plans to put premiums on generative AI pricing. Before I weigh in, I should note the essential follow up to my million-dollar question above: how much involvement will a customer's domain experts need to have, in order to launch effective generative AI scenarios? Experts differ.
I happen to believe that reinforcement learning, via a customer's domain experts, is an important piece of this puzzle also - at least in the launch phases (not to mention helping to hone/test the most effective prompts). Not everyone agrees; some deep learning experts believe that the need for human experts to refine models won't be significant; they point to ML advancements such as RAG (retrieval-augmented generation).
Why does this matter? I believe it has a strong bearing on the pricing debate. Does it make sense to charge aggressive AI premiums when the model's accuracy depends on the infusion of the customer's data, and the significant effort of their own domain experts? As I said in a recent podcast, the more a customer invests in co-innovation, the less you can rationalize big AI surcharges to that same customer.
I'm not saying that vendors won't need to boost prices to profit from AI; we all know generative AI is not a discount aisle technology. But there is a tension here. Until a customer adopts your AI platform, they could be kicking tires on another.
On Workday's last earnings call, co-CEO Aneel Bhusri made the case for initial pricing based on customer data/participation. As per Constellation's Larry Dignan:
Bhusri said Workday isn't necessarily looking to charge for generative AI add-ons. Why? Because customers are sharing anonymized data in return for insights. Workday can use that data to train models.
"The data is valuable to train LLMs and domain specific LLMs. We turn around and make our products more competitive," said Bhusri, who added that Workday is likely to create new products based on models.
Mark me down on the customer co-innovation side, where pricing takes a back seat to earning customer AI trust - and developing winning use cases. As vendors lean into their fall shows, I hope they emphasize transparency and open discussions, and spare us the hyperbole about AI revolutions. That bombast was fine last spring, but the conversations have moved on. During my recent AI discussion with Acumatica CPO Ali Jani, it was refreshing to hear Jani talk about the importance of experimenting on AI with customers. He said for every 10 or 15 such experiments, one of them really sticks. That's the kind of AI project transparency we sorely need right now.
In the months to come, we can look forward to learning just how effective/accurate/transformative enterprise LLMs can be. But that day hasn't arrived yet. What should we do in the meantime? Well, we could linger in the land of the "game changing" AI keynotes, or dig deeper. The customers I speak to urge us to press on: into the essential topics of data privacy, bias, explainability and pricing. This summer, I've had vendors punt the question of how their models are trained. That's not going to work this fall.
Obviously, I'm out of word count to go further into my PR challenge responses. If your firm would like to participate, my next installment will get into enterprise alternatives to ChatGPT. For now, I'm hitting the tarmac, pesky questions in hand.