Main content

ChatGPT-4o - a tiny step for AGI, a giant leap for emotional (like) AI

George Lawton Profile picture for user George Lawton May 15, 2024
OpenAI is starting to roll out Chat-GPT-4o (omni) with impressive emotional intelligence emulation skills. It represents a significant advance in multi-modal AI. However, caution is warranted as we learn about new safety issues and enterprise risks.


Much digital ink has been spilled on the release of OpenAI’s new ChatGPT-4o (omni), which boasts impressive advances in emotional (like) intelligence, more seamless interactions, and the fact that it will be free. Less analysis has been given as to why it’s a significant advance in multi-modal AI, its current limits, and some of the new risks, which I will go into below.

OpenAI has made impressive strides with ChatGPT-4o, which is built on a more efficient and complex Large Language Model (LLM) GPTo. It outperforms previous models and competitive offerings in terms of speed, cost-effectiveness, and emotional responsiveness. This represents a significant leap in the new generation of LLMs, capable of interpreting paralinguistic cues (tone of voice, cadence and facial expressions) that consider our intentions, responses, and behavior. 

OpenAI says it is taking a gradual approach in rolling out the enhanced audio capabilities to help understand and mitigate new safety risks:

We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies. We will share further details addressing the full range of GPT-4o’s modalities in the forthcoming system card.

The new Chat-GPT-4o (omni) model is not only faster but also more efficient. It can provide responses in a mere 0.32 seconds compared to the 2.8 or 5.4 seconds of previous models. Moreover, it directly processes raw audio, enhancing the user experience. Its definitely worth watching the demos available here and exploring the service when it goes live.

OpenAI also made modest technical improvements in its ability to answer questions and hallucinate slightly less. For example, it hallucinates answers to math problems only 23.4% of the time compared to 27.4% for GPT-4T and 57.5% for the original GPT-4. OpenAI’s demos showed how Chat-GPT-4o could help teach math, but this may be unwise for now.

You also may not want to trust it too much when analyzing important business data without the guidance of human experts. This will become increasingly difficult because if ChatGPT instilled a sense of false confidence, the new service's emotional emulation skills are likely to do so in spades. That said, OpenAI has also introduced a new Model Spec, a set of safety guidelines to help deepen public discussion on AI safety, including expressing uncertainty, protecting privacy, and respecting creators and their rights, among other things.

Emotional emulation

The biggest advance is that the new model considers raw audio as a first-class citizen in generating responses. Previously, OpenAI’s Whisper Automated Speech Recognition (ASR) engine translated speech to text, processed the text and then used text-to-speech to return results.

This new approach allows it to take advantage of paralinguistic cues about how we speak in generating responses. Paralinguistic information is everything we say outside of the literal text when talking with other people to convey nuance, emotion, or modify meaning. Examples include prosody, pitch, volume, and intonation.

This advances multimodal AI to a new level of precision by cross-referencing different kinds of information within each modality. In this context, modality refers to information channels like text, audio, video, or data streams. However, there is also some structure to the data within a channel. For example, human speech also contains paralinguistic data that relates to and shapes the meaning of what we say.

The industry is still figuring out how to characterize the different types of meaning conveyed within a single modality, like audio. Intra-modal AI might be a good way to describe this ability to cross-correlate different types of meaning within a given modality and how this differs from traditional inter-modal approaches that consider text descriptions and images. OpenAI has not provided details on how they implemented this new capability or what they call this type of technique internally.

The upshot is that it will make standardizing the analysis of conversational cues across different speakers and cultures easier.  Early use cases might include better customer service, sales, and counselor coaching tools. It may be tempting to roll this out for customer-facing chatbots to improve user experience. However, this could also introduce new reputational or business risks.

Free (sort of)

OpenAI also plans to launch a free version of its latest, most capable models. This is to be lauded, as it will certainly extend the value of these models to a much larger audience.

But as with all free products, it's worth cautioning that when big tech companies offer us a service for free, the product ends up being you. With social media, that means monetizing our attention. With search, that means monetizing our intent. A free, emotionally responsive AI that tracks our questions, tone of voice, and subsequent behavior feels like it has the potential to monetize something that could be far more helpful or dangerous depending on guardrails, regulators, and investors.

Like all big tech companies, it's worth pondering the unique path of this new class of capabilities towards Enshittification. Cory Doctorow, who coined the term, dives deeper into it in this Financial Times essay. Social media companies extolling connection have been called out for monetizing dissent. Search companies extolling “don’t be evil” have been called out for degrading quality search results behind less helpful ads.

OpenAI is on the right track by spelling out important safety goals in its Model Spec, at least regarding the models themselves. However, some of its business and operational behaviors deserve better scrutiny. For example, its Model Spec includes the rule “Respect creators and their rights.” However, the company has been less transparent about what copyrighted data it collects, leading to many lawsuits still sorting out legal issues in the courts.

My take

ChatGPT-4o represents an impressive leap in weaving signals of emotions into a new generation of LLMs. Once the service goes live, it will definitely be worth exploring to help appreciate how it and future competitive offerings may shape and improve user experience design. That said, the cautious reader may note that I deliberately excluded the term “understand” in explaining what this new thing does to distinguish it from human understanding.

Also, while this represents a significant leap in emulating emotional intelligence, I would argue that it is not emotionally intelligent. Some early experimenters have claimed that ChatGPTo is a major step towards AGI. But it is clearly not, it's just going to be more persuasive and maybe hallucinate less. These are important distinctions because its new emotional emulation capabilities might surface useful insights, and it is important to prioritize experience over sophisticated pattern matching.

A few years ago, Google engineer Blake Lemoine was fired after he publicly reported that an LLM was sentient. With the wider rollout of emotional emulation capabilities in LLMs, far more people will fall for the same trick. When used cautiously, these tools may help provide deeper insight into what stresses us, our relationships, and our businesses and how to improve. But it's important to realize we are still coming to grips with the new dangers they may create from the models themselves and the businesses running them.

A grey colored placeholder image