Main content

Why semantic search is a better term than understanding for gen AI

George Lawton Profile picture for user George Lawton May 20, 2024
On the surface, it looks like new generative AI models are getting better at understanding us and the world. But this glosses over the risks and opportunities. The term Semantic Search makes it easier to understand new risks, frame business problems, and identify opportunities, particularly with multi-modal AI.


It seems that generative AI is increasingly being characterized as understanding our questions and the world. For example, a Google search on “generative AI understand” turns up 190 million out of 550 million for just “generative AI.” Gen AI’s seemingly magical ability to answer questions, write content, and generate code is impressive. But “understand” seems different than sophisticated pattern matching.

Just for clarity, the Oxford Dictionary defines understand as:

to know or realize the meaning of words, a language, what somebody says, etc.

Gen AI is certainly good at matching patterns, but “know” or “realize” feels a bit generous for what it does. A much more useful framing is that gen AI uses semantic search to distill intent and match it with a relevant output. This is important because it helps to de-humanize AI a bit so we can think about its strengths and weaknesses more objectively. This distinction seems important to keep a level head in the wake of increasingly persuasive and emotional-like AIs such as ChatGPT-4o.

It's important to remember that there are many different kinds of semantic search; gen AI takes it to a more nuanced level thanks to its billions or trillions of parameters. Semantics does not just have to be about the meaning of words and sentences but also the interconnection of meaning found in audio, video, and even enterprise documents. Academics and technicians widely use this more nuanced definition, but it has fallen out of vogue in mainstream conversations.

For example, Maxime Vermeir, Sr Director of AI Strategy at ABBYY, explains how they use semantic object detection to help interpret the structure of documents to analyze the kinds of documents to improve processing:

A document has structure. So, here's an interesting exercise. Let's say that I put two documents over there. And one is one type of document, one is another. Let's say one is an invoice. One is a contract. Just from this distance, I'm pretty sure that you could tell apart which is which if you know that those are the two options. You don't need to actually read the text to know this. That is image semantic object detection.

What is semantics?

A search for the definition of “semantics” across multiple dictionaries turns up something like this first listing from the Oxford Learners Dictionary:

the study of meaning in language.

But this glosses over the importance of meaning in other modalities of data like audio, images, video, protein structures, and IoT data streams. You have to dust off the 1939 listing from the Oxford English Dictionary to find this wider interpretation:

Of or relations to meaning (of any kind)

This expanded definition is important for understanding the current limitations and potential opportunities for different approaches to multimodal AI. For example, most of the current generation of multimodal AI models sort of glue audio or images to an existing large language model. These are much easier and cheaper to develop but lose insight into deeper patterns in the raw data.

Newer multi-modal models, such as ChatGPT-4o, combine multiple modalities at training to distill semantic information at a more nuanced level. Prior versions translated speech to text, processed it, and then translated this text back into speech. But the emotional nuance embedded in the raw audio never gets baked into the resulting multimodal AI.

OpenAI has not published any details on how they trained this system. Competitors like Microsoft and Google’s DeepMind have elaborated on different ways of doing this. DeepMind has written about using transformers to co-tokenize audio and video to efficiently fuse spatial, temporal, and language information for video question answering. They have a video showing how it can associate the meaning across domains to interpret the ingredients that go into a salad.

Microsoft researchers have explored how multi-modal diffusion models help generate semantically consistent audio for video. In other words, the sounds of waves would match the way they crash over the shore, or a skier’s sounds would match the way they slalom down a slope.

Extending semantic search beyond words

Traditional search techniques looked for keyword matches. This works great when looking for an exact match but fails to bring up useful information when others use a different word to describe the same thing. Semantic search turns words and documents into vectors to bring up relevant information even when different terms are used. For example, “puppy,” “kitten,” and “infant” all describe young creatures or “run,” “trot,” and “canter” all describe similar ways of quickly moving for a horse.

All types of AI need to translate raw information streams into a numerical pattern using various machine learning techniques trained to learn an embedding model. Until recently, turning words into embedding models required a lot of labeled data and was somewhat brittle to more nuanced patterns too complex for humans to recognize or consistently label.

Google’s innovation with transformers in 2017 enabled patterns to be learned from a large collection of raw text. This facilitated the recent progress in LLMs, which sparked the explosion in generative AI. The LLMs could learn an embedding model directly. These same models can be processed either by the same LLM or to represent information for more efficient techniques used in vector databases.

More importantly, transformers can also be used to discern patterns across multiple data modalities. Translating different modalities into a consistent vector format just requires a bit more finesse. This work is also leading to better models for modeling protein interactions in drug development, correlating network sensors with security events, or distilling business process models using process mining data streams.

Traditionally semantic search has relied on older machine learning techniques like approximate nearest neighbor algorithms. They are useful for pulling up a site, a document, or a text string within a document. LLMs take this to a new level of flexibility to semantically search across its billions of parameters to find a pattern relevant to answering a question, summarizing content, or generating code. They are hundreds of times slower than vector search algorithms but also more flexible.

This is where the risks from issues like hallucinations come in. For example, an LLM might not have been trained on enough examples to discern the meaning of a particular statement within a larger context. Or it may not be able to interpret how our tone of voice modifies the meaning of words with sarcasm and humor. It also may not have been trained on different representations of data at various levels of abstraction.

Innovations in combining multiple representations of data across modalities will certainly help reduce some of these kinds of hallucinations. For example, within a voice audio stream, there are different ways of segmenting to represent tone, volume, and rhythm that all color the meaning of the words spoken.

My take

Calling whatever magic generative AI does ‘semantic search’ is a bit of a mouthful and not likely to replace ‘understand’ any time soon. However, it is also important to appreciate that LLMs and their multimodal variants process information and hallucinate very differently than humans.

But the next time you find yourself overly trusting your persuasive new chatbot or liking it a little too much, it may not hurt to remind yourself that it's just running a sophisticated semantic pattern-matching algorithm under the covers. Recognizing that it’s all about semantic search may also inspire a better approach to organizing data into multiple levels of meaning to get better results.

A grey colored placeholder image