Large Language Models versus human intelligence - where do we stand now?

Neil Raden Profile picture for user Neil Raden November 29, 2023
Behind the OpenAI news cycle is a more fundamental question: just how intelligent are LLMs? And what can we learn about human intelligence from LLMs? It's a debate worth exploring.

Robot hand in trust AI machine learning © zapp2photo

I don’t stay up at night, eagerly anticipating the latest updates on AI or the OpenAI saga and other technology issues, attempting to "scoop" topics of immediate interest. I'm not a reporter or a journalist. I prefer to step back and gather different perspectives before forming my own opinion.

I seek out people with much more insight into these developments. Regarding AI, particularly issues involving "intelligence," Melanie Mitchell, a Professor at the Santa Fe Institute, and her Substack AI: A Guide for Thinking Humans is a must-read.

Consider this: Mitchell firmly asserts that studying the nature of intelligence in humans and AI changes our view about what is unique about humans. Machines don't understand the world the way we do. They need more context and common sense. She asks how AI can better understand our world and values. How do we develop the tools to understand AI?

The Cambrian Explosion in AI companies and solutions, especially in Large Language Models (LLM) and generative AI, is challenging the assumption that AI has reached a point of some level of human intelligence. It's true; their performance boggles the mind. I tried this myself:

Neil: Please write a proof of Pythagoras's Theorem and make every line rhyme.

ChatGPT: Certainly! Here's a rhyming proof of Pythagoras' Theorem:

In a realm where right triangles reign,

A square on the hypotenuse does contain,

An area, you see, of a certain degree,

Equal to the sum, quite magically.

Let's call the sides 'a' and 'b' with delight And the hypotenuse 'c' in this geometric sight.

On each side a square, we carefully lay,

To reveal a truth, in a poetic way.

Square 'a' and 'b', on their sides they stand,

Bordering the triangle, quite grand.

Inside these squares, triangles four,

Mirror our original, at its core.

It didn’t prove it, but could you do this in a few seconds?

From Can Large Language Models Reason, Mitchell asks the question, “If LLMs Are Not Reasoning, What Are They Doing?”

If it turns out that LLMs are not reasoning to solve the problems we give them, how else could they be solving them? Several researchers have shown that LLMs are substantially better at solving problems that involve terms or concepts that appear more frequently in their training data, leading to the hypothesis that LLMs do not perform robust abstract reasoning to solve problems but instead solve problems (at least in part) by identifying patterns in their training data that match, or are similar to, or are otherwise related to the text of the prompts they are given.

This is not cognition; it's merely mechanical perception.

Some GPT-based LLMs (pre-trained on a known corpus) were much better at arithmetic problems that involved numbers that appeared frequently in the pre-training corpus than those that appeared less frequently. These models appear to lack a general ability for arithmetic but instead rely on a kind of "memorization"—matching patterns of text they have seen in pre-training. As a stark example of this, Horace He, an undergraduate researcher at Cornell, posted on Twitter that on a dataset of programming challenges, GPT-3 solved 10 out of 10 problems that had been published before 2021 (GPT-3's pre-training cutoff date) and zero out of 10 problems that had been published after 2021. GPT-3's success on the pre-2021 challenges thus seems to be due to memorizing problems seen in its training data rather than reasoning about the problems from scratch.

This is quite an indictment of GPT’s problem solving capabilities. However, there is a vigorous debate about what exactly LLMs “understand" and how different it is from how humans understand. On the one hand, most academics hold that models trained on language “will never approximate human intelligence, even if trained from now until the heat death of the universe.” Not all researchers agree, claiming that “the behavior of LLMs arises not from grasping the meaning of language but rather from learning complex patterns of statistical associations among words and phrases in training data and later performing ‘approximate retrieval’ of these patterns and applying them to new queries. “

Some studies and anecdotes about LLMs' capabilities for generalization and abstraction demonstrate an uncanny ability to solve problems or deal with situations quite different from those exposed in their training data. Other studies highlight behavior for “hallucinating” answers to queries and their vulnerability to adversarial attacks belie a poor grasp of the natural world, especially grasping the subtlety and ambiguity of the user’s prompts.

There are a multitude of challenges. First and foremost, how will we see these technologies understand our world? Second, when will we have the tools to know how they can?

OpenAI disclosed that GPT-4 scored very well on the Uniform Bar Exam, the Graduate Record Exam, and several high-school Advanced Placement tests, among other standardized exams to assess language understanding, coding ability, and other capabilities, but evidence of human-level intelligence in GPT-4 is sketchy.

Critics claim that data contamination was at play. People taking standardized tests answer questions they have not seen before, but a system like GPT-4 may have very well seen them in the training data. OpenAI claims to use a "Substring Match" technique to search training data and tags for similar but not exact matches. OpenAI’s method was criticized in one analysis as “superficial and sloppy.” The same critics noted that “for one of the coding benchmarks, GPT-4’s performance on problems published before 2021 was substantially better than on problems published after 2021—GPT-4’s training cutoff. This is a strong indication that the earlier problems were in GPT-4’s training data. There’s a reasonable possibility that OpenAI’s other benchmarks suffered similar contamination.”

Shortcut Learning - ML and deep learning can cause unpredictable errors when facing situations that differ from the training data. This is because such systems are susceptible to shortcut learning; statistical associations in the training data allow the model to produce correct answers for the wrong reasons. Machine learning, neural nets, and deep learning do not teach concepts; instead, they teach shortcuts to connect responses to the training set and apply statistical associations and probability assumptions to produce correct answers without cognition of the intended query. Another study showed that “an AI system that attained human-level performance on a benchmark for assessing reasoning abilities relied on the fact that the correct answers were (unintentionally) more likely statistically to contain certain keywords. For example, answer choices containing the word ‘not’ were more likely to be correct “

Let’s give Melanie Mitchell the last word on this:

One might argue that humans also rely on memorization and pattern-matching when performing reasoning tasks. Many psychological studies have shown that people are better at reasoning about familiar than unfamiliar situations; one group of AI researchers argued that the same patterns of "content effects" affect both humans and LLMs. However, it is also known that humans are (at least in some cases) capable of abstract, content-independent reasoning, if given the time and incentive to do so, and moreover we are able to adapt our understanding of what we have learned to wholly new situations. Whether LLMs have such general abstract-reasoning capacities, elicited through prompting tricks, scratchpads, or other external enhancements, still needs to be systematically demonstrated. 

A grey colored placeholder image