Main content

AI needs foundational models - so what can we learn from GPT-3, BERT, and DALL-E 2?

Neil Raden Profile picture for user Neil Raden November 2, 2022
Foundational models address a fundamental flaw in bespoke AI. But foundational and large language models have limitations. GPT-3, BERT, and DALL·E 2 garnered gushing headlines, but models like these deserve scrutiny.

Robot hand in trust AI machine learning © zapp2photo

The Stanford Institute for Human-Centered Artificial Intelligence first presented the term foundational model. Most AI applications today are bespoke, designed to solve a particular problem with seldom concern for reuse or more general problems.

Foundational models are trained on a set of unlabeled data that can be applied to a broad range of tasks with little to no additional effort.

Early examples of the promise of foundational models such as GPT-3, BERT, or DALL-E 2 showed promise in language and images. Simply entering a short string, the system can return an entire essay, even if it wasn’t trained to understand beforehand what you’ve asked, and sufficient to fool you that it understands what you’ve asked and what it’s written. It doesn’t.

How foundation models use unsupervised learning and transfer learning can apply connections and relations learned to use information about one combination to another. It creates an uncanny performance that appears to be logical thinking. An analogy is learning to drive a car, and being able to drive others with minimal or little training. 

The thrill of large language models writing novels is over. The concept of foundation models is well-understood, but the industry is busy finding a way to apply the technology across various domains.  

For example, IBM released an open-source CodeFlare tool that streamlines the development and production of machine learning workloads for future foundation models to solve their most important problems (in AI). For example, insurance could customize a foundation model they have for languages for fraud investigation.

Foundation models have the potential, barring other external factors, to accelerate AI in the enterprise. Reducing the biggest headache, labeling requirements, will “democratize” AI-building of highly accurate, efficient AI-driven automation. Whether it will address bias more effectively remains to be seen. There has been skepticism of earlier LLMs.

Even before the recent craze about sentient chatbots, large language models (LLM) created a lot of excitement and a fair amount of concern. LLMs - deep learning models trained on vast amounts of text - display chrematistics that seem to be human language understanding.

LLMs like Bert GPT-3 , and LaMDA manage to keep coherence with remarkable stretches of text and appear to know a range of topics. They can remain consistent in lengthy conversations so convincing that it can be misconstrued as human intelligence. It’s eerie how they maintain context in long conversations and their ability to do context switching smoothly.

A burning issue is whether LLMs can do human-like logical reasoning. In a research paper on transformers, scientists at the University of California, Los Angeles in a research paper on transformers, the deep learning nets in LLM, without doubt, do not learn reasoning functions. Applying statistical methods, they learn quantitative features in the reasoning process.  

What did the researchers mean by “statistical methods?” As the transformer crunched through billions of pages of text, it created a log of the relationships between different [phrases. If you asked, “Was the conclusion of Flaubert’s “Madame Bovary” contrived,” you might get an answer like, “I’m not familiar with that.” But if you asked, “How are you today,” you may get a response, “I’m fine, thank you, how are you,” not because it examined its internal state and applied reasoning leading to a logical answer. It simply knew that in the context of the conversation, that was most likely the correct response. 

The researchers used BERT, one of the most popular transformers in use. Their findings show that BERT can respond to reasoning problems in the training space quite well. However, it responds poorly and can’t generalize from examples from other distributions. It seems that it exposes some lapses in deep neural networks. Developing benchmarks for them is a challenge.

Measuring logical reasoning in AI is challenging. GLUE, SuperGLUE, SNLI, and SqUA are benchmark tests for AI, specifically for NLP models. Transformers have been backed by the largest AI complies, and as a result, their defects and deficiencies are addressed promptly and demonstrate incremental improvement. Until now, progress is driven by more and more massive scale as a method to ensure more accuracy. This scale does not come without some about resources. GPT-3, has 175 billion machine learning parameters. It was trained on NVIDIA V100, but researchers have calculated that using A100s would have taken 1,024 GPUs, 34 days and $4.6million to train the model. While energy usage has not been disclosed, it's estimated that GPT-3 consumed 936 MWh. 

The LLMs are improving because they have acquired logical reasoning capabilities, or have they? Or is it that they have trained on enormous volumes of text? 

The UCLA researchers developed SimpleLogic, based on propositional logic and a set of logical reasoning problems. A problem includes rules, queries (the problem that the ML model must respond to), and Facts are predicates that are known to be true. The answer to the query, “true” or “false,” is the Label.

The researchers concluded (PDF link):

Upon further investigation, we provide an explanation for this paradox: the model attaining high accuracy only on in-distribution test examples has not learned to reason. The model has learned to use statistical features in logical reasoning problems to make predictions rather than to emulate the correct reasoning function.

Neural networks are very good at finding and fitting statistical features. In some applications, this can be very useful. For example, in sentiment analysis, there is a strong correlation between certain words and classes of sentiments. This finding highlights a critical challenge in using deep learning for language tasks:

Caution should be taken when we seek to train neural models end-to-end to solve NLP tasks that involve both logical reasoning and prior knowledge [emphasis mine] and are presented with language variance.

Reasoning in deep learning

Unfortunately, the logical reasoning problem does not disappear as language models become larger. It just becomes hidden in their huge architecture training data. LLMs can spit out facts and nicely stitched-together sentences. However, they are still using statistical features to make inferences when it comes to logical reasoning, which is not a solid foundation.

And there is no sign that the logical reasoning gap will be bridged by adding layers, parameters, and attention heads to transformers.

As the UCLA researchers conclude:

On the one hand, when a model is trained to learn a task from data, it always tends to learn statistical patterns, which inherently exist in reasoning examples; on the other hand, the rules of logic never rely on statistical patterns to conduct reasoning. Since it is challenging to construct a logical reasoning dataset with no statistical features, learning to reason from data is difficult.

A grey colored placeholder image