Deep Learning and AGI: Generalization Issue - Quick Introduction
- Kaan Bıçakcı
- Aug 14, 2024
- 9 min read
Updated: Aug 17, 2024
Deep Learning... One of the most popular terms of the last decade. With the rise of Generative AI, there is a lot of hype around it. In this post, I'll cover the most important fatal flaw of deep learning: generalization.
As I deep dive into Deep Learning topics, I found it easy to become immersed in Neuroscience and Psychology concepts. I'll cover those in separate blog posts.
This blog is not intended to cover all the details; it's just a quick introduction.
Generalization If I'd Define - High Level
Generalization is the ability to efficiently apply knowledge across diverse contexts, bridging familiar, novel, and unseen situations through adaptive cognitive processes. It should mirror the brain's capacity for flexible problem-solving in uncertain environments and be achieved through data-efficient learning, in contrast to the data-hungry nature of current deep learning based approaches.
NOTE: I am not saying Deep Learning is useless. It's useful for what is designed to do, but not useful for creating systems that are trully intelligent or capable of generalizing.
Topics:
Interpolation and Extrapolation
Why I think that this definition is an oversimplification?
How is this related to the machine learning (more specifically deep learning)?
Starting with basic example - approximating f(x) = x^2:
More Complex Situation - Generative Models
As a basic example let's say you trained a GAN model to generate MNIST digits
Curve Fitting and Manifold Hypothesis
Getting back to GANs
Interpolation & Extrapolation
Question is, can deep learning models perform extrapolation?
In order to explain the concepts of those two terms, let's see the definition of extrapolation first from paper Learning in High Dimension Always Amounts to Extrapolation [1]:
Definition 1. Interpolation occurs for a sample x whenever this sample belongs to the convex hull of a set of samples X ≜ {x_1, . . . , x_N }, if not, extrapolation occurs.Let's visualize this in 2D first, then we'll move to more complex scenarios:
In high-dimensional spaces, the distinction between interpolation and extrapolation becomes less clear-cut, according to the definition as the "distances" tend to be far from each other.
Why I think that this definition is an oversimplification?
Deep learning models used nowadays don't just work in the raw input space. They transform data through multiple layers, learning new representations along the way. These transformations can radically alter the geometry of the data, making the notion of "inside" or "outside" a fixed convex hull less relevant.
The authors also provide this statement in their conclusion [1] part:
In short, the behavior of a model within a training set’s convex hull barely impacts that model’s generalization performance since new samples lie almost surely outside of that convex hull. This observation holds whether we are considering the original data space, or embeddings. We believe that those observations open the door to constructing better suited geometrical definitions of interpolation and extrapolation that align with generalization performances, especially in the context of high-dimensional data.I think we need to consider interpolation and extrapolation in different ways. Deep learning leverages the manifold hypothesis (explained in later sections), which suggests that high-dimensional data frequently resides on or near lower-dimensional manifolds. This perspective shifts our understanding of how these models generalize. Instead of thinking about interpolation within a high-dimensional convex hull, we might consider how models learn to navigate along these data manifolds.
Definition of extrapolation from Wikipedia:
In mathematics, extrapolation is a type of estimation, beyond the original observation range, of the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results. Extrapolation may also mean extension of a method, assuming similar methods will be applicable. Extrapolation may also apply to human experience to project, extend, or expand known experience into an area not known or previously experienced so as to arrive at a (usually conjectural) knowledge of the unknown[1] (e.g. a driver extrapolates road conditions beyond his sight while driving). The extrapolation method can be applied in the interior reconstruction problem.How is this related to the machine learning (more specifically deep learning)?
Extrapolation is predicting data points that are outside the range of the training data. If the data point is within this range, the model is interpolating; otherwise, the model is extrapolating.
In other words, extrapolation occurs when the learned function or task (in this case learnt -approximated- by DL model) accurately approximates the data generating process across the entire domain, not just between the observed data points.
I believe those definitions can be extended, which I'm planning to cover in my next blog posts.
Starting with basic example - approximating f(x) = x^2:
Neural networks can't learn (learning may be misleading here) x^2 for whole domain (this is a very trivial examle):
So you can see it does a great job between the lines it's trained on, but the error increases as you exit from that region and approaches to infinity. This means that model was not able to extrapolate but interpolate.
More Complex Situation - Generative Models
For a more complex example, let's start from GANs. They can generate image (sticking to the image case for simplicity, otherwise they can generate audio etc. too!) which they don't have in their training data. In that case, are those models extrapolating? Not quite. Let me explain.
As a basic example let's say you trained a GAN model to generate MNIST digits.
Recall the definition of extrapolation from above:
In other words, extrapolation occurs when the learned function accurately approximates the data generating process across the entire domain, not just between the observed data points.GANs learn to model the distribution of the training data. When generating new samples, they typically sample from this learned distribution. The generated samples are usually combinations or variations of features seen in the training data, rather than entirely novel features or concepts.
Curve Fitting and Manifold Hypothesis
It quite makes sense because at the end of day it's deep learning, and logically a curve fitting. Wait, what is the curve here? What does interpolation have to do with all of this?
Now, the term manifold kicks in. A manifold is a mathematical concept used to describe spaces in arbitrary dimensions. More precisely, it's a type of space that, when viewed locally, resembles familiar Euclidean space, but on a larger scale might have a more complex structure.
Imagine an ant walking on the surface of a basketball. From the ant's tiny perspective, the surface seems flat – it can move freely in any direction without feeling like it's going up or down. This is because locally, at the ant's small scale, the surface of the basketball resembles a flat plane. However, we know that globally, the basketball is actually spherical.
This is the essence of a manifold:
Locally, it looks like a piece of (flat - not always flat but for intuition you may consider it flat) space (Euclidean space).
Globally, it may have a more intricate or curved structure.
The transitions between local 'flat' regions are smooth.
In the context of deep learning, the data we work with often forms a manifold in a high-dimensional space. The 'curve fitting' I've mentioned is really about learning the structure of this manifold.
When we interpolate between data points, we're essentially moving along this manifold, navigating its complex global structure by leveraging its simpler local properties.
Get back to GANs
Wait, we were talking about GANs then jumped into manifold hypothesis. In the context of GANs generator learns to map from its input space (usually random noise) to points on or near this manifold. The GAN's ability to generate is limited to this learned manifold. It can create new combinations of features it has seen (interpolation), but it struggles to generate new features or extend beyond the boundaries of its training data (extrapolation).
In more advanced architectures, like StyleGANs, the style parameters can be seen as coordinates on this manifold. When we mix or interpolate styles, we're moving along the manifold in ways that combine different aspects of the training data.
A more intuitive example would be a GAN which generates realistic images. For this GAN to truly extrapolate and generate any possible realistic image - not just variations of what it has seen - it would need to understand the fundamental principles of physics, optics, and more.
What about LLMs?
Let's consider Large Language Models (LLMs) through the lens of manifolds and generalization. Similar to how GANs learn to navigate a manifold of visual data, LLMs operate on a vast, complex manifold of language and concepts.
When an LLM generates text, it's essentially traversing this high-dimensional space, combining and recombining elements it has learned during training.

Of course, a real manifold is more complex than the figure above, but the intuition stays the same. Word embeddings form local places in the manifold, creating a multidimensional semantic space. If you could visualize the real manifold, you would see a landscape where regions of similarity in the embedding space form clusters (or hills etc.) that represent local properties of embeddings.
That's how LLMs perform interpolation within their learned manifold, which remains static after training. During inference, they navigate this fixed landscape without updating their parameters or accessing new external information. This differs from humans, who can continually learn and incorporate new knowledge from the world around them
Current Problems of LLMs (as they are seen path to AGI)
If you can check out this article from 2016: https://www.graphcore.ai/posts/is-moravecs-paradox-still-relevant-for-ai-today We can see that those problems still remain as of today.
This is where we encounter a fascinating paradox (mentioned in the old article), akin to Moravec's paradox in robotics. LLMs excel at tasks that seem complex to us (humans) like generating coherent paragraphs on abstract topics or even producing syntactically correct code. Yet they struggle with tasks that we find trivially easy, such as understanding basic cause and effect or exhibiting common sense reasoning.
Current models, like GPT-4 or any other LLM at that level, struggle with abstract reasoning and cannot generalize beyond training data.
They fail at abstract reasoning.
In a nutshell, abstract reasoning is the ability to spot patterns or rules from limited information and then apply these patterns to new situations. Children are good at learning general rules from just a few examples and using them in new ways. For instance, a child might learn the concept of "bigger than" by comparing a few objects, then apply this idea to things they've never seen before.
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks explains this topic pretty well, it has some interesting outcomes too!
Overhyped claims about AI capabilities lead to disappointment.
I don't know if that needs an example, non-technical people tend to throw LLMs into every problem.
Current LLMs are not designed to be correct, often producing false statements (hallucinations).
So LLMs are just extremely advanced autocomplete systems. They're great at producing text that sounds plausible and flows well, but they're not built to always state facts accurately. They don't understand or verify the truth of what they're saying.
If you recall the manifold figure from above, for the model it actually makes no difference when it generates false information. It ends up in some place in their manifold and outputting that token. Maybe identifying that regions on the manifold might reduce the amount of false information outputted by the model. This is actually what you're trying to achieve when you fine tune the model for a specific domain!
The rise of LLM generated content
More and more LLM created material floods the internet, it might harm the quality and reliability of information we all share. Or thinking beyond, people gather data from the internet in order to train their models, if they are not accurate this will yield worse models by time.
True Ingelligence in my opinion
True intelligence is an ability to operate effectively in new, unforeseen situations.
LLMs, despite their vast knowledge base, fall short in this regard. They're incredibly proficient at interpolating within the space of their training data, but they struggle to extrapolate beyond it in meaningful ways. In other words they are not intelligent.
This limitation becomes clear when we ask LLMs to tackle truly novel problems or to exhibit the kind of flexible, adaptive thinking that humans excel at. The models are, in essence, navigating a static distribution, no matter how vast.
The major fallback is that, these models rely on function optimization which is static after the training. If you try to add more data and doing the training on that new data, the models tend to forget what they learn previously. (Catastrophic Forgetting)
Last Thoughts
Advanced AI isn't just about scaling up these models or feeding them more data. It's about developing systems that can generalize, creating truly adaptive models that can navigate the dynamic, ever-changing landscape of real-world problems.
I believe this is the challenge that lies at the heart of AI research, pushing us to rethink our approaches, rather than using Deep Learning based models.
References:
[1]: Balestriero, R., Pesenti, J., & LeCun, Y. (2021). Learning in High Dimension Always Amounts to Extrapolation. arXiv [Cs.LG]. Retrieved from http://arxiv.org/abs/2110.09485






