A System of Agents brings Service-as-Software to life READ MORE
11.27.2024 | By: Ashu Garg
For the past four years, the AI community has operated on a principle so powerful that it’s become almost an article of faith: intelligence emerges from scale. Make neural networks bigger, feed them more data, give them more computing power, and they become smarter. This idea is the engine behind ChatGPT (now on the eve of its two-year birthday) and the foundation of our current AI revolution. It’s driven billions in investment and reshaped the field of AI.
Yet, in recent months, this faith in scaling has begun to collide with an inconvenient reality: signs are emerging that brute-force scaling alone may not be enough to drive continued improvements in AI.
This realization comes at a crucial moment, as tech giants make unprecedented bets on the scaling hypothesis. OpenAI is courting trillions of dollars to boost global chip production. Google, Meta, Microsoft, and Amazon are dramatically expanding their AI computing capacity and infrastructure spend. These investments—collectively expected to exceed $1T by 2027—rest on the assumption that scaling transformer-based models will continue to deliver steady improvements.
The intellectual foundations of AI’s scaling laws trace back to 2019, when Richard Sutton, a Canadian computer scientist, published “The Bitter Lesson.” Sutton argued throughout AI’s 70-year history, approaches that leveraged raw computational power had consistently outperformed clever attempts to encode human knowledge and expertise. The implication was that we didn’t need to understand intelligence to recreate it. We just needed bigger computers.
A year later, OpenAI researchers published a paper that empirically confirmed Sutton’s hypothesis. They demonstrated that transformer-based models’ capabilities improved in predictable ways as they increased the model size, dataset size, and amount of compute used for training. When all three factors were scaled in concert, model performance followed a smooth power-law curve.
The results of model training further validated this finding. When OpenAI released GPT-3, it demonstrated capabilities that seemed magical: writing poetry, coding software, and engaging in philosophical discussions. Each subsequent model—from GPT-4 to Claude 3.5 to Gemini—seemed to confirm the pattern. Intelligence, it appeared, wasn’t some ineffable mystery but an engineering problem that could be solved with enough resources.
Few have championed this view more forcefully than Sam Altman. In his recent essay “The Intelligence Age,” he distilled years of progress into fifteen words: “deep learning worked, got predictably better with scale, and we dedicated increasing resources to it.” He reiterated the point in conversation with Garry Tan earlier this month. “This is the first time ever where I felt like we actually know what to do. From here to building an AGI will still take a huge amount of work. There are some known unknowns but I think we basically know what to go do.”
Altman’s message has been consistent and clear: superintelligent AI isn’t just possible but inevitable, potentially arriving within the next “few thousand days.” OpenAI has to date raised $22 billion on the back of this conviction.
Beneath this confident exterior, a more complex reality is emerging.
OpenAI’s experience with its next-generation Orion model provides one data point. At 20% of its training process, Orion was matching GPT-4’s performance—what scaling laws would predict. But as training continued, the model’s gains proved far smaller than the dramatic leap seen between GPT-3 and GPT-4. In some areas, particularly coding, Orion showed no consistent improvement, despite consuming significantly more resources than its predecessors.
This pattern of diminishing returns isn’t isolated. The latest version of Google’s Gemini is reportedly falling short of internal expectations. Anthropic has delayed its next-generation Claude model. Even factoring in that our existing benchmarks are increasingly saturated and lower signal, what was once an exponential now looks more like an S-curve, where each additional input of data, compute, and model size yields increasingly modest gains.
Perhaps most telling is the recent statement from OpenAI’s former chief scientist Ilya Sutskever to Reuters: “The 2010s were the age of scaling. Now we’re back in the age of wonder and discovery once again. Everyone is looking for the next thing. Scaling the right thing matters more now than ever.” For Sutskever, one of scaling’s earliest and most vocal proponents, this suggests a fundamental rethinking of AI’s path forward.
The challenges to scaling broadly fall into three interrelated categories: data, compute, and the limitations of next-token prediction. Each represents a barrier to progress through more data, compute, and parameters alone.
According to the 2022 Chinchilla paper, compute and data need to scale proportionally to achieve optimal model performance. While the indexed web contains about 500T tokens of unique text (30x more data than the largest known training dataset), the high-quality, human-created content that AI models need for training has largely been consumed. Excluding private, proprietary sources, what remains is increasingly repetitive, low-quality, or unsuitable for training.
By some estimates, to reach the reliability and intelligence needed for an AI to write a scientific paper—a basic requirement for a system that could advance AI research autonomously—we’d need to train the model on around 1e35 FLOPs, which would require 100,000x more high-quality data than currently exists. The existing corpus of human scientific writing wouldn’t be enough.
The field could make breakthroughs in data efficiency. Some researchers also propose synthetic data as a solution—using existing AI models like GPT-4 to generate training material for their successors. But this threatens to create a “hall of mirrors” problem, where models trained on synthetic data inherit and amplify their predecessors’ limitations. Unlike games like chess or Go, where success is clearly defined, evaluating the quality of AI-generated training data becomes circular: you need intelligence to evaluate intelligence. According to an OpenAI employee, part of Orion’s stalled progress is because the model trained on outputs from o1.
The second barrier moves from the world of bits to that of atoms. Training SOTA models consumes as much electricity as small cities. AI is already hitting the limits of existing power sources, with tech companies pitching clean-energy providers and Microsoft turning to nuclear. The next generation of models could require the energy budget of entire nations. When OpenAI researcher Noam Brown asks, “Are we really going to train models that cost hundreds of billions or trillions of dollars?”, he’s not just asking about money—he’s also asking about physics.
The computational demands of scaling follow their own exponential curve. Some estimates suggest we’d need nine orders of magnitude more compute than our largest models today to approach human-level reasoning capabilities. At some point, the energy requirements and heat generated by computation become their own limiting factors.
Perhaps the most interesting limit is architectural. Many real-world tasks involve what Meta’s Yann LeCun calls the “long tail” problem: an effectively infinite supply of edge cases that no amount of training data can fully capture. Current AI architectures excel at interpolation but struggle with extrapolation: making predictions and reasoning about situations that fall outside their training distribution.
This limitation is baked into the transformer architecture itself. Next-token prediction, clever as it is, appears to create systems that react rather than truly “understand.” According to researchers like LeCun, no amount of scaling can bridge this architectural gap, just as no amount of data would teach a spreadsheet to comprehend what its numbers mean.
As computer scientist Pedro Domingos frames it, engineering problems involve optimizing what we already know works—making transformers bigger, training more efficiently, finding cleaner data. But we’re now hitting the limits of that approach: in his words, we’re sprinting toward a local maximum. To get beyond it, we face scientific problems that demand new ideas about how to create intelligence.
One such idea comes from OpenAI’s recent work on test-time compute. Instead of trying to instill all knowledge and capabilities into the model during training, the startup’s o1 model focuses on reasoning during inference. According to Noam Brown, the research lead on the project, “20 seconds of thinking time” achieved what would have required a “100,000x increase in model scale” under current methods. Recent research from MIT and the success of China’s DeepSeek model appear to further validate this approach.
While test-time compute evolves existing techniques, researchers are also pursuing new architectures to address the limitations of transformers. Alternatives with traction include State Space Models and RWKV. SSMs excel at handling long-term dependencies and continuous data, while RWKV uses a linear attention mechanism that’s markedly more compute efficient than transformers, whose costs scale quadratically with input length.
But perhaps the most radical proposals come from Domingos, along with Meta’s Yann LeCun and others like Fei-Fei Li, who fall on the grounding side of the AI field’s embodiment rift. They argue that we need to move beyond text-based models entirely and advocate for “world models“: systems designed to understand causality and physical interaction, rather than just recognizing patterns in text.
In my opinion, the story of AI progress isn’t ending—it’s becoming more plural, in a way that’s likely to be better for the field in the long term.
As researcher and ARC prize co-founder François Chollet provocatively argues, the intense focus on scaling LLMs may have actually “set back progress towards AGI by quite a few years, probably like 5-10 years.” The issue isn’t just that frontier research has shifted behind closed doors, bucking the spirit of open sharing behind breakthroughs like the transformer. The bigger problem is that LLMs’ spectacular success has created an intellectual monoculture in AI research.
“LLMs have sucked the oxygen out of the room,” Chollet continues. “Everyone is just doing LLMs. I see LLMs as more of an off-ramp on the path to AGI actually. If you look further into the past to like 2015 or 2016, there were like a thousand times fewer people doing AI back then. Yet the rate of progress was higher because people were exploring more directions. The world felt more open-ended. You could just go and try. You could have a cool idea, launch it, and get some interesting results. There was this energy. Now everyone is very much doing some variation of the same thing.”
Today’s LLMs may not be the direct path to superhuman AI, but they remain incredibly powerful tools whose potential is far from tapped. They’ve reached what you might call “minimum viable intelligence“—smart enough to drive fundamental improvements across industries and architect a new generation of AI-native products that increasingly eat into global services’ spend.
But when everyone is climbing the same hill, we risk getting stuck at a local maximum: a peak that looks impressive until you realize there are higher mountains hidden behind it. The next breakthrough in AI might not come from making our current models bigger, but from making them fundamentally different. Just as Sutskever suggests, what the field might need most is a renewed spirit of “wonder and discovery.”
Two years ago, ChatGPT emerged seemingly from nowhere, transforming our understanding of what AI can do. The next step change may be equally unexpected, arising not from raw computing power, but from a deeper understanding of the intelligence we’re trying to create.
Published on November 27, 2024
Written by Ashu Garg