A System of Agents brings Service-as-Software to life READ MORE

What’s Next After Transformers

Ideas / Points of View / What’s Next After Transformers

07.23.2024 | By: Joanne Chen

I talk with Recursal AI founder Eugene Cheah about RWKV, a new architecture for AI.


This essay is a part of my series, “AI in the Real World,” where I talk with leading AI researchers about their groundbreaking work and how it’s being applied in real businesses today. You can check out previous conversations in the series here.

I recently spoke with Eugene Cheah, a builder who’s working to democratize AI by tackling some of the core constraints of transformers. The backbone of powerhouse models like GPT and Claude, transformers have fueled the generative AI boom. But they’re not without drawbacks.

Enter RWKV (Receptance Weighted Key Value), an open-source architecture that Eugene and his team at Recursal AI are commercializing for enterprise use. Their goal is ambitious but clear: make AI more cost-effective, scalable, and universally accessible, regardless of a user’s native language and access to compute.

Eugene’s journey from nurturing RWKV’s open-source ecosystem to founding Recursal AI reflects the potential he sees in this technology. In our conversation, he explains the technical challenges facing transformers and details how RWKV aims to overcome them. I left with a compelling picture of what a more democratic future for AI might look like – and what it would take to get there.

Here are my notes.

Is attention really all you need?

Introduced in the 2017 paper “Attention is All You Need” by a group of Google Brain researchers, transformers are a form of deep learning architecture designed for natural language processing (NLP). One of their key innovations is self-attention: a mechanism that captures relationships between words regardless of their position in a sequence. This breakthrough has led to numerous advanced models, including BERT, GPT, and Claude.

Yet, despite their power, transformers face significant hurdles in compute cost and scalability. For each token (roughly equivalent to short word or a part of a longer word) processed, transformers essentially recalculate all their calculations. This leads to quadratic scaling costs as the context length increases. In other words, doubling the input length quadruples the amount of compute required.

This inefficiency translates into enormous demands on compute. While exact figures are hard to come by, OpenAI reportedly uses over 300 Azure data centers just to serve 10% of the English-speaking market. Running transformers in production can cost hundreds of thousands or even millions of dollars per month, depending on their scale and usage.

Despite these steep scaling costs, transformers maintain their dominant position in the AI ecosystem. Stakeholders across all levels of the AI stack have invested substantial resources to build the infrastructure necessary to run these models in production. This investment has created a form of technological lock-in, resulting in strong resistance to change.

As my colleague Jaya explained: “The inertia around transformer architectures is real. Unless a new company bets big, we’ll likely see incremental improvements rather than architectural revolutions. This is partly due to the massive investment in optimizing transformers at every level, from chip design to software frameworks. Breaking this inertia would require not just a superior architecture, but a willingness to rebuild the entire AI infrastructure stack.” 

Faced with such a herculean lift, most stakeholders opt for the familiar. Of course, this status quo is not set in stone. Eugene and the RWKV community certainly don’t seem to think so.  

RWKV: a potential alternative?

Instead of the all-to-all comparisons of transformers, RWKV uses a linear attention mechanism that’s applied sequentially. By maintaining a fixed state between tokens, RWKV achieves more efficient processing with linear compute costs. Eugene claims that this efficiency makes RWKV 10 to 100 times cheaper to run than transformers, especially for longer sequences.

RWKV’s benefits extend beyond compute efficiency. Its recurrent architecture means it only needs to store and update a single hidden state vector for each token. Compare this to transformers, which must juggle attention scores and intermediate representations for every possible token pair. The memory savings here could be substantial.

RWKV’s performance compared to transformers remains a topic of active research and debate in the AI community. Its approach, while innovative, comes with its own set of challenges. The token relationships it builds, while more efficient to compute, aren’t as rich as those in transformers. This can lead to difficulties with long-range dependencies and information retrieval. RWKV is also more sensitive to the order of input tokens, meaning small changes in how a prompt is structured can significantly alter the model’s output.

Promising early signs

RWKV isn’t just a concept on paper: it’s being used in real applications today. Eugene cites a company processing over five million messages daily using RWKV for content moderation, achieving substantial cost savings compared to transformer-based alternatives.

Beyond cost-cutting, RWKV also promises to level the linguistic playing field. Its sequential processing method reduces the English-centric bias in many transformer-based models, which stems from their training data and tokenization methods, as well as the benchmarks by which they’re judged. Currently, RWKV models can handle over 100 languages with high proficiency: a significant step toward more inclusive AI.

While direct comparisons are challenging due to differences in training data, the early results are impressive. Eugene reports that RWKV’s 7B parameter model (trained on 1.7 trillion tokens) matches or outperforms Meta’s LLaMA 2 (trained on 2 trillion tokens) across a variety of benchmarks, particularly in non-English evals. These results hint at superior scaling properties compared to transformers, though more research is needed to confirm this conclusively.

Beyond encouraging evals, RWKV also has the potential to break us out of the “architecture inertia” described by my partner Jaya. Eugene explains that RWKV’s design allows for relatively simple integration into existing AI infrastructures. Training pipelines designed for transformers can be adapted for RWKV with minimal tweaks. Preprocessing steps like text normalization, tokenization, and batching also remain largely unchanged.

The primary adjustment needed when using RWKV comes at inference time. Unlike transformers, which handle each input separately, RWKV manages hidden states across time steps. To accommodate this, developers have to modify how hidden states are managed and passed through the model during inference. While this requires some changes to inference code, it’s a relatively manageable adaptation—more of a shift in approach than a complete overhaul.

Implications for the AI field

By improving efficiency and reducing costs, RWKV has the potential to broaden access to AI. Here are a few of the implications that Eugene highlighted:

1. Unleashing innovation through lower costs

Current transformer-based models pose prohibitive costs, particularly in developing economies. This financial hurdle stifles experimentation, limits product development, and constrains the growth of AI-powered businesses. By providing a more cost-effective alternative, RWKV could level the playing field, allowing a more diverse range of ideas and innovations to flourish.

This democratization extends to academia as well. The exponential growth in compute costs driven by transformers has hampered research efforts, particularly in regions with limited resources. By lowering these financial barriers, RWKV could catalyze more diverse contributions to AI research from top universities in India, China, and Latin America, for instance. 

2. Breaking language barriers

Less than 20% of the world speaks English, yet, as discussed above, most transformer-based models are biased toward it . This limits users and applications, particularly in regions with multiple dialects and linguistic nuances. 

RWKV’s multilingual strength could be used to build products that solve these local problems. The Eagle 7B model, a specific implementation of RWKV, has shown impressive results on multilingual benchmarks, making it a potential contender for local NLP tasks. Eugene shared an example of an RWKV-powered content moderation tool capable of detecting bullying across multiple languages, illustrating the potential for more inclusive and culturally attuned AI applications.

3. Enhancing AI agent capabilities

As we venture further into the realm of AI agents and multi-agent systems, the efficiency of token generation becomes increasingly crucial. As agents converse, collaborate, and call external tools, these complex systems often generate thousands of tokens before returning an output to the user. RWKV’s more efficient architecture could significantly enhance the capabilities of these agentic systems.

This efficiency gain isn’t just about speed; it’s about expanding the scope of what’s possible. Faster token generation could allow for more complex reasoning, longer-term planning, and more nuanced interactions between AI agents.

4. Decentralizing AI

The concentration of AI power in the hands of a few tech giants has raised valid concerns about access and control. Many enterprises aspire to run AI models within their own environments, yet this goal often remains out of reach. RWKV’s efficiency could make this aspiration a reality, allowing for a more decentralized AI ecosystem.

What’s next for RWKV?

While the potential of RWKV is clear, its journey from promising technology to industry standard is far from guaranteed.

Currently, Eugene is focused on raising capital and securing the substantial compute power needed for larger training runs. He aims to keep pushing the boundaries of RWKV’s model sizes and performance, and potentially expand into multimodal capabilities—combining text, audio, and vision into unified models. In parallel, the RWKV community is working on improving the quality and diversity of training datasets, with a particular emphasis on non-English languages.

Eugene is also excited about exploring other alternative architectures, such as diffusion models for text generation. His openness reflects a broader trend in the AI community: a recognition that the path forward requires novel ideas for model design.

While the long-term viability of these new architectures remains to be seen, democratizing AI is certainly a worthy goal. Lower costs, better multilingual capabilities, and easier deployment could enable AI to be used in a much wider range of applications and contexts, accelerating the pace of innovation in the field.

For founders interested in exploring these possibilities, Eugene recommends the RWKV Discord and wiki, as well as the EleutherAI Discord.

If you’re an ambitious founder thinking about what’s next, I’d love to connect: jchen@foundationcap.com.


Published on July 23, 2024
Written by Foundation Capital

Related Stories