A System of Agents brings Service-as-Software to life READ MORE
07.26.2025 | By: Ashu Garg
At ICML earlier this month (one of the year’s most important AI research conferences), reinforcement learning (RL) for LLMs dominated the agenda. This confirmed something I’ve observed in my conversations with builders over the past several months: the center of gravity for AI-focused founders has moved from RAG to RL, from knowledge retrieval to reasoning and decision-making.
This came home for me again on Monday when both Google and OpenAI achieved gold-medal performance on the International Math Olympiad (IMO) using LLMs powered by advanced reasoning and RL techniques. It’s the first time AI systems have ever hit this milestone, and it’s clear proof of the power of reasoning + RL.
RL is not new. So why the sudden resurgence of RL? And, more importantly, what does it mean for AI-focused founders?
To get clear signal, I spoke to three founders in FC’s portfolio who are at the forefront of this new reasoning + RL paradigm for building AI apps: Animesh Koratana, CEO of PlayerZero (AI for engineering quality); Ishan Chhabra, CEO of Oliv AI (AI agents for revenue teams), and Kabir Nagrecha, CEO of Tessera Labs (transforming enterprise business workflows with advanced reasoning + RL techniques).
This month, drawing on my conversations with Animesh, Ishan, and Kabir, I explore the shift from RAG to reasoning + RL, why it’s happening now, and what it means for AI builders.
The short answer: RL is back in the spotlight because AI systems are being asked to think, act, and adapt in pursuit of business goals. Massive pretraining runs have given modern LLMs a broad map of how language works. The next frontier is getting them to do actual work. This means teaching them how to reason and take multi-step actions towards a goal.
This is where RL comes in. RL offers a way to train AI models via feedback and rewards so they can learn from outcomes and improve their decision-making over time. In a classic RL setup, an AI agent takes an action, and its environment returns a reward signal indicating the success of that action. Over time, the agent adjusts its strategy to increase desirable outcomes.
RL was key to creating the new class of reasoning models released publicly over the past 9 months, from OpenAI’s “o” series to DeepSeek’s R1 and Google’s Gemini 2.0. Alongside a new generation of base models, this reasoning + RL recipe is also powering a new class of AI apps built around agents, tool use, and long-horizon decision-making.
As Kabir explained, real-world enterprise processes span dozens of systems and thousands of conditional steps. Small decisions at the right moment can swing outcomes by orders of magnitude. You need models that can plan, adapt, and act, not just think. The combination of reasoning and RL helps teach them to do that effectively.
Why are we seeing RL resurge as a central vector of AI advancement now? Why RL and not, say, synthetic data or new model architectures? There isn’t a single answer, but a few factors stand out:
Many early AI apps were LLM wrappers around a search engine or vector database. The developer’s job was to engineer good retrieval (using keywords or embeddings) and then have the LLM synthesize an answer from the retrieved text.
Today, a reasoning agent can break a task into multiple steps, plan its approach, gather information as needed in each step, and assemble the final result. Crucially, this whole workflow can be learned or handled by the model itself.
Let’s illustrate this with a concrete example. Imagine a user asks: “What are the differences between product Gong and product Clari?”
Put simply, the control logic is shifting from humans to AI. The most advanced builders are spending less time optimizing prompt engineering and retrieval heuristics, and more time training AI systems (with reasoning + RL) to handle that logic themselves.
What does this all mean for founders? Here are a few early best practices we’re seeing:
Reasoning models started off as cutting-edge but slow research prototypes. But we know how this story goes: today’s expensive model is a commodity in a few months. Optimization work is underway to make reasoning models more efficient, distill them into smaller versions, and run them on specialized hardware. In the coming months, tasks that were too latency-sensitive or too expensive for a reasoning agent approach will open up.
In the 2023 paradigm, if you had a mountain of data (say all the claims assessment guidelines for an insurance company, or a unique dataset of legal cases), you could fine-tune an LLM or use RAG to make a domain-expert bot. Your moat was that data: others didn’t have it, so their models would be less knowledgeable in that niche.
With RL-based systems, every user interaction spins your data flywheel faster. Suppose you’re building an AI coding agent. Each debugging session generates rich feedback signals: Did the fix resolve the bug? Did tests pass? Did the developer accept the suggestion? These outcomes immediately become training data for your RL loop.
In the RAG paradigm, you might drop a knowledge base into a vector store. In the reasoning + RL paradigm, you want to build a process knowledge base. For instance, if you’re automating warehouse inventory management, you’d want to collect full walkthroughs of how the best human operations manager would handle a tricky inventory balancing problem. That could be in the form of annotated step-by-step examples, transcripts, or workflows.
RL-based systems can introduce new kinds of failures, so builders need to be vigilant. One issue is reward hacking, where the AI finds a shortcut to get a high reward that isn’t actually what you intended (a classic RL problem). To mitigate this, you need to design reward functions carefully and often include multiple objectives or constraints (e.g. maximizing profits is good but only within the bounds of valid accounting principles). Domain expertise is crucial here: you need to anticipate ways the agent might go astray and guard against them.
We’re still in the early days of this new paradigm, and it’s evolving rapidly. Here are a few trends we anticipate in the coming year as the industry embraces the reasoning + RL approach:
We’re now creating software that doesn’t just know things but can figure them out on its own. Builders who embrace the reasoning + RL paradigm will create the next generation of standout AI products. Those who stick with RAG may wake up to find their product feeling outdated, like a know-it-all who can’t solve a puzzle. As services become software, the ability to solve the puzzle is where the true value lies.
Published on July 25, 2025
Written by Ashu Garg