Get insights directly to your inbox.

What it takes to build AI agents that actually work

11.01.2025 | By: Ashu Garg

Getting to 80% takes 20% of the effort. Getting to 99.9% is 100x harder.


There’s a narrative in some corners of Silicon Valley that engineering skills have suddenly become less important now that AI can write code. I believe the opposite is true: engineering knowledge matters more than ever.

Building an agent that works in a demo is now relatively straightforward. Building one that works reliably enough to run a business on remains extremely hard. You can build an 80% solution with 20% of the effort, but getting to 99% accuracy takes 100x more work.

When banks deploy KYC software or loan origination agents, failure means millions in losses. These aren’t nice-to-have productivity tools – they’re infrastructure that companies stake their operations on. That final push from “mostly works” to “reliably works” requires exactly the kind of deep technical expertise that some claim is obsolete: understanding edge cases, debugging complex systems, building robust error handling, and iterating through 1000s of scenarios that only surface in production.

Great technical founders recognize that this painstaking work creates a competitive moat. What looks easy to replicate (the demo) masks what’s nearly impossible to replicate (reliability).

As Andrej Karpathy put it recently: “This isn’t the year of agents, it’s the decade of agents.” Karpathy spent years at Tesla working on self-driving software, where he witnessed firsthand the yawning gap between a demo that works 90% of the time and a system you can trust with your life. In autonomous vehicles, climbing from 90% to 99.9% reliability proved exponentially harder than the initial climb from 0% to 90%. AI agents are now beginning that same ascent.

So what makes the last mile so difficult? Why do those extra nines of reliability matter so much? And what are builders actually doing to get there? This month, I’ll explore these questions through lessons from one of our portfolio founders, Ram Krishnamurthy at Maximor.

Why 99% reliability matters

Building AI agents that do actual work means solving for reliability over long chains of actions, not just isolated tasks.

Think about what an agent actually does: it makes an API call, reads from a database, passes information to another system, waits for a response, then decides what to do next. Each of these is a discrete step where something can go wrong. True workflow automation happens when you can chain 1000s of these steps together reliably.

A metric that captures this is “99% step-length.” It asks: How many actions can your AI agent complete in sequence, maintaining 99% reliability at each step, before it needs a human to step in?

This is important because the length of tasks your system can reliably handle determines how economically valuable it is. If your agent can only manage short sequences before breaking, you’re selling a productivity tool that speeds up existing work. But if it can execute long, multi-step workflows autonomously, you’re selling a complete outcome. The business model shifts from helping people work faster to replacing entire processes.

The METR research group has been tracking how far frontier models can go. They measure what they call “horizon length”: essentially, the length of time an AI agent can work reliably before it breaks down. They’ve found that the horizon length for software and coding agents has been doubling every 7 months.

Source: METR research

According to analysis from Exponential View, today’s best AI systems can manage around 100 steps at 99% accuracy. That translates to a day or two of focused analyst work – something like building a competitive landscape analysis or producing a research brief with multiple sources. Tools like OpenAI’s Deep Research can manage this level of complexity today.

If current trends hold, that means 10,000-step workflows at 99% accuracy will become feasible by 2029. At that scale, you’re looking at about a month of continuous work. To extend the example above, that would mean AI managing a complete product launch, including competitive analysis, product specification, go-to-market strategy, and launch execution.

The brutal math of compounding errors

Why focus so intensely on 99%? Why not settle for 95%, or even 90%?

Because the math gets unforgiving quickly. At 90% per-step accuracy, a 10-step task succeeds only 35% of the time (0.9^10 ≈ 0.35). At 99%, those same 10 steps succeed 90% of the time – the difference between “sometimes works” and “mostly works.” But even 99% isn’t enough for long horizons. A 100-step workflow at 99% per-step reliability completes successfully only ~37% of the time (0.99^100 ≈ 0.37). You’re back to failing more often than succeeding.

Now push to 99.9% reliability. Suddenly, 100 steps succeed ~90% of the time (0.999^100 ≈ 0.90). To reliably automate 100s or 1000s of actions, you need those extra nines. Each additional decimal point unlocks a disproportionately longer autonomy horizon.

A recent paper co-led by researchers at the University of Cambridge, the University of Stuttgart, and the Max Planck Institute studied what they call the “illusion of diminishing returns.” On short-task benchmarks, model accuracy inches from 89 → 90 → 91% and appears to flatten. But for long-horizon execution, small per-step gains compound into much longer successful runs.

Source: “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs” (September 2025)

In their experiments, models that look nearly identical on single-step metrics can perform very differently on multi-step tasks. Once you’re in the high-accuracy regime (roughly greater than 80% per step, and especially above 90%), tiny improvements compound sharply. Moving from 99% to 99.5% accuracy might barely register on a leaderboard, but it could double the number of steps your agent handles before failing. The relationship is hyperbolic: the closer you get to perfect reliability, the more additional steps you gain from each incremental improvement.

Source: Same as above

For founders, this means every nine matters. If you’re not chasing sustained reliability over long sequences, you’re building a demo, not a product.

Why long horizons are so hard, even for frontier AI

Beyond compounding error rates, the paper highlights another problem that makes long horizons so difficult: self-conditioning. As an agent works through a lengthy task, the errors it makes become part of the context for future steps, which increases the probability of subsequent mistakes. The model conditions on its own flawed outputs and snowballs toward worse accuracy.

One remedy to this is reasoning. When a model is forced to think step-by-step, it can effectively “reset” each turn and avoid being derailed by past errors. They found that a version of GPT-5 that produced an explicit reasoning trace could execute over 2,100 steps correctly, with a roughly 80% success rate. Without that structured reasoning, it failed after only a handful of steps.

This study highlights why achieving long-horizon autonomy isn’t just about raw model performance. It’s equally about systems and process engineering – granular evals, guardrails, verification, and error correction so a single glitch doesn’t ripple through a full workflow. Put differently, you won’t get the right outcomes without the right underlying processes – especially in enterprise contexts, where teams won’t adopt your product unless they trust how it gets the work done.

How Maximor builds audit-grade reliability

To see how these principles work in practice, let’s look at one of our portfolio companies, Maximor.

Ram and his team are building an AI agent platform for finance and accounting. This is a domain where mistakes are extremely costly. Misstating financials or botching compliance could get a CFO fired and create serious legal liabilities. As Ram puts it: “In finance, reliability is not a metric: it’s the product.”

Ram and his co-founder, Ajay

Building vertical AI agents that actually work requires a specific combination: strong engineering skills plus deep domain expertise. Ram’s team has both. He and his co-founder Ajay met as undergraduates at IIT Madras – Ram graduated at the top of his class, and Ajay was an ACM-ICPC world finalist. They went on to spend over a decade at Microsoft, where they worked on global finance transformation projects and rebuilt Microsoft’s internal finance platform.

Maximor describes their approach as “audit-ready agentic automation.” They’re chasing nines: compounding reliability across long, multi-step workflows.

Here’s how they’re doing it in practice.

➡️ 1. Systems of agents > one mega model

First, Maximor breaks down the work. “There isn’t one single agent that can handle the entirety of a company’s accounting processes,” Ram explains. Just as a finance org has distinct teams for revenue recognition, treasury, accounting, and more, Maximor runs a system of specialized agents: one for invoice coding, another for reconciliations, another for cash forecasting, and so on.

Specialization allows each agent to be trained, tuned, and evaluated against a narrow, well-scoped task. This pushes per-step accuracy higher than a single, generalized agent could sustain over long horizons.

The system also mirrors how human teams escalate problems. An agent encountering an unusual case or low-confidence scenario doesn’t guess blindly: it hands off to another specialized agent or a human expert at clearly defined checkpoints in the workflow.

If you need extreme reliability, this multi-agent approach matters. A network of small, expert agents orchestrated with clear handoffs will outperform a monolithic AI attempting to handle everything.

➡️ 2. Optimize for both process & outcomes

At the heart of Maximor’s product is what Ram calls the “trust engine”: an orchestration framework that weaves together reasoning models, predictive ML, and deterministic software tools.

In a given workflow (say revenue recognition), Maximor’s system might use an LLM to interpret an accounting policy, deploy a trained ML model to predict a value, call a deterministic calculator to apply a formula, have an LLM draft an explanation, then use another tool to verify all outputs against the underlying data.

The agent’s objective is twofold: optimize for outcomes (getting the right numbers) and processes (using the right method to get there). Maximor forces agents to show their work step-by-step. This verifier-centric design makes errors legible. If something goes wrong, the problem surfaces in the agent’s process trace. For long-horizon tasks, they document every step, then “look back” from the end to learn long-term correlations and identify which early actions rippled into later errors.

From a technical perspective, Maximor treats orchestration as an optimization problem with multiple constraints and objectives (both accuracy and explainability). They apply a mix of techniques: reasoning chains where helpful, RL where it adds value, and classical deterministic algorithms. Their approach is proof that good old-fashioned software engineering still matters tremendously.

➡️ 3. Forward-deployed engineers to surface edge cases

From a data perspective, accounting is uniquely advantaged. ERP systems make processes traceable. As Ram puts it, you can “time-travel” through two to three years of input data → decision data → output data to reconstruct ground truth and mine edge cases. That data richness and process trail are rare in other domains.

But not everything lives in an ERP. SOPs aren’t always codified, and important context often exists only in institutional memory. This missing context can only be uncovered through deep, relationship-based embedding with the finance team – speaking their language, surfacing unwritten rules, tweaking and tuning agents until they’re truly audit-ready.

To bridge this gap, Maximor has embraced “forward-deployed delivery” – a trend that we’re seeing with many AI startups. In the weeks both before and after an implementation, they embed their own engineers and accountants directly with a new client’s finance team. During those weeks, the combined human-AI team systematically uncovers corner cases, implicit assumptions, and undocumented policies hidden in the company’s actual processes. They then customize the agents to handle those specific edges before automation goes live.

This is a significant investment. It’s not a PLG motion with traditional SaaS economics, and it’s definitely not plug-and-play. But it’s what’s needed to achieve audit-ready reliability. This is also, Ram notes, “why OpenAI can’t just enter this space.” Getting deeply embedded in ERP systems, CRM data, and workflows requires sustained relationship-building and domain expertise that can’t be copied from 30,000 feet.

➡️ 4. Design interfaces that mimic human workflows & build trust through transparency

Maximor’s product design prioritizes user trust and oversight. The UI surfaces the agent’s reasoning and explanations clearly, and provides interfaces for finance staff to review and approve significant decisions. The agents are presented less like independent “coworkers” and more like assistants working under human accountants’ supervision.

By making the agent’s process transparent, users come to trust its outputs over time, because they can always drill down into why a number was produced. When the agent is unsure or encounters something novel, it asks for human confirmation rather than guessing.

This kind of UX – treating the human as the final backstop and showing them “here’s what I did, does it look right?” – is crucial for getting to high reliability. It both catches potential errors before they compound and builds the confidence needed for adoption. An agent might have 99.9% technical accuracy, but if users don’t trust it, they won’t use it.

This also means forgoing chat interfaces and integrating agents where the work is actually done. “For us, it’s not a chat experience where you interact with an AI once and then it goes off and comes back,” Ram explains. Accountants are accustomed to Excel, and they want to see things in a structured, sequential format – step one, step two – because it’s not just them using it. Auditors need to see the trail of steps too.

Maximor integrates directly into the tools accountants already use. Its agents converge at specific points in the process in ways that mimic how a human accounting team operates. By replicating these familiar workflows, Maximor creates a “digital twin” that feels trustworthy precisely because it mirrors the process users already understand.

The decade of agent engineering

As Maximor’s approach and the broader research on long-horizon reliability show, building production-grade AI agents remains an engineering problem at its core.

Getting from 90% to 99.9% reliability isn’t about prompt engineering or better models alone. It requires the kind of systems thinking that comes from deep technical experience: understanding how errors cascade, designing orchestration layers that can recover from failures, building verification systems that catch problems before they compound, and instrumenting workflows so you can debug what went wrong at step 124 of a 2,000-step process.

I agree with Karpathy that we’re entering the decade of agents. But, to be more precise, I believe we’re entering the decade of agent engineering. LLMs and reasoning models will become increasingly commoditized. The enduring value lies in the engineering required to make them reliable enough for production use.

The founders who win won’t be those who can ship demos fastest. They’ll be the teams who compound reliability across 1000s of steps – who understand that the unglamorous work of chasing those extra nines creates a moat that’s nearly impossible to replicate.

Published on 10.31.2025
Written by Ashu Garg

Related Stories

Get insights directly to your inbox

This field is for validation purposes and should be left unchanged.
Select your preferences