When model providers eat everything: A survival guide for Service-as-Software startups

Ideas / Perspective / When model providers eat everything: A survival guide for Service-as-Software startups

#AI#Startups#OpenAI

08.05.2025 | By: Jaya Gupta, Ashu Garg

Foundation model providers like OpenAI and Anthropic are no longer just selling API access. In the past year, we’ve watched these companies move aggressively up the stack, evolving from infrastructure providers into true product companies. If your startup finds success by building with one of their models, there’s a real risk the provider will simply incorporate that functionality into their own offerings or release a competing app.

In other words, the model provider that powers you can also turn around and steamroll you.

The threat isn’t just theoretical. Because they serve as the backbone for countless AI apps, the major model labs have a panoramic view of what’s working across the AI ecosystem. They see which domains are driving heavy API usage, which new apps are gaining traction, and which prompts and features deliver the best results. Armed with this knowledge, they can quickly zero in on promising use cases and bake them directly into their products. Early on, fast-growing startups sometimes receive “special treatment” from model providers (like preferential API terms or early feature access) only to later face the provider as a rival or acquirer (as in the recent Windsurf saga).

So how do you build an enduring app-layer AI startup when the very platform you rely on might become your competitor? The wisdom to date is that you should architect your product so that each improvement in the underlying model serves as a tailwind. In practice, that means if OpenAI releases a more powerful model or a new feature, it should make your product better, not irrelevant. This is a great principle – but actually achieving it is easier said than done.

Over the past several months, we set out to answer this question. We’ve spoken with dozens of AI founders, sat in on board meetings, and worked closely with our portfolio companies to figure out what sets the most resilient AI startups apart. Several common threads emerged from these conversations. In short, the most forward-looking teams are thinking at a system level: they’re building AI agents that continuously learn from real-world interactions in ways that the major labs can’t easily replicate.

In the rest of this post, we’ll explore the strategies that can make an app-layer AI startup defensible. First, we’ll recap how the AI software landscape has evolved over the past three years, leading up to our current era of AI agents, powered by reinforcement learning (RL). Next, we’ll describe the specific strategies (illustrated by startups in our portfolio) that help app-layer AI startups turn model-provider threats into tailwinds.

The three waves of AI apps

We’ve seen three distinct waves of app-layer AI startups since the release of ChatGPT, each with increasing competition and consolidation:

Wave 1 (2022): The wrapper era. These were simple products: a clever prompt or two behind an intuitive user interface, all sitting on top of a foundation model’s API. These “wrappers” were easy to build, but just as easy to copy. And as soon as the underlying models improved, much of what these apps offered was subsumed into the model itself or became trivial for others to replicate.

Wave 2 (2023-2024): The AI-native era. The second wave of AI startups tried to be more truly “AI-native.” These teams started from first principles: they designed products around the unique abilities of LLMs, built custom data pipelines to augment model outputs, and tightly embedded AI into user workflows. But as model providers moved up the stack – launching their own apps, changing API terms, building competing products – these startups found themselves in a precarious position. Many got squeezed by the platforms they depended on.

Wave 3 (2025): The agentic era. In the current wave of agentic AI, a new generation of reasoning models and post-training techniques is further blurring the line between AI infrastructure and end-user applications. Model providers are shipping agentic products (like OpenAI’s Agent Mode and Anthropic’s Claude Code) directly to users. For startups, this means the bar for “AI-native” has risen even higher: it’s not enough to leverage a powerful model, you have to be building a learning system that can adapt and improve over time.

The growing importance of RL

In this third wave, RL has emerged as a core ingredient for building models and applications where AI acts as an autonomous worker, not just a copilot. This is changing the competitive dynamics in AI – potentially to the benefit of startups.

In the early days of generative AI, many assumed incumbents (Google, Meta, Microsoft, etc.) would inevitably win because they “owned the data.” But as AI’s capabilities advance, their stockpiles of historical data become less decisive. In an agentic world, the highest leverage asset isn’t a legacy dataset: it’s a feedback loop with real users in a real workflow.

Incumbents might control troves of data inside their SaaS apps, but they don’t see what happens between products: the decisions, hand-offs, and multi-tool processes that real users undertake outside any single UI. Startups can step into those gaps. By integrating with multiple tools and observing end-to-end processes, they can collect data on how work actually gets done across systems.

For example, consider an AI sales agent. A basic AI copilot for sales might automate individual tasks like logging a CRM entry or sending out a templated email. That’s useful as a productivity boost, but those gains are only marginal.

Now imagine an RL-driven sales agent that actively learns the craft of selling by watching the entire process from prospecting to close. Over time, it starts to pick up on patterns that close deals. Maybe it learns that following up two days after a demo yields better results than following up after one day, or that deals in a certain industry usually require looping in a sales engineer at the proposal stage. Eventually, such an agent won’t just draft emails for you – it will remind you when to send them, suggest who to loop in, and choose the most effective channel for communication. In essence, the agent is learning the unwritten playbook of the best human sales reps by observing their actions and outcomes – something you’d never find in the static fields of a CRM.

Building this kind of long-horizon, interactive intelligence is hard. It demands new approaches in both AI research and engineering. The model labs (OpenAI, Anthropic, DeepMind, etc.) are investing heavily to figure it out. A wave of new startups are also emerging to offer “RL-as-a-service” tooling. (Our portfolio company, Turing, for example, was one of the first to offer “RL gyms” to the leading AI labs.)

Several important pieces of this puzzle include:

Evaluators: An evaluator is any mechanism that judges the quality of an AI agent’s output or behavior. This is especially important in fuzzy domains where there’s no single correct answer. An evaluator can take many forms – it might be a rubric, a checklist, or another AI model serving as a critic. Its job is to provide feedback or a score that reflects how well the agent performed.

OpenAI is developing a “universal verifier” for this very purpose. This process reportedly uses one AI model to check and grade another model’s output, step by step, by cross-checking facts and logic against various sources. In practice, this means even for subjective tasks like creative writing or advanced math proofs, a model’s responses can be scored for quality and correctness. OpenAI credits this approach with helping one of its models earn a gold medal in the International Math Olympiad, a feat previously unheard of for AI.

Reward pipelines: A reward pipeline translates all those evaluator scores and other feedback signals into the reward values that an RL algorithm optimizes against. In practice, this means aggregating and weighting various feedback into a single number (perhaps prioritizing factual accuracy over style, for instance) so the agent is optimizing for overall quality, not just one metric.

A good reward pipeline also deals with long-horizon tasks that have many decision points. Complex, multi-step tasks might involve hundreds of actions before achieving a result, and the agent can’t wait until the very end to get a learning signal. The reward pipeline addresses this by assigning credit (or blame) to intermediate actions – either by giving feedback at key steps or by propagating the final outcome back to earlier decisions (the classic credit assignment problem in RL).

Custom training environments: These are simulated software environments – think of a sandboxed Salesforce instance, a fake DoorDash-like logistics dashboard, or a mock web browser for general web tasks. Such environments let an AI agent safely experiment: it can try actions, see the outcomes, and learn long-horizon strategies without affecting real users or data.

Notably, OpenAI and Anthropic are aggressively training agents inside simulated versions of the top 150 websites and every major system of record. Their goal isn’t to license these environments as products, but to use them as training grounds for their models. If a model masters these sandbox simulations, it gains a performance edge and can be deployed as an agent that understands real-world software workflows from day one.

RL-specific infrastructure for sparse and delayed feedback: Many valuable tasks (like debugging a complex system or closing an enterprise sales deal) only pay off after a long sequence of steps – sometimes minutes, hours, or even days later. That delay makes it hard for an AI agent to learn, because feedback is infrequent and arrives long after the actions that influenced it.

Training agents under these conditions requires specialized RL infrastructure. Examples include replay buffers (which let the agent remember and learn from past experiences), asynchronous rollouts (running many instances of the agent in parallel to collect training data faster), and advanced credit assignment techniques (figuring out which earlier actions were responsible for eventual success or failure).

All of this investment in RL isn’t just about building better agents: it’s about owning the learning loop. The stream of interaction data generated by users and agents working in tandem is becoming the key proprietary asset in AI. Whoever captures that loop – whoever’s interface is gathering those clicks, decisions, and outcomes – will build a compounding advantage.

How to build a defensible app-layer AI startup

So what does all of this mean for an AI founder on the application side? In practice, building a durable AI app startup today means focusing on domains where you control the most valuable data and feedback loops. Ideally, your AI agent should plug into judgment-heavy workflows and customer-specific processes that produce interaction data the big model providers can’t easily see or replicate.

This often means taking on messy tasks that rely on human judgment, qualitative signals, and context. Whether a task’s “success” is straightforward to measure (say, a claim was filed correctly) or more subjective (say, an analysis was insightful enough to prompt a CFO to act), the key is that you define what success looks like and capture those signals. By owning the data, outcomes, and definitions of success in your domain, you create a feedback loop that continuously improves your product.

Here are five patterns we’ve seen across app-layer AI startups getting this right.

⚠️ Strategy 1: Master high-exception workflows

Many enterprise processes are riddled with edge cases, exceptions, and constantly changing rules. Model providers will have a hard time handling all this variability across industries, but a focused startup can turn that complexity into an advantage.

Take the insurance industry: the process for handling claims can differ wildly based on policy type, jurisdiction, and new rules. No off-the-shelf AI model will automatically know how to handle every permutation.

One of our portfolio companies, Fulcrum, recognized this and designed its AI agent platform with humans in the loop for the tough cases. Whenever Fulcrum’s insurance-processing AI encounters a scenario that’s ambiguous or outside its confidence zone, it flags a human adjuster to step in. Instead of blundering through a rare edge case, the AI defers to a person – and, crucially, it learns from how the expert handles the situation. Over time, the agent builds up a knowledge base of these exceptions and the appropriate resolutions. By baking expert intervention into its workflow from the start, Fulcrum turns real-world complexity into a training asset.

🔄 Strategy 2: Create proprietary feedback loops in ambiguous domains

In domains where ground truth is fuzzy or subjective, the big model providers struggle to train effectively. Startups that embed themselves in these high-ambiguity environments can use this fuzziness to their benefit. They develop custom evaluators, rubrics, and reward functions that define success in their specific problem space, giving their agents learning signals where none exist in an off-the-shelf model.

PlayerZero is a great example. Their coding agent reasons across codebases, tickets, telemetry, and architectural graphs, learning from patterns of failure and resolution to improve its performance over time. Crucially, PlayerZero doesn’t rely on a static dataset for training: it continuously ingests new incident data, analyzes it, and updates its internal model of how distributed systems behave. This creates a virtuous cycle where the agent gets smarter with every anomaly it processes. PlayerZero’s ability to learn from real-world behavior, even when success criteria are murky or context-dependent, gives them a durable – and compounding – edge in software observability and infrastructure troubleshooting.

We’ve seen a similar approach from Traversal, which focuses on real-time incident response. They capture and log every action an agent takes during an incident, label the outcomes, trace the root causes, and feed those insights back into their model. By doing so, Traversal turns every failure (or close call) into an opportunity for their system to get better. This custom loop hones their agents in ways a one-size-fits-all model can’t match.

⚙️ Strategy 3: Own the execution layer

Defensibility can also come from owning the messy, hard-to-reach parts of the tech stack that big model providers aren’t inclined to tackle. Enterprise tech is full of legacy software, half-documented APIs, siloed institutional knowledge, and clunky manual workflows. The startups that thrive here embrace this mess. They build agents that plug into the duct-tape layer of business processes, where brittle scripts and human workarounds have been the norm.

For example, Maximor targets all the shadow processes that orbit around rigid ERP systems: the spreadsheets, desktop procedures, unofficial SOP documents, and other hacks that employees use to get work done outside their main software. Their AI audit tool observes these behaviors in detail and produces agent playbooks that reflect how work actually happens (as opposed to how the enterprise imagines it happens). From there, Maximor works closely with users to turn those playbooks into AI agents that can operate within each organization’s fragmented stack.

Another team, Optimized (currently in stealth), follows a similar playbook by fully owning the execution surface of DocuSign workflows. Because they manage the underlying data records and know exactly what templates, fields, and signatories are needed, their agent can complete a contract process end-to-end with just a short prompt. While other solutions rely on prompt-heavy interaction with multiple APIs, Optimized’s product delivers a one-click experience.

The unifying lesson is that defensibility lives in the parts of the tech stack that big model providers won’t bother to own. The more obscure, painful, or industry-specific the surface area, the more likely it can serve as a moat for a startup willing to own it.

👥 Strategy 4: Hire domain experts and embed with customers

Another pattern we’ve observed: building a powerful AI app often requires deep domain understanding and close collaboration with your end users. In other words, having great AI researchers isn’t enough: you also need team members who truly understand your customers’ world, and you should be working alongside those customers as you build.

Our portfolio company, Regie, which is developing AI agents for sales and marketing teams, is a case in point. They’ve structured their entire org around GTM expertise. Several of their leaders are former sales-engagement and marketing-automation execs: people who have lived the pain points that Regie’s product now addresses.

In addition to hiring domain experts, Regie embeds their engineers and product team with their customers. They sit alongside SDRs, AEs, and RevOps to directly observe how the sales process works and where it breaks down. This allows Regie to capture the nuances of edge-case sales logic: what reps actually say, how RevOps thinks about ICP scoring, where outreach sequencing breaks down. They also define the reward function for their agents in precise GTM terms, like conversions per rep per channel (rather than a generic metric like “emails sent”). This means their product is optimized for what makes true business impact for their customers.

🏃 Strategy 5: Be fast and in the details

Speed has always been a startup’s advantage, but AI has turned it into a superpower. Features that took quarters to build can now ship in weeks, and product-market fit that once took years can happen in months. Every day sooner you ship means you collect more user data, trigger more feedback loops, and drive faster model improvements.

Being fast, however, doesn’t mean cutting corners on understanding the problem. In fact, staying in the details – remaining close to the actual work and to your users – is crucial when you’re building AI systems to replicate complex workflows. To create high-performing agents in high-exception environments, you have to get in the weeds: do the primary research, map out every edge case, and internalize each pain point in the process you’re automating. Many of the founders succeeding here are often former ICs who have firsthand understanding of the work their agents are automating.

The path forward

Across all of these strategies, the common thread is embedding deeper into your chosen problem space than anyone else. The most resilient AI startups are building entire systems around their AI, complete with domain-specific data feeds, custom feedback loops, and human experts in the loop.

The key is to think in systems. If you can capture data no one else sees, learn from every interaction, and tailor your AI to excel in the nuances of your field, then you won’t have to worry about the next breakthrough from OpenAI or Anthropic – you’ll be able to cheer it on, because it will only make your product stronger.

Published on August 5, 2025
Written by Foundation Capital

When model providers eat everything: A survival guide for Service-as-Software startups

The three waves of AI apps

The growing importance of RL

How to build a defensible app-layer AI startup

⚠️ Strategy 1: Master high-exception workflows

🔄 Strategy 2: Create proprietary feedback loops in ambiguous domains

⚙️ Strategy 3: Own the execution layer

👥 Strategy 4: Hire domain experts and embed with customers

🏃 Strategy 5: Be fast and in the details

The path forward

Related Stories

The $4.6T Service-as-Software opportunity: Lessons from year one

Reasoning + RL: A new recipe for AI apps

A System of Agents brings Service-as-Software to life

Get insights directly to your inbox