Meta’s bet on Scale & the new AI data paradigm

Ideas / Newsletters / Meta’s bet on Scale & the new AI data paradigm

06.27.2025 | By: Ashu Garg

I explore how the data needed to advance AI has changed, and what it will take to push AI to the next frontier.

The big AI news this month: Meta paid nearly $15B for a 49% stake in Scale AI, effectively acqui-hiring Scale’s CEO Alexandr Wang into its new “Superintelligence” lab.

More than the hefty price tag, this move underscores how strategic data has become in advancing AI. The competitive edge is shifting from who has the biggest model and most GPUs to who has the best pipeline for generating proprietary, high-impact data.

In other words, having the largest model increasingly matters less than having the highest-value training data to feed it.

For years, data has been likened to oil – immensely valuable, but ultimately a static commodity any lab could buy or scrape. AI companies outsourced labeling to armies of gig workers, paying pennies per task to tag images or transcribe text. That assembly-line approach was Scale’s bread and butter in its early days.

As AI systems have advanced, simplistic cat-and-dog tagging tasks have given way to more complex needs. The most coveted training data today is expert-generated and dynamic – tied to how AI systems learn in real time. Each of the leading AI companies now spends on the order of $1B per year on human-driven data, and those budgets are only growing.

Meta’s rivals like Google and OpenAI, who once relied on Scale’s services, have been cutting ties in the wake of the deal. (One competitor CEO described Meta’s move “the equivalent of an oil pipeline exploding between Russia and Europe.”)

In practical terms, AI labs worry that Scale’s deep partnership with Meta could lead to leakage of their data and ideas, along with second-class service as Scale prioritizes Meta’s needs. In a hyper-competitive industry, where novel IP determines who wins, even a whiff of these risks is unacceptable.

This has created massive tailwinds for Turing – a company that Jonathan Siddharth and Vijay Krishnan incubated in our Palo Alto office back in 2018. Turing has long positioned itself as a neutral partner, a kind of “Switzerland” for every AI lab. According to Jonathan, Turing has seen a major influx of demand in the weeks following Meta’s announcement, as AI labs scramble to find independent data partners they can trust.

In addition to reshaping competitive dynamics, the Meta-Scale deal highlights how the data needed to advance AI has changed, and what it will take to push AI to the next frontier. In this month’s newsletter, I explore what this new AI data paradigm looks like, and why every major lab is racing to embrace it.

⛽ Data is the new oil, but this isn’t the full story

Much of the last few years’ AI leaps were driven by amassing huge, static datasets. The internet effectively became the oil field for training GPT models. But the old saying “data is the new oil” misses a key point. Oil is a finite resource you extract and burn. Yet data for AI isn’t a fixed asset you mine once – it needs to be continually created, refined, and renewed.

The industry is realizing that accumulating more of the same data yields diminishing returns. OpenAI’s COO, Brad Lightcap, recently noted that even if you combined all the proprietary text from major publishers and added it to GPT-4’s training mix, it would boost the data volume by less than 0.1%.

As Lightcap explained, highly specialized data (say, a detailed financial database or a trove of scientific papers) is often far more useful when provided to a model at inference time than if it were simply added to the model’s pre-training corpus.

Meta’s bet with Scale (and I agree with Zuckerberg here) is that the AI race won’t be won by reusing what’s already online or by purely synthetic data augmentation. Those tactics help, but they’re not sufficient. What’s needed is a constant stream of novel, complex training scenarios that mirror real-world use.

In other words, data is now less like oil and more like a renewable resource we must continuously cultivate. The labs that figure out how to reliably produce that resource, faster and better than their competitors, will hold the advantage.

💡 The 3 data frontiers: Depth, breadth, and agency

A few years ago, it was common to assume that whoever had the largest pile of data would inevitably win. Google and Facebook’s data stockpiles were considered unbeatable moats. Today, the truly scarce resource isn’t the static data sitting in your vault, but the ability to generate and use the right data faster than others.

As AI models grow more capable, the bar for data that actually drives further improvement gets higher. Frontier labs are now pushing their models along three key dimensions, each of which demands new kinds of data and human expertise:

➡️ Depth: Models need to get smarter in domains like coding, reasoning, and STEM fields.

Achieving this requires complex, expert-generated training scenarios that stretch models’ capabilities. For instance, improving a state-of-the-art coding model today means finding real-world programming challenges the model can’t yet solve, then providing step-by-step solutions from expert engineers. In math, finance, and science, it similarly means enlisting PhD-level experts to devise genuinely hard, model-breaking problems and detailed solutions.

➡️ Breadth: Models need to get broader – mastering multiple modalities, languages, and domains.

The progression we saw with text and code (from simple labeled examples to highly sophisticated datasets as models improved) is now playing out across new frontiers like speech, images/video, robotics, and 3D. The race is on to cover all the bases of human knowledge and perception, not just text.

Turing has invested heavily in supporting these emerging needs, building extensive datasets for audio (speech recognition, text to speech, voice library expansion, and full-duplex conversational data), robotics (VLM/VLA datasets and annotated robotic demonstration data), and next-generation image/video generation models.

➡️ Agency: Models need to become more agentic – capable of executing complex, multi-step tasks in real-world contexts.

Training an AI agent is very different from training a model to produce a single answer. It requires creating custom data-capture pipelines and interactive environments where the model can take a series of actions towards a goal.

Often this takes the form of setting up “RL gyms” (simulated, reinforcement-learning environments) where an AI can attempt a task step-by-step, guided by expert feedback. A lab might need to orchestrate an interdisciplinary team – software engineers, domain experts, data annotators, even UX designers – to design and run these training loops. The data produced isn’t just static input-output pairs; it includes sequences of agent actions, intermediate tool outputs, human corrections, and reward signals indicating success or failure. This kind of dynamic, interactive data is crucial for teaching AI to plan and act over long horizons.

⚙️ AI labs need research accelerators, not data vendors

In the early deep learning days, a data vendor’s job was straightforward: maintain a massive labeling workforce and provide heaps of annotated data to whoever paid. Those days are waning. Today’s frontier labs don’t just need a data supplier – they need a collaborator who can identify a model’s blind spots and engineer human-AI feedback loops to create the data that will solve those specific problems.

In short, leading labs don’t need basic data vendors; they need research accelerators.

I had a conversation with Jonathan last year that really drove this home for me. He described how back in 2022, if OpenAI wanted to improve its coding model, anyone could create a few new programming problems or bug fixes, and the model would get a bit better. Low-hanging fruit was everywhere. But today, state-of-the-art coding models have solved or memorized the obvious benchmarks. To meaningfully improve them, you have to find holes in their capabilities – genuinely hard problems they can’t yet crack.

So Turing’s team set out to generate new training data that would push frontier coding models to the next level. They combed through thousands of open-source GitHub repositories, using both automation and human expertise to surface tricky issues and bugs that current models failed to solve. This was no simple scrape; it meant spinning up complex dev environments for each project, running code and tests to verify where the models stumbled, and then packaging those failure cases into new training examples.

To create truly agentic coding assistants, you also have to teach them how to think like software engineers. Turing tackled this by creating advanced reasoning data for coding. Imagine a top developer working on a bug fix and capturing every step: reading log files, tracing through source code, running unit tests, encountering errors, searching for the error, iterating on a fix, and so on. Turing orchestrated teams of expert coders to produce these rich, multi-step reasoning traces for problems that current models couldn’t handle on their own. The result was a trove of sequential, decision-by-decision data – not just static input/output pairs, butentire journeys from problem to solution.

Generating this kind of data is exponentially harder than old-school labeling. You can’t crowdsource it to random annotators on Mechanical Turk; you need skilled engineers, STEM PhDs, enterprise domain experts, custom software to capture the interactions, and careful AI-assisted QA to ensure the data produced is correct.

Crucially, Turing generated this data proactively – not because a client handed them a spec, but because their own researchers identified a weakness (“frontier model X struggles with this part of the taxonomy for coding or physics”) and went out to generate data to fix it. This is what a research accelerator does.

Notice how different that is from the old vendor model. It’s no longer a client handing over a spec for a dataset (“please label these 10M images”), and the vendor executing on it. It’s the data partner itself having AI research insight and co-piloting the model improvement process. In Turing’s case, they moved from being a data provider to an R&D collaborator for top AI labs.

The broader point here is that data generation itself has become a first-class engineering discipline in AI. It’s not grunt work to be outsourced cheaply; it’s core to AI R&D. Often, as in the case of coding reasoning traces, its production is inextricable from complex workflows that require coordinating multiple types of specialized talent. Companies like Turing exist because not every lab can (or wants to) build these capabilities fully in-house.

🏋️ From static datasets to “RL gyms”

Another important shift in how AI labs operate is the move from static datasets to dynamic RL environments (often called “RL gyms”) to train the next generation of AI agents.

Labs have been bumping up against the limits of the Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) paradigm that gave us models like ChatGPT. SFT (learning from example demonstrations) was great for getting models to imitate desired behavior. And RLHF – using human preference judgments as a reward signal – allowed them to fluently interact with humans.

But to train AI agents that perform multi-step tasks (like browsing the web, writing and executing code, or controlling a robot), these techniques alone aren’t enough.

Take the example of a digital agent that can operate a web browser to complete tasks for you. Training it via supervised learning would require an impractically large number of recorded demonstrations to cover all the things a general assistant might need to do. Each demo might be minutes or hours long with countless branching possibilities. It doesn’t scale – there are too many ways to accomplish a task, and you’ll never cover them all with scripted examples. Plus, a model trained only on a handful of demonstrations might overfit to the quirks or inefficiencies of those specific examples.

The alternative is to let the agent learn by doing in a simulated environment. If you can define crisp success criteria, then you have a natural reward function and can use RL to optimize the agent’s behavior. This was the secret behind AlphaGo’s success (reward = winning the game), and we’re now applying it to things like coding agents (reward = tests passed) and web assistants (reward = user’s goal achieved).

In these agentic training loops, labs use pure RL whenever they can.When an automated success metric isn’t available, they still rely on RLHF for the more subjective aspects (things like style, helpfulness, and avoiding toxic outputs), but even there, labs are exploring hybrid approaches – for instance, using RL first with some automatable rewards (like checking factual accuracy via web search or ensuring code compiles and tests pass) and then applying a light layer of human feedback on top.

⏱️ Research velocity is the new moat

All of this has made the speed of experimentation – the ability to build and iterate quickly on these RL training loops – a critical competitive advantage for AI labs. This is one area where Turing provides a lot of value: they help labs set up and run these custom RL gyms. It’s a perfect example of how Turing saw where the field was headed and positioned itself as a true research accelerator for its partners.

These RL gyms are part of Turing’s internal platform (codenamed ALAN) which is built to orchestrate complex AI training workflows at scale. ALAN coordinates all the moving parts – human experts and AI processes – needed to design and run these experiments.

The goal is to compress the cycle time of an AI experiment. If partnering with Turing means a lab can test, say, 3× more hypotheses, then that lab’s chance of discovering something novel goes up significantly.

In startups, the teams that learn the fastest win – they iterate to strong product-market fit before their competitors. For AI labs, the “product” is the model, and the “fit” is its performance on the target tasks. Turing acts as an extension of a lab’s R&D team, one dedicated to tightening that feedback loop as much as possible.

🇨🇭Turing is the “Switzerland” of AI data and talent

Jonathan and Vijay founded Turing in our office in 2018 with a somewhat different initial mission: to build an “Intelligent Talent Cloud” for software developers. They created AI tools to evaluate and manage remote engineers, giving companies instant access to vetted global coding talent. In 2020, that business took off with the remote-work boom.

A turning point came in 2022. OpenAI, fresh off the success of GPT-3, invited Jonathan to a meeting, which he assumed would be about OpenAI using Turing’s platform to hire engineers. Instead, OpenAI had a different request. Their researchers had discovered that adding a lot more code data to a LLM’s training made it significantly better at reasoning and problem-solving. They asked if Turing could generate a massive dataset of code and coding solutions to feed their models.

Jonathan obliged, and in doing so he saw a far larger opportunity. By 2023, Turing had expanded from a talent marketplace into an AI R&D services company. Importantly, they didn’t abandon their original business (companies still use Turing to hire remote devs), but they layered on another mission: help the world’s leading AI labs build better models. In effect, Turing transformed from an on-demand talent platform into an AGI infrastructure firm.

Throughout this evolution, Turing has maintained a neutral stance. Unlike Scale, which is now effectively an arm of Meta, Turing remains independent. Jonathan often describes Turing as “Switzerland”: they’re working with all the major AI labs (OpenAI, DeepMind, Anthropic, and others) in their race to AGI.

By working across the entire industry, Turing has emerged as the leading research accelerator for frontier AI teams. Meta’s alignment with Scale has only opened the door wider for Turing to double down on being the neutral, strategic data partner for any lab pursuing AGI.

🔮 The future of AI data

Meta’s investment in Scale was a loud announcement that data is strategic, and that where you get your data – and who else has access to it – matters deeply. But the bigger story is the qualitative change in what kind of data is needed to advance AI. Static, one-off datasets are giving way to dynamic, evolving training curricula. Data providers that used to be simple order-takers are becoming true partners in scientific discovery.

For AI labs, this new breed of partner is a force multiplier. I’d argue that today, if you’re a serious AI lab and not leveraging an accelerator like Turing (or an internal equivalent), you’re at a real disadvantage. And beyond the top few labs, as AI spreads across the economy, every company building or fine-tuning AI models for its own domain will face similar needs. This points to an enormous emerging market for AI research acceleration services and tools in the coming years.

As an early investor in Turing, I’ve been bullish on these trends since day one. But beyond my investment, I’m fascinated by what Turing’s rise tells us about how AI is evolving. They bet early that data would become an iterative service rather than a one-time product – and that bet is now clearly paying off.

At Foundation, we’re proud to partner with Jonathan, Vijay, and the entire Turing team as they continue to shape where this new AI data paradigm leads.

Published on June 27, 2025
Written by Foundation Capital