Cal alumni: We want to help you start, build, and scale your company. Join Cal Build

Databricks vs. Snowflake: What their rivalry reveals about AI’s future

Ideas / Newsletters / Databricks vs. Snowflake: What their rivalry reveals about AI’s future

06.29.2024 | By: Ashu Garg

I explore the competition between these two data powerhouses and what it says about where enterprise AI is headed.

As a long-time tech investor, I’ve seen many industry rivalries play out: Microsoft vs. Apple, Google vs. Facebook, Uber vs. Lyft. But few have captured my attention quite like the intensifying battle between Databricks and Snowflake.

On the surface, it looks like another Silicon Valley turf war: two well-funded startups fighting for dominance in the multi-$100B data and analytics market. But as I’ve watched this conflict unfold, it’s become evident that the stakes are much higher. This isn’t just about market share. As I first noted last June, this is a fight for the future of enterprise AI.

To understand the true significance of this rivalry, it’s crucial to understand the transformation that data platforms have undergone in recent years. No longer just systems for storing and processing data, they’ve evolved into the foundation for a new breed of intelligent applications that combine data, tools, and multiple AI models into sophisticated systems. This trend, which I first wrote about in April, will fundamentally redefine how enterprises approach and apply software.

Both Databricks and Snowflake have recognized this paradigm shift. They understand that whoever controls enterprise data will control the future of enterprise AI. As a result, owning the platform layer for generative AI has become their top strategic priority.

This became crystal clear to me last year, when the two companies staged their annual conferences on the exact same dates. It felt like a calculated decision, one that forced their shared customers to take sides.

This year, their events were a week apart, but the competitive tensions were equally intense. The primary battleground was data and AI governance. Snowflake fired the first shot by open-sourcing Polaris, its catalog for Apache Iceberg, a popular open-source table format that’s compatible any compute engine. Databricks countered by announcing its acquisition of Tabular, a managed solution for Iceberg created by the project’s founders, right in the middle of Snowflake’s conference. The following week, at their own summit, Databricks further upped the ante by open-sourcing its Unity catalog in front of a live audience.

These strategic maneuvers underscore how AI is redrawing the battle lines in enterprise data infrastructure. Enterprises are increasingly demanding interoperability and portable compute. For Databricks, with its open-source roots, this is a natural evolution. For Snowflake, it marks a major shift from its traditionally closed approach. Both are racing to adapt as value migrates up the stack toward dynamic systems of models and tools built on top of their offerings.

It’s important to remember that this is not a two-horse race: other startups like OpenAI, which recently acquired Rockset to enhance its data retrieval capabilities, and tech giants like Google, Amazon, and Microsoft, who offer similar data management and governance services within their clouds, are formidable contenders.

In this month’s newsletter, I’ll explore four key concepts that shed light on these competitive dynamics: data gravity, the convergence of analytics and AI, the strategic importance of open source, and the rise of compound AI systems. I’ll close with reflections on what this means for enterprises and the broader tech ecosystem.

“The more data, the better”

At the heart of this battle lies a well-established principle in tech circles: “data has gravity.” Moving large datasets is difficult, expensive, and time-consuming. It’s far more efficient to bring applications and services to data rather than vice versa.

Databricks and Snowflake have crafted their strategies around this concept. They recognize that the more customer data they can attract and retain, the stickier and more valuable their platforms become, and the better equipped they are to power custom AI applications. As Databricks co-founder and CEO Ali Ghodsi succinctly put it: “The more data, the better.”

The rise of generative AI has only increased the value of enterprise data. Previously, AI models could only handle highly structured text data. But now, unstructured data—images, PDFs, audio files, and more—is also fair game. According to MIT, this unstructured data accounts for a staggering 80-90% of all enterprise data.

This “frontier data”—expert knowledge, workflow logs, multimedia assets, and so on—represents far more granular and domain-specific information than what’s publicly available on the internet. To put this in perspective, JPMorgan reportedly has 150 petabytes of data: a whopping 150 times the size of the dataset used to train GPT-4.

The challenge for enterprises lies in figuring out how to organize, process, and marshal this data to build AI solutions. In doing so, they must weigh the benefits of training their own AI models against the risks of sharing their IP with outside providers, who have shown to be less than scrupulous with their data sourcing practices. 

This focus on “frontier data” ties into a broader debate about the future of AI innovation: let’s call it the “scale maximalists” versus the “small-but-mighty” camp.

The first view, championed by foundation model companies like OpenAI, holds that scaling models trained on ever-larger datasets will drive ongoing breakthroughs. The underlying belief is that sheer amount and diversity of data can compensate for the nuances of individual tasks. Given their privileged access to compute resources and first-mover advantage, these players have a strong economic incentive to continue pursuing this strategy.

The second view, more aligned with the privacy, security, and customization interests of enterprises, emphasizes efficiency and specialization over sheer scale. The goal here is to develop smaller, more targeted models that can achieve equally impressive results through retrieval and system design. Techniques like vectorized search, modular task decomposition, and iterative refinement (where models improve via their own generated answers) are central to this strategy. 

Think of it as the difference between a student who memorizes an entire textbook and one who knows how to effectively use it in an open-book exam. Research already suggests that this second approach can lead to better performance at a much lower cost.

The convergence of analytics and AI

Both Databricks and Snowflake are now vying to build the ultimate enterprise AI platform: one capable of serving as the foundation for this “small-but-mighty” vision of AI. Their shared goal is to become the single source of truth for all of an organization’s data and use this position to power intelligent applications across every business function.

However, while their destination is the same, Databricks and Snowflake began their journeys from very different starting points. 

As I discussed in depth last June, Databricks emerged from the open-source Apache Spark project and initially focused on serving the needs of data scientists and ML engineers. Its big data processing capabilities made it a natural fit for AI and data science workloads. Snowflake, by contrast, built its early success around a SQL-centric architecture and tight integration with BI tools, catering to data analysts and traditional IT departments with a closed, “it just works” solution.

But as the generative AI revolution has accelerated, the lines between these once distinct domains have blurred. Building generative AI applications requires the ability to manage and process data (a traditional analytics skill) along with the ability to experiment with and fine-tune models (a data science skill). The worlds of analytics and AI are rapidly converging.

Databricks anticipated this convergence early and bet big on its “lakehouse” architecture, which aims to combine the best of data lakes and data warehouses. This AI-friendly approach can efficiently store and process massive amounts of structured and unstructured data. Snowflake, despite its success in BI, was slower to adapt to the rising importance of AI. As the market shifted towards AI-centric use cases, it found itself falling behind, with support only for structured and semi-structured data.

Over the past year, Databricks has aggressively pressed its AI advantage, releasing a steady drumbeat of new capabilities and investing aggressively in R&D. Meanwhile, Snowflake has been rushing to reposition itself, with a flurry of acquisitions and product announcements aimed at closing the gap. Its new CEO, Sridar Ramaswmany, a former Google executive and co-founder of search startup Neeva (which Snowflake acquired in 2023), came in with a clear mandate to amp up the company’s AI capabilities.

The numbers paint a stark picture of the shifting landscape. At its summit, Databricks announced that annualized revenue will hit $2.4B by July, up 60% from a year earlier. While still trailing Snowflake’s revenue, Databricks is growing nearly twice as quickly and has seen its private market valuation jump to $43B, neck-and-neck with Snowflake’s market capitalization of $43.6B as of mid-June.

Open source as a strategic front

As the rivalry between Databricks and Snowflake intensifies, a new front has opened up around the strategic role of open source.

Databricks again has the upper hand here, thanks to its open-source roots and ongoing contributions to projects like Delta Lake. As I discussed at the outset, its recent acquisition of Tabular, founded by the creators of Apache Iceberg, dealt a blow to Snowflake, which was also bidding for the deal. While Snowflake has begun embracing open formats, it remains at a philosophical disadvantage to Databricks’ “open by default” ethos.

But the fight to house more enterprise data goes beyond table formats. Both Databricks and Snowflake are also releasing their own open-source AI models. Databricks recently unveiled DBRX, a new state-of-the-art open-source LLM, and ImageAI, a text-to-image API model trained exclusively on Shutterstock’s image repository. Snowflake followed suit with its own Arctic LLM

By providing affordable, high-quality open-source models that are integrated into their platforms, Databricks and Snowflake hope to attract more customers and developers seeking to build AI-powered applications. The big selling point is that their users will own these models, allowing them to avoid the risks of relying on third-party providers like OpenAI.

From monolithic models to multi-agent systems

These open-source skirmishes are ultimately just fronts in a larger war. The end goal for both Databricks and Snowflake is to become the dominant platform for a new type of intelligent application, where multiple AI models work together in sophisticated systems.

While incredibly powerful, AI models are essentially static, mapping inputs to outputs without the ability to understand broader contexts and objectives. By contrast, agents are dynamic AI systems that leverage models as tools to take actions. Agents can autonomously break down complex tasks into steps, delegate each step to the appropriate model or tool, and iteratively refine the results until the overarching objective is met. This contextual awareness and adaptability sets agents apart from traditional, rigidly defined AI pipelines.

The real magic happens when multiple specialized models and agents are combined into collaborative systems. These compound AI systems make use of a spectrum of architectures, ranging from basic chains with fixed steps and limited feedback to dynamic, agent-driven approaches that can tackle open-ended tasks with human-like goal-orientation and context-sensitivity.

Databricks has emerged as the clear frontrunner, thanks to its $1.3B acquisition of MosaicML and the launch of its Mosaic AI platform. This platform allows users to assemble, manage, and orchestrate networks of models, agents, data sources, and other tools via an intuitive visual interface.

This tight integration of models, agents, and tools marks an important shift from the prevailing paradigm of building AI applications around monolithic models that require careful prompt engineering. Instead, Mosaic allows users to program complex “systems of systems” that can take action to achieve high-value business goals.

Despite its investment in tools like Arctic and Cortex AI, Snowflake remains several steps behind in providing the comprehensive, unified platform needed to fully realize this vision. As the battle for AI platform dominance heats up, the ability to enable this powerful new software paradigm may prove decisive.

What this means for the F500

A recent BCG survey of 1400+ C-suite executives found that 89% rank AI and generative AI among their top three tech priorities for 2024. This laser focus on AI is unsurprising, given this technology’s immense potential to drive economic growth. According to research by McKinsey and PwC, AI could increase global GDP by a staggering $13-15.7T (that’s T for trillion) by 2030. 

Despite the clear imperative to embrace AI, the BCG survey reveals that many organizations have been slow to act, with 90% being merely “observers” of AI adoption. Their sluggishness poses a significant risk, as the AI revolution is poised to create winners and losers in every industry.

The companies that pull ahead will be those that not only leverage the large-scale models developed by “scale maximalists” like OpenAI but also harness their private data to build targeted, “small-but-mighty” models tailored to their specific needs. They’ll deploy multi-agent systems to drive more advanced productivity and automation use cases. They’ll then reinvest these gains into new AI-driven revenue streams, fueling further growth and market share capture.

Databricks and Snowflake are positioning themselves as the key enablers and toll collectors of this $13T+ AI opportunity. Their rivalry is a microcosm of a much broader battle that will play out across the economy, as companies in all industries fight to secure dominance in the AI-driven future. 

While the results may not be obvious for a few years, the decisions that companies make today around their AI strategy will determine their ultimate fate. In the end, regardless of who comes out on top, consumers and society will be the ultimate beneficiaries.


Published on June 29, 2024
Written by Foundation Capital

Related Stories