Goodbye AIOps: welcome AgentSREs—the next $100B opportunity

Ideas / Points of View / Goodbye AIOps: welcome AgentSREs—the next $100B opportunity

08.13.2024 | By: Ashu Garg, Jaya Gupta

We’ve seen countless buzzwords come and go. “AIOps” is the latest in a long line of catchy but ultimately misguided terms that fail to capture the true potential of AI in the world of IT ops and observability.

Here’s why we believe we’re on the cusp of a fundamental shift in how organizations monitor, debug, and optimize their increasingly complex software systems.

The Problem with “AIOps”

The term “AIOps” implies simply layering AI on top of existing operations processes. This vastly undersells the potential of what could happen in this space – currently, the existing solutions get us to automating a few alerts or providing slightly smarter dashboards. The entire paradigm of how we approach observability and root cause analysis is poised for disruption.

The Current State of Observability

The observability market has been fragmented for years, with leading vendors like Datadog, New Relic, and Splunk rarely capturing more than 20% market share. Why? Because fundamentally, observability has been treated as a big data problem rather than an intelligence problem.

Modern distributed systems generate an astronomical amount of telemetry data – often petabytes per day. This data comes in heterogeneous formats: unstructured logs, structured metrics, and complex distributed traces. Each of these data types traditionally requires its own specialized storage and query engine, leading to a proliferation of tools and data silos.

However, the core challenge isn’t collecting or storing massive amounts of telemetry data. It’s making sense of that data quickly enough to drive real business value.

The challenges:

1. Data Volume and Velocity: The sheer scale of data generation in modern systems is staggering. Real-time ingestion and indexing at this scale remain computationally expensive, pushing the limits of even advanced platforms like Elasticsearch or InfluxDB.

2. Heterogeneous Data Formats: Logs are typically unstructured text, metrics are time-series data, and traces form directed acyclic graphs. Each requires specialized tools: Elasticsearch for logs, Prometheus for metrics, and Jaeger for traces, for instance.

3. Lack of Unified Data Model: There’s no standardized way to correlate events across logs, metrics, and traces. While initiatives like OpenTelemetry aim to address this, adoption is still in early stages.

4. Query Complexity: Each observability tool has its own query language. Elasticsearch uses Lucene, Prometheus has PromQL, and many tracing tools use SQL-like languages. Mastering these diverse query languages is a significant barrier for many teams.

5. High-Cardinality Problem: Modern microservices architectures lead to an explosion in the number of unique label combinations. Traditional time-series databases like InfluxDB or Prometheus struggle with high-cardinality data, often leading to performance issues or increased costs.

6. Alert Correlation: A single root cause often triggers cascading alerts across multiple systems. Correlating these alerts programmatically is an NP-hard problem, making automated root cause analysis extremely challenging.

The average enterprise juggles 7-10 different observability tools, each with its own query language and data model. This makes it incredibly difficult to get a holistic view of system health. Engineers often spend up to 30% of their time just triaging alerts, many of which are false positives or symptoms rather than root causes. Even worse, the Mean Time to Resolution (MTTR) for critical incidents still averages 4-5 hours and can often be days in exceptional cases.

Companies like Splunk, Elastic, and Grafana Labs have made strides in unifying some of these data types, but a truly integrated solution remains elusive. Other entrants like Honeycomb and Lightstep (now part of ServiceNow) have focused on high-cardinality data and distributed tracing, but the challenge of unifying all observability data persists.

Why Previous “Smart” Observability Attempts Have Fallen Short

Earlier forays into applying AI to observability, including efforts by established players like Dynatrace and AppDynamics, have often disappointed. The reasons are multifaceted and deeply technical.

Supervised learning approaches struggle with the lack of labeled training data for rare failure modes. Feature engineering across heterogeneous data sources proves to be a Herculean task, often failing to capture the complex interactions in distributed systems. Black-box models, while sometimes accurate, fail to provide the explanations necessary to gain the trust of DevOps teams.

Perhaps most challenging is the issue of concept drift. In the world of continuous deployment, system behavior is constantly evolving. Traditional machine learning models require frequent retraining to maintain accuracy, a luxury rarely afforded in fast-paced production environments.

You knew what was coming here: LLMs

LLMs offer a unified approach to data understanding. Their ability to process and correlate heterogeneous data types – logs, metrics, and traces – in their raw formats breaks down some of the silos that have plagued observability. The transformer architecture underlying LLMs excels at capturing long-range dependencies, crucial for understanding system-wide patterns. LLMs also bring the power of zero-shot and few-shot meaning they can adapt to new failure modes without extensive retraining, addressing the perennial issue of concept drift in rapidly evolving systems.

One of the most exciting things is the introduction of natural language interfaces to observability. Imagine being able to ask, “Show me all HTTP 500 errors in the payment service correlated with high CPU usage in the last hour,” and getting an instant, accurate response. This democratizes access to powerful debugging capabilities, no longer requiring expertise in multiple query languages.

LLMs can also provide context-aware analysis by ingesting not just telemetry data, but also system documentation, code repositories, and historical incident reports. This allows for reasoning that incorporates deep domain knowledge, going far beyond simple pattern matching.

We believe LLMs offer a path to truly automated root cause analysis. By understanding the complex causal relationships in distributed systems, they can rapidly correlate events across the entire stack to pinpoint root cause, potentially reducing Mean Time to Resolution (MTTR) by an order of magnitude.

Beware the technical challenges

While the potential of LLMs in observability is immense, significant technical hurdles persist. Real-time processing, crucial in observability contexts, remains challenging due to current LLM inference latencies and costs. Moreover, the confidential nature of telemetry data, often containing PII in large companies, raises legitimate data privacy and security concerns.

LLMs currently struggle with both tabular and time series data, common formats in observability. Although we anticipate that innovations in newer architectures, multimodality, and multi-agent systems will mitigate some of these challenges over time, near-term solutions will require creative workarounds from builders.

Furthermore, while LLMs excel at identifying correlations, true root cause analysis often demands causal reasoning. A more promising direction lies in integrating LLMs with causal graphical models, bridging the gap between correlation and causation in complex systems.

Conclusion

The term “AIOps” will soon feel as dated as “Big Data.” We’re moving beyond simply applying narrow AI to existing ops processes. The future is LLM-powered, intelligent, unified observability that fundamentally transforms how organizations build, run, and optimize their software systems.

The economic impact of this shift will be profound. By dramatically reducing MTTR, preventing outages, and freeing up engineering time, these technologies will be a force multiplier for software-driven innovation across industries. While Gartner predictions for AIOps are modest ($3.1B by 2025), we believe that automating SREs is worth 50X that, i.e. this is a $100B+ opportunity.

For startups entering this space, success will require a rare combination of deep expertise in ML, LLMs and distributed systems, along with a keen understanding of the practical challenges faced by DevOps teams. The ability to ingest and process heterogeneous data at scale, provide explainable insights, and deliver immediate value will be crucial.

While there are a handful of start-ups going after this opportunity, we believe that the playing field is wide open, and there will be multiple decacorns built in this category. If you are building in this space, email agarg@foundationcap.com and jgupta@foundationcap.com.