The AI Agent Infrastructure Stack: Who's Building the Picks & Shovels
A deep-dive market map of the $100B+ emerging category — orchestration, memory, evals, security, and where the real moats are
Executive Summary
The gold rush metaphor is clichéd but accurate: while model companies compete for the throne, the real durable money in AI may be in the picks and shovels. We're at the inflection point where "AI features" are becoming "AI agents" — and agents need infrastructure that simply didn't exist two years ago.
The shift matters because agents change the failure modes. A chatbot that hallucinates is annoying. An agent that hallucinates, then calls your Stripe API, then emails a customer, is a liability. That gap between "LLM demo" and "production-grade autonomous system" is where a $260B+ market is being born.
Three takeaways from this report:
Orchestration is already commoditizing. LangGraph, CrewAI, AutoGen, and now OpenAI's own Agents SDK are converging on similar primitives. The orchestration layer will be infrastructure — valuable but margin-thin. The durable plays are one level up (evals) and one level down (memory + execution environment).
Memory and observability are the under-appreciated moats. Mem0's $24M Series A (Oct 2025), Arize's $70M Series C (Feb 2025), and Braintrust's $80M Series B (Feb 2026) signal where sophisticated buyers are spending. You can't swap out your eval dataset or your agent memory graph easily — that's the stickiness.
The model companies are moving up the stack, and it's accelerating. Anthropic's acquisition of Vercept (Feb 2026) for computer-use/agent capabilities, OpenAI's Agents SDK, and both companies' launch of hosted tool-use infrastructure signal a compression risk for pure-play orchestration vendors. Security, memory, and specialized deployment are the safer bets.
Market Overview
From Features to Agents: A New Infrastructure Requirement
For three years, "AI" in enterprise meant one thing: wrapping an LLM call in a product feature. Summarize this document. Draft this email. Classify this ticket. The infrastructure requirement was minimal — an API key, a prompt template, maybe a vector store for RAG.
Agents break that model entirely. An agent doesn't just respond to a prompt; it plans, executes sequences of actions, uses external tools, recovers from failures, and operates over time horizons measured in minutes or hours, not milliseconds. This creates an entirely new set of infrastructure requirements:
Durability: What happens when a 4-hour task hits a network timeout at hour 3?
Memory: How does an agent accumulate context across sessions without burning a 200K token window every call?
Tool access: How do agents authenticate against external systems, scope permissions, and avoid catastrophic side effects?
Observability: When an agent produces a wrong result, how do you trace back through a 40-step execution to find where it went wrong?
Security: How do you prevent prompt injection from malicious web content hijacking your agent mid-task?
None of these has adequate answers in standard cloud infrastructure. That gap is the market.
The infrastructure layer that enables this market — orchestration, memory, evals, deployment, security — typically captures 15-25% of the spend in any platform shift. Even at the conservative end, that's a $10-15B infrastructure market emerging by 2030.
For context, the broader AI infrastructure market (GPU clouds, MLOps, etc.) was ~$65B in 2025. Agentic-specific tooling is being carved out from that as a distinct segment, currently estimated at $5-7B but growing 3-4x faster than general AI infra.
Why 2025-2026 Was the Inflection
Three things converged:
Model capability crossed a threshold. GPT-4o, Claude 3.5/3.7 Sonnet, and Gemini 2.0 achieved function-calling reliability high enough (~85-90%+ task completion on complex benchmarks) to make multi-step agents viable at scale — not just demos.
Context windows got long enough. 100K-200K context windows mean agents can carry substantial working memory without hitting walls. Combined with improved long-context recall, this enabled genuine multi-step planning.
Enterprise demand arrived. Q4 2025 saw Fortune 500 companies moving from "AI pilot" to "AI agent" budget lines. McKinsey's 2025 AI survey showed 60%+ of enterprise AI investment shifting from predictive models toward generative/agentic systems. Procurement teams started asking for agent infrastructure, not just LLM access.
The Stack: Layer-by-Layer Breakdown
Layer 1: Orchestration & Workflow
The orchestration layer is where most developer attention has landed — and where commoditization is furthest along.
What to watch: OpenAI's Agents SDK (launched March 2025) is the wildcard. It's simpler than LangGraph and integrates natively with the OpenAI ecosystem. If you're building on GPT-4o, it may be enough — which squeezes pure-play orchestration vendors.
Temporal is the outlier in this layer. At a $5B valuation and backed by a16z's $300M Series D in February 2026, they're not really an AI company — they're a durable execution platform that happens to be the right primitive for long-running agents. Their approach (write application logic, let the runtime handle retries/failures/state) solves the hardest agent reliability problems. They claim "thousands of independent AI projects" and adoption by major AI labs.
Verdict: Orchestration is becoming a commodity. The exception is Temporal, which is solving a genuinely hard distributed systems problem — not just a framework ergonomics problem.
Layer 2: Memory & Context
Memory is the layer that separates "chatbot with tools" from "agent that actually learns." Most production agents today are stateless across sessions — every conversation starts fresh. That's a massive limitation for anything intended to be a long-running assistant or autonomous worker.
The memory problem has two halves, and neither is fully solved. Short-term memory (within-session context management) is essentially solved by long-context models. Long-term memory (cross-session learning, user preference retention, factual grounding) remains genuinely hard.
What makes memory valuable as a business: it's deeply sticky data. An agent's memory graph — who their user is, what they've done, what they prefer — becomes increasingly impossible to migrate. Mem0 is betting that every AI product company will eventually need a memory layer, and it's better to be the infrastructure than to build it yourself.
Vector DBs as memory (Pinecone, Weaviate, Chroma) serve agents as retrieval layers but are not memory systems per se — they store and retrieve, but don't synthesize, update, or maintain temporal coherence the way purpose-built memory systems do. Expect consolidation pressure here.
Layer 3: Tool Use & MCP
If memory is the "what has happened" layer, tool use is the "what can happen" layer. The Model Context Protocol (MCP), released by Anthropic in November 2024 and rapidly standardized through 2025, has become the most important protocol development in agent infrastructure.
What MCP is: A standard JSON-RPC protocol that allows AI models to discover and call external tools (databases, APIs, file systems, services) in a structured, permissioned way. Think of it as USB for AI — a universal interface so models don't need custom integrations for every tool.
Adoption in 2025-2026:
Microsoft integrated MCP into Copilot Studio (May 2025) — a major legitimization signal
Google announced official MCP support for Google services (Dec 2025), including Drive, Docs, Gmail
AWS launched MCP connectors for core services
Thousands of community-built MCP servers now exist across GitHub
Why it matters for the stack:
MCP shifts power to whoever owns the tool integrations. Companies that build the best MCP servers for enterprise software (Salesforce, SAP, ServiceNow) will be deeply embedded in agentic workflows.
It commoditizes one-off integrations — no moat in building a single MCP connector. The moat is in curated, secure, high-quality MCP marketplaces or specialized connectors for specific verticals.
MCP + computer use (Anthropic's Vercept acquisition) represents the most dangerous layer for enterprise software incumbents — agents that can operate any software interface without a custom integration.
Companies to watch: There is no clear winner yet in the MCP tooling/marketplace space. This is a genuine whitespace opportunity (see Section 6).
Layer 4: Evals & Observability
Evals are the unsexy category that's printing money. Every team shipping agents has the same problem: "My agent works in testing and fails in prod, and I don't know why." Observability and evaluation tooling exists to solve that.
The critical insight: Braintrust's $80M Series B in February 2026 is a strong signal. Their thesis is that evals are not a nice-to-have — they're the control plane for AI products. As AI systems grow more autonomous, the ability to detect regressions, measure quality, and run systematic experiments becomes the difference between teams that can ship fast and teams that get burned.
The valuation question is interesting: Braintrust at a post-Series-B implied valuation likely north of $500M, Arize at $400M+ after Series C. Neither is cheap. But the stickiness is real — your eval datasets, your scoring rubrics, your historical traces — these are genuinely hard to migrate.
Layer 5: Agent Deployment & Hosting
Deploying agents is different from deploying APIs. Agents need: isolated execution environments (an agent browsing the web shouldn't be able to escape to your production network), long-running compute (not just millisecond API calls), state persistence across task execution, and the ability to run code safely.
Cloud provider plays: AWS, GCP, and Azure are all moving here. AWS Bedrock Agents, GCP Agent Builder, and Azure AI Studio all provide managed agent runtimes. The question for startups: can you out-execute the clouds on specific needs (security, cost, developer experience) long enough to build switching costs?
E2B is the most interesting pure play. Their Series A in July 2025 was specifically for "the cloud designed for AI agents" — isolated, sandboxed environments where agents can execute code, browse the web, manage files, without escaping into host infrastructure. Their customer list (Cursor, Vercel, unnamed Fortune 100s) suggests real traction. The security story is their differentiation — shared nothing architecture, microsecond spin-up, automatic isolation.
Layer 6: Security & Guardrails
Security is the layer that most agent builders underestimate until they get burned. Prompt injection — where malicious content in agent-retrieved data hijacks the agent's instructions — is the most pressing threat. Beyond that: privilege escalation, data exfiltration, unauthorized actions.
This category is early. No company has cracked the full problem: (1) input sanitization, (2) output validation, (3) behavioral constraints on what the agent can do, (4) audit trails for compliance. The companies above address pieces of it.
The business driver: as agents get authorized to take real-world actions (send emails, process payments, modify databases), the liability from agent misbehavior becomes intolerable without guardrails. GDPR and SOC2 compliance for agent actions is an emerging requirement. Expect significant funding in this category through 2026-2027.
The Consolidation Question
What's Commoditizing vs. What Has Durable Moats
Already commoditizing:
Basic orchestration — LangGraph, CrewAI, AutoGen, and OpenAI's SDK are all capable. Framework choice is increasingly a preference, not a strategic decision.
Vector storage for RAG — Pinecone, Weaviate, Chroma, Qdrant, and pgvector are all fine. The market won't support five premium vector DB companies at scale.
LLM routing/gateways — Portkey, LiteLLM, and similar tools provide model routing/caching. Important but unlikely to be a standalone large business.
Durable moat potential:
Memory graphs — Switching costs are high once an agent's behavioral history lives somewhere. Network effects from usage data.
Eval datasets and traces — A year of eval runs, golden datasets, regression baselines. These don't transfer easily.
Durable execution (Temporal) — Deep integration into application logic; high switching cost.
Security signatures — Lakera's threat models improve with every attack seen. This is a data moat.
Acquisition Targets
The most likely acquisition targets in 2026-2027:
Memory players (Mem0, Zep) — Obvious targets for LangChain, major model companies, or cloud providers wanting full-stack agent platforms
Eval platforms — Braintrust or Arize as "AI quality cloud" acquisitions for AWS/GCP/Azure; or by LangChain to lock in their ecosystem
E2B — Acquisition target for cloud providers wanting to add agent-native sandboxed execution to their platforms
Model Companies Moving Up the Stack
Anthropic is the most aggressive. The Vercept acquisition (February 2026) — a computer-use/agent task automation startup — is unambiguous: they want to own the top of the agent stack, not just the model layer. Combined with MCP (their protocol), Claude computer use capabilities, and their internal agent work, Anthropic is building a vertical stack.
OpenAI launched the Agents SDK (March 2025), the Responses API, and has Operator (their consumer agent product). They're also a threat to pure-play orchestration vendors.
The correct frame: model companies are building default runtimes for their own models. If you're a pure-play orchestration framework, you're competing with first-party tooling that's "free" and co-optimized. The response is either (a) go model-agnostic and add value beyond orchestration, or (b) find a layer where model companies don't want to compete (specialized deployment, vertical-specific memory, security).
Investment Activity
Active Investors
a16z — Led Temporal's $300M Series D; explicitly focused on "AI agent infrastructure" as a thesis (see their Feb 2026 Temporal announcement by Yoko Li and Martin Casado)
Sequoia — Led LangChain's Series B; published investment memo framing agent engineering as a "generational platform shift"
Benchmark, GV, Felicis — Active across evals and deployment categories
NVIDIA Ventures — Strategic investments and open-source bets across the security/guardrails layer
Valuation Multiples
Based on Q1 2026 market data:
Orchestration/platform plays: 15-25x ARR (LangChain at $1.25B on estimated $50-80M ARR)
Infrastructure pure-plays (Temporal): 20-30x ARR or forward revenue
Observability/evals: 10-20x ARR (Arize, Braintrust)
Security: 15-25x ARR (immature but growing)
For context, public DevOps/observability comps (Datadog, Dynatrace) trade at 10-18x forward revenue. AI infrastructure commands a premium for now, but expect compression as growth rates normalize.
Opportunities & White Space
What's Still Unsolved
1. Agent-to-Agent trust and communication
MCP handles model-to-tool communication, but multi-agent systems have no standardized trust or communication fabric. When Agent A instructs Agent B to perform an action, how does B verify A's authorization scope? How do you build auditability? The emerging A2A (Agent-to-Agent) protocol from Google is a start, but this is genuinely unsettled territory.
2. Deterministic planning and task decomposition
The hardest problem in agents is still planning: breaking a complex goal into reliable subtasks. Current agents fail unpredictably on novel task structures. A "task compiler" that could analyze a goal, generate a reliable execution plan, and check the plan against known failure modes would be enormously valuable. This is research-adjacent but close enough to product that a strong team could build it now.
3. Agent compliance and audit infrastructure
Enterprises can't deploy agents that take consequential actions without audit trails, approval workflows, and compliance reporting. Who authorized what? What did the agent do? Can you replay the execution? This exists in basic form (LangSmith traces, etc.) but not as a compliance-grade product that satisfies SOC2, GDPR, or financial services requirements.
4. Cross-agent memory and shared context
In a multi-agent system, how does institutional knowledge propagate? If Agent A learns that Client X prefers a certain format, Agent B shouldn't need to re-learn that. Shared memory layers with fine-grained access control are not solved.
Three Startup Ideas
1. Enterprise Agent Governance Platform
The missing product: a control plane that sits above multiple agents, enforces policy, maintains audit logs, provides approval workflows for high-risk actions, and generates compliance reports. Think PagerDuty or Okta but for agent operations. The wedge is regulated industries (financial services, healthcare, legal) where every agent action needs to be explainable and auditable. Revenue model: per-agent-action pricing + compliance report SaaS.
2. MCP Marketplace & Security Layer
The MCP ecosystem is growing faster than any security or quality control layer. A marketplace that hosts curated, security-scanned, enterprise-grade MCP servers — plus a runtime that enforces permissions, rate limits, and audit logs — would capture significant value as enterprises standardize on MCP. The Amazon App Store / Okta Integration Network model: you don't build the tools, you own the distribution and trust layer. Monetize on transactions + enterprise subscriptions.
3. Agent Testing & Synthetic Environment
QA for agents is a solved problem in human testing — run the agent against your test cases. But agents that interact with real-world systems (Salesforce, email, web browsers) are expensive and risky to test against production. A company that builds high-fidelity synthetic environments — fake CRMs, simulated email inboxes, synthetic web content — that let teams stress-test agents before production deployment would fill a real gap. Think of it as "Playwright for multi-step agent workflows." The technical barriers are high (you need convincing fakes of major enterprise software) but so are the moats.
Conclusion & What to Watch
The AI agent infrastructure market is exactly where cloud infrastructure was in 2009: the category is real, the demand is accelerating, but the map is still being drawn. Most of the current tooling will be absorbed — by model companies, by cloud providers, or by each other.
The durable winners will be companies that: (1) own a layer with genuine switching costs (memory graphs, eval datasets, compliance audit trails), (2) solve a problem that model companies actively don't want to own (security, specialized deployment, vertical compliance), or (3) control a protocol-level abstraction like Temporal's durable execution or MCP's tool interface.
Watch closely through 2026:
Temporal's IPO trajectory — At a $5B valuation with $300M raised, they're the most credible path to a public AI infrastructure company in this cohort
Anthropic's stack-building pace — Post-Vercept, do they make additional acquisitions in memory or security? Every deal changes the threat landscape
MCP ecosystem maturity — The first company to build a defensible MCP marketplace/governance layer will capture significant value
Enterprise agent incidents — The first major public agent failure (a rogue agent causing real financial damage) will accelerate the security and guardrails category by 12-18 months overnight
The infrastructure layer beneath AI agents is not a sideshow — it's the foundation of every AI-native company built in the next decade.
AI Primitives is a B2B research newsletter on AI infrastructure and emerging tech. Issue #1, March 2026.

