The AI evolution

Why Most Multi-Agent AI Systems Waste 90% of Their Time (And How to Fix It)

Divy — Mon, 15 Jun 2026 12:26:17 GMT

I got five AI agents running in parallel, each analyzing a codebase. Total wall-clock time barely improved over sequential.

The code was clean. The concurrency was correct. And the number barely moved. Each agent had spent the first 90 seconds installing analysis tools before running a single line of actual work

You can solve the concurrency problem perfectly and still lose on setup time.

Five agents running in parallel still take only 90 seconds to set up. But across all five VMs, that's 450 seconds of compute spent repeating the same work. The clock doesn't slow down, but the infrastructure cost does.

The fix was not more threads.

It was a memory snapshot. Build the environment once, checkpoint the entire VM state (filesystem, memory, running processes), and fork all five agents from that single frozen moment. Each fork warm-restores rapidly with the tools already loaded. No re-installs. No cold boots.

Here is what that looks like, what took me three iterations to get right, and where it still has rough edges.

Let's get the mental model first.

What This Does (30 Seconds)

The idea is straightforward: instead of five agents each spending 90 seconds installing the same tools, install them once, freeze that environment, and stamp out five identical copies.

Each copy runs a different analysis in parallel. A lead LLM reads all five results and tells you what to fix first.

In code:

Creates one Linux VM, installs code analysis tools (bandit, radon) and writes a sample Python project
Freezes the entire VM state into a memory snapshot (filesystem, memory, running processes included)
Forks 5 independent copies, each agent assigned a different analysis task (Security, Complexity, Docstrings, Tests, Structure)
Runs all 5 in parallel via asyncio.gather, finishing in seconds instead of minutes
Feeds all results to a lead LLM that produces a single prioritized fix list

Setup time is paid once, upfront, before any agent runs. The rest of this article explains how.

Why Sandboxes Matter for Agent Workloads

If you have not worked with sandboxes before: think of one as a disposable computer that lives in the cloud.

You spin it up, run whatever code you need inside it, and throw it away when you're done. It has its own filesystem, its own processes, its own network. Nothing it does can touch your machine or any other sandbox running at the same time.

In short: Sandboxes provide the agent with a secure and isolated enviornement

That isolation is the whole point. Your agent can install packages, write files, crash badly, or spin up a browser, and none of it bleeds out. When the task is done, you terminate the VM and it is gone.

The next agent starts clean.

Most agent frameworks treat the execution environment as an afterthought. The LLM call is the interesting part. The environment is just "wherever the code runs."

That works fine for single-turn tasks. It breaks down fast for anything multi-step.

When an agent needs to install packages, write intermediate files, maintain a browser session across multiple pages, or resume a task from a different machine, you need the execution environment to behave like a persistent object, not a function call that resets on every invocation.

Tensorlake gives each agent a MicroVM backed by Firecracker and CloudHypervisor, optimized for fast boot times and strong isolation. Each sandbox is a full Linux VM. It boots in hundreds of milliseconds, persists filesystem and memory state across sessions, and can be snapshotted at any point in its lifecycle.

Tensorlake also lets you spin up multiple sandboxes in parallel for concurrent agent execution, and honestly it is one of my favourite things about it.

it also ranks in the top 5 of SandboxBenchmarks.

What changes the math is a single question: what does the snapshot actually capture?

Two Kinds of Snapshots. Very Different Behavior.

Quick vocabulary before the details. Tensorlake sandboxes have four lifecycle modes.

An ephemeral sandbox runs a task and disappears when done, with no name and no persistence between runs.
A named sandbox outlives the process that created it and can be suspended then reconnected to from any machine. Suspend freezes the VM exactly as it is and resume brings it back to that same state.
A snapshot is that frozen moment saved as a reusable artifact.
A fork is a snapshot restored into a fresh, independent VM.

This project uses the last two.

Suspend and Snapshot both preserve state, but serve different purposes : Suspend is for pausing this sandbox to resume later, while a snapshot is a reusable artifact for retrying from a checkpoint or cloning an environment.

Tensorlake supports two checkpoint types. Most tutorials only mention one.

CheckpointType.FILESYSTEM captures disk state only. Restore from it and the new sandbox does a full cold boot: processes restart from scratch, packages get re-imported. Your pip installs survive. Nothing that was in memory does.
CheckpointType.MEMORY is different. It captures disk state, VM memory, and all running processes. The restored VM resumes mid-stride, exactly as the source was at checkpoint time. No boot sequence. No re-initialization. If Python had already imported bandit, the fork starts with it loaded. The environment is not rebuilt. It is copied.

The checkpoint type is not a performance detail. It determines whether your fork is a clone or a restart.

The default when you call sandbox.checkpoint() with no arguments is filesystem. That is the wrong choice for a parallel swarm where agents share a prepared environment. You want memory.

One more constraint worth knowing upfront: for memory snapshots, resources (CPUs, RAM) are baked into the snapshot at checkpoint time. You cannot override them when creating forks. Set the right cpus and memory_mb on the base sandbox before you checkpoint. Every fork inherits them automatically.

The Architecture

The pattern has five distinct phases. Each one has a single responsibility.

Phase 1 — Base Snapshot: Spins up a single baseline sandbox, installs analysis tools (bandit, radon), writes the target code, and checkpoints the entire running VM state using CheckpointType.MEMORY. The base sandbox is then terminated, leaving behind the reusable snapshot ID.

Phase 2 — Agent Forking: Restores 5 independent sandboxes concurrently from the base snapshot using sandbox.fork(...). Each fork is a warm start that inherits all installed tools, environment settings, and target files.

Phase 3 — Sequential Baseline (Timing): Runs each agent's analysis script (analyze.py) one-by-one inside its respective sandbox to measure sequential time as a benchmark denominator.

Phase 4 — Parallel Swarm: Executes all 5 agents concurrently using asyncio.gather(...). Each agent runs the same analysis script inside its isolated sandbox but with a different focus configuration passed via the PERSPECTIVE environment variable.

Phase 5 — LLM Aggregation: Collects the individual reports (Security, Complexity, Docstrings, Tests, Structure) alongside the timing data, and passes them to the lead LLM (GPT) to synthesize a single prioritized fix list.

Phase 1 runs once. Phases 2 through 4 run every time you want results. The fork is cheap. The base environment build is not, but you only pay that cost once per snapshot.

Phase 1: Build and Snapshot

The base sandbox installs the analysis tools, writes the target codebase into the VM, then snapshots the entire state. Every fork inherits both the tools and the target project automatically.

from tensorlake.sandbox import AsyncSandbox, CheckpointType

async def build_base_snapshot() -> str:
    async with await AsyncSandbox.create(
        name="base-swarm-env",
        cpus=2.0,
        memory_mb=2048,
        timeout_secs=600,
    ) as sandbox:

        # Install analysis tools. These are baked into the snapshot
        # and available to every forked agent at no extra install cost.
        result = await sandbox.run(
            "pip",
            ["install", "bandit", "radon", "--user", "--break-system-packages", "-q"],
            timeout=180,
        )
        if result.exit_code != 0:
            raise RuntimeError(f"pip install failed:\n{result.stderr}")

        # Write a sample Python project with intentional issues for agents to find.
        # All forks inherit this from the snapshot; no need to write per-agent.
        target_files = {
            "/workspace/target/auth.py": b'''
import subprocess
DB_PASSWORD = "hardcoded_secret_123"

def authenticate(user_input):
    return eval(user_input)

def run_command(cmd):
    return subprocess.call(cmd, shell=True)
''',
            "/workspace/target/logic.py": b'''
def classify(a, b, c, d, e, f, g, h):
    if a and b:
        if c or d:
            if not e and f:
                return "path_a"
            elif e and not f:
                return "path_b"
            elif g and h:
                return "path_c"
            else:
                return "path_d"
        elif g:
            return "path_e"
    return "path_f"
''',
        }
        for path, content in target_files.items():
            parent = "/".join(path.split("/")[:-1])
            await sandbox.run("mkdir", ["-p", parent])
            await sandbox.write_file(path, content)

        # Verify tools work before snapshotting.
        # A broken tool in the snapshot means broken forks.
        verify = await sandbox.run(
            "python3", ["-m", "bandit", "--version"]
        )
        if verify.exit_code != 0:
            raise RuntimeError(f"Tool verification failed:\n{verify.stderr}")

        snapshot = await sandbox.checkpoint(
            checkpoint_type=CheckpointType.MEMORY
        )

    # Context manager terminates the base sandbox here.
    if snapshot.status.value != "completed":
        raise RuntimeError(f"Snapshot failed: {snapshot.status.value}")

    return snapshot.snapshot_id

The async with pattern guarantees terminate() is called on exit, including on exceptions. Without it, any exception before a manual terminate() call leaves an orphaned VM running in the background. TensorLake's async documentation shows this pattern explicitly.

result.exit_code comes from CommandResult, the SDK's return type for run(). It has stdout: str, stderr: str, and exit_code: int. Note that stdout is already a string, not bytes, so no .decode() is needed anywhere.

The status check after checkpoint(): SnapshotStatus is an enum, so .value gives you "completed", "in_progress", or "failed". The documentation shows checkpoint() returns a SnapshotInfo with a status field. Checking that status before proceeding is a useful defensive practice. I learned this after a failed snapshot left me debugging downstream agent failures.

Phase 2: Fork and Run an Agent

This is the actual fork. The call is AsyncSandbox.create(snapshot_id=snapshot_id). No special fork() method. No copy-on-write API. Just create() with a snapshot ID. Every call produces a fully independent VM starting from that snapshot's frozen state.

PERSPECTIVES = ["Security", "Complexity", "Docstrings", "Tests", "Structure"]

async def run_agent(agent_id: int, snapshot_id: str) -> AgentReport:
    perspective = PERSPECTIVES[agent_id % len(PERSPECTIVES)]
    t_start = time.time()

    # cpus and memory_mb intentionally omitted.
    # For MEMORY snapshots, resources are inherited from the snapshot
    # and cannot be overridden at restore time.
    async with await AsyncSandbox.create(
        snapshot_id=snapshot_id,
        allow_internet_access=False,  # code analysis is offline; no outbound needed
        timeout_secs=120,
    ) as sandbox:

        await sandbox.write_file(
            "/workspace/analyze.py",
            ANALYSIS_SCRIPT.encode("utf-8")
        )

        result = await sandbox.run(
            "python3",
            ["/workspace/analyze.py"],
            env={"PERSPECTIVE": perspective},
            timeout=60,
        )

    elapsed = time.time() - t_start

    if result.exit_code != 0:
        raise RuntimeError(f"Agent {agent_id} failed:\n{result.stderr}")

    output = json.loads(result.stdout.strip())
    return AgentReport(
        agent_id=agent_id,
        perspective=perspective,
        score=output["score"],
        finding=output["finding"],
        execution_time_s=elapsed,
    )

allow_internet_access=False is safe here because bandit and radon analyze source code and do not make network calls. This parameter is not locked by MEMORY snapshots. TensorLake's networking documentation recommends disabling outbound internet access for untrusted code.

The dispatch script gets written fresh into each forked VM via sandbox.write_file(). Each agent's VM is fully isolated: writing to /workspace/analyze.py in fork 0 has no effect on fork 1. The target project files are already there, inherited from the snapshot.

Since result.stdout is already a Python string, json.loads(result.stdout.strip()) works directly. The .strip() handles the trailing newline from print() inside the sandbox.

Phase 3: Sequential First, Then Parallel

The sequential baseline exists for one reason: to give the speedup calculation a real denominator. Without it, you have a time with no context.

async def run_sequential(snapshot_id: str, count: int) -> SwarmResult:
    reports = []
    for i in range(count):
        reports.append(await run_agent(i, snapshot_id))
    return SwarmResult(mode="sequential", ...)

async def run_parallel(snapshot_id: str, count: int) -> SwarmResult:
    # asyncio.gather returns a list of results when awaited.
    reports = await asyncio.gather(
        *(run_agent(i, snapshot_id) for i in range(count))
    )
    reports.sort(key=lambda r: r.agent_id)
    return SwarmResult(mode="parallel", ...)

asyncio.gather is what TensorLake's async documentation recommends for concurrent sandbox fan-out. The ThreadPoolExecutor approach works too (the sync Sandbox API supports it), but if you are already in an async context, gather is cleaner.

Phase 4:What the Analysis Script Does

The dispatch script runs inside each forked sandbox. It reads the PERSPECTIVE environment variable, routes to the right analysis function, and prints one JSON line to stdout. All five analyses are fully offline, with no network calls needed.

# ANALYSIS_SCRIPT — runs INSIDE each forked sandbox
import json, os, subprocess, ast, pathlib, sys

PERSPECTIVE = os.environ["PERSPECTIVE"]
TARGET = "/workspace/target"

def run_security():
    """bandit: find hardcoded secrets, unsafe eval, shell injection."""
    r = subprocess.run(
        ["python3", "-m", "bandit", "-r", TARGET, "-f", "json", "-q"],
        capture_output=True, text=True
    )
    try:
        data = json.loads(r.stdout)
    except json.JSONDecodeError:
        return {"score": 0, "finding": "bandit parse error"}
    issues = data.get("results", [])
    high = [i for i in issues if i.get("issue_severity") == "HIGH"]
    return {
        "issues": len(issues), "high": len(high),
        "score": max(0, 100 - len(issues) * 10),
        "finding": high[0]["issue_text"] if high else ("Minor issues" if issues else "Clean"),
    }

def run_complexity():
    """radon: cyclomatic complexity per function."""
    r = subprocess.run(
        ["python3", "-m", "radon", "cc", TARGET, "-j"],
        capture_output=True, text=True
    )
    try:
        data = json.loads(r.stdout)
    except json.JSONDecodeError:
        return {"score": 0, "finding": "radon parse error"}
    blocks = [b for file_blocks in data.values() for b in file_blocks]
    complex_blocks = [b for b in blocks if b.get("complexity", 0) > 5]
    avg = sum(b["complexity"] for b in blocks) / len(blocks) if blocks else 0
    top = f"{complex_blocks[0]['name']} (cc={complex_blocks[0]['complexity']})" if complex_blocks else "All within threshold"
    return {
        "functions": len(blocks), "complex_count": len(complex_blocks),
        "avg_cc": round(avg, 2),
        "score": max(0, 100 - len(complex_blocks) * 15),
        "finding": top,
    }

def run_docstrings():
    """ast: count functions and classes that lack docstrings."""
    total, documented = 0, 0
    for path in pathlib.Path(TARGET).rglob("*.py"):
        tree = ast.parse(path.read_text())
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
                total += 1
                if ast.get_docstring(node):
                    documented += 1
    pct = int(documented / total * 100) if total else 100
    return {"total": total, "documented": documented, "score": pct,
            "finding": f"{documented}/{total} documented ({pct}%)"}

def run_tests():
    """Count test files relative to source files."""
    all_py = list(pathlib.Path(TARGET).rglob("*.py"))
    test_files = [f for f in all_py if f.stem.startswith("test_") or f.stem.endswith("_test")]
    ratio = len(test_files) / len(all_py) * 100 if all_py else 0
    return {
        "source_files": len(all_py), "test_files": len(test_files),
        "score": min(100, int(ratio * 2)),
        "finding": f"{len(test_files)}/{len(all_py)} files are tests ({ratio:.0f}%)",
    }

def run_structure():
    """ast: count functions, classes, imports across the codebase."""
    stats = {"functions": 0, "classes": 0, "imports": 0, "files": 0}
    for path in pathlib.Path(TARGET).rglob("*.py"):
        stats["files"] += 1
        tree = ast.parse(path.read_text())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):          stats["functions"] += 1
            elif isinstance(node, ast.ClassDef):           stats["classes"] += 1
            elif isinstance(node, (ast.Import, ast.ImportFrom)): stats["imports"] += 1
    fpr = stats["functions"] / stats["files"] if stats["files"] else 0
    return {**stats, "functions_per_file": round(fpr, 1),
            "score": min(100, int(fpr * 20)),
            "finding": f"{stats['functions']} functions across {stats['files']} files"}

dispatch = {
    "Security":   run_security,
    "Complexity": run_complexity,
    "Docstrings": run_docstrings,
    "Tests":      run_tests,
    "Structure":  run_structure,
}

fn = dispatch.get(PERSPECTIVE)

if fn is None:
    print(json.dumps({"error": f"Unknown perspective: {PERSPECTIVE}"}))
    sys.exit(1)

result = fn()
result["perspective"] = PERSPECTIVE
print(json.dumps(result))

Two things worth keeping when you adapt this.

Parameters via environment variables: sandbox.run(env={"KEY": "val"}) passes per-command variables and avoids shell escaping issues when values contain spaces or special characters. It also keeps the dispatch script stateless, with no hardcoded perspective names inside the script itself.

JSON to stdout: the orchestrator reads result.stdout.strip() and passes it directly to json.loads(). The script has one job: print exactly one valid JSON line. Any other stdout output (debug prints, progress bars) breaks the parse. Keep it strict.

Phase 5: Lead Agent Synthesis

After all five agents return, a single GPT-4o call synthesizes their findings into a prioritized action list.

def aggregate_with_llm(parallel: SwarmResult, sequential: SwarmResult) -> str:
    client = OpenAI()
    speedup = sequential.total_time_s / parallel.total_time_s

    reports_block = "\n".join(
        f"[{r.perspective}] Score: {r.score}/100 | {r.finding}"
        for r in parallel.reports
    )

    prompt = (
        "You are a senior engineering lead reviewing a parallel code analysis report.\n\n"
        f"Agent Findings:\n{reports_block}\n\n"
        "Benchmark:\n"
        f"  Sequential : {sequential.total_time_s:.2f}s\n"
        f"  Parallel   : {parallel.total_time_s:.2f}s\n"
        f"  Speedup    : {speedup:.2f}x\n\n"
        "Provide: overall codebase health score, top three issues to fix immediately "
        "(with file and severity), recommended next actions, and one sentence on what "
        "the parallel speedup means for running this at scale."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

The lead agent sees both the analysis findings and the timing benchmark in the same context. That is the reduce step in a map-reduce agent pattern: give the aggregator everything the workers produced, not just the domain data. The call is synchronous because there is nothing left to concurrently await at this point.

Where the Time Actually Goes

Both timelines contain the same agents doing the same work. What changes is when setup happens. These numbers are structural projections based on typical pip install times and sandbox warm-restore behavior, not measured results. Your numbers will vary by workload and network conditions. Run the demo to measure your case.

Without memory snapshots:

Agent 0: [setup ~90s][work ~8s]
Agent 1: [setup ~90s][work ~9s]
Agent 2: [setup ~90s][work ~8s]
Agent 3: [setup ~90s][work ~9s]
Agent 4: [setup ~90s][work ~8s]

Sequential total: ~490s
Parallel total:   ~100s  (setup still paid by each fork separately)

With memory snapshots (MEMORY type):

Base build:  [setup ~90s][checkpoint ~3s]  ← paid once, outside the loop
Agent 0: [warm fork ~1s][work ~8s]
Agent 1: [warm fork ~1s][work ~9s]
Agent 2: [warm fork ~1s][work ~8s]
Agent 3: [warm fork ~1s][work ~9s]
Agent 4: [warm fork ~1s][work ~8s]

Sequential total: ~48s
Parallel total:   ~10s

The speedup ratio looks similar on paper. The absolute time is not. At five agents the gap is 450 seconds versus 5 seconds of overhead. At fifty agents it is 4,500 seconds versus 50 seconds.

Setup time does not scale down with parallelism. It multiplies. The snapshot moves it outside the loop entirely.

The benchmark captures four numbers: sequential total time (the denominator), parallel total time (wall-clock from first fork to last return), speedup (sequential divided by parallel), and efficiency (speedup divided by agent count, multiplied by 100).

Efficiency is the one most benchmarks skip. A 4.2x speedup across five agents is 84% parallel efficiency: 16% is lost to fork startup, scheduling, and I/O contention. That number matters when you scale from five agents to fifty.

What the Code Does Not Handle

The demo covers the happy path. Three things to add before production:

LLM rate limits. Twenty or thirty concurrent agents all hitting the OpenAI API will trigger rate limit errors. The demo has no retry logic. Add exponential backoff before you scale.
Snapshot storage. Snapshots may incur charges depending on your plan. Use Sandbox.delete_snapshot(snapshot_id) when done. The demo has a CLEANUP_SNAPSHOT_ON_EXIT flag at the top of the file.
Agent error isolation. If one run_agent() coroutine raises inside asyncio.gather, the whole batch fails. In production, wrap each coroutine with asyncio.create_task() and handle errors per-agent.

When to Use This Pattern (And When Not To)

Use it when:

Multiple agents need the same environment
Their tasks are independent (no inter-agent communication mid-run)
Setup time is a meaningful fraction of total runtime
Reproducibility matters: every fork starts from an identical state

Skip it when:

Agents need to share state during execution. Forks are fully isolated. If agent 2 needs to react to what agent 1 found, use shared storage or message queues instead.
The task is fast enough for a single agent. Forking five sandboxes for a 3-second job adds overhead, not speed.
Environment setup takes under 5 seconds. The snapshot overhead only pays off when setup is the actual bottleneck.

Your situation	Right choice
Multiple agents, shared dependencies, independent outputs	Memory snapshot, fork N copies
Single agent, long task, needs to pause and resume	Named sandbox with suspend/resume
Pure browser automation, no code execution	Stagehand or BrowserBase
Stateless task, resets every run	Ephemeral sandbox, no snapshot needed
Environment setup under 5 seconds	Filesystem snapshot or skip snapshots

On filesystem performance: Tensorlake publishes performance benchmarks on their GitHub comparing sandbox execution times across providers. Refer to their repository for current numbers.

Running This

pip install tensorlake openai
export TENSORLAKE_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
python3 agent.py

Free tier at cloud.tensorlake.ai, no credit card required. The demo takes 3-5 minutes end to end. After it runs, benchmark_results.json has the full per-agent timing data.

Phase 1 (base build and snapshot) runs once. If you want to run the benchmark multiple times, pass your existing snapshot ID directly and skip Phase 1. The snapshot persists between runs until you delete it.

What Actually Took Three Iterations

The first version had plain await sandbox.terminate() at the end of each function. Two exceptions during testing left sandboxes running and billing for idle compute. Switched to async with await AsyncSandbox.create(...) as sandbox: and that stopped.

The second version called sandbox.checkpoint(sandbox.sandbox_id). I had copied the pattern from a CLI reference (tl sbx checkpoint ) and assumed the Python SDK matched. It does not. The Python instance method takes no positional arguments: sandbox.checkpoint(checkpoint_type=CheckpointType.MEMORY). That is it.

The third version was the first one that ran end to end, but with CheckpointType.FILESYSTEM by default because I had not read the snapshots documentation carefully. The benchmark looked reasonable. The forks were doing full cold boots and I was measuring them alongside the actual work. Switching to CheckpointType.MEMORY was the change that made setup time disappear from per-fork timing.

Small mistakes individually. What they share: Tensorlake's API is well documented, but the snapshot docs, the SDK reference, and the async docs are three separate pages. Read only the quickstart and you miss two of the three things that matter most for this pattern.

You can also check the complete project on my github here:

click_here

The Thing That Changes

Running the same five agents sequentially and then in parallel is one of those moments where the architecture becomes legible in a way that documentation does not fully convey.

The snapshot moves setup cost from inside the loop to outside it. The agents still do the same work on the same hardware. The savings come from not rebuilding an environment five times when it only needed to be built once.

Most multi-agent optimization advice focuses on LLM calls: batching, caching, cheaper models. That advice is right. But if you have five agents each spending 90 seconds on pip installs before making a single inference call, no amount of LLM optimization helps until you address setup time first.

The bottleneck was never the agents. It was rebuilding the same environment on every run. Snapshot it once, fork cheaply, and parallel execution finally delivers what you expected when you first wrote asyncio.gather.

References:

MCP Is Dead. The Downloads Just Don't Know It Yet.

Divy — Fri, 05 Jun 2026 13:04:24 GMT

Your AI agent ran a query on a fake database last month.

It got real results. The tool worked perfectly. Your SSH keys left in the background.

The agent didn't flag it. The registry didn't catch it. Nobody warned you.

That's not a hypothetical. That's MCP in 2026, with 97 million monthly downloads and a Linux Foundation home.

The hype was real. So are the cracks.

First: what is MCP, and why should you care

If you've never built AI agents before, this matters. Skip it if you have.

Say you're building an AI assistant that needs to do real work:

Look up customer records in a database
Create tickets in Jira
Send a Slack message
Pull a file from Google Drive

Each of those lives in a different system. Different API, different auth, different data format.

To connect your AI to all of them, you'd write a custom integration for each one. Fine for two tools. Painful for ten. Then you switch models and rewrite everything.

This is the N×M problem: N tools multiplied by M AI models equals a mountain of glue code nobody wants to maintain.

MCP — the Model Context Protocol — solves that. Released by Anthropic in November 2024, it's an open standard that gives AI models one universal way to talk to external tools. You build an MCP server once around a tool, and any MCP-compatible AI can use it.

Your agent  →  MCP Client  →  MCP Server  →  Real Tool (Slack, Postgres, GitHub)

Three pieces:

MCP Host: your app (Claude Desktop, VS Code, a custom agent)
MCP Client: the component inside your app that speaks MCP, discovers tools, calls them
MCP Server: a small process wrapping a real tool, exposing it in a format any MCP client can use

That's it. The N×M problem disappears. One integration per tool, works with every AI.

The pitch was real. The adoption proved it.

How MCP went from zero to everywhere in 14 months

The adoption happened fast. Unusually fast.

Nov 2024 — Anthropic launches MCP. ~2M monthly SDK downloads.
Apr 2025 — OpenAI adopts it. Downloads jump to 22M.
Jul 2025 — Microsoft integrates it into Copilot Studio. 45M.
Nov 2025 — AWS adds support. 68M.
Mar 2026 — Every major AI vendor on board. 97M downloads. 10,000+ public MCP servers.

In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded with OpenAI and Block. It stopped being Anthropic's protocol and became the industry's.

The "USB-C for AI" comparison spread everywhere. Everyone plugged in.

And then engineers started running it in production.

Why developers hate MCP

The complaints have been building in forums, GitHub issues, and private Slack channels for months. Not from people who misunderstood MCP. From people who ran it in production and got surprised by the same things.

"It's unauthenticated by default."

Out of the box, an MCP server trusts whatever connects to it.

No built-in check that the server is who it claims. No built-in check that the client is who it claims. You're responsible for adding that layer.

Most tutorials don't mention this.

"The STDIO transport executes arbitrary OS commands."

The official MCP STDIO transport runs any OS command you point at it to launch a server. Even when the server startup fails. No sanitization warnings. Nothing in the developer toolchain flags it.

OX Security documented this in April 2026.

Anthropic's response: expected behavior, sanitization is the developer's responsibility. LangChain said the same. Microsoft said the same.

Three major vendors. Same answer: your problem.

"The spec moves, community servers don't." MCP servers published in community registries frequently fall behind spec updates. A server that worked last month may behave differently after the protocol updates. The registry has no enforcement mechanism. You find out in production.

"Every REST API I already have needs a new wrapper process."

Adding MCP to a tool that already has a clean REST API means building an entire MCP server around it.

That server needs to be:

Deployed and monitored
Updated when the underlying API changes
Secured separately from both the agent and the tool

For ten existing APIs, that's ten new processes to own. Month three of production, you feel every one of them.

"The registries are basically npm circa 2015."

In early 2026, OX Security cloned mcp-server-postgres and named it mcp-server-postgress (extra 's'). Functionally identical. Same queries, same responses.

Hidden inside: a payload that silently pulled SSH keys and environment files to an outside server.

They submitted it to eleven major MCP registries.

Nine published it. No automated security review. No source code analysis. Nothing.

"It eats my context window before the user says anything."

When your MCP client connects to a server, it loads the full tool schema into your context window — names, descriptions, parameters for every tool.

One server, five tools: ~500 tokens gone before the first message
Ten servers: 2,000–3,000 tokens gone before the first message

The model is already reasoning over a smaller budget. Before your user typed a word.

These aren't edge cases. They're the standard experience for anyone who's moved past a local demo.

The 3 problems that actually break production systems

The trust problem

Your agent has no way to verify an MCP server is who it claims to be.

The OX Security incident made this real: nine out of eleven registries accepted a typosquatted credential-stealing package. The malicious server functioned correctly. Ran database queries. Returned results. And silently pulled your SSH keys in the background. Nothing in the protocol flagged it.

Since January 2026, researchers have filed 30 CVEs against the MCP ecosystem in 60 days. Prompt injection through tool descriptions. Credential theft via config file reads. "Tool poisoning," where a server description manipulates the agent's next decision. These aren't exotic attack vectors.

Your agent can't tell your Postgres server from an attacker's. That's not a code bug. It's a design gap.

MCP was built for a trusted local environment. Production isn't that.

The wrapper tax

Every tool you connect to MCP needs its own MCP server. Ten tools means ten additional processes to own.

Each one needs to:

Stay in sync when the underlying tool's API changes
Be monitored for failures in production
Be secured separately from the agent and the tool itself
Be deployed as part of your infrastructure

For the first two tools, manageable. Month three with fifteen integrations, it's a job.

The N×M problem is solved. The "N new processes" problem quietly replaced it.

The context window bill

Tool schemas are not free. They're tokens. And they arrive before your user's message does.

A team building a customer service agent connected to ten MCP servers found their available reasoning budget had shrunk by 30% before the first user question arrived. Same model. Same prompts. Just more tools.

In a long multi-step agent session, schema tokens compound. Quality drifts. Costs climb. Most teams don't trace this back to tool schema overhead until they look at what's actually in the context window.

What engineers are using instead

A few patterns have emerged for teams that ran into the problems above.

Direct REST API calls

For tools with a clean existing API, skip MCP entirely. Call the API directly from your agent. No new server to maintain, no schema overhead, existing auth covers it.

Works well when you control the tool and the API is stable. Doesn't scale when you need multiple AI systems to share the same integrations.

Native provider tool use

Anthropic and OpenAI both have built-in tool calling that needs no MCP infrastructure. You define the tool schema inline, pass it with the request, the model calls it.

No server process
No registry
Your auth sits directly on the call

Most teams running focused single-purpose agents in 2026 are doing this. Simpler to reason about, harder to share across systems.

UTCP (Universal Tool Calling Protocol)

UTCP skips the wrapper entirely. Instead of wrapping a tool in an MCP server, it calls the tool's existing HTTP endpoints directly, with a discovery layer on top.

As of early 2026:

1,000+ GitHub stars
Implementations in Python, Go, and TypeScript
Growing community from teams that wanted lower latency and less infrastructure overhead

Best for teams with well-designed existing APIs who don't want a separate server layer. Not a full MCP replacement if you need the ecosystem breadth — but for many production use cases, materially simpler.

MCP with a gateway layer

For teams committed to MCP, the answer to most of the problems above is an MCP gateway — a controlled layer between your agent and your servers.

Your agent  →  MCP Gateway  →  MCP Server 1  →  Tool
                            →  MCP Server 2  →  Tool
                            →  MCP Server N  →  Tool

A gateway handles:

Authentication — verifies server identity before your agent calls anything
Tool filtering — loads only schemas relevant to the current task, not all of them
Audit logging — records every tool call for compliance and debugging
Rate limiting — stops runaway tool calls from blowing your budget

As of April 2026, 86–89% of AI agent pilots fail before reaching production. Governance gaps and audit visibility are the two most common reasons. A gateway is what closes both.

So do we actually need MCP?

Yes. With caveats that matter.

Use MCP when:

Multiple AI systems need to share the same tools
You're a SaaS company giving AI agents access to your product
You need dynamic tool discovery across a large integration ecosystem

Skip MCP when:

You're building a focused agent with two or three tools you already own
Your tools have clean REST APIs you control
You need low latency and minimal infrastructure overhead

The "MCP for everything" era is over. It's the right call when standardization pays off at scale. When you just need your agent to hit an API you already control, MCP is overhead pretending to be infrastructure.

Cheat sheet: what to actually do

Your situation	What makes sense
Local dev, one or two tools, just exploring	Bare MCP or native tool calls. Don't over-engineer.
Agent using tools you own, clean REST APIs	Direct API calls or native tool use. Skip MCP overhead.
Production agent, 5+ tools, or external users	MCP with a gateway. Authentication is not optional.
Enterprise, compliance, or regulated industry	MCP gateway with audit logs and SSO. Non-negotiable.
Pulling from community MCP registries	Treat every server as untrusted. Verify before deploying.
Need to share tools across multiple AI systems	MCP is the right call. This is exactly what it's for.

The actual state of things

MCP isn't going away. The downloads are real. The Linux Foundation governance is serious. Multi-vendor adoption means the protocol has institutional staying power.

But the MCP of early tutorials — install a community server, plug it in, done — that version is dead.

It was never safe for production. It was never meant to be.

The engineers moving to UTCP or direct API calls aren't abandoning MCP because it failed. They're routing around the parts that weren't built for what they're building.

I keep coming back to the OX Security test. Nine out of eleven registries. No automated review. The agent called the fake server, ran its queries, and handed over credentials it didn't know it was handing over.

Your agent does what it's told by the tools it trusts.

MCP hasn't fully answered how it decides what to trust. Until it does, treat every community MCP server the way you'd treat a random npm package in 2015.

You know how that era ended.

I Built a Stateful Research Agent Inside a Sandbox. Here's What the Numbers Actually Looked Like.

Divy — Wed, 27 May 2026 04:51:08 GMT

Three steps into a multi-page research task, the agent lost everything.

Not a crash. Not a thrown exception.

The function returned, context reset, and the pricing data it had just collected vanished.

This failure is predictable: stateless execution environments were never built to hold state across browser sessions that run for twenty minutes.

You hit it eventually, usually at the worst moment.

The two standard workarounds are both annoying. Stuffing state into the prompt works until token costs starts becoming an issue. An external state store solves the problem but now you are maintaining another service.

I had been using E2B for short-lived code execution. It handles that well, and they have added persistence features over time, including early-stage snapshot support. But for agents that need to pause mid-task and resume from a different process, state management is still mostly on you.

Someone in my Discord mentioned Tensorlake. I opened the docs and decided to build against this specific problem.

In this article i will walk you through the steps using which you can build a desktop using agent using sandbox.

Let's start with setting up.

Visual Explanation First

Setup

What caught my attention first: named sandboxes with suspend() and resume() that preserve the full VM state, not just files, but running processes and open browser sessions. Sub-second resume, according to their docs.

Ten minutes from zero to running:

pip install tensorlake
tl login   # or TENSORLAKE_API_KEY env var

Free tier, no credit card.

from tensorlake.sandbox import Sandbox

sandbox = Sandbox.create(
    name="research-agent",
    cpus=2.0,
    memory_mb=4096,
    secret_names=["OPENAI_API_KEY"],
    image="tensorlake/ubuntu-vnc",
)

The tensorlake/ubuntu-vnc image is what gives you a real desktop and Firefox inside the VM. You need an actual browser because modern pricing pages heavily use client-side rendering and bot detection that stops headless scrapers cold. Firefox inside a sandbox just looks like a person browsing.

Important: Playwright is not pre-installed in ubuntu-vnc. Install it before the agent runs:

sandbox.run("pip", ["install", "playwright"])
sandbox.run("playwright", ["install", "chromium"])

Two to three minutes on first setup. After that, packages persist across suspend/resume so you pay the cost once.

Latency: What I Actually Measured

First sandbox was running in roughly 800-900ms from the Sandbox.create() call to status running.

Here is where time actually goes:

Sandbox creation:        ~800ms          (named sandbox, first time)
Sandbox resume:          ~400ms          (from suspended state)
LLM call (GPT-4o):       2,000-4,000ms   (per step, dominates everything)
Browser screenshot:      ~300ms          (capture + transfer)
Page load in sandbox:    1,000-2,000ms   (varies by site)
File read/write:         <50ms           (block-based storage)
Sandbox suspend:         ~200ms

The LLM calls dominate by a large margin. Sandbox overhead is not the bottleneck. The main optimization is batching browser operations before each model call rather than interleaving individual round trips.

Tensorlake publishes a SQLite filesystem benchmark claiming 1.6-1.9x faster I/O than E2B and Modal. Self-reported numbers. I could not independently verify them. What I can say is that the block-based storage felt responsive for frequent small writes, which is exactly the pattern a research agent uses when checkpointing after every step.

Computer Use: What Worked and What Didn't

The desktop API itself is clean:

with sandbox.connect_desktop(password="tensorlake") as desktop:
    png_bytes = desktop.screenshot()
    desktop.move_mouse(640, 400)
    desktop.click()
    desktop.type_text("pinecone.io")
    desktop.press("Return")

Screenshot as PNG bytes, decode it, figure out where to click, send coordinates. Each browser interaction takes 1-3 seconds depending on page load. Slow compared to an API call. But it works on pages that block scrapers, because from the server's side it is just a person using Firefox.

The problem: coordinates assume a fixed layout, and layouts do not stay fixed.

Weaviate's pricing page ran an A/B test between two of my agent's steps. The toggle moved 30px down. The agent clicked empty space. No error, no exception. Just a screenshot showing nothing happened, and twenty minutes of debugging before I identified the offset.

The fix: pass screenshots to GPT-4o Vision to identify element positions dynamically rather than hardcoding coordinates. Adds about 2 seconds per interaction, handles layout drift reliably. Worth it for reliability; too slow for high-frequency operations.

When the DOM is accessible, Playwright inside the sandbox is the better path:

result = sandbox.run(
    "python",
    ["-c", """
import asyncio
from playwright.async_api import async_playwright

async def get_pricing():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://pinecone.io/pricing")
        pricing_text = await page.inner_text(".pricing-section")
        print(pricing_text)
        await browser.close()

asyncio.run(get_pricing())
"""]
)

The hybrid strategy I landed on:

Situation	Approach	Why
Site with bot detection	Vision + coordinates	Playwright gets blocked
Accessible DOM	Playwright directly	Faster, no coordinate drift
Unknown or variable layout	Screenshot + GPT-4o Vision	Resolves position dynamically
High-frequency operations	Playwright only	Vision adds ~2s per call

Use vision as a fallback, not a first tool. Vision handles layout variation. Playwright handles speed. Neither does both well.

Statefulness: The Part That Actually Mattered

After three steps (Pinecone free tier limits noted, $70/mo Starter plan recorded, Weaviate docs started), I called sandbox.suspend().

The sandbox froze. Filesystem, memory, running browser: all paused. Twelve minutes later, from a different terminal:

sandbox = Sandbox.connect("research-agent")
sandbox.resume()

About 400ms. The Weaviate pricing tab was still open. Tensorlake's suspend/resume preserves the full VM state, including memory and running processes.

Everything written to /workspace/research_notes.json was intact.

The workflow I settled on: write state explicitly after each meaningful step, then suspend.

# After each step, before suspending:
sandbox.write_file(
    "/workspace/state.json",
    json.dumps({
        "pinecone_pricing": pinecone_data,
        "weaviate_started": True,
        "next_url": "https://weaviate.io/pricing"
    }).encode()
)
sandbox.suspend()

# On next invocation, from any process:
sandbox = Sandbox.connect("research-agent")
sandbox.resume()
state = json.loads(bytes(sandbox.read_file("/workspace/state.json")))
# picks up from state["next_url"]

The state file is the continuity mechanism. Not elegant, but it removes the need for an external database and the filesystem is fast, durable across suspend, and readable from any reconnecting process.

Scaling and Failure Handling

Sandbox.create() is a blocking synchronous call. For parallel workloads, wrap in concurrent.futures:

from tensorlake.sandbox import Sandbox
from concurrent.futures import ThreadPoolExecutor

def research_competitor(name, url):
    sandbox = Sandbox.create(
        name=f"research-{name}",
        cpus=1.0,
        memory_mb=2048,
        secret_names=["OPENAI_API_KEY"],
        image="ubuntu-vnc",
    )
    # ... agent logic ...
    result = sandbox.read_file("/workspace/report.json")
    sandbox.terminate()
    return result

competitors = [
    ("pinecone", "pinecone.io/pricing"),
    ("weaviate", "weaviate.io/pricing"),
    ("qdrant", "qdrant.tech/pricing"),
]

with ThreadPoolExecutor(max_workers=5) as executor:
    reports = list(executor.map(lambda c: research_competitor(*c), competitors))

Three concurrent sandboxes ran without delay. I have not tested at twenty or fifty. Their docs mention hundreds per second. Take that at face value until you have load data.

Note: Tensorlake's Python SDK v0.5.8 introduced native async APIs that offer a cleaner alternative to threading for I/O-bound orchestration. If you are on v0.5.8 or later, those are worth reaching for before wrapping synchronous calls in a thread pool.

Patterns worth building from day one:

Idempotent state writes. Write state after each meaningful step. If the agent fails mid-run, the next invocation reads the file and skips completed work. This does not happen automatically.

Checkpoint before risky operations. sandbox.checkpoint() creates a restorable snapshot. By default, snapshots preserve the filesystem state. Preserving full memory state is supported as an explicit option. Either way, you can restore into a fresh sandbox if an operation goes wrong:

# Filesystem snapshot (default)
snapshot = sandbox.checkpoint()

try:
    agent.navigate_to_pricing_page()
except Exception:
    # Restore filesystem state into a new sandbox
    sandbox = Sandbox.create(snapshot_id=snapshot.snapshot_id)

Named sandboxes. If the orchestration process dies, any other process reconnects with Sandbox.connect("sandbox-name") and resumes from the last written state.

Architectural boundary: Tensorlake provides the execution environment and runtime for agents: the VM, the filesystem, the process lifecycle, the networking. It is not an agent framework. Retry logic, circuit breakers, and LLM rate-limit backoff belong in the orchestration layer above it: LangChain, LlamaIndex, a custom harness, or whatever you are using to drive the agent. That separation is deliberate, not a gap.

The Mental Model

The part that shifted how I thought about the design:

┌─────────────────────────────────────────────┐
│                 Your Agent                   │
│    (LLM + tool calling logic)                │
└──────────────────┬──────────────────────────┘
                   │ tool calls
┌──────────────────▼──────────────────────────┐
│           Tensorlake Sandbox                 │
│  ┌──────────────────────────────────────┐   │
│  │ State Layer: /workspace filesystem   │   │
│  │  state.json, research_notes.json     │   │
│  └──────────────────────────────────────┘   │
│  ┌──────────────────────────────────────┐   │
│  │ Execution Layer: processes, scripts  │   │
│  └──────────────────────────────────────┘   │
│  ┌──────────────────────────────────────┐   │
│  │ Computer Use: VNC, screenshots, mouse│   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

The sandbox is not the agent. It is the stable environment the agent operates in. When it resumes, the environment is exactly where the agent left it. The agent's logic lives outside and reconnects to a world that did not reset.

That changes what you can build. An agent that runs for an hour, navigates fifteen pages, and writes a structured report is feasible when the execution environment outlasts the orchestration session. With purely ephemeral execution, it is not.

How It Compares

vs E2B:

Both use Firecracker microVMs. E2B markets sub-200ms cold starts; community reports put real-world p50 closer to 400-600ms. Tensorlake named sandbox creation was ~800ms in my testing.
E2B has added snapshot and pause-resume in recent releases. The statefulness gap is narrower than a year ago. Tensorlake's suspend/resume preserves the full running VM state, including open processes, browser sessions, all in under a second. E2B's memory snapshot support is still described as early-stage.
Tensorlake claims 1.6-1.9x faster filesystem I/O on their own benchmarks. Self-reported. For an independent reference: Tensorlake recently ranked top 2 across all three categories in the ComputeSDK sandbox benchmarks.
Neither provides DOM-level element selection at the SDK layer.

vs Modal:

Modal uses gVisor rather than Firecracker, designed around stateless function execution. Stateful long-running agents work but need more setup. Cold starts are around 1-1.5 seconds per their docs.

vs Stagehand (BrowserBase):

Stagehand has DOM-level selectors (CSS, XPath, natural language) via locator(). For pure browser automation, this is a real ergonomic advantage.
Tensorlake gives you a full VM. Code execution, file management, package installs, and browser use in the same environment. If that combination is what you need, the full VM model is worth the coordinate complexity.
Browser automation only? Stagehand is the more focused tool.

from tensorlake.sandbox import SandboxClient

client = SandboxClient()

for sb in client.list():
    print(sb.sandbox_id, sb.status)

What the Build Produced

By the end of the session, the agent had produced the comparison: Pinecone versus Weaviate pricing, extracted across seven pages, with notes preserved across two suspensions and a full restart of the orchestrating machine.

report_bytes = sandbox.read_file("/workspace/comparison_report.md")
print(bytes(report_bytes).decode("utf-8"))

Accurate. Correct tier names and numbers.

Tensorlake did not solve the hard parts: the retrieval logic, state schema, hybrid browser strategy. It stayed out of the way while those got built. Most of the infrastructure friction came down to state management, and most of that went away once the sandbox filesystem became the state store.

Three Things to Know Before You Start

Speed is a systems problem, not a sandbox problem. LLM calls account for the bulk of per-step latency. Optimize by batching browser operations before each model call, not by chasing sandbox startup time.

Design for interruption from day one. Write state after every meaningful step. Not because the sandbox will crash, but because resuming from a different process after an unexpected interruption is a real scenario, not an edge case.

Computer use is a primitive. The coordinate-based API works, but layout drift will break hardcoded positions. Use Playwright when the DOM is accessible. Fall back to vision when you need a real browser session. Do not automate full workflows with raw coordinates.

Is the sandbox infrastructure production-ready? Yes. Suspend/resume held up, filesystem persistence was consistent, and Firecracker isolation did what it was supposed to.

Is the computer use layer production-ready? Not without additional engineering. The raw coordinate API is a reasonable primitive, but element resolution needs to be built on top of it. A vision-backed click_element() in the SDK would change the story significantly. Until then, budget the time to build that layer yourself.

Worth using? Yes, if you go in with clear expectations about what the platform handles and what it leaves to you. That boundary is sharper than most, which makes it easier to work with once you have internalized it.

You can also check the complete project on my github here:

click_here

References

Tensorlake. Tensorlake Documentation & Sandbox SDK. https://tensorlake.ai

E2B. E2B Sandbox Infrastructure. https://e2b.dev

Modal. Modal Serverless Infrastructure. https://modal.com

Stagehand (BrowserBase). Stagehand Browser Automation. https://browserbase.com/stagehand

Amazon Web Services. Firecracker MicroVMs. https://firecracker-microvm.github.io/

Microsoft. Playwright Browser Automation. https://playwright.dev

Benchmark: https://www.computesdk.com/benchmarks/sandboxes/