TesterArmy: AI app testing - catch what your evals miss

Backed by

[02] AI stack

Made for the AI stack you
already use

OpenAI

Anthropic

Google

Meta

Perplexity

LangChain

Vercel Gateway

OpenAI

Anthropic

Google

Meta

Perplexity

LangChain

Vercel Gateway

[03] The eval gap

Evals see the model
We see the journey

Evals score answers. Sentry catches exceptions. Neither catches the journey that quietly stops working - and that's the one users feel.

Outputs aren't deterministic

Same prompt, different responses. Snapshot diffs are useless - you need semantic and journey-level checks.

A prompt tweak ships an outage

Change a system prompt to fix one thing, break three others. The eval still passes. The user journey breaks. Standard E2E suites won't tell you which.

In AI, quality IS the UX

A slow answer is a wrong answer. A refused query is a churned user. The metrics your dashboards track are the ones your users feel.

AI breaks at the seams

[04] Use cases

From copilots to multi-turn agents

Describe a real user journey and we'll watch it across every release. Here's how that looks for the most common AI app shapes.

01LLM

Chat & copilots

Ask the assistant a real user question. Assert it references the right doc, calls the right tool, and stays under 4s.

StreamingCitations

02Tools

Agents

Multi-turn flows: user, tool call, user follow-up, final action. Assert the outcome, not exact wording.

Multi-turnOutcomes

[05] Per-regression evidence

Reproducible, recorded, routable

When a journey breaks, you don't get a guess at which prompt change caused it. You get the recording, the step-by-step trace, screenshots, and a check status on the PR - pinned to the deploy that introduced the regression.

01Full browser or mobile simulator session recording of the journey
02Step-by-step pass/fail breakdown of every action
03Screenshots captured at each key moment
04Clear bug report with reproduction context
05Pull request comment + GitHub check status

[06] Integrations

Plugs into the tools your
AI team already runs

slack

CI/CD

github

expo

vercel

discord

api

webhook

slack

CI/CD

github

expo

vercel

discord

api

webhook

[07] How it works

Four steps to journey-level QA for your AI app

Natural-language journeys, quality bars that travel with each release, and evidence you can act on the moment something breaks.

Connect GitHub

staging.yourapp.com

Upload app binary

Create a project

Paste your staging or production URL to set up a project for your AI app. Works with Next.js, React, Vue, or any framework. No SDK, no test scripts, no infrastructure to maintain.

Any frameworkURLGitHub

staging.yourapp

user@test.com

Submit

Set quality bars

Describe your AI flow in plain English ("user asks for a refund and the agent processes it through the support tool") and define what "good" looks like: tool calls, refusals, latency budget, semantic match. We verify outcomes, not exact wording.

Tool callsLatencyRefusals

GitHub AppAuto on deploy

Production monitoringRecurring runs

WebhookAny CI pipeline

Run on every release

TesterArmy executes against staging or prod previews on each deploy. GitHub App or webhook from any CI pipeline.

GitHub AppVercel previewWebhook

PassSearch results

Query returns the correct filtered items and pagination controls work as expected

Get a clear report

Recordings, transcripts of the AI conversation, tool call traces, and a clean bug report. Triage in minutes, not hours.

RecordingTranscriptTool trace

[08] Resources

TesterArmy in action

Explore real bug finds, QA tips, and client case studies from our work in practice.

Read the blog

June 10, 2026

Inside Playwright CLI: Browser Automation Built for Coding Agents

Playwright CLI is a thin wrapper around a daemon that reuses the exact same tool layer as Playwright MCP - minus the token cost. Here's how it works under the hood and how to use it well.

Read article

June 6, 2026

How to Handle Authentication in Playwright E2E Tests

Stop logging in through the UI before every Playwright test. Use storageState, keep one real login test, isolate roles, and make stale auth fail loudly in CI.

[09] FAQ

AI testing, answered

Yes. Journeys can include multi-turn agent steps and tool calls. We assert on outcomes (did the agent call the right tool with the right arguments and reach the right end state) rather than on exact wording.

Yes - via the TesterArmy GitHub App or Vercel preview integration. You get pass/fail check statuses and report links right on each pull request.

Full session recording of the journey, a step-by-step pass/fail breakdown, screenshots at every key moment, a clear bug report with reproduction context, and a comment + check status posted on the pull request.

We don't replace it. Eval platforms score the model in isolation (prompt, output, against a dataset). TesterArmy verifies the journey in production (user, AI app, outcome, on a real deploy). Most AI teams need both - evals catch model regressions, TesterArmy catches everything around the model.

Yes. We treat the stream as one journey step: assertions run on the final concatenated output, and latency assertions split into first-token and completion budgets. You'll catch a hang in the middle, a stalled stream, or a slow first token without flake.

Outcome-based assertions, not snapshot diffs. You define what good looks like - the right tool was called, the response cites the right source, refusal happened when sources didn't support the claim, latency stayed under budget. We grade against intent, not exact wording, so a paraphrased answer that does the right thing still passes.

The next regression in your AI app will ship silently

A prompt tweak. A model upgrade. A retrieval index drift. The eval still passes; the user journey breaks. TesterArmy is the safety net you wire in once and trust on every release.

[10] Contact us

Catch what evals miss

Backed by

[02] AI stack

Made for the AI stack you
already use

OpenAI

Anthropic

Google

Meta

Perplexity

LangChain

Vercel Gateway

OpenAI

Anthropic

Google

Meta

Perplexity

LangChain

Vercel Gateway

[03] The eval gap