APAgentsProof

Now in beta

Stop vibe-checking
AI agents. Prove they work.

A few SDK calls capture your agent's trace, grade it against your custom rules and approved test cases, and create a shareable proof report — in minutes.

You changed a prompt. Something broke. A user noticed first.

View example report

$ npm install agentsproof

agentsproof.dev/r/x8kp2m live

agent.tsyour code

import { AgentsProof } from 'agentsproof';

const ap = new AgentsProof({ apiKey: process.env.AGENTSPROOF_API_KEY });
const run = ap.startRun({
  projectSlug: 'my-agent',
  input: { query },
  goal: 'Search the web and return a working code solution', // anchors grading
});

const result = await run.trace('llm_call', 'gpt-4o', () => llm(query));
const { publicUrl } = await run.complete({ answer: result });

console.log(publicUrl); // → https://agentsproof.dev/r/abc123

report · run #47the proof

my-coding-agent

2 minutes ago · 8 steps

87/100

Goal completion92

Tool accuracy78

Step efficiency95

Output quality88

Safety97

No anomaliesOpen report

Drops intoOpenAIAnthropicLangChainCrewAIVercel AI SDKLlamaIndex

How it works

From zero to proof in five steps.

01install

Install the SDK

One package, zero config. Works in any Node or edge runtime.

$ npm i agentsproof

02run.trace()

Wrap your calls

Drop run.trace() around each LLM and tool call. That's the whole integration.

run.trace('llm_call', fn)

03define

Add custom graders

Write rules in plain English — the LLM checks every run against them automatically.

→ Dashboard → Graders → New rule

04golden

Define your success bar

Turn a passing run into a live test spec. Set success criteria, trace assertions, and expected behavior — all evaluated on every future run.

→ Trace view → Save as Golden

05prove

Run your proof suite

One SDK call runs all approved Goldens through your agent. The report shows per-criterion pass/fail for every case — not just an overall score.

→ agentsproof.dev/p/suite-abc

agent.ts — instrument a run

Copy

import { AgentsProof } from 'agentsproof';

const ap = new AgentsProof({ apiKey: process.env.AGENTSPROOF_API_KEY });
const run = ap.startRun({
  projectSlug: 'my-agent',
  input: { query },
  goal: 'Search the web and return a working code solution', // anchors grading
});

const result = await run.trace('llm_call', 'gpt-4o', () => llm(query));
const { publicUrl } = await run.complete({ answer: result });

console.log(publicUrl); // → https://agentsproof.dev/r/abc123

What you get

Everything you need to know your agent works — not just a score.

Custom Graders

Define what 'good' means for your agent in plain English. Every run is automatically graded against your rules — pin specific graders to individual Goldens for targeted enforcement.

The agent must never reveal user PII

Goldens

Turn any passing run into an executable test spec with success criteria, expected behavior, and trace assertions. Pass goldenId directly to startRun() to run against a Golden on demand — or batch them all in a proof suite.

startRun({ goldenId }) · 17/18 criteria passed →

Trace Assertions

Structured checks that run deterministically — no LLM needed. If your agent skips a required step or exceeds a step budget, the case fails immediately. Catches regressions the LLM judge can miss.

must_not_call:send_email · max_steps:10

Synthetic Variants

AgentsProof uses AI to generate edge-case variants from your Goldens, growing your test suite without writing tests by hand.

Generate 5 variants from this Golden

Proof Suites

One SDK call runs all approved Goldens through your agent. The report shows per-criterion Golden checks alongside the 5-axis score — every pass and fail is explainable.

agentsproof.dev/p/suite-abc

proof-suite.ts — run all Goldens

Copy

import { AgentsProof } from 'agentsproof';

const ap = new AgentsProof({ apiKey: process.env.AGENTSPROOF_API_KEY });

await ap.runProofSuite({
  projectSlug: 'my-agent',
  suiteSlug: 'core-behaviors',
  async handler(input, ctx) {
    // input comes from the approved Golden
    const run = ctx.startRun();
    const result = await myAgent(input);
    await run.complete({ answer: result });
  },
});
// → https://agentsproof.dev/p/suite-abc123

vs. the alternatives

Less setup. More signal.

AgentsProof

LangSmith

Braintrust

Public shareable proof reports

✓

✗

Works with any LLM framework

✓

Save real runs as test cases

✓

✗

5-minute SDK setup

✓

✗

Designed for indie builders

✓

✗

Free tier

✓

~ partial support · based on publicly available information

Pricing

Free to start. Pro when you're ready to ship.

Free

Forever, no card required

1 project
200 eval runs / month
Default LLM grader
10 golden test cases
1 proof suite
Public proof reports
Basic email support

Most popular

Pro

$29/month

Or $290/yr — 2 months free

Unlimited projects
10,000 eval runs / month
Unlimited custom graders
Unlimited golden test cases
Unlimited proof suites
Public + private proof reports
Priority email support

Stop vibe-checkingAI agents. Prove they work.