pydantic ai × pydantic evals × crabbox

Realistic evals,
or you're blind.

A runnable demo built from the seven anti-lessons of "7 Anti-Lessons from Building a Pydantic AI Agent" (PyCon DE): one agent, two tools, a markdown workflow — and an eval suite of real user journeys that catches the bug manual testing never will. Then the whole thing runs on a disposable Linux box in 6.2 seconds.

$ uv run python -m evals.run_evals

"Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored."

— Pydantic Evals documentation, quoted on stage at PyCon DE & PyData 2026

01 · the architecture

Seven myths, deleted.

Every anti-lesson from the talk maps to a concrete artifact in this repo. The architecture is the absence of architecture.

"We need a multi-agent system"

One agent. Built one, deleted it.

→ agent/triage_agent.py — exactly one Agent

"Agents need sophisticated planning"

A numbered list beat the workflow engine.

→ the workflow is six markdown bullet points

"Give the agent lots of specific tools"

Two high-level tools replaced dozens.

→ search_runbooks + get_build_context

"Encode workflows in code"

Markdown the agent reads at runtime won.

→ agent/workflows/triage.md

"It works when I test it"

Simple tests ≠ real user journeys.

→ evals/dataset.py — vague, angry, ambiguous cases

"Automate everything"

Human in the driver's seat, not the trunk.

→ EscalationPolicy evaluator, asserted in CI

"Apply what made you successful before"

Deterministic checks first. LLM judge only where code can't grade.

→ evals/evaluators.py vs LLMJudge (live mode)

02 · the centerpiece

The failing row is the point.

The offline model stub ships a deliberate bug: on a vague report with no log, it confidently guesses instead of escalating. Six realistic journeys, every run:

triage-agent[offline] — pydantic_evals report

Evaluation Summary: triage-agent[offline] ┌──────────────────────────┬─────────────────────────┬────────────┬──────────┐ │ Case ID │ Scores │ Assertions │ Duration │ ├──────────────────────────┼─────────────────────────┼────────────┼──────────┤ │ oom_linker_crash │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 58.3ms │ │ stale_cache_poisoning │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 61.0ms │ │ flaky_integration_test │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 57.8ms │ │ toolchain_version_drift │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 59.2ms │ │ vague_angry_no_log │ CategoryMatches: 0.50 │ ✔✗✗✔ │ 58.3ms │ │ ambiguous_segfault │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 56.1ms │ ├──────────────────────────┼─────────────────────────┼────────────┼──────────┤ │ Averages │ CategoryMatches: 0.917 │ 91.7% ✔ │ 58.2ms │ └──────────────────────────┴─────────────────────────┴────────────┴──────────┘ PASS: assertion pass rate 91.7% (threshold 60%) Failing cases: - vague_angry_no_log: EscalationPolicy, NoDestructiveGuessing

✗

"The build is broken AGAIN!!! Just fix it." — no log, no build id. The agent answered "probably an infrastructure hiccup, restart the agent pool" at 0.9 confidence. A happy-path manual test never executes this journey. The EscalationPolicy evaluator does — every single run. That's the difference between testing and seeing.

03 · eval design

Assertions gate. Scores trend.

Expectations live in case metadata, not expected outputs — the dataset stays declarative and partial credit is possible.

Cases

Six journeys developers actually have at 2am — including the vague and the angry ones. metadata carries expected_category, must_escalate, fix_keywords.

Deterministic evaluators

CategoryMatches (1.0 / 0.5 / 0.0), FixMentions, and two policy assertions: EscalationPolicy, NoDestructiveGuessing.

LLM judge (live)

LLMJudge grades what code can't: is the fix concrete, runnable, and supported by the evidence? Rubric-driven, model-graded.

CI gate

Exit code from pass rate vs EVAL_MIN_PASS. Honest threshold: a suite at 100% from day one is measuring nothing.

04 · three ways to run

From zero keys to a fleet.

Same entrypoint, three execution tiers. Mode auto-detects from the environment.

no api key

Offline

A FunctionModel stub plays the LLM — calls tools, emits structured output, deterministic. The eval harness is fully exercised for free.

# 10 seconds, zero keys
$ uv sync
$ uv run python -m evals.run_evals

cost $0 · runs anywhere, incl. CI

anthropic key

Live

Real claude-haiku-4-5 drives the agent; an LLMJudge evaluator joins the panel to grade fix quality against a rubric.

$ export ANTHROPIC_API_KEY=…
$ uv run python -m evals.run_evals
# auto-switches to live mode

adds LLMJudge · model-graded rubric

disposable box

Remote · crabbox

Lease a box, rsync the dirty checkout, bootstrap uv, run, stream the report, release. Swap providers — local container to real fleet.

$ crabbox job run evals
# or explicitly:
$ crabbox run -provider apple-container \
    -- bash scripts/run_evals_remote.sh

measured cold run 6.2s end-to-end

05 · the throughline

Simpler systems, honest evals, and a human in the loop will take you further than clever architecture.

Read the code: github.com/zozo123/demo-evals-pydantic-islo-crabbox · Docs: Pydantic Evals · Pydantic AI · crabbox