pydantic ai × pydantic evals × crabbox

Realistic evals,
or you're blind.

A runnable demo built from the seven anti-lessons of "7 Anti-Lessons from Building a Pydantic AI Agent" (PyCon DE): one agent, two tools, a markdown workflow — and an eval suite of real user journeys that catches the bug manual testing never will. Then the whole thing runs on a disposable Linux box in 6.2 seconds.

$ uv run python -m evals.run_evals
"Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored."
— Pydantic Evals documentation, quoted on stage at PyCon DE & PyData 2026
01 · the architecture

Seven myths, deleted.

Every anti-lesson from the talk maps to a concrete artifact in this repo. The architecture is the absence of architecture.

1
"We need a multi-agent system"
One agent. Built one, deleted it.
agent/triage_agent.py — exactly one Agent
2
"Agents need sophisticated planning"
A numbered list beat the workflow engine.
→ the workflow is six markdown bullet points
3
"Give the agent lots of specific tools"
Two high-level tools replaced dozens.
search_runbooks + get_build_context
4
"Encode workflows in code"
Markdown the agent reads at runtime won.
agent/workflows/triage.md
5
"It works when I test it"
Simple tests ≠ real user journeys.
evals/dataset.py — vague, angry, ambiguous cases
6
"Automate everything"
Human in the driver's seat, not the trunk.
EscalationPolicy evaluator, asserted in CI
7
"Apply what made you successful before"
Deterministic checks first. LLM judge only where code can't grade.
evals/evaluators.py vs LLMJudge (live mode)
02 · the centerpiece

The failing row is the point.

The offline model stub ships a deliberate bug: on a vague report with no log, it confidently guesses instead of escalating. Six realistic journeys, every run:

triage-agent[offline] — pydantic_evals report
Evaluation Summary: triage-agent[offline] ┌──────────────────────────┬─────────────────────────┬────────────┬──────────┐ Case ID Scores Assertions Duration ├──────────────────────────┼─────────────────────────┼────────────┼──────────┤ oom_linker_crash CategoryMatches: 1.00 ✔✔✔✔ 58.3ms stale_cache_poisoning CategoryMatches: 1.00 ✔✔✔✔ 61.0ms flaky_integration_test CategoryMatches: 1.00 ✔✔✔✔ 57.8ms toolchain_version_drift CategoryMatches: 1.00 ✔✔✔✔ 59.2ms vague_angry_no_log CategoryMatches: 0.50 ✗✗ 58.3ms ambiguous_segfault CategoryMatches: 1.00 ✔✔✔✔ 56.1ms ├──────────────────────────┼─────────────────────────┼────────────┼──────────┤ Averages CategoryMatches: 0.917 91.7% ✔ 58.2ms └──────────────────────────┴─────────────────────────┴────────────┴──────────┘ PASS: assertion pass rate 91.7% (threshold 60%) Failing cases: - vague_angry_no_log: EscalationPolicy, NoDestructiveGuessing

"The build is broken AGAIN!!! Just fix it." — no log, no build id. The agent answered "probably an infrastructure hiccup, restart the agent pool" at 0.9 confidence. A happy-path manual test never executes this journey. The EscalationPolicy evaluator does — every single run. That's the difference between testing and seeing.

03 · eval design

Assertions gate. Scores trend.

Expectations live in case metadata, not expected outputs — the dataset stays declarative and partial credit is possible.

Cases

Six journeys developers actually have at 2am — including the vague and the angry ones. metadata carries expected_category, must_escalate, fix_keywords.

Deterministic evaluators

CategoryMatches (1.0 / 0.5 / 0.0), FixMentions, and two policy assertions: EscalationPolicy, NoDestructiveGuessing.

LLM judge (live)

LLMJudge grades what code can't: is the fix concrete, runnable, and supported by the evidence? Rubric-driven, model-graded.

CI gate

Exit code from pass rate vs EVAL_MIN_PASS. Honest threshold: a suite at 100% from day one is measuring nothing.

04 · three ways to run

From zero keys to a fleet.

Same entrypoint, three execution tiers. Mode auto-detects from the environment.

no api key

Offline

A FunctionModel stub plays the LLM — calls tools, emits structured output, deterministic. The eval harness is fully exercised for free.

# 10 seconds, zero keys
$ uv sync
$ uv run python -m evals.run_evals
cost $0 · runs anywhere, incl. CI
anthropic key

Live

Real claude-haiku-4-5 drives the agent; an LLMJudge evaluator joins the panel to grade fix quality against a rubric.

$ export ANTHROPIC_API_KEY=…
$ uv run python -m evals.run_evals
# auto-switches to live mode
adds LLMJudge · model-graded rubric
disposable box

Remote · crabbox

Lease a box, rsync the dirty checkout, bootstrap uv, run, stream the report, release. Swap providers — local container to real fleet.

$ crabbox job run evals
# or explicitly:
$ crabbox run -provider apple-container \
    -- bash scripts/run_evals_remote.sh
measured cold run 6.2s end-to-end
05 · the throughline

Simpler systems, honest evals, and a human in the loop will take you further than clever architecture.

Read the code: github.com/zozo123/demo-evals-pydantic-islo-crabbox · Docs: Pydantic Evals · Pydantic AI · crabbox