Realistic evals,
or you're blind.
A runnable demo built from the seven anti-lessons of "7 Anti-Lessons from Building a Pydantic AI Agent" (PyCon DE): one agent, two tools, a markdown workflow — and an eval suite of real user journeys that catches the bug manual testing never will. Then the whole thing runs on a disposable Linux box in 6.2 seconds.
"Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored."— Pydantic Evals documentation, quoted on stage at PyCon DE & PyData 2026
Seven myths, deleted.
Every anti-lesson from the talk maps to a concrete artifact in this repo. The architecture is the absence of architecture.
agent/triage_agent.py — exactly one Agentsearch_runbooks + get_build_contextagent/workflows/triage.mdevals/dataset.py — vague, angry, ambiguous casesEscalationPolicy evaluator, asserted in CIevals/evaluators.py vs LLMJudge (live mode)The failing row is the point.
The offline model stub ships a deliberate bug: on a vague report with no log, it confidently guesses instead of escalating. Six realistic journeys, every run:
"The build is broken AGAIN!!! Just fix it." — no log, no build id. The agent answered "probably an infrastructure hiccup, restart the agent pool" at 0.9 confidence. A happy-path manual test never executes this journey. The EscalationPolicy evaluator does — every single run. That's the difference between testing and seeing.
Assertions gate. Scores trend.
Expectations live in case metadata, not expected outputs — the dataset stays declarative and partial credit is possible.
Cases
Six journeys developers actually have at 2am — including the vague and the
angry ones. metadata carries expected_category,
must_escalate, fix_keywords.
Deterministic evaluators
CategoryMatches (1.0 / 0.5 / 0.0), FixMentions,
and two policy assertions: EscalationPolicy,
NoDestructiveGuessing.
LLM judge (live)
LLMJudge grades what code can't: is the fix concrete,
runnable, and supported by the evidence? Rubric-driven, model-graded.
CI gate
Exit code from pass rate vs EVAL_MIN_PASS. Honest threshold:
a suite at 100% from day one is measuring nothing.
From zero keys to a fleet.
Same entrypoint, three execution tiers. Mode auto-detects from the environment.
Offline
A FunctionModel stub plays the LLM — calls tools, emits structured
output, deterministic. The eval harness is fully exercised for free.
# 10 seconds, zero keys $ uv sync $ uv run python -m evals.run_evals
Live
Real claude-haiku-4-5 drives the agent; an LLMJudge
evaluator joins the panel to grade fix quality against a rubric.
$ export ANTHROPIC_API_KEY=… $ uv run python -m evals.run_evals # auto-switches to live mode
Remote · crabbox
Lease a box, rsync the dirty checkout, bootstrap uv, run, stream the report, release. Swap providers — local container to real fleet.
$ crabbox job run evals # or explicitly: $ crabbox run -provider apple-container \ -- bash scripts/run_evals_remote.sh
Simpler systems, honest evals, and a human in the loop will take you further than clever architecture.
Read the code: github.com/zozo123/demo-evals-pydantic-islo-crabbox · Docs: Pydantic Evals · Pydantic AI · crabbox