Lab 3: Text-to-SQL Agent + Eval Harness

Overview

Overview & Goals

Learning objectives — after this lab you can…

Design a BIRD-style evaluation set whose questions exercise real enterprise ambiguity, not toy joins.
Implement an execution-accuracy grader over a pinned data snapshot, and defend it against false positives.
Build an agent loop with schema retrieval, SQL generation, and self-checking repair under explicit token and call budgets.
Author a Cube-style semantic layer and quantify exactly how much accuracy each of its ingredients buys.
Report results as ablation tables broken down by error class, instead of a single headline number.

In Week 9 lecture we traced the now-famous curve: a raw LLM pointed at an enterprise warehouse answers roughly 21% of analyst questions correctly, and a system wrapped in retrieval, a semantic layer, and self-checking execution reaches roughly 95% on the same questions. Vendors show you the endpoints. This lab makes you build the whole curve — every intermediate point, with your own grader, on a schema we have salted with the same traps that make real warehouses hostile: redundant metrics, contested definitions, and dead tables nobody deleted.

The deliverable is not “an agent.” It is a measured system: a 50-question eval set with gold SQL and gold answers, a grader you have stress-tested, an agent you can ablate, and a table that says — per error class — what each component is worth. We grade the table, not the demo.

Vendors show you the endpoints of the curve. You will build every point in between.

Non-goals & forbidden shortcuts

No fine-tuning. You use the course-pinned model via the proxy (DATA2027_LLM_BASE_URL), temperature 0, fixed seed. The lab is about systems around the model, not the model.
No gold SQL anywhere near the agent. Gold queries may not appear in prompts, retrieval corpora, few-shot examples, or the semantic layer. Automated leakage checks run at grading time (see Pitfalls).
No per-question hints. The semantic layer must be written against the schema, not against your 50 questions. If a YAML entry quotes question text, it is leakage.
No string-matching graders. SQL text equality and fuzzy SQL similarity are both forbidden as correctness signals. Execution accuracy only.
No schema edits. The snapshot is read-only; its SHA-256 is pinned in MANIFEST. Creating views inside your own session is allowed; persisting them is not.

Provided materials

The Warehouse You Are Given

You receive a snapshot of the data warehouse of Vantage Retail Group, a synthetic mid-size retailer: 126 tables across the schemas finance, sales, orders, returns, marketing, web, hr, and ops, plus a legacy graveyard left in place. The data is synthetic but the dysfunction is faithfully reproduced. Three traps are documented here because every real warehouse has them; finding the rest is part of the lab.

Three revenue variants. sales.rev_billed is invoice-line grain on the billed date and includes tax and shipping; finance.bookings is order-header grain at placement, includes later-cancelled orders, and is in local currency; finance.revenue_recognized is accrual-based, lags shipment by up to 30 days, and is the canonical definition for almost all reporting questions. A naive agent picks whichever it retrieves first.
Two definitions of “active user.” web.active_users_daily counts any tracked event in a 1-day window, bots included; marketing.active_users is a daily snapshot of a 30-day rolling window with fraud-flagged accounts excluded. Both are legitimate. Questions that say “active users” without qualification are answerable only with a stated definition — your semantic layer must pick one and say so.
Deprecated tables left in place. Fourteen legacy tables (*_old, *_v1, *_bak) were frozen at various cutover dates — e.g. finance.revenue_recognized_v1 at 2024-06-30, orders.orders_old at 2024-09-30 — and never dropped. They have plausible names, no FK constraints, and silently stale data. Any query touching them after their freeze date is wrong by construction.

File	What it is	You may modify?
`vrg_snapshot_2025-11-30.duckdb`	Pinned snapshot, logical date frozen at 2025-11-30; SHA-256 in `MANIFEST`	No — read-only
`schema_student.sql`	Full DDL for all 126 tables: columns, types, defaults, free-text column comments (some comments are stale on purpose)	No
`grader/grade.py`	Grader stub with the `check()` signature below and 4 example test cases	Yes — you implement it
`agent/`	Empty package + course proxy client; budget accounting built in	Yes — your code
`eval/questions.jsonl`	Empty; Milestone 1 fills it with 50 records	Yes — you write it
`semantic_layer.yaml`	Empty; Milestone 3	Yes — you write it

Fig. 3.1 — Lab 3 system. Top: the agent loop (retrieve → generate → self-check), optionally fed by the semantic layer. Bottom: the eval harness, which runs every configuration against the same pinned snapshot and emits per-error-class counts. Both halves are graded.

Milestone 1 · end of Week 9, Wed

Eval Set & Grader

Write 50 questions in BIRD style: a natural-language question, a difficulty rating, gold SQL, and the expected result shape; gold answers are materialized once from the snapshot and frozen. Distribution constraints — at least 10 questions joining 3+ tables; at least 8 that hit a trap metric (revenue or active-user); at least 5 unanswerable without choosing between contested definitions; at least 5 whose naive answer comes from a deprecated table. Every gold query must be deterministic on the snapshot (no now(), no unordered LIMIT without a tiebreaker).

One record per line in eval/questions.jsonl, exactly these fields:

# eval/questions.jsonl — one JSON object per line
{"id": "q06",
 "difficulty": "medium",
 "question": "What was total revenue in Q3 2025?",
 "gold_sql": "SELECT SUM(amount_usd) FROM finance.revenue_recognized ...",
 "expected_shape": "scalar"}              # or "table"

# grader/grade.py — the contract you implement
def check(qid: str, predicted_sql: str,
          conn: duckdb.DuckDBPyConnection) -> Verdict: ...

Verdict = {"ok": bool,
           "error_class": None | "wrong-join" | "wrong-metric"
                        | "hallucinated-column" | "stale-table" | "other"}

Grader semantics (normative). A prediction is correct iff it executes without error and its result matches the gold answer as an unordered multiset of rows (ordered comparison when the question implies ranking), after normalization: project to gold arity by position, compare floats with tolerance 1e-6 relative or 1e-9 absolute, render dates as ISO-8601, treat NULL as distinct from 0 and from the empty string. An empty predicted result matching an empty gold result is not automatically correct — see Pitfalls. On failure, the classifier assigns one error class by precedence: hallucinated-column (binder error on a nonexistent column/table) → stale-table (any reference into a frozen legacy table — *_old, *_v1, *_bak) → wrong-metric (correct row grain, wrong measure column or aggregate) → wrong-join (row-count or grain mismatch traceable to join structure) → other.

Ship an adversarial grader test suite: ≥10 hand-built near-miss cases (wrong query, right-looking answer) that your grader must reject, checked in under grader/tests/.

Milestone 2 · end of Week 10, Mon

The Agent Baseline

Implement the agent loop in agent/. Three stages are required; how you implement each is your design space, and your report must defend the choices.

Schema retrieval. You may not stuff all 126 tables into context. Hard budget: at most 20 tables and 8,000 schema tokens per question. Embedding search over schema_student.sql, keyword match, FK-graph expansion — anything goes within budget, and the budget is enforced by the proxy.
SQL generation. Single dialect (DuckDB). The generator sees the question, the retrieved schema slice, and — in semantic-layer configurations only — the relevant YAML entries.
Self-check. Execute the candidate with a LIMIT 20 probe. On binder errors, runtime errors, empty results, or a self-assessed implausibility (your heuristic — document it), revise. At most 3 repair rounds; at most 12 LLM calls per question end-to-end; at most 16k context tokens per call. Exceeding any budget marks the question failed.

Run the full eval under configuration A1 (retrieval + self-check, no semantic layer) and the degenerate A0 (no retrieval: first 20 tables alphabetically — the honest stand-in for “dump whatever fits”). Expect A0 to land near the bottom of the 21→95 curve; if A0 scores above 50%, your eval set is too easy and will be graded as such. Emit runs/<config>/predictions.jsonl with fields qid, sql, config, n_llm_calls, latency_ms.

Milestone 3 · end of Week 10, Fri

Semantic Layer & the Ablation Table

Write semantic_layer.yaml: a Cube-style document defining canonical measures, dimensions, join paths, and explicit anti-patterns. It is a contract about the schema, written as if for human analysts; it must never reference your eval questions. Format (normative — the autograder parses it):

# semantic_layer.yaml — Cube-style, parsed by the autograder
version: 1
dialect: duckdb
snapshot_logical_date: 2025-11-30

cubes:
  - name: revenue
    sql_table: finance.revenue_recognized
    description: >
      Canonical revenue: accrual basis, net of contract discounts.
      Use for any unqualified "revenue" question.
    filters:
      - sql: "posting_status = 'posted'"
    measures:
      - { name: amount, sql: amount_usd, type: sum, format: usd }
    dimensions:
      - { name: recognized_date, sql: recognized_date, type: time }
      - { name: cost_center_id,  sql: cost_center_id,  type: number }
    joins:
      - name: order_lines
        sql: "{revenue}.order_line_id = {order_lines}.order_line_id"
        relationship: many_to_one
    anti_patterns:
      - "sales.rev_billed is invoice-line grain on the billed date and
         INCLUDES tax and shipping; it is billing, not revenue."
      - "finance.revenue_recognized_v1 was frozen at the 2024-06-30
         ERP cutover; never query it."

Then run every configuration below against the same 50 questions and fill the ablation table. This table, completed with your numbers and a per-class reading of why each delta happened, is the centerpiece of your report. Reproduce the lecture’s 21→95 curve in miniature: A0 should sit near the left end, A3 near the right, and the table should show which component moved which error class.

Config	Components	Exec acc (%)	wrong-join	wrong-metric	halluc. column	stale-table
`A0`	no retrieval, no layer, no self-check	—	—	—	—	—
`A1`	retrieval + self-check	—	—	—	—	—
`A2`	A1 + semantic layer (descriptions + joins only)	—	—	—	—	—
`A3`	A2 + anti-patterns	—	—	—	—	—
`A3−sc`	A3 with self-check disabled	—	—	—	—	—

Report token and call costs per configuration alongside accuracy. A point on the Pareto frontier (accuracy vs. mean LLM calls per question) is worth more than a higher point bought with an unbounded loop.

Grading

Rubric

Component	Points	What we look at
Eval set quality (M1)	15	Distribution constraints met; gold SQL deterministic; traps genuinely exercised, not name-checked
Grader correctness (M1)	15	Normalization rules implemented as specified; adversarial suite present and passing; no string-match shortcuts
Agent baseline (M2)	15	Budgets enforced; A0/A1 reproducible from tag; self-check heuristic documented
Semantic layer (M3)	15	Parses; covers contested metrics with stated definitions; schema-level, no question leakage
Ablation & analysis	25	Table complete; per-error-class deltas explained mechanistically; Pareto framing of cost vs. accuracy
Report & reproducibility	10	`make reproduce` regenerates every number in the report from the tag, on the pinned snapshot
Checkpoint questions	5	Submitted on time at each milestone; engaged, not pro-forma

Total: 100. Headline accuracy is deliberately not a rubric line: a team reporting an honest 71% with a sharp error-class analysis outscores a team reporting 94% from a leaky grader. We re-run your harness; numbers that don’t reproduce score the row at zero.

Pitfalls

Where Previous Cohorts Lost Points

Pitfall · eval leakage The subtle version is not pasting gold SQL into a prompt — the autograder catches that. It is writing your semantic layer after watching the agent fail on your own questions, then encoding question-specific fixes as “documentation.” The check: every YAML entry must be justifiable to an analyst who has never seen your eval set. We diff your layer against your question text and flag near-verbatim overlap; we also hold out 10 instructor questions on the same schema, and a large gap between your-50 and our-10 accuracy is treated as leakage.

Pitfall · grader false positives Execution accuracy overcounts in predictable ways: two different wrong queries both return empty; a wrong aggregate coincidentally equals the right one on this snapshot; a query returns the right number wrapped in extra columns your projection-by-arity silently drops. Your adversarial suite exists to catch exactly these. A grader that passes a near-miss case loses the full grader-correctness component, because every downstream number becomes unfalsifiable.

Pitfall · nondeterminism now(), unseeded sampling, and tie-broken LIMIT are the classic ways a gold answer drifts between your laptop and our re-run. The snapshot’s logical date is 2025-11-30; write gold SQL against that constant, and add deterministic tiebreakers to every ranking question.

Checkpoints

Checkpoint Questions

Submit short written answers (½ page each) with the corresponding milestone. They are graded for engagement with your own measurements, not for matching an answer key.

Checkpoint 3.1 · with Milestone 1

Execution accuracy on a pinned snapshot can still mark wrong queries correct. Construct two concrete pairs (gold SQL, wrong-but-passing SQL) against the Vantage schema, explain the coincidence each exploits, and state which normalization rule or adversarial test in your grader closes the hole — or why it can’t be closed without a second snapshot.

Checkpoint 3.2 · with Milestone 2

Present the error-class histogram for A1. Before running Milestone 3, predict in writing which classes the semantic layer will reduce and which it cannot touch, with a mechanistic reason per class (e.g., can documentation fix a wrong-join caused by grain mismatch, or only a wrong-metric?). You will be graded on the quality of the reasoning, then asked to revisit the prediction in your final report.

Checkpoint 3.3 · with Milestone 3

Your 21→95 in miniature: using the ablation table, identify the single semantic-layer ingredient (descriptions, join paths, or anti-patterns) that bought the most accuracy per added context token. Where does your A3 sit on the accuracy-vs-cost Pareto frontier relative to A2, and is A3−sc evidence that self-check and the semantic layer are substitutes or complements on this schema?

❦

Provided Materials

Starter files

→

schema.sqlFull warehouse DDL — 126 tables across finance, sales, orders, returns, marketing, web, hr, ops plus a deprecated graveyard; instructor edition with -- TRAP annotations (stripped from the student build)

→

questions.jsonl50 BIRD-style eval questions with gold SQL (DuckDB dialect) and expected result shape; 12+ deliberately hit the salted traps

→

semantic_layer.yamlInstructor reference semantic layer in the Cube-style format the autograder parses — governed revenue, active-user, return-rate, and AOV definitions with anti-patterns and a deprecated-table list

→

grader/grade.pyExecution-accuracy grader stub — loads the schema into DuckDB, grades an answers JSONL against the gold queries (PASS / WRONG_RESULT / BINDER_ERROR / STALE_TABLE), with four example test cases via --self-test

→

README.mdLoading the DDL into DuckDB/Postgres, how the grader pins the snapshot by SHA-256, and the exact ablation-table format for report/ablation.md