Overview & Goals
- Design a BIRD-style evaluation set whose questions exercise real enterprise ambiguity, not toy joins.
- Implement an execution-accuracy grader over a pinned data snapshot, and defend it against false positives.
- Build an agent loop with schema retrieval, SQL generation, and self-checking repair under explicit token and call budgets.
- Author a Cube-style semantic layer and quantify exactly how much accuracy each of its ingredients buys.
- Report results as ablation tables broken down by error class, instead of a single headline number.
In Week 9 lecture we traced the now-famous curve: a raw LLM pointed at an enterprise warehouse answers roughly 21% of analyst questions correctly, and a system wrapped in retrieval, a semantic layer, and self-checking execution reaches roughly 95% on the same questions. Vendors show you the endpoints. This lab makes you build the whole curve — every intermediate point, with your own grader, on a schema we have salted with the same traps that make real warehouses hostile: redundant metrics, contested definitions, and dead tables nobody deleted.
The deliverable is not “an agent.” It is a measured system: a 50-question eval set with gold SQL and gold answers, a grader you have stress-tested, an agent you can ablate, and a table that says — per error class — what each component is worth. We grade the table, not the demo.
Non-goals & forbidden shortcuts
- No fine-tuning. You use the course-pinned model via the proxy (
DATA2027_LLM_BASE_URL), temperature 0, fixed seed. The lab is about systems around the model, not the model. - No gold SQL anywhere near the agent. Gold queries may not appear in prompts, retrieval corpora, few-shot examples, or the semantic layer. Automated leakage checks run at grading time (see Pitfalls).
- No per-question hints. The semantic layer must be written against the schema, not against your 50 questions. If a YAML entry quotes question text, it is leakage.
- No string-matching graders. SQL text equality and fuzzy SQL similarity are both forbidden as correctness signals. Execution accuracy only.
- No schema edits. The snapshot is read-only; its SHA-256 is pinned in
MANIFEST. Creating views inside your own session is allowed; persisting them is not.
The Warehouse You Are Given
You receive a snapshot of the data warehouse of Vantage Retail Group, a synthetic mid-size retailer: 126 tables across the schemas finance, sales, orders, returns, marketing, web, hr, and ops, plus a legacy graveyard left in place. The data is synthetic but the dysfunction is faithfully reproduced. Three traps are documented here because every real warehouse has them; finding the rest is part of the lab.
- Three revenue variants.
sales.rev_billedis invoice-line grain on the billed date and includes tax and shipping;finance.bookingsis order-header grain at placement, includes later-cancelled orders, and is in local currency;finance.revenue_recognizedis accrual-based, lags shipment by up to 30 days, and is the canonical definition for almost all reporting questions. A naive agent picks whichever it retrieves first. - Two definitions of “active user.”
web.active_users_dailycounts any tracked event in a 1-day window, bots included;marketing.active_usersis a daily snapshot of a 30-day rolling window with fraud-flagged accounts excluded. Both are legitimate. Questions that say “active users” without qualification are answerable only with a stated definition — your semantic layer must pick one and say so. - Deprecated tables left in place. Fourteen legacy tables (
*_old,*_v1,*_bak) were frozen at various cutover dates — e.g.finance.revenue_recognized_v1at 2024-06-30,orders.orders_oldat 2024-09-30 — and never dropped. They have plausible names, no FK constraints, and silently stale data. Any query touching them after their freeze date is wrong by construction.
| File | What it is | You may modify? |
|---|---|---|
vrg_snapshot_2025-11-30.duckdb | Pinned snapshot, logical date frozen at 2025-11-30; SHA-256 in MANIFEST | No — read-only |
schema_student.sql | Full DDL for all 126 tables: columns, types, defaults, free-text column comments (some comments are stale on purpose) | No |
grader/grade.py | Grader stub with the check() signature below and 4 example test cases | Yes — you implement it |
agent/ | Empty package + course proxy client; budget accounting built in | Yes — your code |
eval/questions.jsonl | Empty; Milestone 1 fills it with 50 records | Yes — you write it |
semantic_layer.yaml | Empty; Milestone 3 | Yes — you write it |
Eval Set & Grader
Write 50 questions in BIRD style: a natural-language question, a difficulty rating, gold SQL, and the expected result shape; gold answers are materialized once from the snapshot and frozen. Distribution constraints — at least 10 questions joining 3+ tables; at least 8 that hit a trap metric (revenue or active-user); at least 5 unanswerable without choosing between contested definitions; at least 5 whose naive answer comes from a deprecated table. Every gold query must be deterministic on the snapshot (no now(), no unordered LIMIT without a tiebreaker).
One record per line in eval/questions.jsonl, exactly these fields:
# eval/questions.jsonl — one JSON object per line
{"id": "q06",
"difficulty": "medium",
"question": "What was total revenue in Q3 2025?",
"gold_sql": "SELECT SUM(amount_usd) FROM finance.revenue_recognized ...",
"expected_shape": "scalar"} # or "table"
# grader/grade.py — the contract you implement
def check(qid: str, predicted_sql: str,
conn: duckdb.DuckDBPyConnection) -> Verdict: ...
Verdict = {"ok": bool,
"error_class": None | "wrong-join" | "wrong-metric"
| "hallucinated-column" | "stale-table" | "other"}
Grader semantics (normative). A prediction is correct iff it executes without error and its result matches the gold answer as an unordered multiset of rows (ordered comparison when the question implies ranking), after normalization: project to gold arity by position, compare floats with tolerance 1e-6 relative or 1e-9 absolute, render dates as ISO-8601, treat NULL as distinct from 0 and from the empty string. An empty predicted result matching an empty gold result is not automatically correct — see Pitfalls. On failure, the classifier assigns one error class by precedence: hallucinated-column (binder error on a nonexistent column/table) → stale-table (any reference into a frozen legacy table — *_old, *_v1, *_bak) → wrong-metric (correct row grain, wrong measure column or aggregate) → wrong-join (row-count or grain mismatch traceable to join structure) → other.
Ship an adversarial grader test suite: ≥10 hand-built near-miss cases (wrong query, right-looking answer) that your grader must reject, checked in under grader/tests/.
The Agent Baseline
Implement the agent loop in agent/. Three stages are required; how you implement each is your design space, and your report must defend the choices.
- Schema retrieval. You may not stuff all 126 tables into context. Hard budget: at most 20 tables and 8,000 schema tokens per question. Embedding search over
schema_student.sql, keyword match, FK-graph expansion — anything goes within budget, and the budget is enforced by the proxy. - SQL generation. Single dialect (DuckDB). The generator sees the question, the retrieved schema slice, and — in semantic-layer configurations only — the relevant YAML entries.
- Self-check. Execute the candidate with a
LIMIT 20probe. On binder errors, runtime errors, empty results, or a self-assessed implausibility (your heuristic — document it), revise. At most 3 repair rounds; at most 12 LLM calls per question end-to-end; at most 16k context tokens per call. Exceeding any budget marks the question failed.
Run the full eval under configuration A1 (retrieval + self-check, no semantic layer) and the degenerate A0 (no retrieval: first 20 tables alphabetically — the honest stand-in for “dump whatever fits”). Expect A0 to land near the bottom of the 21→95 curve; if A0 scores above 50%, your eval set is too easy and will be graded as such. Emit runs/<config>/predictions.jsonl with fields qid, sql, config, n_llm_calls, latency_ms.
Semantic Layer & the Ablation Table
Write semantic_layer.yaml: a Cube-style document defining canonical measures, dimensions, join paths, and explicit anti-patterns. It is a contract about the schema, written as if for human analysts; it must never reference your eval questions. Format (normative — the autograder parses it):
# semantic_layer.yaml — Cube-style, parsed by the autograder
version: 1
dialect: duckdb
snapshot_logical_date: 2025-11-30
cubes:
- name: revenue
sql_table: finance.revenue_recognized
description: >
Canonical revenue: accrual basis, net of contract discounts.
Use for any unqualified "revenue" question.
filters:
- sql: "posting_status = 'posted'"
measures:
- { name: amount, sql: amount_usd, type: sum, format: usd }
dimensions:
- { name: recognized_date, sql: recognized_date, type: time }
- { name: cost_center_id, sql: cost_center_id, type: number }
joins:
- name: order_lines
sql: "{revenue}.order_line_id = {order_lines}.order_line_id"
relationship: many_to_one
anti_patterns:
- "sales.rev_billed is invoice-line grain on the billed date and
INCLUDES tax and shipping; it is billing, not revenue."
- "finance.revenue_recognized_v1 was frozen at the 2024-06-30
ERP cutover; never query it."
Then run every configuration below against the same 50 questions and fill the ablation table. This table, completed with your numbers and a per-class reading of why each delta happened, is the centerpiece of your report. Reproduce the lecture’s 21→95 curve in miniature: A0 should sit near the left end, A3 near the right, and the table should show which component moved which error class.
| Config | Components | Exec acc (%) | wrong-join | wrong-metric | halluc. column | stale-table |
|---|---|---|---|---|---|---|
A0 | no retrieval, no layer, no self-check | — | — | — | — | — |
A1 | retrieval + self-check | — | — | — | — | — |
A2 | A1 + semantic layer (descriptions + joins only) | — | — | — | — | — |
A3 | A2 + anti-patterns | — | — | — | — | — |
A3−sc | A3 with self-check disabled | — | — | — | — | — |
Report token and call costs per configuration alongside accuracy. A point on the Pareto frontier (accuracy vs. mean LLM calls per question) is worth more than a higher point bought with an unbounded loop.
Rubric
| Component | Points | What we look at |
|---|---|---|
| Eval set quality (M1) | 15 | Distribution constraints met; gold SQL deterministic; traps genuinely exercised, not name-checked |
| Grader correctness (M1) | 15 | Normalization rules implemented as specified; adversarial suite present and passing; no string-match shortcuts |
| Agent baseline (M2) | 15 | Budgets enforced; A0/A1 reproducible from tag; self-check heuristic documented |
| Semantic layer (M3) | 15 | Parses; covers contested metrics with stated definitions; schema-level, no question leakage |
| Ablation & analysis | 25 | Table complete; per-error-class deltas explained mechanistically; Pareto framing of cost vs. accuracy |
| Report & reproducibility | 10 | make reproduce regenerates every number in the report from the tag, on the pinned snapshot |
| Checkpoint questions | 5 | Submitted on time at each milestone; engaged, not pro-forma |
Total: 100. Headline accuracy is deliberately not a rubric line: a team reporting an honest 71% with a sharp error-class analysis outscores a team reporting 94% from a leaky grader. We re-run your harness; numbers that don’t reproduce score the row at zero.
Where Previous Cohorts Lost Points
now(), unseeded sampling, and tie-broken LIMIT are the classic ways a gold answer drifts between your laptop and our re-run. The snapshot’s logical date is 2025-11-30; write gold SQL against that constant, and add deterministic tiebreakers to every ranking question.
Checkpoint Questions
Submit short written answers (½ page each) with the corresponding milestone. They are graded for engagement with your own measurements, not for matching an answer key.
Execution accuracy on a pinned snapshot can still mark wrong queries correct. Construct two concrete pairs (gold SQL, wrong-but-passing SQL) against the Vantage schema, explain the coincidence each exploits, and state which normalization rule or adversarial test in your grader closes the hole — or why it can’t be closed without a second snapshot.
Present the error-class histogram for A1. Before running Milestone 3, predict in writing which classes the semantic layer will reduce and which it cannot touch, with a mechanistic reason per class (e.g., can documentation fix a wrong-join caused by grain mismatch, or only a wrong-metric?). You will be graded on the quality of the reasoning, then asked to revisit the prediction in your final report.
Your 21→95 in miniature: using the ablation table, identify the single semantic-layer ingredient (descriptions, join paths, or anti-patterns) that bought the most accuracy per added context token. Where does your A3 sit on the accuracy-vs-cost Pareto frontier relative to A2, and is A3−sc evidence that self-check and the semantic layer are substitutes or complements on this schema?