DATA 2027 DATA SYSTEMS IN THE AGENTIC ERA ← SCHEDULE
Lab 3 · Weeks 9–10

Text-to-SQL Agent + Eval Harness

You will build an analytics agent over a deliberately hostile 126-table enterprise schema — and, more importantly, the evaluation harness that tells you whether it actually works. The agent is half the lab; the measurement is the other half.

Due: Fri Week 10, 23:59 · Weight: 10% of course grade (¼ of the 40% lab component) · Teams: pairs  ·  Submit: Git tag lab3-final + report/ablation.md
Overview

Overview & Goals

Learning objectives — after this lab you can…
  • Design a BIRD-style evaluation set whose questions exercise real enterprise ambiguity, not toy joins.
  • Implement an execution-accuracy grader over a pinned data snapshot, and defend it against false positives.
  • Build an agent loop with schema retrieval, SQL generation, and self-checking repair under explicit token and call budgets.
  • Author a Cube-style semantic layer and quantify exactly how much accuracy each of its ingredients buys.
  • Report results as ablation tables broken down by error class, instead of a single headline number.

In Week 9 lecture we traced the now-famous curve: a raw LLM pointed at an enterprise warehouse answers roughly 21% of analyst questions correctly, and a system wrapped in retrieval, a semantic layer, and self-checking execution reaches roughly 95% on the same questions. Vendors show you the endpoints. This lab makes you build the whole curve — every intermediate point, with your own grader, on a schema we have salted with the same traps that make real warehouses hostile: redundant metrics, contested definitions, and dead tables nobody deleted.

The deliverable is not “an agent.” It is a measured system: a 50-question eval set with gold SQL and gold answers, a grader you have stress-tested, an agent you can ablate, and a table that says — per error class — what each component is worth. We grade the table, not the demo.

Vendors show you the endpoints of the curve. You will build every point in between.

Non-goals & forbidden shortcuts

Provided materials

The Warehouse You Are Given

You receive a snapshot of the data warehouse of Vantage Retail Group, a synthetic mid-size retailer: 126 tables across the schemas finance, sales, orders, returns, marketing, web, hr, and ops, plus a legacy graveyard left in place. The data is synthetic but the dysfunction is faithfully reproduced. Three traps are documented here because every real warehouse has them; finding the rest is part of the lab.

FileWhat it isYou may modify?
vrg_snapshot_2025-11-30.duckdbPinned snapshot, logical date frozen at 2025-11-30; SHA-256 in MANIFESTNo — read-only
schema_student.sqlFull DDL for all 126 tables: columns, types, defaults, free-text column comments (some comments are stale on purpose)No
grader/grade.pyGrader stub with the check() signature below and 4 example test casesYes — you implement it
agent/Empty package + course proxy client; budget accounting built inYes — your code
eval/questions.jsonlEmpty; Milestone 1 fills it with 50 recordsYes — you write it
semantic_layer.yamlEmpty; Milestone 3Yes — you write it
NL questionanalyst intent schema retriever≤20 tables / 8k tok SQL generatortemp 0, pinned model self-check + run≤3 repair rounds answer repair: error / empty / suspicious result semantic_layer.yaml (Milestone 3, ablatable) pinned snapshotvrg_snapshot_2025-11-30.duckdb questions.jsonl50 q + gold SQL runnerper-config graderexec accuracy error classifier4 classes + other ablationtable
Fig. 3.1 — Lab 3 system. Top: the agent loop (retrieve → generate → self-check), optionally fed by the semantic layer. Bottom: the eval harness, which runs every configuration against the same pinned snapshot and emits per-error-class counts. Both halves are graded.
Milestone 1 · end of Week 9, Wed

Eval Set & Grader

Write 50 questions in BIRD style: a natural-language question, a difficulty rating, gold SQL, and the expected result shape; gold answers are materialized once from the snapshot and frozen. Distribution constraints — at least 10 questions joining 3+ tables; at least 8 that hit a trap metric (revenue or active-user); at least 5 unanswerable without choosing between contested definitions; at least 5 whose naive answer comes from a deprecated table. Every gold query must be deterministic on the snapshot (no now(), no unordered LIMIT without a tiebreaker).

One record per line in eval/questions.jsonl, exactly these fields:

# eval/questions.jsonl — one JSON object per line
{"id": "q06",
 "difficulty": "medium",
 "question": "What was total revenue in Q3 2025?",
 "gold_sql": "SELECT SUM(amount_usd) FROM finance.revenue_recognized ...",
 "expected_shape": "scalar"}              # or "table"

# grader/grade.py — the contract you implement
def check(qid: str, predicted_sql: str,
          conn: duckdb.DuckDBPyConnection) -> Verdict: ...

Verdict = {"ok": bool,
           "error_class": None | "wrong-join" | "wrong-metric"
                        | "hallucinated-column" | "stale-table" | "other"}

Grader semantics (normative). A prediction is correct iff it executes without error and its result matches the gold answer as an unordered multiset of rows (ordered comparison when the question implies ranking), after normalization: project to gold arity by position, compare floats with tolerance 1e-6 relative or 1e-9 absolute, render dates as ISO-8601, treat NULL as distinct from 0 and from the empty string. An empty predicted result matching an empty gold result is not automatically correct — see Pitfalls. On failure, the classifier assigns one error class by precedence: hallucinated-column (binder error on a nonexistent column/table) → stale-table (any reference into a frozen legacy table — *_old, *_v1, *_bak) → wrong-metric (correct row grain, wrong measure column or aggregate) → wrong-join (row-count or grain mismatch traceable to join structure) → other.

Ship an adversarial grader test suite: ≥10 hand-built near-miss cases (wrong query, right-looking answer) that your grader must reject, checked in under grader/tests/.

Milestone 2 · end of Week 10, Mon

The Agent Baseline

Implement the agent loop in agent/. Three stages are required; how you implement each is your design space, and your report must defend the choices.

Run the full eval under configuration A1 (retrieval + self-check, no semantic layer) and the degenerate A0 (no retrieval: first 20 tables alphabetically — the honest stand-in for “dump whatever fits”). Expect A0 to land near the bottom of the 21→95 curve; if A0 scores above 50%, your eval set is too easy and will be graded as such. Emit runs/<config>/predictions.jsonl with fields qid, sql, config, n_llm_calls, latency_ms.

Milestone 3 · end of Week 10, Fri

Semantic Layer & the Ablation Table

Write semantic_layer.yaml: a Cube-style document defining canonical measures, dimensions, join paths, and explicit anti-patterns. It is a contract about the schema, written as if for human analysts; it must never reference your eval questions. Format (normative — the autograder parses it):

# semantic_layer.yaml — Cube-style, parsed by the autograder
version: 1
dialect: duckdb
snapshot_logical_date: 2025-11-30

cubes:
  - name: revenue
    sql_table: finance.revenue_recognized
    description: >
      Canonical revenue: accrual basis, net of contract discounts.
      Use for any unqualified "revenue" question.
    filters:
      - sql: "posting_status = 'posted'"
    measures:
      - { name: amount, sql: amount_usd, type: sum, format: usd }
    dimensions:
      - { name: recognized_date, sql: recognized_date, type: time }
      - { name: cost_center_id,  sql: cost_center_id,  type: number }
    joins:
      - name: order_lines
        sql: "{revenue}.order_line_id = {order_lines}.order_line_id"
        relationship: many_to_one
    anti_patterns:
      - "sales.rev_billed is invoice-line grain on the billed date and
         INCLUDES tax and shipping; it is billing, not revenue."
      - "finance.revenue_recognized_v1 was frozen at the 2024-06-30
         ERP cutover; never query it."

Then run every configuration below against the same 50 questions and fill the ablation table. This table, completed with your numbers and a per-class reading of why each delta happened, is the centerpiece of your report. Reproduce the lecture’s 21→95 curve in miniature: A0 should sit near the left end, A3 near the right, and the table should show which component moved which error class.

ConfigComponentsExec acc (%)wrong-joinwrong-metrichalluc. columnstale-table
A0no retrieval, no layer, no self-check
A1retrieval + self-check
A2A1 + semantic layer (descriptions + joins only)
A3A2 + anti-patterns
A3−scA3 with self-check disabled

Report token and call costs per configuration alongside accuracy. A point on the Pareto frontier (accuracy vs. mean LLM calls per question) is worth more than a higher point bought with an unbounded loop.

Grading

Rubric

ComponentPointsWhat we look at
Eval set quality (M1)15Distribution constraints met; gold SQL deterministic; traps genuinely exercised, not name-checked
Grader correctness (M1)15Normalization rules implemented as specified; adversarial suite present and passing; no string-match shortcuts
Agent baseline (M2)15Budgets enforced; A0/A1 reproducible from tag; self-check heuristic documented
Semantic layer (M3)15Parses; covers contested metrics with stated definitions; schema-level, no question leakage
Ablation & analysis25Table complete; per-error-class deltas explained mechanistically; Pareto framing of cost vs. accuracy
Report & reproducibility10make reproduce regenerates every number in the report from the tag, on the pinned snapshot
Checkpoint questions5Submitted on time at each milestone; engaged, not pro-forma

Total: 100. Headline accuracy is deliberately not a rubric line: a team reporting an honest 71% with a sharp error-class analysis outscores a team reporting 94% from a leaky grader. We re-run your harness; numbers that don’t reproduce score the row at zero.

Pitfalls

Where Previous Cohorts Lost Points

Pitfall · eval leakage The subtle version is not pasting gold SQL into a prompt — the autograder catches that. It is writing your semantic layer after watching the agent fail on your own questions, then encoding question-specific fixes as “documentation.” The check: every YAML entry must be justifiable to an analyst who has never seen your eval set. We diff your layer against your question text and flag near-verbatim overlap; we also hold out 10 instructor questions on the same schema, and a large gap between your-50 and our-10 accuracy is treated as leakage.
Pitfall · grader false positives Execution accuracy overcounts in predictable ways: two different wrong queries both return empty; a wrong aggregate coincidentally equals the right one on this snapshot; a query returns the right number wrapped in extra columns your projection-by-arity silently drops. Your adversarial suite exists to catch exactly these. A grader that passes a near-miss case loses the full grader-correctness component, because every downstream number becomes unfalsifiable.
Pitfall · nondeterminism now(), unseeded sampling, and tie-broken LIMIT are the classic ways a gold answer drifts between your laptop and our re-run. The snapshot’s logical date is 2025-11-30; write gold SQL against that constant, and add deterministic tiebreakers to every ranking question.
Checkpoints

Checkpoint Questions

Submit short written answers (½ page each) with the corresponding milestone. They are graded for engagement with your own measurements, not for matching an answer key.

Checkpoint 3.1 · with Milestone 1

Execution accuracy on a pinned snapshot can still mark wrong queries correct. Construct two concrete pairs (gold SQL, wrong-but-passing SQL) against the Vantage schema, explain the coincidence each exploits, and state which normalization rule or adversarial test in your grader closes the hole — or why it can’t be closed without a second snapshot.

Checkpoint 3.2 · with Milestone 2

Present the error-class histogram for A1. Before running Milestone 3, predict in writing which classes the semantic layer will reduce and which it cannot touch, with a mechanistic reason per class (e.g., can documentation fix a wrong-join caused by grain mismatch, or only a wrong-metric?). You will be graded on the quality of the reasoning, then asked to revisit the prediction in your final report.

Checkpoint 3.3 · with Milestone 3

Your 21→95 in miniature: using the ablation table, identify the single semantic-layer ingredient (descriptions, join paths, or anti-patterns) that bought the most accuracy per added context token. Where does your A3 sit on the accuracy-vs-cost Pareto frontier relative to A2, and is A3−sc evidence that self-check and the semantic layer are substitutes or complements on this schema?

Provided Materials

Starter files

schema.sqlFull warehouse DDL — 126 tables across finance, sales, orders, returns, marketing, web, hr, ops plus a deprecated graveyard; instructor edition with -- TRAP annotations (stripped from the student build)
questions.jsonl50 BIRD-style eval questions with gold SQL (DuckDB dialect) and expected result shape; 12+ deliberately hit the salted traps
semantic_layer.yamlInstructor reference semantic layer in the Cube-style format the autograder parses — governed revenue, active-user, return-rate, and AOV definitions with anti-patterns and a deprecated-table list
grader/grade.pyExecution-accuracy grader stub — loads the schema into DuckDB, grades an answers JSONL against the gold queries (PASS / WRONG_RESULT / BINDER_ERROR / STALE_TABLE), with four example test cases via --self-test
README.mdLoading the DDL into DuckDB/Postgres, how the grader pins the snapshot by SHA-256, and the exact ablation-table format for report/ablation.md