Week 9: Text-to-SQL Is Not Solved; It’s Specified

Learning objectives — after this week you can…

Explain the 86% → 17% cliff between Spider 1.0 and Spider 2.0, and name the four properties of enterprise warehouses that cause it.
Classify a failed agent-generated query into the standard taxonomy: wrong join, wrong metric variant, hallucinated column, stale schema.
Define a semantic layer precisely — metrics, dimensions, grain, governed joins — and explain why it is a queryable contract rather than documentation.
Rank context sources by trustworthiness (semantic layer > lineage > query corpus > business context) and justify the ordering with failure-mode arguments.
Argue, with evidence, why LLM-bootstrapping a semantic layer fails, and where the human/model division of labor actually sits.

Lecture 1 · Tuesday

Why Enterprise Text-to-SQL Fails

For about three years, text-to-SQL looked like a solved problem. On Spider 1.0 — the academic benchmark of record, released in 2018 — frontier models climbed past 86% execution accuracy, and the leaderboard compressed into a fight over decimal points. Then in late 2024 the same research group released Spider 2.0, built from real enterprise warehouses instead of curated classroom databases, and the best agentic frameworks scored roughly 17% at launch. Nothing about the models got worse. The benchmark simply stopped lying about what the task is. Today’s lecture is about what Spider 1.0 was hiding, and why every property it hid is a property of the database environment, not of SQL generation per se.

The cliff: what Spider 2.0 actually changed

Spider 1.0 databases average around 27 columns across 5 tables, with clean names like student.age and one obvious join path between any two tables. Spider 2.0 instances are drawn from production-grade warehouses — BigQuery public datasets, Snowflake deployments, real dbt projects — and they differ on four axes simultaneously:

Scale of schema. Databases routinely carry 1,000–3,000+ columns. A schema dump alone can exceed 100K tokens, so “put the schema in the prompt” stops being a strategy and becomes a retrieval problem. The model must find the right three tables before it can join them, and column-name collisions (amount, amount_usd, amt_net, total_amount_v2) make lexical retrieval actively misleading.
Dialects. Real warehouses speak BigQuery SQL, Snowflake SQL, DuckDB, ClickHouse — each with different date functions, quoting rules, lateral-join syntax, and semi-structured-data operators. A model trained mostly on SQLite-flavored benchmarks emits strftime where it needs FORMAT_TIMESTAMP. Dialect errors are the cheapest failures because they fail loudly; they are also a warning that the training distribution does not match the deployment distribution.
Multi-step workflows. Spider 2.0 tasks often require exploration: inspect INFORMATION_SCHEMA, sample rows to learn the encoding of a status column, build an intermediate result, then write the final query — sometimes 100+ lines with CTEs, window functions, and a pivot. Single-shot generation is structurally incapable here; this is an agent task with a feedback loop, which is exactly why Spider 2.0 was evaluated with agentic frameworks rather than one forward pass.
Business logic absent from the schema. The killer. “Active customer” might mean made a purchase in the trailing 90 days, excluding refunds and test accounts. No DDL says that. The warehouse stores facts; the meaning of the facts lives in dbt models, BI definitions, Slack threads, and the heads of two analysts. A model with perfect SQL skill and zero access to those definitions can only guess — and a fluent guess is worse than an error, because it survives review.

Hold on to that decomposition. Axes one through three are engineering problems — retrieval, dialect adapters, agent loops — and the field is making steady progress on all of them. Axis four is not an engineering problem at all. It is a missing specification, and no amount of model scale fills in a definition that was never written down.

BIRD and the annotation-quality problem

BIRD (NeurIPS 2023) sits between the toy and the warehouse: 12,751 questions over 95 databases totaling 33 GB, with deliberately “dirty” values and an explicit external knowledge field — a hint like “revenue means price * quantity” attached to each question. BIRD was the first major benchmark to admit that the question alone underdetermines the SQL. That admission was the important contribution. But it surfaced a second, more uncomfortable lesson: when later audits re-examined BIRD’s gold queries, a substantial fraction were themselves wrong or ambiguous — wrong joins, questions with multiple defensible readings, gold SQL that disagreed with the stated evidence. Human experts hired to annotate “the correct SQL” for realistic questions could not reliably produce it either.

Students usually read this as a quality-control scandal. Read it instead as a measurement: the inter-annotator agreement on “the right SQL for this business question” is far below 1.0, even among professionals staring at the same schema. If humans can’t agree on the target, an 80%-accurate model isn’t 20% away from solved — it’s being graded against a noisy key. The task as posed (English in, SQL out, no shared definitions) is underspecified, and underspecified tasks have no ceiling to converge to.

Stonebraker’s zero

Michael Stonebraker — who has earned the right to be grumpy about database fashion — reported the bluntest version of the result: his group took state-of-the-art text-to-SQL systems, pointed them at MIT’s own data warehouse with realistic questions from its actual users, and measured an accuracy of approximately zero. Not 17%. Zero. MIT’s warehouse is not adversarial; it is merely ordinary — decades of accreted tables, cryptic column names inherited from a mainframe era, conventions documented nowhere. The benchmarks-to-production gap is not a gap. It is most of the problem.

Field note A team I worked with shipped a text-to-SQL assistant over a 2,400-column SAP-derived warehouse. In the pilot, the model answered “what was Q3 churn?” three times with three different numbers — all syntactically valid, all executed without error, each using a different one of the four churn-adjacent columns left behind by three generations of analysts. The post-mortem’s first finding: the company itself had no single agreed definition of churn. The model had faithfully reproduced the org chart’s disagreement. That post-mortem line — “the model reproduced our ambiguity” — is the thesis of Thursday’s lecture.

A failure taxonomy

When you grade failed enterprise queries by hand — and in Friday’s lab you will — nearly everything lands in four bins. Learn them; they recur in every paper and every production incident review.

Wrong join. The query joins orders to customers through a deprecated bridge table, or fans out a one-to-many and silently double-counts revenue. The SQL is valid; the cardinality is wrong. This is the most dangerous bin because the result is plausible — off by 1.3×, not 1000×.
Wrong metric variant. Gross revenue vs. net of refunds; ARR vs. recognized revenue; calendar month vs. fiscal month. The model picked a revenue, just not the revenue the asker meant. This is axis four wearing a costume.
Hallucinated column. Under retrieval pressure on a 3,000-column schema, the model invents customer.lifetime_value because it plausibly should exist. This one at least fails loudly — the database throws — unless a similarly named column exists, in which case it degrades into bin two.
Stale schema. The context the model saw (a cached schema dump, an old dbt manifest, a six-month-old example query) describes a warehouse that no longer exists. Migrations renamed user_id to account_id; the model writes correct SQL against a ghost. This is a freshness/invalidation problem, and it is the bin that grows fastest as organizations bolt context onto agents without versioning it.

Notice the pattern: only one of the four (hallucinated column) is a “model is dumb” failure, and it is the most fixable. The other three are failures of specification and contract — exactly the things a schema was supposed to provide and, for analytical meaning, never did. The schema tells you what is storable. Nothing machine-readable tells you what is meant. Thursday we fix that.

Lecture 2 · Thursday

The Semantic Layer as Schema-for-Models

Here is the reframe that organizes everything in Part III of this course: a relational schema is a contract about structure, and for fifty years that contract was enough, because the consumers of meaning were humans who carried definitions in their heads. Agents carry nothing. They need the meaning contract written down, machine-readable, and queryable — and that artifact already has a name from the BI world: the semantic layer. It was invented to keep dashboards consistent. It turns out to be the schema that models were missing.

What a semantic layer actually is

Strip away vendor branding and a semantic layer is four kinds of declarations sitting between raw tables and every query consumer:

Metrics — named, versioned computations with their filters baked in. net_revenue is not “whatever SUM the model writes” but a single governed expression: gross minus refunds minus test accounts, defined once.
Dimensions — the legal ways to slice a metric (region, plan tier, fiscal month), each bound to a specific column and casing convention, so “by month” can never mean three different date columns.
Grain — the declared row-level meaning of each table: one row per order, per order line, per customer-day. Grain declarations are what make fan-out double-counting detectable before execution rather than after the board meeting.
Governed joins — the explicit graph of permitted join paths with their cardinalities. If orders → customers is declared many-to-one along account_id, the deprecated bridge table simply is not in the graph, and the wrong-join failure class is closed by construction.

The crucial property is that this is a queryable contract, not documentation. The agent does not read a wiki page about revenue and then freestyle SQL; it issues a request like metric=net_revenue, dims=[region, fiscal_month] and the layer compiles the SQL. A request for a metric that doesn’t exist, a dimension that doesn’t apply, or a join that isn’t governed is rejected at compile time. In dbt’s phrasing, which you should tattoo somewhere visible: the system should “fail loudly rather than be plausibly wrong.” A compile error costs a retry; a plausible wrong number costs trust, and trust is the actual product of an analytics system. A typical metric declaration is small and boring, which is the point:

# semantic_layer/metrics/net_revenue.yml
metric: net_revenue
description: >
  Recognized revenue net of refunds. Excludes internal
  and test accounts (is_test = false). Fiscal calendar.
model: fct_order_lines          # grain: one row per order line
expr: sum(amount_usd) - sum(refund_amount_usd)
filters:
  - field: account_type
    operator: not_in
    values: [internal, test]
dimensions: [fiscal_month, region, plan_tier]
joins:
  - to: dim_customers
    type: many_to_one
    on: account_id
owner: finance-analytics        # a human team, not a model

The evidence: same model, different contract

Hold the model fixed and vary only the contract, and the numbers move more than any architecture change of the last three years. Three data points, in increasing order of experimental cleanliness:

Anthropic’s internal self-service analytics (June 2026 engineering writeup): Claude answering analytics questions over their warehouse scored about 21% with raw schema access and about 95% after the data team added curated “skills” — essentially semantic-layer-shaped documents encoding metric definitions, join guidance, and known pitfalls. Same model, 4.5× the accuracy. The expensive ingredient was analyst time spent writing definitions down, not GPUs.
Cube’s paired benchmark: identical questions, identical model, with and without semantic-layer context. Adding roughly a 4 KB document of metric and join definitions moved accuracy by +17 to +23 percentage points across question categories. Four kilobytes. That is the cheapest accuracy intervention ever recorded in this literature — context-efficiency matters because, per Tuesday, the schema alone can be 100K tokens of mostly noise.
dbt’s production argument: routing agent queries through the semantic layer doesn’t just raise accuracy on answerable questions — it converts a chunk of the unanswerable ones from silent wrong numbers into loud compile failures. The accuracy metric improves twice: more correct answers, and fewer confident wrong ones.

A bigger model makes the guess more fluent. Only a contract makes it unnecessary.

The trust hierarchy

Real deployments feed agents more than one kind of context, and the kinds are not equally trustworthy. The working ordering, which you should be able to defend on an exam:

Semantic layer > lineage > query corpus > business context. The semantic layer is governed and compiled — it cannot silently drift, because changes go through review and broken references fail builds. Lineage (which tables feed which) is automatically derived, hence accurate but meaning-free: it tells you fct_orders descends from raw_orders, not what either means. The historical query corpus is seductive — “here are 10,000 real queries” — but it encodes every past analyst’s mistakes and deprecated patterns with equal confidence; mining it gives you the warehouse’s folklore, not its law. And free-text business context (wiki pages, Slack) is the richest in meaning and the weakest in guarantees: unversioned, contradictory, and stale by default — the natural habitat of the stale-schema failure bin.

Fig. 9.1 — The trust hierarchy of agent context. An agent should resolve a question at the highest layer that can answer it, and fall through deliberately — never blend all four with equal weight. The dashed bottom layer is load-bearing for meaning and structurally unreliable; treat it as hypothesis, not ground truth.

The operational rule that falls out: an agent should attempt resolution top-down and report which layer answered. “Net revenue by region, resolved via semantic layer” deserves different downstream trust than “…inferred from a 2024 query I found in the corpus,” and surfacing the provenance is what lets humans calibrate.

Why LLM-bootstrapping the semantic layer failed

The obvious shortcut occurred to everyone simultaneously around 2025: writing a semantic layer is tedious, models are good at reading schemas and old queries — so have the model generate the semantic layer, then have models query through it. Closed loop, no analysts required. It failed, and it failed for a reason worth stating precisely: the inputs to bootstrapping are exactly the untrusted layers of the hierarchy. A model inferring “what does churn mean here?” from the query corpus and the wiki recovers all four ambient definitions of churn, picks the statistically most common one, and enshrines it with the false authority of a governed artifact. You haven’t removed the ambiguity; you’ve laundered it. The semantic layer’s entire value is that a human with organizational authority decided — finance owns net_revenue, the decision survived review, and dissent happened in a pull request rather than in production dashboards. Deciding is not a prediction task.

The division of labor that works in practice is narrower and more boring: humans own definitions; models draft documentation. A model is excellent at proposing candidate metrics from corpus mining (“these 340 queries all compute something revenue-shaped in three variants — here they are, diffed”), at writing the description fields, at flagging undeclared grain, at noticing that a dimension is referenced but unbound. That’s drafting. The commit — the act that turns a draft into a contract — needs an owner who can be wrong in a way that matters. Note the line in the YAML above: owner: finance-analytics. That field is not metadata decoration; it is the entire governance model in seventeen characters.

Historical aside This is the second time the industry has tried to make meaning machine-readable. The first was the 1990s CASE-tool and enterprise-data-dictionary era, which produced beautiful repositories of definitions that no running system ever consulted — so they rotted. The semantic layer avoids that fate for one mechanical reason: queries are routed through it, so a wrong or stale definition breaks something visible the same day. Specifications survive only when they sit on the execution path. Documentation beside the path is compost.

Closing the taxonomy

We can now line Tuesday’s failure bins up against their mitigations and see that each one is addressed by a contract mechanism, not by model improvement:

Failure mode	Root cause	Mitigation	Failure becomes…
Wrong join (fan-out, deprecated bridge)	Join paths and cardinalities undeclared; model infers from names	Governed join graph + declared grain; compiler rejects paths outside the graph	Compile-time error (loud)
Wrong metric variant (gross vs. net, fiscal vs. calendar)	Business logic absent from schema; many plausible readings	Named, owned metric definitions; agent selects a metric instead of writing an expression	A clarifying question (“net or gross?”) — the correct behavior
Hallucinated column	Retrieval pressure on 1,000–3,000-column schemas	Semantic layer as the retrieval surface: small (≈4 KB beats 100K tokens), curated, exhaustively enumerable	Unresolvable reference, rejected before SQL exists
Stale schema	Unversioned context drifts from the live warehouse	Layer is versioned and CI-tested against the warehouse; broken references fail the build, not the query	A failed build owned by a team, pre-production

Read the rightmost column as the week’s summary. In every row, the mitigation does not make the agent smarter — it converts a silent, plausible wrongness into a loud, attributable failure or an honest question. That is the same move that types, schemas, and constraints have always made in database systems. Text-to-SQL was never waiting for a better model. It was waiting, as this course keeps insisting, for somebody to write the specification — and the physics of that hasn’t changed since Codd: systems are reliable exactly to the degree that their contracts are explicit.

Readings

Read Before Thursday

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows — Lei et al., ICLR 2025.The 86→17 cliff, quantified. Focus on §3’s task construction and the error analysis: count how many failures are schema-scale or business-logic problems rather than SQL-skill problems.

Can LLM Already Serve as a Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD) — Li et al., NeurIPS 2023.Read for the “external knowledge” design decision — the first benchmark to admit questions underdetermine SQL — and bring a skeptical eye to the gold annotations; we discuss the audit findings in class.

How Anthropic Enables Self-Service Data Analytics with Claude — Anthropic engineering blog, June 2026.The 21→95 result. Focus on what the curated skills actually contain (metric definitions, join guidance, pitfalls) and who maintains them — it is a semantic layer in everything but name.

Exercises

This Week’s Problems

Exercise 9.1 · warm-up

Take the question “How many active customers did we have last month?” and write down four materially different but individually defensible SQL interpretations (vary the definition of active, the calendar, and the customer-population filter). For each, name which failure bin from Tuesday’s taxonomy an agent would land in if it silently picked that reading when the asker meant another. Then write the one-paragraph semantic-layer entry for active_customers that would force the ambiguity to surface as a question instead.

Exercise 9.2 · core

Using the lab’s 400-column retail warehouse: (a) run the provided 25-question evaluation against a frontier model given only the raw schema dump, and bucket every failure into the four-bin taxonomy; (b) write a semantic layer of at most 4 KB — metrics, dimensions, grain, governed joins, in the YAML style from lecture — covering the ten metrics the questions need; (c) re-run with the model required to route through your layer, and report the accuracy delta per bin. You are graded on the error analysis, not the headline number: which bins did your 4 KB close, which survived, and why? (Reference point: Cube’s paired benchmark gained +17–23pp from a comparable artifact.)

Exercise 9.3 · stretch

Lecture 2 claims LLM-bootstrapping fails because it launders ambiguity — but the argument is qualitative. Design and run an experiment that measures it. Have a model mine the lab warehouse’s query corpus (provided: ~3,000 historical queries containing at least three latent variants of “revenue”) and auto-generate a semantic layer; then evaluate agents through the bootstrapped layer vs. your hand-written 9.2 layer vs. no layer. The hard parts you must solve, not hand-wave: (1) a metric for definition consistency distinct from execution accuracy — a bootstrapped layer can be internally consistent and uniformly wrong, and your metric must separate those; (2) an audit protocol that detects which corpus variant the bootstrap enshrined and whether it matches the (sealed) finance-approved definition; (3) a proposal, with measurements, for the highest-leverage partial automation — what is the largest slice of layer authoring a model can own before consistency degrades, and can you locate the knee of that curve? Write it up as a 4-page workshop paper. Genuinely open: a strong negative result is publishable.

❦