DATA 2027 DATA SYSTEMS IN THE AGENTIC ERA ← SCHEDULE
Week 09 · Part III — Semantics, Agents, Governance

Text-to-SQL Is Not Solved; It’s Specified

Models write fluent SQL against toy schemas and produce confident nonsense against real warehouses. The gap is not intelligence — it is the absence of a written contract about what the data means, and this week we build that contract.

Lectures: Tue — Why enterprise text-to-SQL fails · Thu — The semantic layer as schema-for-models  ·  Lab: Fri — Write a semantic layer for a 400-column retail warehouse and measure the accuracy delta  ·  Slides →
Learning objectives — after this week you can…
  • Explain the 86% → 17% cliff between Spider 1.0 and Spider 2.0, and name the four properties of enterprise warehouses that cause it.
  • Classify a failed agent-generated query into the standard taxonomy: wrong join, wrong metric variant, hallucinated column, stale schema.
  • Define a semantic layer precisely — metrics, dimensions, grain, governed joins — and explain why it is a queryable contract rather than documentation.
  • Rank context sources by trustworthiness (semantic layer > lineage > query corpus > business context) and justify the ordering with failure-mode arguments.
  • Argue, with evidence, why LLM-bootstrapping a semantic layer fails, and where the human/model division of labor actually sits.
Lecture 1 · Tuesday

Why Enterprise Text-to-SQL Fails

For about three years, text-to-SQL looked like a solved problem. On Spider 1.0 — the academic benchmark of record, released in 2018 — frontier models climbed past 86% execution accuracy, and the leaderboard compressed into a fight over decimal points. Then in late 2024 the same research group released Spider 2.0, built from real enterprise warehouses instead of curated classroom databases, and the best agentic frameworks scored roughly 17% at launch. Nothing about the models got worse. The benchmark simply stopped lying about what the task is. Today’s lecture is about what Spider 1.0 was hiding, and why every property it hid is a property of the database environment, not of SQL generation per se.

The cliff: what Spider 2.0 actually changed

Spider 1.0 databases average around 27 columns across 5 tables, with clean names like student.age and one obvious join path between any two tables. Spider 2.0 instances are drawn from production-grade warehouses — BigQuery public datasets, Snowflake deployments, real dbt projects — and they differ on four axes simultaneously:

Hold on to that decomposition. Axes one through three are engineering problems — retrieval, dialect adapters, agent loops — and the field is making steady progress on all of them. Axis four is not an engineering problem at all. It is a missing specification, and no amount of model scale fills in a definition that was never written down.

BIRD and the annotation-quality problem

BIRD (NeurIPS 2023) sits between the toy and the warehouse: 12,751 questions over 95 databases totaling 33 GB, with deliberately “dirty” values and an explicit external knowledge field — a hint like “revenue means price * quantity” attached to each question. BIRD was the first major benchmark to admit that the question alone underdetermines the SQL. That admission was the important contribution. But it surfaced a second, more uncomfortable lesson: when later audits re-examined BIRD’s gold queries, a substantial fraction were themselves wrong or ambiguous — wrong joins, questions with multiple defensible readings, gold SQL that disagreed with the stated evidence. Human experts hired to annotate “the correct SQL” for realistic questions could not reliably produce it either.

Students usually read this as a quality-control scandal. Read it instead as a measurement: the inter-annotator agreement on “the right SQL for this business question” is far below 1.0, even among professionals staring at the same schema. If humans can’t agree on the target, an 80%-accurate model isn’t 20% away from solved — it’s being graded against a noisy key. The task as posed (English in, SQL out, no shared definitions) is underspecified, and underspecified tasks have no ceiling to converge to.

Stonebraker’s zero

Michael Stonebraker — who has earned the right to be grumpy about database fashion — reported the bluntest version of the result: his group took state-of-the-art text-to-SQL systems, pointed them at MIT’s own data warehouse with realistic questions from its actual users, and measured an accuracy of approximately zero. Not 17%. Zero. MIT’s warehouse is not adversarial; it is merely ordinary — decades of accreted tables, cryptic column names inherited from a mainframe era, conventions documented nowhere. The benchmarks-to-production gap is not a gap. It is most of the problem.

Field note A team I worked with shipped a text-to-SQL assistant over a 2,400-column SAP-derived warehouse. In the pilot, the model answered “what was Q3 churn?” three times with three different numbers — all syntactically valid, all executed without error, each using a different one of the four churn-adjacent columns left behind by three generations of analysts. The post-mortem’s first finding: the company itself had no single agreed definition of churn. The model had faithfully reproduced the org chart’s disagreement. That post-mortem line — “the model reproduced our ambiguity” — is the thesis of Thursday’s lecture.

A failure taxonomy

When you grade failed enterprise queries by hand — and in Friday’s lab you will — nearly everything lands in four bins. Learn them; they recur in every paper and every production incident review.

Notice the pattern: only one of the four (hallucinated column) is a “model is dumb” failure, and it is the most fixable. The other three are failures of specification and contract — exactly the things a schema was supposed to provide and, for analytical meaning, never did. The schema tells you what is storable. Nothing machine-readable tells you what is meant. Thursday we fix that.

Lecture 2 · Thursday

The Semantic Layer as Schema-for-Models

Here is the reframe that organizes everything in Part III of this course: a relational schema is a contract about structure, and for fifty years that contract was enough, because the consumers of meaning were humans who carried definitions in their heads. Agents carry nothing. They need the meaning contract written down, machine-readable, and queryable — and that artifact already has a name from the BI world: the semantic layer. It was invented to keep dashboards consistent. It turns out to be the schema that models were missing.

What a semantic layer actually is

Strip away vendor branding and a semantic layer is four kinds of declarations sitting between raw tables and every query consumer:

The crucial property is that this is a queryable contract, not documentation. The agent does not read a wiki page about revenue and then freestyle SQL; it issues a request like metric=net_revenue, dims=[region, fiscal_month] and the layer compiles the SQL. A request for a metric that doesn’t exist, a dimension that doesn’t apply, or a join that isn’t governed is rejected at compile time. In dbt’s phrasing, which you should tattoo somewhere visible: the system should “fail loudly rather than be plausibly wrong.” A compile error costs a retry; a plausible wrong number costs trust, and trust is the actual product of an analytics system. A typical metric declaration is small and boring, which is the point:

# semantic_layer/metrics/net_revenue.yml
metric: net_revenue
description: >
  Recognized revenue net of refunds. Excludes internal
  and test accounts (is_test = false). Fiscal calendar.
model: fct_order_lines          # grain: one row per order line
expr: sum(amount_usd) - sum(refund_amount_usd)
filters:
  - field: account_type
    operator: not_in
    values: [internal, test]
dimensions: [fiscal_month, region, plan_tier]
joins:
  - to: dim_customers
    type: many_to_one
    on: account_id
owner: finance-analytics        # a human team, not a model

The evidence: same model, different contract

Hold the model fixed and vary only the contract, and the numbers move more than any architecture change of the last three years. Three data points, in increasing order of experimental cleanliness:

A bigger model makes the guess more fluent. Only a contract makes it unnecessary.

The trust hierarchy

Real deployments feed agents more than one kind of context, and the kinds are not equally trustworthy. The working ordering, which you should be able to defend on an exam:

Semantic layer > lineage > query corpus > business context. The semantic layer is governed and compiled — it cannot silently drift, because changes go through review and broken references fail builds. Lineage (which tables feed which) is automatically derived, hence accurate but meaning-free: it tells you fct_orders descends from raw_orders, not what either means. The historical query corpus is seductive — “here are 10,000 real queries” — but it encodes every past analyst’s mistakes and deprecated patterns with equal confidence; mining it gives you the warehouse’s folklore, not its law. And free-text business context (wiki pages, Slack) is the richest in meaning and the weakest in guarantees: unversioned, contradictory, and stale by default — the natural habitat of the stale-schema failure bin.

SEMANTIC LAYER governed · compiled · fails loudly LINEAGE derived · accurate · meaning-free QUERY CORPUS abundant · encodes old mistakes BUSINESS CONTEXT rich · unversioned · stale by default TRUST route here first
Fig. 9.1 — The trust hierarchy of agent context. An agent should resolve a question at the highest layer that can answer it, and fall through deliberately — never blend all four with equal weight. The dashed bottom layer is load-bearing for meaning and structurally unreliable; treat it as hypothesis, not ground truth.

The operational rule that falls out: an agent should attempt resolution top-down and report which layer answered. “Net revenue by region, resolved via semantic layer” deserves different downstream trust than “…inferred from a 2024 query I found in the corpus,” and surfacing the provenance is what lets humans calibrate.

Why LLM-bootstrapping the semantic layer failed

The obvious shortcut occurred to everyone simultaneously around 2025: writing a semantic layer is tedious, models are good at reading schemas and old queries — so have the model generate the semantic layer, then have models query through it. Closed loop, no analysts required. It failed, and it failed for a reason worth stating precisely: the inputs to bootstrapping are exactly the untrusted layers of the hierarchy. A model inferring “what does churn mean here?” from the query corpus and the wiki recovers all four ambient definitions of churn, picks the statistically most common one, and enshrines it with the false authority of a governed artifact. You haven’t removed the ambiguity; you’ve laundered it. The semantic layer’s entire value is that a human with organizational authority decided — finance owns net_revenue, the decision survived review, and dissent happened in a pull request rather than in production dashboards. Deciding is not a prediction task.

The division of labor that works in practice is narrower and more boring: humans own definitions; models draft documentation. A model is excellent at proposing candidate metrics from corpus mining (“these 340 queries all compute something revenue-shaped in three variants — here they are, diffed”), at writing the description fields, at flagging undeclared grain, at noticing that a dimension is referenced but unbound. That’s drafting. The commit — the act that turns a draft into a contract — needs an owner who can be wrong in a way that matters. Note the line in the YAML above: owner: finance-analytics. That field is not metadata decoration; it is the entire governance model in seventeen characters.

Historical aside This is the second time the industry has tried to make meaning machine-readable. The first was the 1990s CASE-tool and enterprise-data-dictionary era, which produced beautiful repositories of definitions that no running system ever consulted — so they rotted. The semantic layer avoids that fate for one mechanical reason: queries are routed through it, so a wrong or stale definition breaks something visible the same day. Specifications survive only when they sit on the execution path. Documentation beside the path is compost.

Closing the taxonomy

We can now line Tuesday’s failure bins up against their mitigations and see that each one is addressed by a contract mechanism, not by model improvement:

Failure modeRoot causeMitigationFailure becomes…
Wrong join (fan-out, deprecated bridge)Join paths and cardinalities undeclared; model infers from namesGoverned join graph + declared grain; compiler rejects paths outside the graphCompile-time error (loud)
Wrong metric variant (gross vs. net, fiscal vs. calendar)Business logic absent from schema; many plausible readingsNamed, owned metric definitions; agent selects a metric instead of writing an expressionA clarifying question (“net or gross?”) — the correct behavior
Hallucinated columnRetrieval pressure on 1,000–3,000-column schemasSemantic layer as the retrieval surface: small (≈4 KB beats 100K tokens), curated, exhaustively enumerableUnresolvable reference, rejected before SQL exists
Stale schemaUnversioned context drifts from the live warehouseLayer is versioned and CI-tested against the warehouse; broken references fail the build, not the queryA failed build owned by a team, pre-production

Read the rightmost column as the week’s summary. In every row, the mitigation does not make the agent smarter — it converts a silent, plausible wrongness into a loud, attributable failure or an honest question. That is the same move that types, schemas, and constraints have always made in database systems. Text-to-SQL was never waiting for a better model. It was waiting, as this course keeps insisting, for somebody to write the specification — and the physics of that hasn’t changed since Codd: systems are reliable exactly to the degree that their contracts are explicit.

Readings

Read Before Thursday

1
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows — Lei et al., ICLR 2025.The 86→17 cliff, quantified. Focus on §3’s task construction and the error analysis: count how many failures are schema-scale or business-logic problems rather than SQL-skill problems.
2
Can LLM Already Serve as a Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD) — Li et al., NeurIPS 2023.Read for the “external knowledge” design decision — the first benchmark to admit questions underdetermine SQL — and bring a skeptical eye to the gold annotations; we discuss the audit findings in class.
3
How Anthropic Enables Self-Service Data Analytics with Claude — Anthropic engineering blog, June 2026.The 21→95 result. Focus on what the curated skills actually contain (metric definitions, join guidance, pitfalls) and who maintains them — it is a semantic layer in everything but name.
Exercises

This Week’s Problems

Exercise 9.1 · warm-up

Take the question “How many active customers did we have last month?” and write down four materially different but individually defensible SQL interpretations (vary the definition of active, the calendar, and the customer-population filter). For each, name which failure bin from Tuesday’s taxonomy an agent would land in if it silently picked that reading when the asker meant another. Then write the one-paragraph semantic-layer entry for active_customers that would force the ambiguity to surface as a question instead.

Exercise 9.2 · core

Using the lab’s 400-column retail warehouse: (a) run the provided 25-question evaluation against a frontier model given only the raw schema dump, and bucket every failure into the four-bin taxonomy; (b) write a semantic layer of at most 4 KB — metrics, dimensions, grain, governed joins, in the YAML style from lecture — covering the ten metrics the questions need; (c) re-run with the model required to route through your layer, and report the accuracy delta per bin. You are graded on the error analysis, not the headline number: which bins did your 4 KB close, which survived, and why? (Reference point: Cube’s paired benchmark gained +17–23pp from a comparable artifact.)

Exercise 9.3 · stretch

Lecture 2 claims LLM-bootstrapping fails because it launders ambiguity — but the argument is qualitative. Design and run an experiment that measures it. Have a model mine the lab warehouse’s query corpus (provided: ~3,000 historical queries containing at least three latent variants of “revenue”) and auto-generate a semantic layer; then evaluate agents through the bootstrapped layer vs. your hand-written 9.2 layer vs. no layer. The hard parts you must solve, not hand-wave: (1) a metric for definition consistency distinct from execution accuracy — a bootstrapped layer can be internally consistent and uniformly wrong, and your metric must separate those; (2) an audit protocol that detects which corpus variant the bootstrap enshrined and whether it matches the (sealed) finance-approved definition; (3) a proposal, with measurements, for the highest-leverage partial automation — what is the largest slice of layer authoring a model can own before consistency degrades, and can you locate the knee of that curve? Write it up as a 4-page workshop paper. Genuinely open: a strong negative result is publishable.