DATA 2027 · Week 09 · Part III — Semantics, Agents, Governance

Text-to-SQL Is Not Solved; It’s Specified

Models write fluent SQL against toy schemas and confident nonsense against real warehouses — the gap is a missing contract, not missing intelligence.

Lecture 1 — Why Enterprise Text-to-SQL Fails · Lecture 2 — The Semantic Layer as Schema-for-Models

Lecture 1 · Tuesday

Why Enterprise Text-to-SQL Fails

For three years the problem looked solved. Then the benchmark stopped lying.

L1 · The Cliff

Same models, real warehouses

86% → 17%

Spider 1.0 execution accuracy vs. Spider 2.0 at launch (late 2024). Nothing about the models got worse — the benchmark stopped hiding the task.

L1 · The Cliff

What Spider 1.0 was hiding

Databases average ~27 columns across 5 tables.
Clean names like student.age.
One obvious join path between any two tables.
Leaderboard became a fight over decimal points.
Spider 2.0: BigQuery datasets, Snowflake deployments, real dbt projects.

L1 · Four Axes

Enterprise differs on four axes at once

Scale of schema — 1,000–3,000+ columns.
Dialects — BigQuery, Snowflake, DuckDB, ClickHouse.
Multi-step workflows — explore, sample, then write.
Business logic absent from the schema — the killer.
First three are engineering. The fourth is a missing specification.

L1 · Axis 1 — Scale

The schema no longer fits

A schema dump alone can exceed 100K tokens.
“Put the schema in the prompt” becomes a retrieval problem.
The model must find the right three tables first.
Collisions: amount, amount_usd, amt_net, total_amount_v2.
Lexical retrieval turns actively misleading.

L1 · Axes 2 & 3

Dialects & multi-step workflows

Dialects. strftime where FORMAT_TIMESTAMP is needed.
Cheapest failures — they fail loudly.
Warning: training distribution ≠ deployment distribution.

Workflows. Inspect INFORMATION_SCHEMA, sample rows, build intermediates.
Final queries: 100+ lines, CTEs, windows, a pivot.
An agent task with a feedback loop — not one forward pass.

L1 · Axis 4 — Meaning

The killer: meaning isn’t in the schema

“Active customer” = purchase in trailing 90 days…
…excluding refunds and test accounts. No DDL says that.
Meaning lives in dbt models, BI definitions, Slack, two analysts’ heads.
A fluent guess is worse than an error — it survives review.
No model scale fills in a definition never written down.

L1 · BIRD

BIRD: between toy and warehouse

12,751 questions over 95 databases totaling 33 GB.
Deliberately “dirty” values.
External knowledge field: “revenue means price * quantity”.
First benchmark to admit the question underdetermines the SQL.

L1 · BIRD

Even the answer key is noisy

Audits: a substantial fraction of gold queries wrong or ambiguous.
Hired experts couldn’t reliably produce “the correct SQL” either.
Inter-annotator agreement on the right SQL: far below 1.0.
An 80%-accurate model is graded against a noisy key.
Underspecified tasks have no ceiling to converge to.

L1 · Stonebraker’s Zero

The bluntest measurement

≈ 0

Accuracy of state-of-the-art text-to-SQL on MIT’s own warehouse, with realistic questions from its actual users. Not adversarial — merely ordinary.

L1 · Field Note

“The model reproduced our ambiguity”

Assistant over a 2,400-column SAP-derived warehouse.
“What was Q3 churn?” — three different numbers.
All valid SQL; four churn-adjacent columns, three analyst generations.
Post-mortem: the company had no agreed definition of churn.
The model faithfully reproduced the org chart’s disagreement.

L1 · Failure Taxonomy

Four bins hold nearly everything

Fig. 1 — Grade failed enterprise queries by hand and nearly everything lands here.

L1 · The Pattern

Three of four are specification failures

Only hallucinated column is a “model is dumb” failure.
And it is the most fixable.
The other three are failures of specification and contract.
The schema tells you what is storable.
Nothing machine-readable tells you what is meant. Thursday we fix that.

Lecture 2 · Thursday

The Semantic Layer as Schema-for-Models

The meaning contract already exists. BI invented it to keep dashboards consistent.

L2 · The Reframe

Schemas contract structure, not meaning

A relational schema is a contract about structure.
Enough for fifty years — humans carried definitions in their heads.
Agents carry nothing.
The meaning contract must be written, machine-readable, queryable.
That artifact has a name: the semantic layer.

L2 · Anatomy

Four kinds of declarations

Metrics — named, versioned computations, filters baked in.
Dimensions — the legal slices, each bound to one column.
Grain — declared row meaning: per order, per order line.
Governed joins — explicit path graph with cardinalities.
Grain makes fan-out double-counting detectable before execution.

L2 · Queryable Contract

The agent selects; the layer compiles

Fig. 2 — Undefined metric, inapplicable dimension, or ungoverned join: rejected before SQL exists.

L2 · Anatomy

A metric is small and boring — the point

# semantic_layer/metrics/net_revenue.yml
metric: net_revenue
model: fct_order_lines        # grain: one row per order line
expr: sum(amount_usd) - sum(refund_amount_usd)
filters:
  - field: account_type
    operator: not_in
    values: [internal, test]
dimensions: [fiscal_month, region, plan_tier]
joins:
  - { to: dim_customers, type: many_to_one, on: account_id }
owner: finance-analytics      # a human team, not a model

L2 · Queryable Contract

“Fail loudly rather than be plausibly wrong”

A queryable contract, not documentation.
The agent never reads a wiki and freestyles SQL.
Illegal requests are rejected at compile time.
A compile error costs a retry.
A plausible wrong number costs trust — the actual product.

L2 · Evidence

Same model, different contract

21% → 95%

Anthropic’s self-service analytics: raw schema access vs. curated “skills” encoding metric definitions, join guidance, and pitfalls. Same model, 4.5× the accuracy — the expensive ingredient was analyst time, not GPUs.

L2 · Evidence

The cheapest accuracy intervention recorded

4 KB

Cube’s paired benchmark: one ~4 KB document of metric and join definitions moved accuracy +17 to +23 percentage points — while a schema dump alone can be 100K tokens of mostly noise.

L2 · Evidence

Accuracy improves twice

dbt’s production argument: route agent queries through the layer.
More correct answers on answerable questions.
Unanswerable ones become loud compile failures…
…instead of silent wrong numbers.
Fewer confident wrong answers is also accuracy.

L2 · Trust Hierarchy

Not all context is equally trustworthy

Fig. 3 — Resolve at the highest layer that can answer; fall through deliberately. Never blend all four with equal weight.

L2 · Trust Hierarchy

Report which layer answered

Attempt resolution top-down; fall through deliberately.
“Resolved via semantic layer” ≠ “inferred from a 2024 corpus query.”
The query corpus is the warehouse’s folklore, not its law.
Free-text context: richest in meaning, weakest in guarantees.
Surfacing provenance is what lets humans calibrate trust.

L2 · Bootstrapping

Why LLM-bootstrapping the layer failed

The 2025 shortcut: have the model generate the layer.
Its inputs are exactly the untrusted layers of the hierarchy.
It recovers all four churn definitions, enshrines the most common.
You haven’t removed the ambiguity — you’ve laundered it.
Deciding is not a prediction task.

L2 · Division of Labor

Humans own definitions; models draft

Models: mine corpora, diff metric variants, flag undeclared grain.
That’s drafting. The commit turns a draft into a contract.
It needs an owner who can be wrong in a way that matters.
owner: finance-analytics — the entire governance model in seventeen characters.

L2 · Closing the Taxonomy

Every bin meets a contract mechanism

Failure mode	Mitigation	Failure becomes…
Wrong join	Governed join graph + declared grain	Compile-time error (loud)
Wrong metric variant	Named, owned metric definitions	A clarifying question (“net or gross?”)
Hallucinated column	≈4 KB curated retrieval surface	Rejected before SQL exists
Stale schema	Versioned layer, CI-tested against warehouse	A failed build, pre-production

A bigger model makes the guess more fluent. Only a contract makes it unnecessary.

— Week 9 lecture notes, DATA 2027

Week 09 · Checkpoint

Discuss before Friday’s lab

“How many active customers last month?” — write four defensible SQL readings. Which failure bin does each silent choice land in?
Which taxonomy bins can a 4 KB semantic layer close on the 400-column warehouse — and which survive?
A bootstrapped layer can be internally consistent yet uniformly wrong. How would you measure that, separately from execution accuracy?

Week 09 · Readings

Read before Thursday

Spider 2.0 — Lei et al., ICLR 2025. The 86→17 cliff, quantified; focus on the error analysis.
BIRD — Li et al., NeurIPS 2023. The “external knowledge” design decision; be skeptical of the gold annotations.
How Anthropic Enables Self-Service Data Analytics with Claude — engineering blog, June 2026. The 21→95 result; a semantic layer in everything but name.