DATA 2027 · Week 09 · Part III — Semantics, Agents, Governance

Text-to-SQL Is Not Solved; It’s Specified

Models write fluent SQL against toy schemas and confident nonsense against real warehouses — the gap is a missing contract, not missing intelligence.

Lecture 1 — Why Enterprise Text-to-SQL Fails · Lecture 2 — The Semantic Layer as Schema-for-Models

Lecture 1 · Tuesday

Why Enterprise Text-to-SQL Fails

For three years the problem looked solved. Then the benchmark stopped lying.

L1 · The Cliff

Same models, real warehouses

86% → 17%

Spider 1.0 execution accuracy vs. Spider 2.0 at launch (late 2024). Nothing about the models got worse — the benchmark stopped hiding the task.

L1 · The Cliff

What Spider 1.0 was hiding

L1 · Four Axes

Enterprise differs on four axes at once

L1 · Axis 1 — Scale

The schema no longer fits

L1 · Axes 2 & 3

Dialects & multi-step workflows

  • Dialects. strftime where FORMAT_TIMESTAMP is needed.
  • Cheapest failures — they fail loudly.
  • Warning: training distribution ≠ deployment distribution.
  • Workflows. Inspect INFORMATION_SCHEMA, sample rows, build intermediates.
  • Final queries: 100+ lines, CTEs, windows, a pivot.
  • An agent task with a feedback loop — not one forward pass.
L1 · Axis 4 — Meaning

The killer: meaning isn’t in the schema

L1 · BIRD

BIRD: between toy and warehouse

L1 · BIRD

Even the answer key is noisy

L1 · Stonebraker’s Zero

The bluntest measurement

≈ 0

Accuracy of state-of-the-art text-to-SQL on MIT’s own warehouse, with realistic questions from its actual users. Not adversarial — merely ordinary.

L1 · Field Note

“The model reproduced our ambiguity”

L1 · Failure Taxonomy

Four bins hold nearly everything

WRONG JOIN fan-out · deprecated bridge plausible: off by 1.3x, not 1000x WRONG METRIC VARIANT gross vs. net · fiscal vs. calendar a revenue, not the revenue HALLUCINATED COLUMN invented under retrieval pressure at least fails loudly STALE SCHEMA correct SQL against a ghost fastest-growing bin contract failure the only "model is dumb" bin
Fig. 1 — Grade failed enterprise queries by hand and nearly everything lands here.
L1 · The Pattern

Three of four are specification failures

Lecture 2 · Thursday

The Semantic Layer as Schema-for-Models

The meaning contract already exists. BI invented it to keep dashboards consistent.

L2 · The Reframe

Schemas contract structure, not meaning

L2 · Anatomy

Four kinds of declarations

L2 · Queryable Contract

The agent selects; the layer compiles

AGENT no freestyle SQL metric=net_revenue dims=[region, fiscal_month] SEMANTIC LAYER governed · compiled raw tables COMPILED SQL defined once COMPILE ERROR loud, costs a retry governed not in contract
Fig. 2 — Undefined metric, inapplicable dimension, or ungoverned join: rejected before SQL exists.
L2 · Anatomy

A metric is small and boring — the point

# semantic_layer/metrics/net_revenue.yml
metric: net_revenue
model: fct_order_lines        # grain: one row per order line
expr: sum(amount_usd) - sum(refund_amount_usd)
filters:
  - field: account_type
    operator: not_in
    values: [internal, test]
dimensions: [fiscal_month, region, plan_tier]
joins:
  - { to: dim_customers, type: many_to_one, on: account_id }
owner: finance-analytics      # a human team, not a model
L2 · Queryable Contract

“Fail loudly rather than be plausibly wrong”

L2 · Evidence

Same model, different contract

21% → 95%

Anthropic’s self-service analytics: raw schema access vs. curated “skills” encoding metric definitions, join guidance, and pitfalls. Same model, 4.5× the accuracy — the expensive ingredient was analyst time, not GPUs.

L2 · Evidence

The cheapest accuracy intervention recorded

4 KB

Cube’s paired benchmark: one ~4 KB document of metric and join definitions moved accuracy +17 to +23 percentage points — while a schema dump alone can be 100K tokens of mostly noise.

L2 · Evidence

Accuracy improves twice

L2 · Trust Hierarchy

Not all context is equally trustworthy

SEMANTIC LAYER governed · compiled · fails loudly LINEAGE derived · accurate · meaning-free QUERY CORPUS abundant · encodes old mistakes BUSINESS CONTEXT rich · unversioned · stale by default TRUST route here first
Fig. 3 — Resolve at the highest layer that can answer; fall through deliberately. Never blend all four with equal weight.
L2 · Trust Hierarchy

Report which layer answered

L2 · Bootstrapping

Why LLM-bootstrapping the layer failed

L2 · Division of Labor

Humans own definitions; models draft

L2 · Closing the Taxonomy

Every bin meets a contract mechanism

Failure modeMitigationFailure becomes…
Wrong joinGoverned join graph + declared grainCompile-time error (loud)
Wrong metric variantNamed, owned metric definitionsA clarifying question (“net or gross?”)
Hallucinated column≈4 KB curated retrieval surfaceRejected before SQL exists
Stale schemaVersioned layer, CI-tested against warehouseA failed build, pre-production
A bigger model makes the guess more fluent. Only a contract makes it unnecessary.
— Week 9 lecture notes, DATA 2027
Week 09 · Checkpoint

Discuss before Friday’s lab

Week 09 · Readings

Read before Thursday