DATA 2027 · Week 10 · Part III — Semantics, Agents, Governance

Semantic Operators

The predicate becomes English, the evaluator becomes a language model, and the optimizer must reason about plans that cost dollars and are sometimes wrong.

Lecture 1 — sem_filter, sem_join: relational algebra meets the model · Lecture 2 — Optimizing pipelines that cost dollars and lie

Lecture 1 · Tuesday

sem_filter, sem_join: relational algebra meets the model

Codd’s separation of what from how — now with English predicates.

L1 · The Bet

What is a semantic operator?

L1 · The Operator Zoo

Four operators carry the weight

L1 · The Operator Zoo

Implementation quirks worth knowing

L1 · The Contract

What the contract does not promise: determinism

L1 · A Real Pipeline

A LOTUS pipeline you can write

claims = papers.sem_filter(
    "the abstract {abstract} explicitly claims results "
    "that are reproducible with released code"
)                                # 10,000 → ~1,800 survive

matched = claims.sem_join(
    datasets,                    # 500 benchmark rows
    "paper {abstract:left} reports results on the "
    "benchmark {name:right} ({description:right})"
)                                # 1,800 × 500 pairs?!

digest = (matched
    .sem_topk("most rigorous experimental methodology",
              K=5, group_by=["name"])
    .sem_agg("write a 3-sentence reproducibility summary",
             group_by=["name"]))
L1 · The Cost Shock

That join, executed naively

900,000

model calls: 1,800 × 500 pairs. ~500 input tokens each at $3/M → about $1,400 and several GPU-days of latency for one join.

L1 · The Cost Shock

The algebra transplanted. The cost model didn’t.

L1 · Cascades

Cheap models propose, expensive models dispose

L1 · Cascades

Anatomy of a sem_join cascade

n × m candidate pairs PROXY cosine(e_r, e_s) ~$0.04 total s > τ⁺ τ⁻ ≤ s ≤ τ⁺ s < τ⁻ ACCEPT (free) ORACLE LLM ~2% of pairs REJECT (free) τ⁻, τ⁺ calibrated on a labeled sample: P(precision ≥ p* and recall ≥ r*) ≥ 1 − δ
Fig. 10.1 — The proxy scores all n×m pairs for pennies; only the uncertain band pays for a frontier-model verdict.
L1 · Guarantees

Thresholds are not vibes: the SUPG recipe

L1 · Worked Numbers

Costing the join honestly

Plan componentCost
Naive: 900K pairs × ~500 tokens at $3/M~$1,400
Embed 1,800 + 500 rows (~575K tokens at $0.02/M)$0.012
Oracle-label 1,000 calibration pairs$1.55
Oracle on ~20,000 uncertain pairs (2.2% band)~$30
Cascade total — with a certified recall floor~$32
L1 · Worked Numbers

The default you must justify deviating from

44×

cost reduction: ~$32 vs $1,400 — and latency drops from days to minutes, since embedding is two batched calls and the oracle pass parallelizes.

L1 · Field Note

“It’s just a join”

$19,000

one uncascaded sem_join: 40K tickets × 800 known issues = 32M model calls, four days of rate limiting. The afternoon fix — embeddings + a 3% oracle band — reran for $410.

Lecture 2 · Thursday

Optimizing pipelines that cost dollars and lie

The System R question, asked again: who picks the plan?

L2 · The New Problem

Three axes, one optimizer

L2 · Physical Plan Space

One logical filter, five physical operators

Implementation (10K rows)CostLatencyQuality
Frontier model, chain-of-thought~$95hours1.00 (reference)
Mid-tier model, terse prompt~$6~20 min0.93
Cascade: embedding → frontier on 3%~$3.20~10 min0.95 (floor 0.90)
Three filters fused, one mid-tier call~$2.50/filter~20 min0.88
Code synthesis: frontier writes a classifier~$0.40 totalseconds0.71
L2 · Physical Plan Space

No row dominates — that’s the whole story

L2 · Optimizer Theory

Why System R’s dynamic programming breaks

L2 · Optimizer Theory

Memoize the frontier, not a winner

cost (10K rows, log scale) → quality → code synth $0.40 / 0.71 fused $2.50 / 0.88 cascade $3.20 / 0.95 mid-tier $6 / 0.93 (dominated) frontier CoT $95 / 1.00
Fig. 10.2 — The Lecture 2 menu as a (cost, quality) plot. Only dominated points may be pruned; the cascade dominates the mid-tier model.
L2 · Optimizer Theory

Pareto enumeration, kept tractable

L2 · Historical Aside

Old theory doesn’t die; it waits for its workload

L2 · Statistics

Quality estimation: the cardinality estimation of our era

L2 · DocETL

When every physical plan is bad, rewrite the query

L2 · DocETL

The unreliable substrate repairing itself

1.34–4.6×

quality improvement on DocETL’s benchmarks. It works because verification on a sample is cheaper and more reliable than generation in the large — the oldest trick in computer science.

L2 · Synthesis

The physics held; the workload moved

  • Intact: declarative queries, logical/physical separation.
  • Cost-based search, statistics, fear of cross products.
  • Transformed: cost unit — microseconds → dollars.
  • Correctness — exact → bounded-error.
  • “Better” — total order → partial order.
A query plan used to be a route to the one right answer. Now it’s a position you take on a frontier of dollars, hours, and truth.
— Week 10 notes, DATA 2027
Checkpoint · Discussion

Before you leave

Readings · Due Thursday

Read before Thursday