DATA 2027 · Week 10 · Part III — Semantics, Agents, Governance

Semantic Operators

The predicate becomes English, the evaluator becomes a language model, and the optimizer must reason about plans that cost dollars and are sometimes wrong.

Lecture 1 — sem_filter, sem_join: relational algebra meets the model · Lecture 2 — Optimizing pipelines that cost dollars and lie

Lecture 1 · Tuesday

sem_filter, sem_join: relational algebra meets the model

Codd’s separation of what from how — now with English predicates.

L1 · The Bet

What is a semantic operator?

Declarative transformation over relations, parameterized by natural language.
Physical implementation happens to involve LLM invocations.
Semantics defined against a reference judgment, not a model.
Plus a statistical accuracy target.
Result: rewrites, cost models, cardinality estimation come back online.

L1 · The Operator Zoo

Four operators carry the weight

sem_filter(R, p) — tuples where an English predicate holds.
sem_join(R, S, p) — pairs (r, s) satisfying a language predicate.
sem_agg(R, q) — many-to-one reduction: summarize, synthesize, answer.
sem_topk(R, q, k) — rank by a language criterion, return top k.

L1 · The Operator Zoo

Implementation quirks worth knowing

sem_agg: fold or hierarchical tree — groups rarely fit one context window.
sem_topk: LLMs are miscalibrated absolute scorers.
So: pairwise or listwise comparisons into a tournament network.
Call complexity: O(n log n) comparisons, not O(n) scores.

L1 · The Contract

What the contract does not promise: determinism

Run a sem_filter twice — you may get different rows.
Honest semantics: output is a sample from a distribution over relations.
Even eventual consistency converged to a unique value. This never does.
Promise instead: bounded disagreement with the reference.

L1 · A Real Pipeline

A LOTUS pipeline you can write

claims = papers.sem_filter(
    "the abstract {abstract} explicitly claims results "
    "that are reproducible with released code"
)                                # 10,000 → ~1,800 survive

matched = claims.sem_join(
    datasets,                    # 500 benchmark rows
    "paper {abstract:left} reports results on the "
    "benchmark {name:right} ({description:right})"
)                                # 1,800 × 500 pairs?!

digest = (matched
    .sem_topk("most rigorous experimental methodology",
              K=5, group_by=["name"])
    .sem_agg("write a 3-sentence reproducibility summary",
             group_by=["name"]))

L1 · The Cost Shock

That join, executed naively

900,000

model calls: 1,800 × 500 pairs. ~500 input tokens each at $3/M → about $1,400 and several GPU-days of latency for one join.

L1 · The Cost Shock

The algebra transplanted. The cost model didn’t.

Joins are quadratic — that part Selinger knew.
Model calls: 5–9 orders of magnitude pricier than boolean predicates.
Selinger budgeted predicates at fractions of a microsecond.
Now every evaluation has a unit price printed on it.

L1 · Cascades

Cheap models propose, expensive models dispose

Older than LLMs: Viola–Jones (2001), multi-stage search rankers.
Proxy scores every candidate; two thresholds τ⁻ < τ⁺.
Below τ⁻: reject. Above τ⁺: accept. Middle band → oracle.
sem_filter proxy: small model’s log-probability on “True.”
sem_join proxy: embed once — O(n+m), not O(n·m) — cosine similarity.

L1 · Cascades

Anatomy of a sem_join cascade

Fig. 10.1 — The proxy scores all n×m pairs for pennies; only the uncertain band pays for a frontier-model verdict.

L1 · Guarantees

Thresholds are not vibes: the SUPG recipe

Oracle-label a sample; pick τ⁻, τ⁺ from it.
Guarantee: recall ≥ r* and precision ≥ p*, probability ≥ 1−δ.
Importance-weighted sampling concentrates labels near the decision boundary.
SLA-ready: “90% recall at 95% confidence.”
Garbage proxy? Band widens — you pay more, never silently answer worse.

L1 · Worked Numbers

Costing the join honestly

Plan component	Cost
Naive: 900K pairs × ~500 tokens at $3/M	~$1,400
Embed 1,800 + 500 rows (~575K tokens at $0.02/M)	$0.012
Oracle-label 1,000 calibration pairs	$1.55
Oracle on ~20,000 uncertain pairs (2.2% band)	~$30
Cascade total — with a certified recall floor	~$32

L1 · Worked Numbers

The default you must justify deviating from

44×

cost reduction: ~$32 vs $1,400 — and latency drops from days to minutes, since embedding is two batched calls and the oracle pass parallelizes.

L1 · Field Note

“It’s just a join”

$19,000

one uncascaded sem_join: 40K tickets × 800 known issues = 32M model calls, four days of rate limiting. The afternoon fix — embeddings + a 3% oracle band — reran for $410.

Lecture 2 · Thursday

Optimizing pipelines that cost dollars and lie

The System R question, asked again: who picks the plan?

L2 · The New Problem

Three axes, one optimizer

1979: dynamic program over join orders, one scalar cost.
2025: plans vary in runtime, dollars, and quality at once.
Quality can’t be bought back after the fact.
Palimpzest (CIDR 2025): declare a pipeline and a policy.
“Maximize quality subject to cost ≤ $50,” or the reverse.

L2 · Physical Plan Space

One logical filter, five physical operators

Implementation (10K rows)	Cost	Latency	Quality
Frontier model, chain-of-thought	~$95	hours	1.00 (reference)
Mid-tier model, terse prompt	~$6	~20 min	0.93
Cascade: embedding → frontier on 3%	~$3.20	~10 min	0.95 (floor 0.90)
Three filters fused, one mid-tier call	~$2.50/filter	~20 min	0.88
Code synthesis: frontier writes a classifier	~$0.40 total	seconds	0.71

L2 · Physical Plan Space

No row dominates — that’s the whole story

Tuesday’s cascade is just one physical operator here.
Fusion amortizes per-call token overhead — at quality cost (interference).
Code synthesis: great on lexical predicates, bad on judgment.
No single number without asking what the user values.
Hence: the policy is part of the query.

L2 · Optimizer Theory

Why System R’s dynamic programming breaks

Principle of optimality needs costs totally ordered.
(cost, accuracy) is a partial order: incomparable subplans exist.
Cheap-but-sloppy subplan may prefix the globally best plan.
Downstream operators may tolerate — even mask — its errors.
Prune it at the memo table: optimum gone.

L2 · Optimizer Theory

Memoize the frontier, not a winner

Fig. 10.2 — The Lecture 2 menu as a (cost, quality) plot. Only dominated points may be pruned; the cascade dominates the mid-tier model.

L2 · Optimizer Theory

Pareto enumeration, kept tractable

You’ve seen the baby version: Selinger’s “interesting orders.”
This is interesting orders with the dial turned to eleven.
Frontiers grow multiplicatively with pipeline depth.
Repairs: sample-based quality estimates, frontier sparsification.
Confidence-interval pruning, à la multi-armed bandits.

L2 · Historical Aside

Old theory doesn’t die; it waits for its workload

1992: Ioannidis, Ng, Shim, Sellis formalize parametric query optimization.
2001: Papadimitriou & Yannakakis — multi-objective complexity results.
Frontiers can be exponential, yet admit polynomial ε-approximations.
Thirty years: a beautiful answer in search of a question.
Predicates with API bills made it a load-bearing wall.

L2 · Statistics

Quality estimation: the cardinality estimation of our era

No precomputed histogram for “fraction a mid-tier model misjudges.”
So: online and sampled, scored against a champion frontier model.
Week 5 pathologies return: multiplicative error compounding, correlated predicates.
New twist: estimates themselves cost money — explore/exploit.
The perpetually weakest input; the next decade of papers.

L2 · DocETL

When every physical plan is bad, rewrite the query

Sometimes the logical operator is too big to execute faithfully.
“Extract all medications from an 80-page record” fails on attention.
DocETL’s optimizer is agentic: an LLM proposes pipeline rewrites.
Directives: split + gather context, decompose maps, insert resolve.
An LLM-as-judge validates rewrites on sampled data.

L2 · DocETL

The unreliable substrate repairing itself

1.34–4.6×

quality improvement on DocETL’s benchmarks. It works because verification on a sample is cheaper and more reliable than generation in the large — the oldest trick in computer science.

L2 · Synthesis

The physics held; the workload moved

Intact: declarative queries, logical/physical separation.
Cost-based search, statistics, fear of cross products.

Transformed: cost unit — microseconds → dollars.
Correctness — exact → bounded-error.
“Better” — total order → partial order.

A query plan used to be a route to the one right answer. Now it’s a position you take on a frontier of dollars, hours, and truth.

— Week 10 notes, DATA 2027

Checkpoint · Discussion

Before you leave

At what uncertain-band width does a cascade stop beating the naive plan? (Ex 10.1)
Why does cost climb as target recall r* moves toward 0.99? (Ex 10.2)
Where could DocETL’s LLM-as-judge validation loop be fooled?

Readings · Due Thursday

Read before Thursday

LOTUS — Patel et al., VLDB 2025. The algebra; calibrated cascade guarantees.
Palimpzest — Liu, Russo, Cafarella et al., CIDR 2025. The (runtime, cost, quality) Pareto search.
DocETL — Shankar, Parameswaran, Wu, VLDB 2025. Agentic rewrites; ask where the judge gets fooled.