The predicate becomes English, the evaluator becomes a language model, and the optimizer must reason about plans that cost dollars and are sometimes wrong.
Lecture 1 — sem_filter, sem_join: relational algebra meets the model · Lecture 2 — Optimizing pipelines that cost dollars and lie
Codd’s separation of what from how — now with English predicates.
sem_filter(R, p) — tuples where an English predicate holds.sem_join(R, S, p) — pairs (r, s) satisfying a language predicate.sem_agg(R, q) — many-to-one reduction: summarize, synthesize, answer.sem_topk(R, q, k) — rank by a language criterion, return top k.sem_agg: fold or hierarchical tree — groups rarely fit one context window.sem_topk: LLMs are miscalibrated absolute scorers.claims = papers.sem_filter(
"the abstract {abstract} explicitly claims results "
"that are reproducible with released code"
) # 10,000 → ~1,800 survive
matched = claims.sem_join(
datasets, # 500 benchmark rows
"paper {abstract:left} reports results on the "
"benchmark {name:right} ({description:right})"
) # 1,800 × 500 pairs?!
digest = (matched
.sem_topk("most rigorous experimental methodology",
K=5, group_by=["name"])
.sem_agg("write a 3-sentence reproducibility summary",
group_by=["name"]))
model calls: 1,800 × 500 pairs. ~500 input tokens each at $3/M → about $1,400 and several GPU-days of latency for one join.
| Plan component | Cost |
|---|---|
| Naive: 900K pairs × ~500 tokens at $3/M | ~$1,400 |
| Embed 1,800 + 500 rows (~575K tokens at $0.02/M) | $0.012 |
| Oracle-label 1,000 calibration pairs | $1.55 |
| Oracle on ~20,000 uncertain pairs (2.2% band) | ~$30 |
| Cascade total — with a certified recall floor | ~$32 |
cost reduction: ~$32 vs $1,400 — and latency drops from days to minutes, since embedding is two batched calls and the oracle pass parallelizes.
one uncascaded sem_join: 40K tickets × 800 known issues = 32M model calls, four days of rate limiting. The afternoon fix — embeddings + a 3% oracle band — reran for $410.
The System R question, asked again: who picks the plan?
| Implementation (10K rows) | Cost | Latency | Quality |
|---|---|---|---|
| Frontier model, chain-of-thought | ~$95 | hours | 1.00 (reference) |
| Mid-tier model, terse prompt | ~$6 | ~20 min | 0.93 |
| Cascade: embedding → frontier on 3% | ~$3.20 | ~10 min | 0.95 (floor 0.90) |
| Three filters fused, one mid-tier call | ~$2.50/filter | ~20 min | 0.88 |
| Code synthesis: frontier writes a classifier | ~$0.40 total | seconds | 0.71 |
quality improvement on DocETL’s benchmarks. It works because verification on a sample is cheaper and more reliable than generation in the large — the oldest trick in computer science.