Week 10: Semantic Operators

Learning objectives — after this week you can…

Define sem_filter, sem_join, sem_agg, and sem_topk as relational operators with LLM-backed implementations, and state precisely what their output contract guarantees (and doesn’t).
Explain why a naive sem_join is O(n·m) model calls, and design a proxy→oracle cascade that cuts cost by 10–100× while meeting a precision/recall target with probability 1−δ.
Compute the dollar cost of a semantic pipeline from first principles: tokens per call × calls per operator × price per token — before running it.
Describe Palimpzest’s physical plan space (model tiers, prompt strategies, operator fusion, code synthesis) and why plan quality is a first-class optimization axis.
Prove to yourself why System R’s dynamic programming breaks under a (cost, accuracy) partial order, and explain Pareto-frontier plan enumeration as the repair.

Lecture 1 · Tuesday

sem_filter, sem_join: relational algebra meets the model

Relational algebra survived every hardware revolution since 1970 because it separates what from how. Codd’s operators don’t care whether the executor is a PDP-11 or a thousand-node cluster. The LOTUS line of work (Patel et al., VLDB 2025) makes a bet that should feel inevitable by Week 10 of this course: the same separation works when the predicate is written in English and the evaluator is a language model. A semantic operator is a declarative transformation over relations, parameterized by a natural-language expression, whose physical implementation happens to involve one or more LLM invocations. The crucial move is that the operator’s semantics are defined independently of any particular model — defined, instead, against a reference judgment (a “gold” oracle, in practice a frontier model or a human) plus a statistical accuracy target. Once you have that, the entire apparatus of database optimization — logical rewrites, physical operator selection, cost models, cardinality estimation — comes back online. That’s the whole lecture in one sentence; the rest is the details, and the details have teeth.

The operator zoo, precisely

Four operators carry most of the weight. sem_filter(R, p) takes a relation and a natural-language predicate (“this abstract claims a new state-of-the-art result”) and returns the tuples for which the predicate holds, where “holds” means: agrees with the reference judgment at a target accuracy. sem_join(R, S, p) returns pairs (r, s) satisfying a language predicate over both tuples — “this paper r addresses dataset s.” sem_agg(R, q) is a many-to-one reduction: summarize, synthesize, or answer q over a group of tuples, typically implemented as a fold or hierarchical tree of model calls because the group rarely fits one context window. sem_topk(R, q, k) ranks tuples by a language criterion (“most novel methodology”) and returns the top k — implemented not by scoring each tuple independently (LLMs are miscalibrated absolute scorers) but by pairwise or listwise comparisons fed into a sorting or tournament network, which is why its call complexity looks like O(n log n) comparisons rather than O(n) scores.

Notice what is not in the contract: determinism. Run a sem_filter twice and you may get different rows. The honest semantics are probabilistic: the output is a sample from a distribution over relations, and the guarantee is about expected agreement with the reference. This is genuinely new for the relational world — even our flakiest Week 4 friends (eventual consistency, anyone?) eventually converged to a unique value. Semantic operators never promise that. What they promise instead is bounded disagreement, and Tuesday’s second half is about how to make that bound real rather than aspirational.

A pipeline you can actually write

Here is a LOTUS-style pipeline in the Pandas-flavored API the paper popularized. The task: from a corpus of ML papers, find ones claiming reproducible results, match each against a relation of known benchmark datasets, and produce a ranked digest per dataset. Curly braces splice column values into the prompt:

papers = pd.read_parquet("arxiv_ml_2025.parquet")      # 10,000 rows
datasets = pd.read_parquet("benchmarks.parquet")        #    500 rows

claims = papers.sem_filter(
    "the abstract {abstract} explicitly claims results "
    "that are reproducible with released code"
)                                                       # ~1,800 survive

matched = claims.sem_join(
    datasets,
    "paper {abstract:left} reports results on the "
    "benchmark {name:right} ({description:right})"
)                                                       # 1,800 × 500 pairs?!

digest = (matched
    .sem_topk("most rigorous experimental methodology",
              K=5, group_by=["name"])
    .sem_agg("write a 3-sentence reproducibility summary "
             "of these papers for dataset {name}",
             group_by=["name"]))

Read the comment on line 11 again. That sem_join, executed naively, is 1,800 × 500 = 900,000 model calls. At roughly 500 input tokens per pair (two text fields plus instructions) and frontier pricing of $3 per million input tokens plus a few output tokens per verdict, you are staring at about $1,400 and several GPU-days of latency for one join. Joins are quadratic; model calls are five to nine orders of magnitude more expensive than the predicate evaluations Selinger budgeted at fractions of a microsecond. The algebra transplanted fine. The cost model did not.

Cascades: cheap models propose, expensive models dispose

The standard repair is the model cascade, and it is older than LLMs — Viola–Jones face detection (2001) was a cascade; so was every multi-stage ranker at a search company. The structure: a cheap proxy scores every candidate; we pick two thresholds τ⁻ < τ⁺; candidates scoring below τ⁻ are rejected outright, above τ⁺ accepted outright, and only the uncertain middle band is sent to the expensive oracle model. For sem_filter the proxy is a small instruction-tuned model’s token log-probability on “True.” For sem_join it is even cheaper: embed both sides once (O(n+m) embedding calls, not O(n·m)) and use cosine similarity as the proxy score over all pairs.

Fig. 10.1 — A sem_join cascade. The proxy (embedding cosine similarity) scores all n×m pairs for pennies; only the uncertain band between τ⁻ and τ⁺ pays for a frontier-model verdict. The thresholds are not vibes: they are calibrated on an oracle-labeled sample to meet a precision/recall target with confidence 1−δ.

The thresholds are where the statistics earn their keep. LOTUS inherits the SUPG recipe (Kang et al.): draw a sample of candidates, label them with the oracle, and use the sample to pick τ⁻ and τ⁺ such that, with probability at least 1−δ over the sampling randomness, the cascade’s output achieves recall ≥ r* (no more than a bounded fraction of true pairs auto-rejected below τ⁻) and precision ≥ p* (auto-accepts above τ⁺ are sufficiently pure). Importance-weighted sampling concentrates labels near the decision boundary where they matter. The result is a knob you can write into an SLA: “this join agrees with GPT-class judgment at 90% recall, 95% confidence,” with the residual risk quantified rather than vibed. If the proxy is garbage — uncorrelated with the oracle — the procedure degrades gracefully: the uncertain band widens until you’re paying the oracle for everything, which is the naive plan. You never silently get worse answers; you pay more money. That failure direction is a design choice and the correct one.

The worked numbers

Let’s cost the join from the code block honestly. Naive plan: 1,800 × 500 = 900,000 pairs × ~500 input tokens = 450M tokens at $3/M → $1,350, plus ~$45 of output tokens. Call it $1,400. Cascade plan: embed 1,800 abstracts and 500 descriptions at ~250 tokens each ≈ 575K tokens at $0.02/M → $0.012 (effectively free). Oracle-label a calibration sample of 1,000 pairs → $1.55. Suppose calibration sets thresholds such that 97.8% of pairs fall outside [τ⁻, τ⁺] — realistic, because almost all of the 900K pairs are obvious non-matches — leaving ~20,000 uncertain pairs for the oracle: 20,000 × 500 tokens × $3/M ≈ $30. Total ≈ $32 against $1,400: a 44× reduction, with a certified recall floor. And note the asymmetry of where time went: the embedding pass is two batched calls; the oracle pass parallelizes to whatever rate limit you can buy. Latency drops from days to minutes. This single pattern — cheap scan, calibrated thresholds, expensive verification on the margin — is to semantic query processing what the B-tree was to Week 2: the default you must justify deviating from.

Field note A team I advised ran an uncascaded sem_join between a 40K-row support-ticket table and an 800-row known-issues table as an overnight batch job — “it’s just a join.” Thirty-two million model calls later, the bill was $19,000 and the rate limiter had stretched “overnight” to four days. The fix took an afternoon: embeddings plus a 3%-band oracle pass brought the rerun to $410. Nobody on that team had read Selinger; everybody now has. Quadratic operators stop being an abstraction the moment each evaluation has a unit price printed on it.

Lecture 2 · Thursday

Optimizing pipelines that cost dollars and lie

Tuesday gave us an algebra and one heroic physical optimization. Thursday asks the System R question: who picks the plan? In 1979 the answer was a dynamic program over join orders minimizing a scalar cost. In 2025 the answer cannot be that, because semantic plans vary along three axes at once — runtime, dollars, and quality — and quality is not something you can buy back after the fact the way you can re-sort a mis-ordered intermediate. Palimpzest (Liu, Russo et al., CIDR 2025) is the cleanest statement of the new problem: the user declares a pipeline and a policy (“maximize quality subject to cost ≤ $50,” or “minimize cost subject to quality ≥ 0.9 of the reference plan”), and the optimizer searches a space of physical plans whose points are scattered across a three-dimensional Pareto surface. The deep lesson of the lecture is structural: when plan goodness is a partial order, the optimal-substructure assumption underneath forty-five years of query optimization quietly fails, and we have to reach for machinery the community built decades ago for a problem almost nobody had — until now.

The physical plan space is the model menu

What is a “physical operator” when the logical operator is sem_filter? Palimpzest’s answer: every concrete way of getting the judgment. The same logical filter can be implemented by a frontier model with chain-of-thought; a mid-tier model with a terse prompt; a small local model; an embedding-similarity proxy with calibrated thresholds (Tuesday’s cascade is just one physical operator here!); a fused prompt that evaluates three adjacent filters in one call to amortize the per-call token overhead; or — Palimpzest’s slyest trick — synthesized code: ask a frontier model once to write a Python function that approximates the predicate, then run that function for free on every row. Each choice lands at a different (runtime, cost, quality) point. A taste of the menu for one filter over 10,000 documents:

Physical implementation	Cost (10K rows)	Latency	Quality (vs. reference)
Frontier model, chain-of-thought	~$95	hours (rate-limited)	1.00 (is the reference)
Mid-tier model, terse prompt	~$6	~20 min	0.93
Cascade: embedding proxy → frontier on 3% band	~$3.20	~10 min	0.95 (certified floor 0.90)
Three filters fused into one mid-tier call	~$2.50/filter	~20 min	0.88 (interference between criteria)
Code synthesis: one frontier call writes a classifier fn	~$0.40 total	seconds	0.71 (great on lexical predicates, bad on judgment)

No row dominates. That non-domination is the whole story: the optimizer can’t collapse this table to a single number without asking the user what they value, which is exactly why Palimpzest makes the policy part of the query.

Why System R’s dynamic programming breaks

Selinger-style optimization rests on the principle of optimality: the best plan for joining {A,B,C} extends the best plan for some sub-join, so you can memoize one winner per subexpression and prune everything else. That works because scalar costs are totally ordered — “best” is well-defined and composition is monotone. Under a (cost, accuracy) objective, goodness is a partial order: plan P dominates Q only if P is no worse on every axis and strictly better on one. Two subplans where one is cheaper and the other more accurate are incomparable, and — here is the killer — the cheap-but-sloppy subplan may be the prefix of the globally best full plan, because a downstream operator might tolerate (or even mask) its errors, while the accurate-but-pricey subplan blows the budget. Prune it at the memo table and you’ve thrown away the optimum. The repair is to memoize the full Pareto frontier of undominated subplans at each node and compose frontiers, pruning only dominated points. You’ve seen a baby version of this: Selinger’s own “interesting orders” keeps multiple plans per subexpression because sortedness can pay off later. Pareto enumeration is interesting orders with the dial turned to eleven — and the frontier can grow multiplicatively with pipeline depth, so real systems (Palimpzest included) approximate: sample-based quality estimation on a small labeled prefix of the data, frontier sparsification, and confidence-interval pruning à la multi-armed bandits to stop spending estimation budget on plans that are probably dominated.

Historical aside The theory was waiting for us. Ioannidis, Ng, Shim and Sellis formalized parametric query optimization in 1992; Papadimitriou and Yannakakis gave multi-objective query optimization its complexity-theoretic teeth in 2001, including the result that Pareto frontiers can be exponentially large but admit polynomial ε-approximations. For thirty years this was a beautiful answer in search of a question — optimizers shipped with one cost number because users wanted one plan. It took predicates with API bills and error bars to make multi-objective optimization the load-bearing wall of a CIDR paper. Old theory doesn’t die; it waits for its workload.

Estimating quality: the cardinality estimation of our era

A cost model needs statistics. System R had histograms; what does Palimpzest have? Nothing ahead of time — there is no precomputed histogram for “fraction of abstracts a mid-tier model misjudges on this predicate.” So quality estimation is online and sampled: run candidate physical operators on a small sample, score them against the champion (frontier) implementation as pseudo-ground-truth, and propagate estimated selectivities and quality through the plan DAG. Every pathology you learned in Week 5 returns wearing a new coat: errors compound multiplicatively across operators; correlated predicates wreck independence assumptions; and the estimates themselves now cost money, so the optimizer faces an explore/exploit problem — every sample spent evaluating a doomed plan is budget stolen from running the good one. If you remember one structural fact from Thursday, make it this: quality estimation is to semantic optimizers what cardinality estimation was to relational ones — the perpetually weakest input, and the place where the next decade of papers will live.

DocETL: when the right plan needs a different query

Palimpzest holds the logical plan fixed and shops for physical operators. DocETL (Shankar et al., VLDB 2025) makes a more radical observation about document pipelines: often every physical implementation of your logical plan is bad, because the logical operator itself is too big for a model to execute faithfully — “extract all medication mentions with dosage from this 80-page medical record” fails not on model choice but on attention over long context. DocETL’s optimizer is agentic: an LLM proposes rewrites of the pipeline using a library of directives — split a document into chunks and gather peripheral context around each; decompose one mega-map into a chain of smaller maps; insert a deduplicating resolve before an aggregation — and a second LLM-as-judge evaluates candidate rewrites on sampled data to decide whether the rewrite actually improved output quality. This should make you pleasantly uneasy: the optimizer is using the unreliable substrate to repair plans over the unreliable substrate, with sampling and validation as the only adult supervision. It works — 1.34× to 4.6× quality improvements on their benchmarks — precisely because evaluation-on-a-sample is cheaper and more reliable than generation-in-the-large. Verification being easier than synthesis is the oldest trick in computer science; semantic data systems run almost entirely on it.

A query plan used to be a route to the one right answer. Now it’s a position you take on a frontier of dollars, hours, and truth.

Where does this leave the classical picture? Intact in shape, transformed in content. We still have declarative queries, logical/physical separation, cost-based search, statistics, and a healthy fear of cross products. What changed is the unit of cost (microseconds → dollars), the contract of correctness (exact → bounded-error), and the order structure of “better” (total → partial). The physics held; the workload moved. Next week we point these operators at a stranger target: the agent’s own memory.

Readings

Read Before Thursday

Semantic Operators: A Declarative Model for Rich, AI-Based Data Processing (LOTUS) — Patel, Jha, Pan, Gupta, Asawa, Guestrin, Zaharia, VLDB 2025.The algebra itself. Focus on the formal semantics of accuracy targets and the cascade algorithms for sem_filter/sem_join — especially how thresholds are calibrated to give statistical guarantees, not vibes.

Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing — Liu, Russo, Cafarella et al., CIDR 2025.The optimizer story. Focus on the physical plan space (model tiers, fusion, code synthesis) and how plan search handles the (runtime, cost, quality) Pareto frontier; skim the implementation section.

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing — Shankar, Parameswaran, Wu, VLDB 2025.Optimization beyond physical operator choice. Focus on the rewrite directives and on how LLM-as-judge validates rewrites — ask yourself where this validation loop could be fooled.

Exercises

This Week’s Problems

Exercise 10.1 · warm-up

Cost out a sem_join between a 25,000-row product-review table and a 1,200-row defect-taxonomy table. Assume 600 input tokens per pair, $3/M input tokens for the oracle, $0.02/M for embeddings (300 tokens per row), and a cascade whose uncertain band is 1.5% of pairs after a 2,000-pair calibration sample. Compute: (a) the naive plan’s dollar cost; (b) the cascade’s total cost including calibration; (c) the band width (as a percentage of pairs) at which the cascade stops being cheaper than the naive plan. State every assumption you add.

Exercise 10.2 · core

Implement sem_filter with a calibrated cascade over the 5,000-abstract sample corpus from the course bucket, predicate: “this abstract proposes a new benchmark or dataset.” Use any embedding model as proxy and any frontier model as oracle. Following the SUPG recipe, label a 400-item importance-weighted sample, choose τ⁻ and τ⁺ to target recall ≥ 0.9 at δ = 0.05, then evaluate on a held-out 500-item oracle-labeled test set. Report: realized precision/recall, fraction of items routed to the oracle, total dollar cost, and a plot of cost versus target recall for r* ∈ {0.8, 0.9, 0.95, 0.99}. Explain the shape of that curve.

Exercise 10.3 · stretch

System R memoizes one optimal plan per subexpression; Pareto enumeration memoizes a frontier, which can grow multiplicatively with pipeline depth. Design and analyze a plan-enumeration algorithm for linear semantic pipelines of k operators, each with c physical implementations, that maintains at most B frontier points per memo entry via ε-dominance sparsification (P ε-dominates Q if P is within (1+ε) of Q on cost and at least Q’s quality). Prove a bound on how far your returned plan can be from the true Pareto frontier as a function of ε, k, and B — note that quality estimates carry sampling error, so your bound must hold with probability 1−δ given n-sample quality estimates per operator. Then either (a) show your bound is tight via an adversarial pipeline construction, or (b) implement the enumerator over the Lecture 2 table’s operator menu extended to k = 6 and measure realized regret against exhaustive enumeration. This is an open-flavored problem; partial rigor with honest gaps beats false completeness.

❦