Lab 4: A Semantic-Operator Optimizer

Learning objectives — after this lab you can…

Specify semantic operators (sem_filter, sem_join, sem_topk) with precision/recall contracts rather than exact-answer semantics.
Build a per-operator cost model spanning model tiers and use it to price candidate plans before running them.
Calibrate a proxy-score cascade with a one-sided confidence bound, and explain why the guarantee holds.
Apply classical optimizer ideas — selectivity estimation, predicate reordering — when the “predicate” is a language-model call.
Report results as Pareto frontiers and ablation tables instead of single-point claims.

Overview

What You’re Building

Relational engines got fast because the optimizer, not the user, chose how to evaluate a query. Semantic query processing is at the pre-optimizer stage of that history: most pipelines today call a frontier model on every row, for every predicate, in whatever order the author happened to write them. This lab makes you build the missing layer. You will implement three semantic operators over a fixed corpus — 20,000 product reviews and 5,000 product descriptions in CSV — and an optimizer that assigns each operator a model tier, a cascade configuration, and a position in the plan.

All model calls go through simulator.py, a provided simulator with configurable, seeded error rates. There are no real API calls, no network, and no nondeterminism: given the same seed and the same plan, your pipeline must produce byte-identical output. This is what makes the lab gradable — and it is also a non-negotiable spec item (see the rubric’s reproducibility row).

The benchmark query, fixed for everyone: find the 25 products whose reviews most credibly complain about battery degradation, joining each complaint to the product whose description claims long battery life. It decomposes into one sem_filter over reviews (P1: “complains about battery degradation”), one sem_filter over products (P2: “claims long battery life”), one sem_join (review discusses this product’s battery-life claim), and one sem_topk (k = 25, criterion “most credible battery complaint”).

Target. The naive plan runs every call on the frontier tier in written order; the autograder prices it at about $205,742 on the benchmark seed. Your optimized plan must cost ≤ naive/10 while scoring within 2% relative F1 of the same-seed naive plan for the join stage and ≥ 0.93 nDCG@25 for the top-k stage.

Non-goals and forbidden shortcuts

No gold-standard mining. You may read calibration_reviews.csv and calibration_products.csv (500 labeled rows in total) programmatically. The gold rules are visible in simulator.py — deriving labels from ground_truth_filter, the salt, or the planted signal fragments, in code or “by inspection,” is an academic-integrity violation; the grader re-calibrates your thresholds on a shifted split and catches it.
No answer caching across seeds. Memoizing within one run is fine (and expected); persisting simulator outputs to disk and replaying them on graded seeds is not.
No hand-tuned magic constants. Every threshold in your submitted plan must be produced by your calibration code from the calibration split. The grader re-runs calibration on a shifted split; hardcoded thresholds will visibly fail the guarantee check.
No real LLMs. Importing any model client disqualifies the run. The point is the optimizer, not the model.

Provided materials

The Starter Kit

Clone data2027/lab4-starter. It contains the sealed simulator (simulator.py), the corpus generator (make_data.py), and the operator stubs with a working naive sem_filter reference (operators_stub.py). Running python3 make_data.py writes two fixed-schema CSVs: reviews.csv (20,000 rows: review_id, product_id, stars, text) and products.csv (5,000 rows: product_id, title, description, category), plus the two calibration files. Every row is a pure function of its id, so the CSVs hash identically everywhere — regenerate freely; never edit them.

The simulator exposes two tiers plus a free proxy scorer. The proxy is an embedding-similarity heuristic: cheap, biased, and exactly the kind of signal a cascade is built to exploit. Its agreement with gold is roughly 75% on this corpus — useless as a final answer, valuable as a router.

# simulator.py — provided; treat as a sealed service. Do not modify.
from simulator import Simulator

sim = Simulator(seed=2027)               # deterministic per (seed, tier, doc)
keep  = sim.judge_filter(doc, predicate, tier="cheap")   # "cheap" | "frontier"
match = sim.judge_pair(left, right, predicate, tier="frontier")
s = sim.score_proxy(doc, predicate)      # float in [0, 1] — costs $0
sim.ledger                               # the only meter the grader trusts

# Operators YOU implement — operators_stub.py (signatures are frozen)
def sem_filter(sim: Simulator, rows: Sequence[Row], predicate: str,
               target_precision: float = 0.95,
               plan: Optional[OpConfig] = None) -> list[Row]
def sem_join(sim: Simulator, left: Sequence[Row], right: Sequence[Row],
             predicate: str, budget: Optional[float] = None,
             plan: Optional[OpConfig] = None) -> list[tuple[Any, Any]]
def sem_topk(sim: Simulator, rows: Sequence[Row], criterion: str, k: int,
             plan: Optional[OpConfig] = None) -> list[Row]

# The plan — Plan/OpConfig/CascadeConfig dataclasses in operators_stub.py;
# serialize into results/ in any stable text format (the grader re-executes
# plans, it does not trust prose)
Plan(order=["P2", "P1", "JOIN", "TOPK"],
     ops={"P1": OpConfig(op="sem_filter", predicate="...", tier="cheap",
                         cascade=CascadeConfig(tau_lo=0.21, tau_hi=0.84,
                                               delta=0.05)),
          "JOIN": OpConfig(op="sem_join", predicate="...", blocker_m=8)})

Costs and latencies are part of the spec, not something you measure. Your cost model must reproduce the table below exactly; the autograder cross-checks your reported plan price against its own meter and rejects mismatches above 0.5%.

Tier	$ / 1k input tok	$ / 1k output tok	Latency / call	Seeded error rate
`proxy`	0.000	—	0.4 ms	25% disagreement w/ gold at the 0.5 threshold
`cheap`	0.025	0.10	180 ms	8% flip, biased toward “yes”
`frontier`	0.750	3.00	1,400 ms	2% flip; false positives rarer than misses

Latency matters only for the stretch metric (critical-path latency under 32-way concurrency, reported but lightly weighted). Dollars are the primary axis.

Milestone 1 · due Friday, week 10

Correct Operators, Naive Plan

Implement all three operators so the conformance checks pass under the all-frontier plan (start from the provided smoke test: python3 operators_stub.py). Operator semantics are statistical contracts, frozen as follows. sem_filter(T, p) returns the satisfying rows, in input order; the kept set must achieve precision ≥ 0.95 and recall ≥ 0.90 against gold when run all-frontier. sem_join(L, R, p) returns the set of (left_id, right_id) pairs satisfying p; the naive implementation evaluates the full cross-product of surviving rows — that quadratic blowup (~1.73M pair judgements) is precisely what makes the naive plan cost $205,742. sem_topk(T, c, k) must return a full ranking of k rows; ties broken by ascending row id so output stays deterministic. Internally you may implement it as a tournament or pairwise-comparison sort, but the comparison count is billed like everything else.

Deliverable: green conformance run plus results/m1.json (the grader writes it) showing naive-plan cost within 0.5% of $205,742. Nothing clever yet — M1 is the control arm of every experiment you will run later.

Milestone 2 · due Friday, week 11

Cascades With a Guarantee

A cascade routes each row by its free proxy score s(x): accept outright if s(x) ≥ τ_hi, reject outright if s(x) ≤ τ_lo, and escalate the uncertain middle band to a model tier (cheap first, frontier only on low-confidence disagreement). The entire trick is choosing τ_hi and τ_lo so that the shortcuts don’t silently wreck quality.

The statistics are simpler than they look, and you must implement them rather than eyeball thresholds. Take the m calibration rows with s(x) ≥ τ_hi and compute p̂, the fraction the gold labels mark positive — that’s the observed precision of the “accept” shortcut. Because m is small, p̂ is noisy, so you don’t trust it directly: you use a one-sided lower confidence bound. With Hoeffding’s inequality the bound is p̂ − √(ln(1/δ) / 2m); with probability at least 1 − δ, the true precision is no lower than that. Your calibrator should sweep τ_hi from strict to loose and pick the loosest threshold whose lower bound still clears the target (0.95) — loosest, because every additional accepted row is a model call you never pay for. Run the mirror-image procedure on τ_lo against the recall target. Use δ = 0.05 per threshold. A Clopper–Pearson exact bound is accepted and slightly tighter; say which you used.

One subtlety the grader checks: the guarantee is only valid if calibration rows are an i.i.d. sample of the rows the cascade will route. Calibrate per predicate — a τ_hi fitted on P1 (battery complaints) transfers terribly to P2 (marketing claims), and the “shifted split” re-run in the autograder will expose any cross-predicate reuse.

Fig. 1 — Cascade routing for one semantic predicate. The free proxy score splits rows into accept / reject / escalate bands; only the uncertain middle pays for tokens, and only its hardest residue reaches the frontier tier. Thresholds are chosen so the shortcuts hold precision ≥ 0.95 and recall ≥ 0.90 with probability ≥ 1 − δ.

For the join, the cross-product is the enemy: cascade thresholds alone won’t save you if you score 20,000 × 5,000 pairs. Use the proxy as a blocker — keep only each review’s top-m products by proxy score (blocker_m in the plan; calibrate the recall loss it induces the same way you calibrate τ_lo). Deliverable: results/m2.json with per-predicate thresholds, their lower bounds, and an empirical check that quality held on calibration.

The optimizer’s job is not to avoid the frontier model — it is to spend it only where it changes the answer.

Milestone 3 · due Friday, week 12

Reordering, Experiments, Report

Now make it an optimizer. Your planner reads the operator DAG, estimates each predicate’s selectivity and per-row cost from a 200-row pilot sample (pilot calls are billed — budget for them), and emits the cheapest valid ordering. The classical rule survives translation: run the most selective, cheapest-per-row filter first. P2 runs over 5,000 products; P1 over 20,000 reviews; whether filtering products first shrinks the join enough to pay for itself is exactly the question your cost model must answer with numbers, not vibes.

Required experiments — all on seeds {2027, 2028, 2029}, reporting mean and range:

E-NAIVE: all-frontier, written order. Your control row.
E-CASCADE: calibrated cascades, written order.
E-CASCADE+REORDER: the full optimizer, plus join blocking.
Pareto sweep: vary δ ∈ {0.20, 0.10, 0.05, 0.01} and blocker_m ∈ {4, 8, 16}; plot cost (x, log scale) vs quality (y) and trace the frontier.
Ablation table: E-CASCADE+REORDER with each mechanism removed one at a time (no cascade, no reorder, no blocking, no cheap tier), one row each: cost, F1, nDCG@25.

Report format (report.pdf, ≤ 4 pages, two columns fine): §1 plan diagram of your final E-CASCADE+REORDER plan with tier and threshold annotations; §2 the E-NAIVE / E-CASCADE / E-CASCADE+REORDER cost/quality table; §3 the Pareto figure with your submitted operating point marked; §4 the ablation table plus one paragraph per row explaining the mechanism’s marginal contribution; §5 a short “where the guarantee leaks” discussion — every cascade has one, and finding yours is worth more than pretending it doesn’t exist. Submit the repo with your serialized plan, results/, and report.pdf; python3 run_experiments.py must reproduce every number in the report.

Grading rubric

How the 100 Points Fall

Component	Pts	What we check
Operator correctness (M1)	20	Conformance suite green; all-frontier plan meets precision/recall contracts; naive cost within 0.5% of $205,742.
Cascade calibration (M2)	20	Thresholds derived from calibration code; lower bounds computed correctly; guarantee holds on the grader’s shifted split.
Optimizer & target (M3)	20	≤ naive/10 at ≤ 2% F1 loss and ≥ 0.93 nDCG@25 on held-out seeds. Partial credit on a published sliding scale from 5× / 4%.
Experiments & sweeps	20	All five present, multi-seed, with ranges; Pareto frontier monotone-checked; ablations isolate one mechanism each.
Report	10	≤ 4 pages; every claim traceable to `results/`; honest §5 failure analysis.
Reproducibility & code	10	`python3 run_experiments.py` reproduces all numbers byte-identically; no forbidden shortcuts; readable plan/calibration code.

Late policy: 10 points per 24 h on the affected milestone, capped at 48 h. The quality threshold is a cliff, not a slope — a plan at 2.4% loss grades as a 5×-tier plan no matter how cheap it is. That asymmetry is deliberate; it is how production systems are judged too.

Pitfalls

Ways Previous Cohorts Lost Points

Pitfall · proxy bias The proxy is not uniformly wrong: it over-scores reviews that merely mention batteries without complaining, so a loose τ_hi on P1 admits a correlated cluster of false positives that a random-error model never predicts. Your Hoeffding bound still holds — on calibration’s distribution. Inspect the proxy’s confusion by stars-bucket before trusting one global threshold; per-bucket thresholds are allowed and often worth half a point of F1.

Pitfall · gold-standard contamination Subtler than opening the test file: tuning δ, m, or the plan order by repeatedly submitting to the autograder and watching the held-out score is also contamination — you’re using the test set as a validation set, twenty bits at a time. You get five graded submissions total. Carve your own validation split out of calibration and develop against that.

Pitfall · billing the wrong thing Teams routinely forget that pilot-sample calls, top-k comparison calls, and frontier escalations inside the join all hit the meter. If your self-reported cost disagrees with the grader’s by more than 0.5%, the run is rejected outright — instrument Simulator usage through one accounting choke point, not ad-hoc counters in three modules.

Pitfall · seed leakage Deterministic ≠ seed-independent. If your threshold sweep ties and you break the tie using simulator outputs from the benchmark seed, your plan quietly overfits to seed 2027 and the 2028/2029 runs sag. Break ties with fixed lexicographic rules.

Checkpoint questions

Answer Before You Submit

Each checkpoint is a short written answer (≤ 200 words) in CHECKPOINTS.md, graded inside the report points. They exist to catch conceptual errors before the autograder does.

Checkpoint 4.1 · the bound

Your calibration split has m = 92 rows above a candidate τ_hi, with p̂ = 0.978. Compute the Hoeffding lower bound at δ = 0.05. Does this threshold clear the 0.95 precision target? Now explain, in two sentences, why doubling the calibration sample is sometimes worth real simulator dollars — and what it would cost here given that calibration labels are free but proxy scores must still be computed.

Checkpoint 4.2 · the ordering

Suppose your pilot estimates: P1 selectivity 0.06 at $0.9/1k rows (cascaded), P2 selectivity 0.30 at $1.1/1k rows, and join cost proportional to |L′| × m. Using your cost model, derive which filter order is cheaper end-to-end and by how much. Then state one condition under which the classical “most selective first” heuristic gives the wrong answer for semantic predicates but never for boolean ones.

Checkpoint 4.3 · the leak

Your cascade guarantee is conditioned on calibration and deployment rows being exchangeable. Name the specific way the benchmark query breaks this for the join operator (hint: blocking changes which pairs the escalation tier ever sees), and propose a measurement — runnable inside your budget — that would detect the resulting quality drift without touching the test split.

❦

Provided Materials

Starter files

→

simulator.pySealed deterministic mock-LLM: tier presets, judge_filter / judge_pair / score_proxy, the cost-and-latency ledger, and the frozen-gold grading functions.

→

make_data.pySeeded corpus generator — 20,000 reviews + 5,000 products as CSV, byte-identical everywhere, plus the 500 sanctioned calibration labels.

→

operators_stub.pyThe file you edit: sem_filter / sem_join / sem_topk stubs, Plan and Cost dataclasses, Hoeffding bound helper, and a working naive sem_filter reference.

→

README.mdSetup, the three required experiments (naive vs cascade vs cascade+reorder), report format, and how the accuracy budget is checked against frozen gold.