- Specify semantic operators (
sem_filter,sem_join,sem_topk) with precision/recall contracts rather than exact-answer semantics. - Build a per-operator cost model spanning model tiers and use it to price candidate plans before running them.
- Calibrate a proxy-score cascade with a one-sided confidence bound, and explain why the guarantee holds.
- Apply classical optimizer ideas — selectivity estimation, predicate reordering — when the “predicate” is a language-model call.
- Report results as Pareto frontiers and ablation tables instead of single-point claims.
What You’re Building
Relational engines got fast because the optimizer, not the user, chose how to evaluate a query. Semantic query processing is at the pre-optimizer stage of that history: most pipelines today call a frontier model on every row, for every predicate, in whatever order the author happened to write them. This lab makes you build the missing layer. You will implement three semantic operators over a fixed corpus — 20,000 product reviews and 5,000 product descriptions in CSV — and an optimizer that assigns each operator a model tier, a cascade configuration, and a position in the plan.
All model calls go through simulator.py, a provided simulator with configurable, seeded error rates. There are no real API calls, no network, and no nondeterminism: given the same seed and the same plan, your pipeline must produce byte-identical output. This is what makes the lab gradable — and it is also a non-negotiable spec item (see the rubric’s reproducibility row).
The benchmark query, fixed for everyone: find the 25 products whose reviews most credibly complain about battery degradation, joining each complaint to the product whose description claims long battery life. It decomposes into one sem_filter over reviews (P1: “complains about battery degradation”), one sem_filter over products (P2: “claims long battery life”), one sem_join (review discusses this product’s battery-life claim), and one sem_topk (k = 25, criterion “most credible battery complaint”).
Target. The naive plan runs every call on the frontier tier in written order; the autograder prices it at about $205,742 on the benchmark seed. Your optimized plan must cost ≤ naive/10 while scoring within 2% relative F1 of the same-seed naive plan for the join stage and ≥ 0.93 nDCG@25 for the top-k stage.
Non-goals and forbidden shortcuts
- No gold-standard mining. You may read
calibration_reviews.csvandcalibration_products.csv(500 labeled rows in total) programmatically. The gold rules are visible insimulator.py— deriving labels fromground_truth_filter, the salt, or the planted signal fragments, in code or “by inspection,” is an academic-integrity violation; the grader re-calibrates your thresholds on a shifted split and catches it. - No answer caching across seeds. Memoizing within one run is fine (and expected); persisting simulator outputs to disk and replaying them on graded seeds is not.
- No hand-tuned magic constants. Every threshold in your submitted plan must be produced by your calibration code from the calibration split. The grader re-runs calibration on a shifted split; hardcoded thresholds will visibly fail the guarantee check.
- No real LLMs. Importing any model client disqualifies the run. The point is the optimizer, not the model.
The Starter Kit
Clone data2027/lab4-starter. It contains the sealed simulator (simulator.py), the corpus generator (make_data.py), and the operator stubs with a working naive sem_filter reference (operators_stub.py). Running python3 make_data.py writes two fixed-schema CSVs: reviews.csv (20,000 rows: review_id, product_id, stars, text) and products.csv (5,000 rows: product_id, title, description, category), plus the two calibration files. Every row is a pure function of its id, so the CSVs hash identically everywhere — regenerate freely; never edit them.
The simulator exposes two tiers plus a free proxy scorer. The proxy is an embedding-similarity heuristic: cheap, biased, and exactly the kind of signal a cascade is built to exploit. Its agreement with gold is roughly 75% on this corpus — useless as a final answer, valuable as a router.
# simulator.py — provided; treat as a sealed service. Do not modify.
from simulator import Simulator
sim = Simulator(seed=2027) # deterministic per (seed, tier, doc)
keep = sim.judge_filter(doc, predicate, tier="cheap") # "cheap" | "frontier"
match = sim.judge_pair(left, right, predicate, tier="frontier")
s = sim.score_proxy(doc, predicate) # float in [0, 1] — costs $0
sim.ledger # the only meter the grader trusts
# Operators YOU implement — operators_stub.py (signatures are frozen)
def sem_filter(sim: Simulator, rows: Sequence[Row], predicate: str,
target_precision: float = 0.95,
plan: Optional[OpConfig] = None) -> list[Row]
def sem_join(sim: Simulator, left: Sequence[Row], right: Sequence[Row],
predicate: str, budget: Optional[float] = None,
plan: Optional[OpConfig] = None) -> list[tuple[Any, Any]]
def sem_topk(sim: Simulator, rows: Sequence[Row], criterion: str, k: int,
plan: Optional[OpConfig] = None) -> list[Row]
# The plan — Plan/OpConfig/CascadeConfig dataclasses in operators_stub.py;
# serialize into results/ in any stable text format (the grader re-executes
# plans, it does not trust prose)
Plan(order=["P2", "P1", "JOIN", "TOPK"],
ops={"P1": OpConfig(op="sem_filter", predicate="...", tier="cheap",
cascade=CascadeConfig(tau_lo=0.21, tau_hi=0.84,
delta=0.05)),
"JOIN": OpConfig(op="sem_join", predicate="...", blocker_m=8)})
Costs and latencies are part of the spec, not something you measure. Your cost model must reproduce the table below exactly; the autograder cross-checks your reported plan price against its own meter and rejects mismatches above 0.5%.
| Tier | $ / 1k input tok | $ / 1k output tok | Latency / call | Seeded error rate |
|---|---|---|---|---|
proxy | 0.000 | — | 0.4 ms | 25% disagreement w/ gold at the 0.5 threshold |
cheap | 0.025 | 0.10 | 180 ms | 8% flip, biased toward “yes” |
frontier | 0.750 | 3.00 | 1,400 ms | 2% flip; false positives rarer than misses |
Latency matters only for the stretch metric (critical-path latency under 32-way concurrency, reported but lightly weighted). Dollars are the primary axis.
Correct Operators, Naive Plan
Implement all three operators so the conformance checks pass under the all-frontier plan (start from the provided smoke test: python3 operators_stub.py). Operator semantics are statistical contracts, frozen as follows. sem_filter(T, p) returns the satisfying rows, in input order; the kept set must achieve precision ≥ 0.95 and recall ≥ 0.90 against gold when run all-frontier. sem_join(L, R, p) returns the set of (left_id, right_id) pairs satisfying p; the naive implementation evaluates the full cross-product of surviving rows — that quadratic blowup (~1.73M pair judgements) is precisely what makes the naive plan cost $205,742. sem_topk(T, c, k) must return a full ranking of k rows; ties broken by ascending row id so output stays deterministic. Internally you may implement it as a tournament or pairwise-comparison sort, but the comparison count is billed like everything else.
Deliverable: green conformance run plus results/m1.json (the grader writes it) showing naive-plan cost within 0.5% of $205,742. Nothing clever yet — M1 is the control arm of every experiment you will run later.
Cascades With a Guarantee
A cascade routes each row by its free proxy score s(x): accept outright if s(x) ≥ τhi, reject outright if s(x) ≤ τlo, and escalate the uncertain middle band to a model tier (cheap first, frontier only on low-confidence disagreement). The entire trick is choosing τhi and τlo so that the shortcuts don’t silently wreck quality.
The statistics are simpler than they look, and you must implement them rather than eyeball thresholds. Take the m calibration rows with s(x) ≥ τhi and compute p̂, the fraction the gold labels mark positive — that’s the observed precision of the “accept” shortcut. Because m is small, p̂ is noisy, so you don’t trust it directly: you use a one-sided lower confidence bound. With Hoeffding’s inequality the bound is p̂ − √(ln(1/δ) / 2m); with probability at least 1 − δ, the true precision is no lower than that. Your calibrator should sweep τhi from strict to loose and pick the loosest threshold whose lower bound still clears the target (0.95) — loosest, because every additional accepted row is a model call you never pay for. Run the mirror-image procedure on τlo against the recall target. Use δ = 0.05 per threshold. A Clopper–Pearson exact bound is accepted and slightly tighter; say which you used.
One subtlety the grader checks: the guarantee is only valid if calibration rows are an i.i.d. sample of the rows the cascade will route. Calibrate per predicate — a τhi fitted on P1 (battery complaints) transfers terribly to P2 (marketing claims), and the “shifted split” re-run in the autograder will expose any cross-predicate reuse.
For the join, the cross-product is the enemy: cascade thresholds alone won’t save you if you score 20,000 × 5,000 pairs. Use the proxy as a blocker — keep only each review’s top-m products by proxy score (blocker_m in the plan; calibrate the recall loss it induces the same way you calibrate τlo). Deliverable: results/m2.json with per-predicate thresholds, their lower bounds, and an empirical check that quality held on calibration.
Reordering, Experiments, Report
Now make it an optimizer. Your planner reads the operator DAG, estimates each predicate’s selectivity and per-row cost from a 200-row pilot sample (pilot calls are billed — budget for them), and emits the cheapest valid ordering. The classical rule survives translation: run the most selective, cheapest-per-row filter first. P2 runs over 5,000 products; P1 over 20,000 reviews; whether filtering products first shrinks the join enough to pay for itself is exactly the question your cost model must answer with numbers, not vibes.
Required experiments — all on seeds {2027, 2028, 2029}, reporting mean and range:
- E-NAIVE: all-frontier, written order. Your control row.
- E-CASCADE: calibrated cascades, written order.
- E-CASCADE+REORDER: the full optimizer, plus join blocking.
- Pareto sweep: vary δ ∈ {0.20, 0.10, 0.05, 0.01} and
blocker_m∈ {4, 8, 16}; plot cost (x, log scale) vs quality (y) and trace the frontier. - Ablation table: E-CASCADE+REORDER with each mechanism removed one at a time (no cascade, no reorder, no blocking, no cheap tier), one row each: cost, F1, nDCG@25.
Report format (report.pdf, ≤ 4 pages, two columns fine): §1 plan diagram of your final E-CASCADE+REORDER plan with tier and threshold annotations; §2 the E-NAIVE / E-CASCADE / E-CASCADE+REORDER cost/quality table; §3 the Pareto figure with your submitted operating point marked; §4 the ablation table plus one paragraph per row explaining the mechanism’s marginal contribution; §5 a short “where the guarantee leaks” discussion — every cascade has one, and finding yours is worth more than pretending it doesn’t exist. Submit the repo with your serialized plan, results/, and report.pdf; python3 run_experiments.py must reproduce every number in the report.
How the 100 Points Fall
| Component | Pts | What we check |
|---|---|---|
| Operator correctness (M1) | 20 | Conformance suite green; all-frontier plan meets precision/recall contracts; naive cost within 0.5% of $205,742. |
| Cascade calibration (M2) | 20 | Thresholds derived from calibration code; lower bounds computed correctly; guarantee holds on the grader’s shifted split. |
| Optimizer & target (M3) | 20 | ≤ naive/10 at ≤ 2% F1 loss and ≥ 0.93 nDCG@25 on held-out seeds. Partial credit on a published sliding scale from 5× / 4%. |
| Experiments & sweeps | 20 | All five present, multi-seed, with ranges; Pareto frontier monotone-checked; ablations isolate one mechanism each. |
| Report | 10 | ≤ 4 pages; every claim traceable to results/; honest §5 failure analysis. |
| Reproducibility & code | 10 | python3 run_experiments.py reproduces all numbers byte-identically; no forbidden shortcuts; readable plan/calibration code. |
Late policy: 10 points per 24 h on the affected milestone, capped at 48 h. The quality threshold is a cliff, not a slope — a plan at 2.4% loss grades as a 5×-tier plan no matter how cheap it is. That asymmetry is deliberate; it is how production systems are judged too.
Ways Previous Cohorts Lost Points
Simulator usage through one accounting choke point, not ad-hoc counters in three modules.
Answer Before You Submit
Each checkpoint is a short written answer (≤ 200 words) in CHECKPOINTS.md, graded inside the report points. They exist to catch conceptual errors before the autograder does.
Your calibration split has m = 92 rows above a candidate τhi, with p̂ = 0.978. Compute the Hoeffding lower bound at δ = 0.05. Does this threshold clear the 0.95 precision target? Now explain, in two sentences, why doubling the calibration sample is sometimes worth real simulator dollars — and what it would cost here given that calibration labels are free but proxy scores must still be computed.
Suppose your pilot estimates: P1 selectivity 0.06 at $0.9/1k rows (cascaded), P2 selectivity 0.30 at $1.1/1k rows, and join cost proportional to |L′| × m. Using your cost model, derive which filter order is cheaper end-to-end and by how much. Then state one condition under which the classical “most selective first” heuristic gives the wrong answer for semantic predicates but never for boolean ones.
Your cascade guarantee is conditioned on calibration and deployment rows being exchangeable. Name the specific way the benchmark query breaks this for the join operator (hint: blocking changes which pairs the escalation tier ever sees), and propose a measurement — runnable inside your budget — that would detect the resulting quality drift without touching the test split.