- Why the course is sequenced stack-upward, and how the central thesis doubles as a grading instrument.
- Three pacing variants: 14-week semester, 10-week quarter, undergraduate adaptation.
- Lab logistics — what ships ready-made, what you must prepare, and the expected grading load.
- Enforcing "accountability, not abstinence" when frontier models are allowed everywhere.
- Re-skinning the Vantage Retail Group trap schema for your institution's domain.
Why Stack-Upward, and Why Frontiers
The course climbs from storage engines to agent governance because that is the order in which the thesis can be tested rather than asserted. The claim — the client changed, the physics didn't — is empty until students know what the physics are. So Part I re-reads B-trees, LSM-trees, and column stores as workload bets; Part II shows which bets the new client violates (recall as a first-class axis, branchable storage, learned components); Part III climbs to where accuracy actually lives; Part IV asks whether any of it is genuinely new. Teach it top-down and every claim about vector indexes or semantic layers becomes vendor folklore. Teach it bottom-up and Week 6's "a vector index is an access method, not a product" lands as an observation students can verify with amplification arithmetic they did in Week 2.
The thesis is also your grading instrument. Almost every assessment in the course reduces to one move: identify the classical assumption, show what the agentic workload does to it, measure the consequence. When you grade an exam answer or a project paper and can't find that move, the answer is wrong even when it is fluent — and fluency is now free, which is the point.
This is why everything is graded on Pareto frontiers and ablation tables rather than point estimates. A single headline number ("95% accuracy," "2 ms p99") is exactly the artifact a frontier model can produce on request, and exactly the artifact vendors lead with. A frontier requires the student to have actually swept the trade-off; an ablation requires them to have actually built each configuration. Both are cheap to verify (re-run the harness) and expensive to fake (you'd have to build the system anyway). The grading scheme is the LLM policy, expressed in points.
Three Ways to Schedule It
The 14-week semester is the design target: four parts, four labs, a project with real incubation time. The compressions below have been thought through; cutting anything else starts to sever load-bearing dependencies (Week 2's amplification arithmetic feeds Lab 1; Week 9's curve feeds Lab 3 directly).
| Variant | Schedule | What changes | What it costs |
|---|---|---|---|
| Semester 14 weeks |
As published. Labs in weeks 2–4, 5–8, 9–10, 10–12; project weeks 9–14; exam after Week 14. | Nothing. | Nothing. This is the course. |
| Quarter 10 weeks |
Cut Week 3 (column stores / vectorized execution) and Week 13 (self-driving systems); merge Weeks 11 and 12 into one "memory + protocols" week — Mem0/Graphiti on Tuesday, MCP and the lethal trifecta on Thursday. | Assign Week 3's two readings as background for Lab 1's quantization milestone; fold Week 13's self-design question into the Week 14 debate. Keep all four labs but overlap Labs 3 and 4 fully. | Students meet vectorized execution as a reading, not a lecture; the memory week loses its second lecture's depth. Acceptable. Cutting any Part I week other than 3 is not. |
| Undergrad 14 weeks |
Same lecture sequence, slower lab ramp. Drop Lab 1 Milestones 3–4 (graph-merging compaction and the full frontier harness) — students stop at a working SSTable format plus per-segment index, graded on correctness and a single measured trade-off. Replace the formal parts of exam Q1 (RUM-R) and Q4 (the optimizer's new objective) with structured essays: argue the four-axis conjecture and the multi-objective optimizer in prose, with one worked numeric example instead of a formalization. | Reweight Lab 1 to 10% and the project to 40%, or add a fifth problem-set week. Keep Labs 3 and 4 intact — they are the most undergrad-accessible and the most employable. | The course loses its hardest systems-building rite of passage. The thesis survives; the calluses don't. |
Running the Labs
Every lab ships with machine-tested starter materials under materials/ — Rust skeleton for Lab 1, Go pageserver for Lab 2, the full warehouse DDL and reference semantic layer for Lab 3, the deterministic simulator for Lab 4. What ships is the instructor edition. Four preparation tasks are yours, and two of them are about keeping answer keys out of student hands.
| Lab | Provided | You must | Grading load (pairs) |
|---|---|---|---|
| 1 · VLSM Rust, wks 2–4 |
Cargo skeleton (memtable.rs, sstable.rs, workload.rs), deterministic workload generator, bench.rs harness stub. cargo test passes on the skeleton. |
Distribute the grading suite — the hidden workload traces and recall ground truth — at the end of Week 2, after Milestone 1 is in. Releasing it with the skeleton invites overfitting the file format to the trace; withholding it past Week 2 makes Milestone 4's frontier unbuildable. | ~45 min/team: re-run their harness, eyeball the frontier for swept (not cherry-picked) points, read the ablation. |
| 2 · Mini-Neon any lang, wks 5–8 |
Go reference types, WAL generator (cmd/walgen), pageserver and branch skeletons. go vet clean. |
Stand up MinIO (or any S3-compatible store) before Week 5 — one shared instance with per-team buckets is fine; the lab's GetPage@LSN latencies assume object storage with real network round-trips, so do not let teams substitute the local filesystem. Budget an hour for credentials and a smoke test with walgen. |
~60 min/team: the 50-branch demo plus GC must run from their tag against your MinIO, not theirs. |
| 3 · Text-to-SQL Python, wks 9–10 |
Full 126-table Vantage DDL, 50 gold questions, reference semantic layer, grader stub. | The shipped schema.sql is the instructor edition: it carries 40 -- TRAP comments annotating every salted dysfunction. Strip them (one grep -v) to produce schema_student.sql before distribution, build the DuckDB snapshot, pin its SHA-256 in MANIFEST. Hold back questions.jsonl and the reference semantic_layer.yaml entirely — the gold questions are your held-out set and your leakage detector. |
~90 min/team: re-run the ablation grid, run the leakage diff, score their grader against your near-miss cases. |
| 4 · Semantic optimizer Python, wks 10–12 |
simulator.py, make_data.py (seeded, byte-identical corpus), operator stubs, 500 sanctioned calibration labels. |
The simulator students receive necessarily contains ground_truth_filter — but the grading salt and re-labeled held-out configuration stay with you. Generate a fresh salt per cohort, never commit it, and cap teams at five graded submissions so the autograder can't be used as an oracle. The grader re-labels under your salt; gold-mined plans collapse on contact. |
~45 min/team: mostly automated; the human time goes to reading the cascade-guarantee argument in the report. |
Total per cohort of 15 pairs: roughly 60 TA-hours across the term, front-loaded at lab deadlines. The deterministic harnesses are what make this tractable — every number in every report regenerates from a Git tag, so grading is re-running, not believing.
Accountability, Not Abstinence, in Practice
The policy — frontier models allowed everywhere, including the exam — reads as permissive. Enforced properly, it is the opposite: it relocates all integrity weight onto artifacts that models cannot fake, then makes faking them deterministically detectable.
Require methods notes. Every lab report and the exam carry a mandatory model-usage section: which models, for what, what they got wrong, what was kept. Grade it for specificity, not contrition — "Claude wrote the first compaction loop; it deadlocked on concurrent flush; here is the fix" is an A-grade sentence. Students learn quickly that the note is free marks if honest and a liability if vague, which is exactly the incentive you want them to carry into industry.
Know what cheating looks like when models are legal. It is never "used an LLM." It is:
- Gold-mining Lab 4's simulator. The labels are produced by visible code; deriving them from
ground_truth_filter, the salt, or the planted signal fragments — in code or "by inspection" — is the violation, and the lab page says so in those words. Detection is structural: the grader re-labels under a fresh salt, so a mined plan's quality collapses while an honest cascade's quality holds within its stated guarantee. Submission-throttling (five graded runs) closes the slower oracle attack. - Eval leakage in Lab 3. The subtle form is writing the semantic layer after watching the agent fail, encoding per-question fixes as "documentation." Detection: diff the YAML against question text for near-verbatim overlap, and compare accuracy on their 50 questions against your 10 held-out instructor questions on the same schema. A large gap is leakage, mechanically.
Both detections work only because the harnesses are deterministic — pinned model via the course proxy, temperature 0, fixed seeds, pinned snapshots. Preserve that determinism above all else when you adapt the course; it is the entire enforcement mechanism. A policy of "use anything, but every number must reproduce from your tag on our hardware" needs no honor code. The honor code is make reproduce.
What is graded is what models can't fake: Pareto frontiers from your own benchmarks, ablation tables, and designs you can defend in a hallway argument.— the course LLM policy, index page
Re-Skinning the Trap Schema
Vantage Retail Group is a synthetic retailer because retail is universally legible, but the schema works better when it smells like your institution's domain — a hospital network, a university ERP, a logistics firm. Re-skinning is safe if you understand what the 50 gold queries actually depend on: identifiers and data values, not semantics. The rules:
- Rename, never restructure. Build a single rename map (schema, table, and column names —
finance.revenue_recognized→billing.charges_recognized, etc.) and apply it consistently to the DDL, the data generator, the gold SQL inquestions.jsonl, and the referencesemantic_layer.yamlin one pass. Ased-script-from-a-table is the right tool; hand-editing four files is how gold queries silently break. - Preserve the trap topology. The three-revenue-variants trap must survive as three same-quantity-different-definition tables with the same grain, currency, and timing differences; the two active-user definitions must remain two defensible windows; the fourteen frozen legacy tables keep their
*_old/*_v1/*_baksuffixes and their freeze dates. The traps are the pedagogy — a hospital version has "billed vs. collected vs. recognized charges," not fewer revenue tables. - Do not touch grains, dates, or numbers. The snapshot's logical date (2025-11-30), every freeze date, row counts, and the generated values are load-bearing: gold answers are materialized from them. Change a date and five questions' answers drift.
- Verify mechanically. Rebuild the snapshot, run all 50 gold queries, and byte-diff the result sets against the originals. All 50 must match exactly — values are domain-neutral by construction, so any diff means your rename map missed a reference. Then re-pin the SHA-256, strip the
-- TRAPcomments, and re-run the four example grader cases.
Budget half a day. The renaming is an hour; convincing yourself the 50-for-50 diff is genuinely clean is the rest, and it is not optional.
Where This Courseware Came From
cargo test, the Lab 2 reference passes go vet and builds, the Lab 3 DDL and gold queries load and execute in DuckDB, and the Lab 4 simulator is deterministic by construction and verified seed-for-seed. None of that substitutes for your judgment. Machine-tested means the code runs, not that the pedagogy is right for your students; verified-base means the claims trace to sources, not that they will age well. Read every page you assign, run every lab you grade, and disagree where you disagree — then send the disagreement back. Corrections, adaptations, and new trap schemas are welcome as pull requests; the license is CC BY 4.0 precisely so this can become a course maintained by the people who teach it.