- Design and document a binary on-disk format with explicit alignment, endianness, and checksum rules — and write a reader that rejects corrupt input.
- Implement HNSW construction and beam search from the paper, and reason about its
M/efknobs as memory–recall–latency trade-offs. - Extend leveled compaction so that an auxiliary index (a proximity graph) is merged, not rebuilt, and quantify what that buys you.
- Build a benchmark harness for a mixed agentic workload and report results as Pareto frontiers and ablation tables rather than point estimates.
- Argue, with your own numbers, where the LSM write path and the ANN read path fight each other — and which knob resolves each fight.
What You Are Building
Agents read and write the same store in the same breath: an upsert of a tool result is followed, milliseconds later, by a semantic search over everything written so far. Classical vector databases assume a mostly-static corpus and rebuild indexes offline; classical LSM-trees ingest fast but cannot answer “what is near this embedding?” In this lab you build VLSM, an LSM engine in which every SSTable carries a vector segment: a block of scalar-quantized embeddings plus a serialized HNSW graph over them. Searches fan out across the memtable and every level’s graphs; compaction merges graphs instead of throwing them away.
The lab has four milestones. Milestone 1 is a file format. Milestone 2 is an index. Milestone 3 is the heart of the lab — compaction that preserves index work. Milestone 4 turns your engine into evidence: a harness, a frontier, and an ablation table. Each milestone has an acceptance test in the grading suite (cargo test --features grading, distributed at the end of Week 2); passing it is necessary but not sufficient for full credit.
Fix these workload parameters unless a milestone says otherwise: vector dimension D = 768 (f32 source vectors), cosine distance over L2-normalized inputs (so you may implement it as negative dot product), k = 10 for all recall figures, and the synthetic agent trace produced by the provided generator with seed 2027. “Recall@10” always means recall against exact f32 brute-force search over the live (non-deleted, latest-version) rows.
The Skeleton
Clone data2027/vlsm-skeleton. It is a small but real engine — roughly 1,000 lines — and it already passes its own key–value test suite. You get:
- Core types &
Enginetrait (lib.rs):Key/Value/DocId/Vector,EngineConfig/HnswParams, the error enum, andParetoRow. - Memtable (
memtable.rs): aBTreeMapmapping keys to values, with vectors in aDocId-keyed side-table and brute-forceknn. Flush threshold 64 MiB. - SSTable writer & format spec (
sstable.rs): the normative byte layout in its doc comments, a workingSstWriter(codec 0 = raw f32), and a referencecrc32c;SstReaderistodo!()— Milestone 1. - Workload generator (
workload.rs): emits the Milestone 4 trace, with brute-force ground truth per query. - Smoke harness (
src/bin/bench.rs): replays the trace against the memtable; the seed of your Milestone 4 harness.
The public surface you extend, abridged:
// src/lib.rs — provided; the memtable implements it, your Db must too
pub trait Engine {
fn put(&mut self, key: Key, value: Value, vector: Option<Vector>)
-> Result<Option<DocId>>;
fn get(&self, key: &[u8]) -> Result<Option<Value>>;
fn scan(&self, start: &[u8], end: &[u8]) -> Result<Vec<(Key, Value)>>;
fn knn(&self, query: &Vector, k: usize, ef: usize)
-> Result<Vec<(DocId, f32)>>;
}
// src/sstable.rs — writer provided (codec 0)
impl SstWriter {
pub fn new(path: impl Into<PathBuf>, dim: usize) -> SstWriter;
pub fn add(&mut self, key: &[u8], value: &[u8],
vector: Option<(DocId, &Vector)>) -> Result<()>;
pub fn add_tombstone(&mut self, key: &[u8]) -> Result<()>;
pub fn finish(self) -> Result<SstMeta>;
}
// src/sstable.rs — YOU WRITE (Milestones 1–2): the todo!() stubs
impl SstReader {
pub fn open(path: &Path) -> Result<SstReader>;
pub fn get(&self, key: &[u8]) -> Result<Option<Value>>;
pub fn scan(&self, start: &[u8], end: &[u8]) -> Result<Vec<(Key, Option<Value>)>>;
pub fn vector(&self, ordinal: u32) -> Result<Vector>;
pub fn knn(&self, query: &Vector, k: usize, ef: usize)
-> Result<Vec<(DocId, f32)>>; // Milestone 2
}
// new modules — YOU WRITE (Milestones 2–3): HNSW build/search/serialize,
// a Db (memtable + levels) implementing Engine, and graph-merging
// compaction, e.g.
pub fn merge_graphs(srcs: &[&SstReader], live: &DocIdFilter,
params: HnswParams) -> (SstWriter, MergeStats);
DocId is a u64, assigned sequentially in put order, one per vectored version; each segment’s row-id map takes a graph ordinal to its DocId. The workload generator pre-assigns the same sequence in Op::Upsert::doc_id — replay the trace in order, exactly once, and ground-truth ids match engine ids with no translation table.
The Vector Segment Format
Every .sst file holds six sections — header, key block, vector codes, codebook, row-id map, HNSW adjacency — followed by a 64-byte footer at EOF that points backward at all of them. All integers are little-endian. Every section starts at a 64-byte-aligned offset (pad with zeros), so the reader can hand out aligned zero-copy slices from the mmap. The layout:
| Section | Field | Type / size | Meaning |
|---|---|---|---|
| header (32 B) | magic | [u8; 8] | ASCII “VSST0001”; reject anything else |
version | u32 | 1 | |
dim | u32 | vector dimension D (768 here) | |
entry_count | u32 | rows in the key block | |
vector_count | u32 | number of vectors N (may be 0) | |
codec | u8 | 0 = raw f32, 1 = per-dimension i8 scalar quant, 2 = PQ | |
reserved | 7 B, zero | future codecs; readers must ignore | |
| key block | — | entry_count records | per record: key_len: u32 · val_len: u32 · flags: u32 (bit0 = tombstone) · vec_ordinal: u32 (0xFFFF_FFFF = no vector), then key and value bytes; keys strictly increasing |
| codes | — | N × D codes | row-major vectors; codec 0 = f32, codec 1 = i8 — code c decodes as scale[j]·c + bias[j]; codec 2 = PQ u8 |
| codebook | — | m_pq: u32 · k_pq: u32 + params | codec 1: D × f32 scales then D × f32 biases, computed over this segment’s vectors; codec 2: PQ centroids |
| row-id map | — | N × u64 | graph ordinal → DocId; sorted by ordinal |
| HNSW | — | see Fig. 1 | empty in Milestone 1 (num_layers = 0); filled in Milestone 2 |
| footer (64 B) | — | 6 × u64 + u32 crc32c + u32 magic “VSFT” + u64 pad | section offsets from file start; CRC covers header through HNSW |
Acceptance: round-trip property tests pass (write → read → decode within quantization error); the reader returns Err(Corrupt), never panics or UB, on the provided corpus of 200 mutated files; brute-force search over decoded i8 codes achieves recall@10 ≥ 0.95 against f32 brute force on the 100k-vector sample set.
Per-Level HNSW: Build and Search
Implement HNSW (Malkov & Yashunin, 2018 — assigned in Week 2) from scratch. Construction runs at flush and at the end of compaction, over the quantized codes; drive it from the provided HnswParams { m, ef_construction, ef_search } and serialize the result into the segment’s HNSW section as per-layer CSR adjacency (Fig. 1). Distances during build and search are computed directly on i8 codes using the segment’s scale/bias — do not decode whole vectors to f32 buffers.
Db::knn must consult every live source: brute-force over the memtable, then beam search (width ef) in each level’s segment graphs, merging candidates in a single global heap. A candidate whose DocId is shadowed by a newer version or a tombstone above it must be filtered before it can occupy one of the k result slots. Optionally re-rank the final 4k candidates with f32 vectors — keep this behind a flag; it is one of your ablations.
Acceptance: on a 1M-vector single-segment fixture with m = 16, ef = 64: recall@10 ≥ 0.90 with mean query latency ≤ 2 ms on the grading machine (8 cores, AVX2); construction ≤ 8 minutes; the serialized graph reloads byte-identically.
Graph-Merging Compaction
This is where VLSM differs from “LSM plus a vector index bolted on.” When compaction merges SSTables S₁…Sₙ into new output tables, the naive move is to rebuild each output graph from scratch — O(N log N) distance computations you already paid for once. Implement merge_graphs instead:
- Remap, don’t rebuild. Surviving nodes keep their neighbor lists, with ordinals rewritten through the compaction remap table.
- Repair, locally. Edges pointing at dropped rows (tombstoned or shadowed) are repaired by promoting neighbors-of-neighbors of the dropped node, re-selected with the standard HNSW heuristic; cap repair fan-out at
2mdistance evaluations per dropped node. - Insert the minority. Nodes from all source graphs but the largest are inserted into the largest graph’s structure via normal HNSW insertion (over quantized codes). Layer assignments may be preserved or redrawn — measure both and report.
- Requantize when forced. If the merged value range moves any dimension’s scale by more than 25%, recompute scale/bias and re-encode; otherwise keep the dominant segment’s parameters. Record which case occurred in
MergeStats.
Acceptance: after the scripted compaction storm in tests/m3_storm.rs (50 compactions, 30% deletes), recall@10 of merged graphs is within 0.02 of rebuilt-from-scratch graphs at equal ef, and total compaction CPU time is at least 3× lower than the rebuild baseline — which you must also implement, behind --features rebuild-baseline, because it is the control arm of your ablation.
Benchmark Harness and Pareto Report
The provided generator (workload.rs) emits a synthetic agent trace: sessions that interleave upserts (tool results, ~20% of ops), range scans (~10%), and vector searches (~70%) with Zipfian skew over both keys and query templates. Build a harness that replays it at a fixed arrival rate while sampling four axes: search latency p99, sustained update throughput, resident memory (RSS, plus your own accounting of graph bytes), and recall@10 (sampled: every 50th search is also answered by the brute-force oracle).
Sweep at least three knobs — m ∈ {8, 16, 32}, ef ∈ {16, 64, 256}, and re-ranking on/off at minimum; codec and level fan-out are encouraged — and plot the Pareto frontier in the recall–latency plane with memory encoded as mark size, plus one ablation table isolating each knob’s marginal effect with the others held at defaults. The report (≤ 6 pages, PDF) must state the hardware, the seed, and the exact commit; every figure must be regenerable by make report.
Rubric
| Component | Points | What we look at |
|---|---|---|
| Correctness | 40 | Milestone acceptance tests (10 each): format round-trip and corruption handling; HNSW recall/latency floor; merge-vs-rebuild recall gap and CPU ratio; harness oracle agreement. Tests run on our machine, fresh clone, cargo test --features grading. |
| Performance frontier | 30 | Your frontier’s dominance region against the class. Full credit requires a frontier — at least 6 non-dominated configurations — not a single tuned point. A fast system with recall 0.6 and a slow one with recall 0.99 both earn less than a genuine trade-off curve. |
| Report | 20 | Pareto plot correctness, ablation table, honest negative results, reproducibility (make report works). Claims must cite your own measurements. |
| Code quality | 10 | No unsafe outside the SstReader cast paths (each with a safety comment); no panics on user input; clippy-clean; the format documented in FORMAT.md well enough for a stranger to write an independent reader. |
Late policy: 10% per day, 3 days maximum. One milestone may be designated “best effort” in your report for a 5-point cap reduction instead of failing its tests silently — honesty is cheaper than hope.
Where Last Year’s Cohorts Bled
&[u32] at an unaligned offset is undefined behavior that often “works” on x86 until it doesn’t. The 64-byte section alignment in the spec is not decorative — validate alignment in the reader and add a miri test for the cast paths.ef and recall decays as the trace ages. Filter at candidate-expansion time, and keep dropped-but-repaired nodes usable for routing until the next compaction if you want the easy fix.What You May Not Do
- No off-the-shelf vector or ANN libraries.
faiss,hnswlib,usearch,instant-distance,qdrant-anything — including “just for the baseline” or via FFI. The brute-force oracle, the quantizer, and the graph are the lab. - No serde-derived on-disk formats. The vector segment is written and parsed by your own code against the byte-level spec above. (Serde for the report’s JSON output is fine.)
- No single-number results. Every performance claim in the report carries a distribution (p50/p99) or a frontier; “1.7× faster” with no operating point is graded as unsupported.
- Allowed:
crossbeam,rayon,memmap2,crc32c, hand-written SIMD viastd::arch, and anything already in the skeleton’s lockfile. When in doubt, ask on the course forum before the due date, not in the regrade request.
Checkpoint Questions
Answer in ≤ 1 page each, in CHECKPOINTS.md, before the milestone they accompany. They are graded inside the report component.
Why per-level graphs instead of one global mutable HNSW over the whole store? Give the write-amplification argument from SSTable immutability, then compute: with level fan-out 10, L0 holding up to 4 files, and 100M vectors at steady state, how many graphs does one search touch in the worst case, and how does expected search cost scale with store size compared to a single graph? State your assumptions about per-graph search cost.
Derive a worst-case bound on the error of a dot product computed over per-dimension i8 scalar-quantized vectors with affine decode scale[j]·c + bias[j], as a function of D and the per-dimension value ranges. Using your Milestone 1 measurements, explain when f32 re-ranking of the top 4k candidates recovers essentially all lost recall — and construct (or describe) a data distribution where it cannot.
In merge_graphs, how do you choose the entry point and maximum layer of the merged graph, and why is naively keeping the largest source’s entry point risky after heavy deletes? Then give an asymptotic cost comparison of merge (edge repair + minority insertion) versus full rebuild as a function of the surviving fraction s and the size ratio between sources, and predict — before running the storm test — the s at which rebuilding wins. Check the prediction against your measurements and report the discrepancy.