Lab 1: VLSM — an LSM-Tree with Vector Segments

Learning objectives — after this lab you can…

Design and document a binary on-disk format with explicit alignment, endianness, and checksum rules — and write a reader that rejects corrupt input.
Implement HNSW construction and beam search from the paper, and reason about its M/ef knobs as memory–recall–latency trade-offs.
Extend leveled compaction so that an auxiliary index (a proximity graph) is merged, not rebuilt, and quantify what that buys you.
Build a benchmark harness for a mixed agentic workload and report results as Pareto frontiers and ablation tables rather than point estimates.
Argue, with your own numbers, where the LSM write path and the ANN read path fight each other — and which knob resolves each fight.

Overview & Goals

What You Are Building

Agents read and write the same store in the same breath: an upsert of a tool result is followed, milliseconds later, by a semantic search over everything written so far. Classical vector databases assume a mostly-static corpus and rebuild indexes offline; classical LSM-trees ingest fast but cannot answer “what is near this embedding?” In this lab you build VLSM, an LSM engine in which every SSTable carries a vector segment: a block of scalar-quantized embeddings plus a serialized HNSW graph over them. Searches fan out across the memtable and every level’s graphs; compaction merges graphs instead of throwing them away.

The lab has four milestones. Milestone 1 is a file format. Milestone 2 is an index. Milestone 3 is the heart of the lab — compaction that preserves index work. Milestone 4 turns your engine into evidence: a harness, a frontier, and an ablation table. Each milestone has an acceptance test in the grading suite (cargo test --features grading, distributed at the end of Week 2); passing it is necessary but not sufficient for full credit.

We grade the frontier, not the point. A system that can’t trade memory for recall isn’t tuned — it’s stuck.

Fix these workload parameters unless a milestone says otherwise: vector dimension D = 768 (f32 source vectors), cosine distance over L2-normalized inputs (so you may implement it as negative dot product), k = 10 for all recall figures, and the synthetic agent trace produced by the provided generator with seed 2027. “Recall@10” always means recall against exact f32 brute-force search over the live (non-deleted, latest-version) rows.

Provided Materials

The Skeleton

Clone data2027/vlsm-skeleton. It is a small but real engine — roughly 1,000 lines — and it already passes its own key–value test suite. You get:

Core types & Engine trait (lib.rs): Key/Value/DocId/Vector, EngineConfig/HnswParams, the error enum, and ParetoRow.
Memtable (memtable.rs): a BTreeMap mapping keys to values, with vectors in a DocId-keyed side-table and brute-force knn. Flush threshold 64 MiB.
SSTable writer & format spec (sstable.rs): the normative byte layout in its doc comments, a working SstWriter (codec 0 = raw f32), and a reference crc32c; SstReader is todo!() — Milestone 1.
Workload generator (workload.rs): emits the Milestone 4 trace, with brute-force ground truth per query.
Smoke harness (src/bin/bench.rs): replays the trace against the memtable; the seed of your Milestone 4 harness.

The public surface you extend, abridged:

// src/lib.rs — provided; the memtable implements it, your Db must too
pub trait Engine {
    fn put(&mut self, key: Key, value: Value, vector: Option<Vector>)
        -> Result<Option<DocId>>;
    fn get(&self, key: &[u8]) -> Result<Option<Value>>;
    fn scan(&self, start: &[u8], end: &[u8]) -> Result<Vec<(Key, Value)>>;
    fn knn(&self, query: &Vector, k: usize, ef: usize)
        -> Result<Vec<(DocId, f32)>>;
}

// src/sstable.rs — writer provided (codec 0)
impl SstWriter {
    pub fn new(path: impl Into<PathBuf>, dim: usize) -> SstWriter;
    pub fn add(&mut self, key: &[u8], value: &[u8],
               vector: Option<(DocId, &Vector)>) -> Result<()>;
    pub fn add_tombstone(&mut self, key: &[u8]) -> Result<()>;
    pub fn finish(self) -> Result<SstMeta>;
}

// src/sstable.rs — YOU WRITE (Milestones 1–2): the todo!() stubs
impl SstReader {
    pub fn open(path: &Path) -> Result<SstReader>;
    pub fn get(&self, key: &[u8]) -> Result<Option<Value>>;
    pub fn scan(&self, start: &[u8], end: &[u8]) -> Result<Vec<(Key, Option<Value>)>>;
    pub fn vector(&self, ordinal: u32) -> Result<Vector>;
    pub fn knn(&self, query: &Vector, k: usize, ef: usize)
        -> Result<Vec<(DocId, f32)>>;                   // Milestone 2
}

// new modules — YOU WRITE (Milestones 2–3): HNSW build/search/serialize,
// a Db (memtable + levels) implementing Engine, and graph-merging
// compaction, e.g.
pub fn merge_graphs(srcs: &[&SstReader], live: &DocIdFilter,
                    params: HnswParams) -> (SstWriter, MergeStats);

DocId is a u64, assigned sequentially in put order, one per vectored version; each segment’s row-id map takes a graph ordinal to its DocId. The workload generator pre-assigns the same sequence in Op::Upsert::doc_id — replay the trace in order, exactly once, and ground-truth ids match engine ids with no translation table.

Milestone 1

The Vector Segment Format

Every .sst file holds six sections — header, key block, vector codes, codebook, row-id map, HNSW adjacency — followed by a 64-byte footer at EOF that points backward at all of them. All integers are little-endian. Every section starts at a 64-byte-aligned offset (pad with zeros), so the reader can hand out aligned zero-copy slices from the mmap. The layout:

Fig. 1 — VLSM on-disk layout. Top: an SSTable with the vector sections between key block and footer. Bottom: the vector sections expanded. The footer is written last and points backward at every section, so a reader seeks to it first.

Section	Field	Type / size	Meaning
header (32 B)	`magic`	`[u8; 8]`	ASCII “VSST0001”; reject anything else
	`version`	`u32`	1
	`dim`	`u32`	vector dimension D (768 here)
	`entry_count`	`u32`	rows in the key block
	`vector_count`	`u32`	number of vectors N (may be 0)
	`codec`	`u8`	0 = raw f32, 1 = per-dimension i8 scalar quant, 2 = PQ
	`reserved`	7 B, zero	future codecs; readers must ignore
key block	—	`entry_count` records	per record: `key_len: u32 · val_len: u32 · flags: u32` (bit0 = tombstone) · `vec_ordinal: u32` (`0xFFFF_FFFF` = no vector), then key and value bytes; keys strictly increasing
codes	—	`N × D` codes	row-major vectors; codec 0 = f32, codec 1 = i8 — code `c` decodes as `scale[j]·c + bias[j]`; codec 2 = PQ u8
codebook	—	`m_pq: u32 · k_pq: u32` + params	codec 1: `D × f32` scales then `D × f32` biases, computed over this segment’s vectors; codec 2: PQ centroids
row-id map	—	`N × u64`	graph ordinal → `DocId`; sorted by ordinal
HNSW	—	see Fig. 1	empty in Milestone 1 (`num_layers = 0`); filled in Milestone 2
footer (64 B)	—	`6 × u64` + `u32` crc32c + `u32` magic “VSFT” + `u64` pad	section offsets from file start; CRC covers header through HNSW

Acceptance: round-trip property tests pass (write → read → decode within quantization error); the reader returns Err(Corrupt), never panics or UB, on the provided corpus of 200 mutated files; brute-force search over decoded i8 codes achieves recall@10 ≥ 0.95 against f32 brute force on the 100k-vector sample set.

Milestone 2

Per-Level HNSW: Build and Search

Implement HNSW (Malkov & Yashunin, 2018 — assigned in Week 2) from scratch. Construction runs at flush and at the end of compaction, over the quantized codes; drive it from the provided HnswParams { m, ef_construction, ef_search } and serialize the result into the segment’s HNSW section as per-layer CSR adjacency (Fig. 1). Distances during build and search are computed directly on i8 codes using the segment’s scale/bias — do not decode whole vectors to f32 buffers.

Db::knn must consult every live source: brute-force over the memtable, then beam search (width ef) in each level’s segment graphs, merging candidates in a single global heap. A candidate whose DocId is shadowed by a newer version or a tombstone above it must be filtered before it can occupy one of the k result slots. Optionally re-rank the final 4k candidates with f32 vectors — keep this behind a flag; it is one of your ablations.

Acceptance: on a 1M-vector single-segment fixture with m = 16, ef = 64: recall@10 ≥ 0.90 with mean query latency ≤ 2 ms on the grading machine (8 cores, AVX2); construction ≤ 8 minutes; the serialized graph reloads byte-identically.

Milestone 3

Graph-Merging Compaction

This is where VLSM differs from “LSM plus a vector index bolted on.” When compaction merges SSTables S₁…Sₙ into new output tables, the naive move is to rebuild each output graph from scratch — O(N log N) distance computations you already paid for once. Implement merge_graphs instead:

Remap, don’t rebuild. Surviving nodes keep their neighbor lists, with ordinals rewritten through the compaction remap table.
Repair, locally. Edges pointing at dropped rows (tombstoned or shadowed) are repaired by promoting neighbors-of-neighbors of the dropped node, re-selected with the standard HNSW heuristic; cap repair fan-out at 2m distance evaluations per dropped node.
Insert the minority. Nodes from all source graphs but the largest are inserted into the largest graph’s structure via normal HNSW insertion (over quantized codes). Layer assignments may be preserved or redrawn — measure both and report.
Requantize when forced. If the merged value range moves any dimension’s scale by more than 25%, recompute scale/bias and re-encode; otherwise keep the dominant segment’s parameters. Record which case occurred in MergeStats.

Acceptance: after the scripted compaction storm in tests/m3_storm.rs (50 compactions, 30% deletes), recall@10 of merged graphs is within 0.02 of rebuilt-from-scratch graphs at equal ef, and total compaction CPU time is at least 3× lower than the rebuild baseline — which you must also implement, behind --features rebuild-baseline, because it is the control arm of your ablation.

Milestone 4

Benchmark Harness and Pareto Report

The provided generator (workload.rs) emits a synthetic agent trace: sessions that interleave upserts (tool results, ~20% of ops), range scans (~10%), and vector searches (~70%) with Zipfian skew over both keys and query templates. Build a harness that replays it at a fixed arrival rate while sampling four axes: search latency p99, sustained update throughput, resident memory (RSS, plus your own accounting of graph bytes), and recall@10 (sampled: every 50th search is also answered by the brute-force oracle).

Sweep at least three knobs — m ∈ {8, 16, 32}, ef ∈ {16, 64, 256}, and re-ranking on/off at minimum; codec and level fan-out are encouraged — and plot the Pareto frontier in the recall–latency plane with memory encoded as mark size, plus one ablation table isolating each knob’s marginal effect with the others held at defaults. The report (≤ 6 pages, PDF) must state the hardware, the seed, and the exact commit; every figure must be regenerable by make report.

Grading

Rubric

Component	Points	What we look at
Correctness	40	Milestone acceptance tests (10 each): format round-trip and corruption handling; HNSW recall/latency floor; merge-vs-rebuild recall gap and CPU ratio; harness oracle agreement. Tests run on our machine, fresh clone, `cargo test --features grading`.
Performance frontier	30	Your frontier’s dominance region against the class. Full credit requires a frontier — at least 6 non-dominated configurations — not a single tuned point. A fast system with recall 0.6 and a slow one with recall 0.99 both earn less than a genuine trade-off curve.
Report	20	Pareto plot correctness, ablation table, honest negative results, reproducibility (`make report` works). Claims must cite your own measurements.
Code quality	10	No `unsafe` outside the `SstReader` cast paths (each with a safety comment); no panics on user input; clippy-clean; the format documented in `FORMAT.md` well enough for a stranger to write an independent reader.

Late policy: 10% per day, 3 days maximum. One milestone may be designated “best effort” in your report for a 5-point cap reduction instead of failing its tests silently — honesty is cheaper than hope.

Pitfalls

Where Last Year’s Cohorts Bled

Pitfall · the accidental rebuildIf your “merge” inserts every node of every source graph into a fresh graph, you have written the rebuild baseline twice and the merge zero times. The 3× CPU acceptance test exists precisely to catch this. Instrument distance evaluations per compaction from day one.

Pitfall · unaligned zero-copy castsCasting an mmap slice to &[u32] at an unaligned offset is undefined behavior that often “works” on x86 until it doesn’t. The 64-byte section alignment in the spec is not decorative — validate alignment in the reader and add a miri test for the cast paths.

Pitfall · grading recall against the wrong oracleRecall measured against brute force over quantized codes, or over a corpus that still contains tombstoned rows, will flatter you by several points and fail our oracle check. Ground truth is exact f32 over live rows only.

Pitfall · tombstones poisoning the beamIf shadowed rows are filtered after the result heap fills, deletes silently shrink your effective ef and recall decays as the trace ages. Filter at candidate-expansion time, and keep dropped-but-repaired nodes usable for routing until the next compaction if you want the easy fix.

Ground Rules

What You May Not Do

No off-the-shelf vector or ANN libraries. faiss, hnswlib, usearch, instant-distance, qdrant-anything — including “just for the baseline” or via FFI. The brute-force oracle, the quantizer, and the graph are the lab.
No serde-derived on-disk formats. The vector segment is written and parsed by your own code against the byte-level spec above. (Serde for the report’s JSON output is fine.)
No single-number results. Every performance claim in the report carries a distribution (p50/p99) or a frontier; “1.7× faster” with no operating point is graded as unsupported.
Allowed: crossbeam, rayon, memmap2, crc32c, hand-written SIMD via std::arch, and anything already in the skeleton’s lockfile. When in doubt, ask on the course forum before the due date, not in the regrade request.

Checkpoints

Checkpoint Questions

Answer in ≤ 1 page each, in CHECKPOINTS.md, before the milestone they accompany. They are graded inside the report component.

Checkpoint 1.1 · due with Milestone 2

Why per-level graphs instead of one global mutable HNSW over the whole store? Give the write-amplification argument from SSTable immutability, then compute: with level fan-out 10, L0 holding up to 4 files, and 100M vectors at steady state, how many graphs does one search touch in the worst case, and how does expected search cost scale with store size compared to a single graph? State your assumptions about per-graph search cost.

Checkpoint 1.2 · due with Milestone 2

Derive a worst-case bound on the error of a dot product computed over per-dimension i8 scalar-quantized vectors with affine decode scale[j]·c + bias[j], as a function of D and the per-dimension value ranges. Using your Milestone 1 measurements, explain when f32 re-ranking of the top 4k candidates recovers essentially all lost recall — and construct (or describe) a data distribution where it cannot.

Checkpoint 1.3 · due with Milestone 3

In merge_graphs, how do you choose the entry point and maximum layer of the merged graph, and why is naively keeping the largest source’s entry point risky after heavy deletes? Then give an asymptotic cost comparison of merge (edge repair + minority insertion) versus full rebuild as a function of the surviving fraction s and the size ratio between sources, and predict — before running the storm test — the s at which rebuilding wins. Check the prediction against your measurements and report the discrepancy.

❦

Provided Materials

Starter files

→

README.mdRepo layout, build/run instructions, provided-vs-TODO map, and the Pareto CSV submission schema

→

Cargo.tomlCrate manifest — rand, serde, csv only; the ANN-crate ban is enforced from here

→

src/lib.rsCore types, the Engine trait (put/get/scan/knn), EngineConfig/HnswParams, error enum, ParetoRow

→

src/memtable.rsComplete BTreeMap memtable with a DocId→Vector side-table, brute-force knn, and the flush iterator

→

src/sstable.rsNormative on-disk byte layout (header, key block, codes, PQ codebook, HNSW CSR, footer), working writer, reader stubs for Milestone 1

→

src/workload.rsDeterministic seeded agent trace — 70% Zipfian-repeat knn, 20% upserts, 10% scans — with brute-force ground truth per query

→

src/bin/bench.rsSmoke harness: replays the trace against the memtable (recall must print 1.0) and emits a schema-correct pareto.csv row