Week 6: Vector Indexes Are Access Methods, Not Products

Learning objectives — after this week you can…

Explain why exact kNN in high dimensions degenerates to a scan, using the concentration-of-distances argument with concrete numbers.
Work the product-quantization codebook math: derive the 32× compression of a 128-dim float vector into 16 bytes and state what it costs in distance distortion.
Trace a query through HNSW’s layered graph and predict the effect of turning the efSearch and efConstruction knobs.
Explain how DiskANN serves a billion vectors from one node by splitting the graph (SSD) from compressed vectors (RAM), and why beam width matters.
Argue, with the RUM conjecture extended by a recall axis, why every production vector system converged on LSM-like segment-and-merge designs.

Lecture 1 · Tuesday

ANN Search: IVF, PQ, HNSW

Every agent you have built this semester ends up issuing the same query: given this embedding, find the k most similar vectors among N. Agent memory is this query. RAG is this query. Tool retrieval, deduplication, semantic caching of LLM calls — all this query. It is now plausibly the single most-executed access pattern in new systems, which is remarkable for an operation that relational engines never had a native index for. Today we build up the three ideas — inverted files, product quantization, and navigable small-world graphs — that underlie essentially every ANN index in production, and we are honest about what each one trades away.

Why exact kNN at scale is hopeless

Start with the brute-force baseline. N = 100M vectors, d = 768 floats each: that is 307 GB of raw data, and one exact query must compute 100M dot products — roughly 154 GFLOPs (77 billion multiply-adds), plus the memory traffic to stream 307 GB through the CPU. At ~50 GB/s of effective bandwidth that is six seconds per query, before you’ve served a second user. So we want an index. But every spatial index you know — kd-trees, R-trees, grid files — works by pruning regions of space, and pruning dies in high dimensions.

The killer is concentration of distances. As d grows, the distance from a query to its nearest neighbor and to its farthest point converge: the ratio d_max/d_min → 1. Intuition: in d dimensions, the volume of a hypersphere concentrates in a thin shell near its surface — for d = 768, more than 99% of a unit ball’s volume lies in the outer 1% of radius. Everything is “far,” and everything is far by roughly the same amount, so a branch-and-bound index can almost never prove that a partition contains no candidates. A kd-tree on 768-dim data visits nearly every leaf; you have rebuilt the scan with extra pointer chasing. Real embeddings have lower intrinsic dimension than their ambient dimension — which is the only reason ANN works at all — but no exact structure exploits this with guarantees. Hence the field’s great surrender, and its great unlock: stop demanding the true neighbor. Accept recall@k — the fraction of true top-k results returned — as a quality dial, and suddenly orders of magnitude appear.

IVF: coarse quantization, the “shard by meaning” trick

The inverted file (IVF) index is k-means wearing a database hat. Offline: cluster the N vectors into nlist centroids (typical: nlist ≈ √N, so ~10,000 lists for 100M vectors), and store each vector in a posting list under its nearest centroid — exactly an inverted index from text retrieval, except the “terms” are learned regions of space. At query time: compare the query against all 10,000 centroids (cheap — that’s 10,000 dot products, not 100M), pick the closest nprobe lists, and scan only those. With nprobe = 10 of 10,000 lists you scan ~0.1% of the data — a 1000× reduction in work.

The catch is the edge problem: a query that lands near a Voronoi boundary has true neighbors sitting in cells you didn’t probe. Recall@10 with nprobe = 1 might be 40%; pushing to nprobe = 64 might reach 95% — at 64× the scan cost. nprobe is your first recall-vs-latency dial, and it is beautifully monotone: more probes, more recall, more time. Note what IVF does not fix: each probed list is still full of raw float vectors. For 100M × 768 dims you still hold 307 GB somewhere. That is the problem product quantization attacks.

Product quantization: 32× compression with arithmetic you can do in your head

Jégou, Douze, and Schmid’s 2011 TPAMI paper is one of those rare results where the entire idea fits in two sentences. Split each vector into m subvectors; vector-quantize each subspace independently with its own small codebook. Concretely, the canonical configuration: a 128-dim float vector (512 bytes) is split into m = 16 subvectors of 8 dims each. For each of the 16 subspaces, run k-means to learn 256 centroids (an 8-bit codebook). A vector is then encoded as 16 centroid IDs × 8 bits = 16 bytes — a 32× compression. The effective codebook is the Cartesian product of the 16 small ones: 256¹⁶ = 2¹²⁸ distinct representable points, from only 16 × 256 × 8 dims × 4 B ≈ 131 KB of stored centroids. You get the resolution of an astronomically large codebook at the training cost of sixteen tiny ones — that is the whole trick.

Distance computation is where PQ earns its keep. For a query q, precompute a lookup table: for each of the 16 subspaces, the squared distance from q’s subvector to each of the 256 centroids — a 16 × 256 table of floats, built once per query. The (asymmetric) distance to any database vector is then 16 table lookups and 15 adds. No floating-point multiplies per candidate; scanning millions of PQ codes becomes memory-bandwidth-bound on 16-byte codes instead of 512-byte vectors. The price is distortion: PQ distances are estimates, biased by quantization error, so recall suffers near decision boundaries. The standard remedy is re-ranking: use IVF-PQ to fetch, say, 200 candidates cheaply, then recompute exact distances on raw vectors for just those 200. FAISS’s IVFADC — inverted file plus asymmetric distance computation on PQ codes of the residual (vector minus its coarse centroid, which has lower variance and quantizes better) — is precisely this stack, and it is still the backbone of billion-scale deployments fifteen years later.

Field noteA team I advised stored raw 1536-dim OpenAI embeddings for 400M items in a key-value store “temporarily” and brute-forced shards of it with GPUs. Their monthly bill for what is, mathematically, an unindexed table scan exceeded the salary of the engineer who eventually replaced it with IVF-PQ — 2.4 TB of floats became 38 GB of codes, and p99 latency fell from 1.8 s to 19 ms at recall@10 = 0.96. Nobody had asked “what is the access method here?” because the vectors arrived through the ML door, not the database door.

HNSW: the skip list, rediscovered as a graph

Graph-based search takes a different bet: don’t partition space, walk it. Build a graph where each vector is a node connected to ~M of its near neighbors (typical M = 16–32); answer queries by greedy traversal — from some entry point, repeatedly move to the neighbor closest to the query, until no neighbor improves. On a pure kNN graph this gets stuck in local minima and takes O(N^1/d)-ish hop counts across the space. Malkov and Yashunin’s HNSW (TPAMI 2018) fixes both with two ideas. First, navigability: during construction, edges are chosen with a diversity heuristic (an edge to candidate c is kept only if c is closer to the new node than to any already-selected neighbor), which preserves long-range “highway” links instead of M redundant short ones. Second, the hierarchy: each node is assigned a maximum layer ℓ drawn geometrically (P(layer ≥ ℓ) = e^−ℓ/m_L), and with the paper’s recommended m_L = 1/ln M this produces a stack of graphs where each layer up has ~1/M the nodes of the one below. Search starts at the sparse top layer, greedily descends — coarse hops first, fine hops last. If that sounds exactly like a skip list, it should: HNSW is a skip list generalized from a sorted line to a metric space, and the expected logarithmic search depth survives the generalization.

Fig. 6.1 — HNSW search. Enter at the sparse top layer, greedy-walk the long-range links, drop a layer each time the walk stalls, finish with a beam of width efSearch on layer 0. A skip list, generalized to a metric space.

Two knobs govern everything. efConstruction (typical 100–500) is the candidate-beam width while building: bigger means better edges, slower builds. efSearch is the beam width at query time on layer 0 — the algorithm maintains a priority queue of the efSearch best candidates instead of pure greedy. efSearch = 10 might give recall@10 = 0.80 in 30 µs; efSearch = 200 gives 0.99 in 400 µs. It is IVF’s nprobe wearing graph clothes — every ANN family ends up with exactly one such dial, and learning to read its trade curve is the actual skill.

HNSW’s sin is memory. The graph is useless without the vectors, traversal does random access, so everything lives in RAM: full-precision vectors plus adjacency. Budget it: 100M vectors × 768 dims × 4 B = 307 GB of vectors, plus ~100M × M=32 edges × 4 B ≈ 13 GB of graph, plus allocator overhead — call it 350 GB of DRAM for one replica. At cloud prices, DRAM is ~20× the cost of NVMe per byte. That gap is the entire reason Thursday’s lecture exists.

Lecture 2 · Thursday

DiskANN and the Recall Axis

Tuesday ended on an invoice: 350 GB of DRAM to serve 100M vectors with HNSW, and a billion vectors simply off the table for a single node. The 2019 DiskANN paper from Microsoft Research asked the impolite question — what if the graph lived on SSD? — and answered it well enough that one 64 GB-RAM machine with an NVMe drive could serve a billion vectors at 95%+ recall and ~5 ms latency. The design is a small masterpiece of fitting an algorithm to a storage hierarchy, and it completes this course’s recurring argument: new workload, same physics. Today: how DiskANN works, how it learned to handle updates, and why every production system — purpose-built or bolted onto Postgres — converged on the same LSM-shaped answer.

Vamana: a flatter graph, built for block storage

DiskANN’s graph algorithm, Vamana, drops HNSW’s hierarchy entirely — one flat graph, searched greedily from a fixed medoid entry point. What it adds is a relaxed pruning rule: the α-parameter (typical α = 1.2) keeps an edge to candidate c only if no kept neighbor c′ satisfies α·dist(c′, c) < dist(node, c). With α > 1 the graph keeps more aggressive long-range shortcuts than HNSW’s heuristic would, deliberately overshooting toward the query. The payoff is fewer hops — and on SSD, hops are the whole cost model. DRAM lets HNSW afford a 60-hop stroll of 100 ns accesses; an SSD random read is ~100 µs, a thousand times worse, so the algorithm that wins on disk is the one that minimizes round trips, not distance computations. Same lesson as the B-tree in Week 2: fan-out and access count dominate when storage is slow. The layout is equally deliberate: each SSD block stores a node’s full-precision vector and its adjacency list together, so one 4 KB read yields both the geometry and the next hops.

PQ residuals in RAM, beam search on SSD

The second half of the design is where Lecture 1 pays off. Greedy traversal must rank neighbors — but reading every candidate’s vector from SSD just to rank it would be ruinous. So DiskANN keeps a PQ-compressed copy of every vector in RAM: at ~32 bytes per code, a billion vectors fit in ~32 GB. Navigation runs on cheap, slightly-wrong PQ distances in memory; the SSD is consulted only to fetch the adjacency lists along the path and the full-precision vectors of the final candidates for exact re-ranking. PQ is demoted from index to steering mechanism — its distortion barely matters because it only has to point the walk in roughly the right direction, and the exact re-rank at the end repairs the error.

To hide SSD latency, the search runs as beam search with beam width W (typical 4–8): issue W adjacency-list reads concurrently, merge results, repeat. NVMe drives deliver hundreds of thousands of random IOPS but only at queue depth — a beam of 4–8 keeps the device busy and cuts wall-clock hops nearly W-fold. A representative query: ~10 rounds × 4 parallel 4 KB reads ≈ 40 I/Os ≈ 1–4 ms, with PQ arithmetic essentially free beside it. Billion-scale, one node, five milliseconds, recall 0.95. The 2019 baseline for that workload was a DRAM cluster an order of magnitude more expensive.

Historical asideThe arc rhymes with 1970s hashing. Early hash tables assumed memory; databases needed disk; linear hashing and extendible hashing (1979–80) re-derived hashing under the constraint “one I/O per lookup.” DiskANN is the same move for similarity graphs — take an in-memory structure the workload demands, and re-derive it under the storage hierarchy you can afford. The access-method playbook hasn’t changed in fifty years; only the query has.

Updates, and the inevitable LSM-shaped convergence

Graphs hate deletes. Removing a node can sever the only navigable path through a region; in-place edge repair while queries traverse the graph is a concurrency nightmare. FreshDiskANN (2021) gave the streaming answer: inserts go to a small in-RAM Vamana index; deletes go to a tombstone set consulted at query time (deleted nodes are still traversed for navigation, just filtered from results — sever nothing, hide instead); periodically a background pass merges the RAM index into the SSD index and physically excises tombstoned nodes, repairing edges with the α-prune. Read it back slowly: a write-optimized in-memory delta, deletion markers, background compaction into a read-optimized structure. That is an LSM-tree. Nobody copied the design from Week 3 — the constraints forced it. Lucene (and so Elasticsearch and OpenSearch) builds an HNSW per immutable segment, tombstones deletes, and rebuilds graphs at segment merge. pgvector sits inside Postgres heap pages with MVCC handling versioning and vacuum doing the reclamation. Three independent codebases, one shape: immutable searchable units + tombstones + merge. When every lineage converges, you are looking at physics, not fashion.

Recall: RUM grows a fourth axis

The RUM conjecture (Week 4) said an access method balances Read, Update, and Memory overheads — optimize two, pay with the third. ANN indexes add a genuinely new axis: Recall. It is not a hidden constant; it is a knob the operator turns at query time. The same HNSW index serves recall 0.80 at 30 µs and recall 0.99 at 400 µs depending on one integer. This means a vector index is not characterized by a point (“QPS at p99”) but by a curve — recall-vs-latency at fixed memory, or recall-vs-memory at fixed latency — and comparing systems at unstated recall is the benchmark fraud of our decade. Treat recall as a resource to budget like CPU: an agent re-ranking 50 candidates through an LLM can buy cheap recall-0.85 retrieval and let the model absorb the noise; an agent doing exact-match memory lookup (“have I already filed this ticket?”) needs 0.99+ and should pay for it. Per-query recall targets are an API surface agents can actually use — they already set temperature; min_recall is the same kind of dial.

Property	IVF-PQ	HNSW	DiskANN (Vamana)
Memory, 100M × 768d	~3–6 GB (codes + centroids; raw vectors on disk for re-rank)	~350 GB (full vectors + graph, all DRAM)	~6–10 GB RAM (PQ codes) + ~340 GB SSD
Build cost	Low — k-means train + assign; hours, parallelizes trivially	High — sequential-ish graph insertion, ~tens of core-hours per 100M	Highest — Vamana needs 2 passes + α-pruning; often built sharded then merged
QPS / latency (1 node)	High QPS, ~1–10 ms; SIMD table scans, batches beautifully	Highest — 10⁴–10⁵ QPS at 0.1–1 ms; random RAM access	~10³ QPS at 2–10 ms; bounded by SSD IOPS × beam width
Recall behavior	Ceiling limited by quantization; needs re-rank stage above ~0.95	0.99+ reachable by raising efSearch alone	0.95–0.99 with re-rank; recall per dollar is the headline
Updates	Easy appends; centroid drift forces periodic retrain	Inserts fine; deletes poison the graph → segment + rebuild	FreshDiskANN: RAM delta + tombstones + background merge
Use when	Huge N, tight memory, batch/offline tolerance	Latency is king and DRAM is paid for	Billion-scale on commodity nodes; cost is king

The agent angle: vector-as-feature beat vector-as-product

Around 2021–2023, venture capital funded the proposition that ANN search was a database category. By 2025 the answer was in: Postgres grew pgvector, Elasticsearch and OpenSearch shipped HNSW, MongoDB and Redis and SQLite and every cloud warehouse added vector columns, and the standalone vendors pivoted toward “memory platforms.” The reason is the thesis of this course. An agent’s retrieval almost never arrives alone — it is WHERE tenant_id = ? AND created_at > ? ORDER BY embedding <=> ? LIMIT 20, similarity joined with predicates, freshness, and transactions. Filtered ANN is an open research problem precisely because the filter interacts with graph connectivity (filter out 99% of nodes and your navigable graph disconnects), and solving it requires sitting inside a query planner, next to the other indexes, with shared statistics — not across a network hop in a separate product. B-trees were never a company. Vector indexes are access methods, and access methods belong inside engines.

Recall is the fourth resource. You don’t maximize it — you budget it, query by query, like CPU.

Readings

Read Before Thursday

Product Quantization for Nearest Neighbor Search — Jégou, Douze & Schmid, IEEE TPAMI 2011.The codebook math behind every compressed vector index. Work through §III until the 256^16-codebook trick and the asymmetric distance tables feel obvious; skim the GIST results.

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs — Malkov & Yashunin, IEEE TPAMI 2018.Focus on the neighbor-selection heuristic (Alg. 4) and the layer assignment — the skip-list analogy is in the paper. Note carefully what the memory model assumes.

DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node — Subramanya, Devvrit, Simhadri, Krishnaswamy & Kadekodi, NeurIPS 2019.Read for the systems decisions, not the graph theory: why α > 1, why PQ steers while SSD stores, why node + adjacency share a block. Compare its cost model against HNSW’s before class.

Exercises

This Week’s Problems

Exercise 6.1 · warm-up

Do the budgets by hand. (a) For N = 1B vectors at d = 768 float32: compute raw storage, HNSW DRAM footprint (M = 32 edges, 4-byte IDs), and the RAM needed for DiskANN’s PQ codes at 32 B/vector. (b) For PQ with m = 24 subvectors and 8-bit codebooks on d = 768: code size, compression ratio, total codebook storage, and the size of the per-query distance table. (c) At 50 GB/s memory bandwidth, what is the floor on brute-force latency for (a), and how many 100 µs SSD round trips equal it?

Exercise 6.2 · core

Implement IVF-PQ in ~300 lines (NumPy permitted, FAISS forbidden) on the 1M-vector GloVe-100 dataset from ann-benchmarks: k-means for nlist = 1024 coarse centroids, PQ with m = 10 on residuals, asymmetric distance tables, and optional exact re-ranking of the top-200. Produce one plot: recall@10 vs mean query latency, sweeping nprobe ∈ {1, 2, 4, …, 128}, with and without re-ranking. Then answer in one paragraph: where does the no-re-rank curve plateau, and which term in the PQ error analysis explains the ceiling?

Exercise 6.3 · stretch

Filtered ANN is unsolved; take a real swing at it. Using your Exercise 6.2 index or an HNSW you build, attach a uniformly random categorical label to each vector and study queries of the form “top-10 neighbors WHERE label = X” as label selectivity drops from 50% to 0.1%. Compare at least three strategies: post-filtering (search then discard), pre-filtering (brute-force the matching subset), and an integrated approach of your own design — e.g., label-aware edges during graph construction, per-label entry points, or selectivity-based planning that switches strategy at a crossover you derive. Report the recall-vs-latency curves at each selectivity, find empirical crossover points, and write one page on what a query optimizer would need (statistics, cost model, index metadata) to choose the strategy automatically. A serious attempt at the integrated method, with an honest account of where it breaks, earns full credit even if it loses to brute force.

❦