In 2005 the column store beat the general-purpose engine by 50× — this week we dissect why columns beat rows, and why batches beat tuples.
Lecture 1 — The column-store argument · Lecture 2 — Vectorized execution: from tuple-at-a-time to batches
Pick the workload, derive the architecture, measure the gap.
C-Store’s scan speedup over the general-purpose incumbent — one to two orders of magnitude. The generalist isn’t a compromise; it’s a category error.
AVG(price) WHERE shipdate > … touches 2 of 16 attributes.lineitem table.shipdate, table sorted on it: ~2,500 distinct days.(value, run_length): 2,500 pairs × 8 B ≈ 20 KB.4 GB → 20 KB. Absurd, and real — but only on the sort column. That’s why C-Store makes sort order a first-class physical-design decision.
ship_state, 50 values, ~9-byte strings.state = 'CA' → code = 4.quantity ranges 1–50, needs 6 bits, not 32.price only at surviving positions.Same 40-line revenue query: row-store warehouse vs. an early Vertica cluster on cheaper hardware. Not the cache — 14 columns never read, RLE-run arithmetic on the date predicate, tuples materialized only for the final 900 rows.
From tuple-at-a-time to batches — the inner loop meets the modern CPU.
open(), next(), close() — composes like Unix pipes.next() pulls one tuple from below.Share of CPU cycles commercial DBMSs spent on actual query work running TPC-H Q1 — ~3× lower IPC and an order of magnitude more cycles per tuple than hand-written C.
next(): a virtual call, ~20–40 cycles with pipeline flush.next() returns a vector of ~1000 values per column./* one virtual call delivered n ≈ 1024 values;
* below: branch-free, type-specialized, cache-resident,
* auto-vectorized (AVX2: 8 lanes per instruction) */
size_t sel_lt_int32(const int32_t *col, int32_t bound,
uint32_t *sel_out, size_t n)
{
size_t k = 0;
for (size_t i = 0; i < n; i++) {
sel_out[k] = (uint32_t)i; /* write position… */
k += (col[i] < bound); /* …keep it only on match */
}
return k; /* positions, not copies */
}
(data, sel, n).k += (cond) sidesteps the branch predictor.| Dimension | Volcano | Vectorized (X100 → DuckDB) | JIT (HyPer → Umbra) |
|---|---|---|---|
| Overhead / value | ~100s of cycles | ~0.03–0.5 cycles | ≈0 (fused loop) |
| Intermediates | One tuple in flight | Cache-resident vectors (~4 KB/col) | None — registers |
| SIMD | Impossible | Natural | Possible, harder |
| Startup latency | None | None — precompiled | ms–100s of ms / query |
| Profiling | Easy, per-operator | Easy, per-primitive | Hard — opaque loop |
| Engineering cost | Low | Medium | High |