DATA 2027 · Week 03 · Part I — Foundations Under New Workloads

One Size Fits None

In 2005 the column store beat the general-purpose engine by 50× — this week we dissect why columns beat rows, and why batches beat tuples.

Lecture 1 — The column-store argument · Lecture 2 — Vectorized execution: from tuple-at-a-time to batches

Lecture 1 · Tuesday

The Column-Store Argument

Pick the workload, derive the architecture, measure the gap.

L1 · Stonebraker’s method

Read the 2005 polemic for its method

L1 · Stonebraker’s method

The gap wasn’t 20%

50×

C-Store’s scan speedup over the general-purpose incumbent — one to two orders of magnitude. The generalist isn’t a compromise; it’s a category error.

L1 · Layout

Rows are an OLTP decision, not a law of nature

L1 · Layout

Same table, two physical layouts

ROW STORE (NSM) id name qty price date scan reads EVERY cell; cache line = 5 columns, 2 useful COLUMN STORE (DSM) id name qty price date scan reads 2 of 5 strips; cache line = 100% one column, SIMD-ready
Fig. 3.1 — Shaded cells: what SELECT AVG(price) … WHERE date > … must pull through the memory hierarchy.
L1 · Compression

Compression: the second order of magnitude

L1 · Compression

Run-length encoding (RLE)

L1 · Compression

RLE on the sort column

200,000×

4 GB → 20 KB. Absurd, and real — but only on the sort column. That’s why C-Store makes sort order a first-class physical-design decision.

L1 · Compression

Dictionary & bit-packing

  • Dictionary: ship_state, 50 values, ~9-byte strings.
  • ⌈log₂ 50⌉ = 6 bits/value: 9 GB → 750 MB, 12×.
  • Predicates rewrite into code space: state = 'CA'code = 4.
  • Bit-packing: quantity ranges 1–50, needs 6 bits, not 32.
  • 4 GB → 750 MB, 5.3× — just dropping leading zeros.
  • Encodings stack: dictionary → bit-pack → RLE; Parquet lands at 5–10× overall.
L1 · Late materialization

Operate on columns as long as possible

L1 · Projections

C-Store doesn’t store “the table”

L1 · Hardware

Why the hardware votes for columns

  • Cache lines: memory arrives in 64-byte lines.
  • Column of 4-byte ints: 16 useful values per line.
  • 200-byte rows: a fraction of one value per line.
  • Scans are bandwidth-bound (~25 GB/s per core).
  • SIMD: AVX-512 compares 16 packed 32-bit values per instruction.
  • Requires contiguous, same-typed values — columnar is the precondition.
  • Row layout forces a gather, forfeiting the win.
  • The column store is shaped like the machine.
L1 · Field note, 2009

11 minutes became

4.2 s

Same 40-line revenue query: row-store warehouse vs. an early Vertica cluster on cheaper hardware. Not the cache — 14 columns never read, RLE-run arithmetic on the date predicate, tuples materialized only for the final 900 rows.

Lecture 2 · Thursday

Vectorized Execution

From tuple-at-a-time to batches — the inner loop meets the modern CPU.

L2 · Volcano

The beloved villain: Volcano (Graefe, 1990)

L2 · Volcano

What X100 measured (CIDR 2005)

<10%

Share of CPU cycles commercial DBMSs spent on actual query work running TPC-H Q1 — ~3× lower IPC and an order of magnitude more cycles per tuple than hand-written C.

L2 · Volcano

The price of one tuple at a time

L2 · X100

X100’s fix: change the unit of exchange

L2 · X100

Why ~1000, not 1,000,000

vector size (log scale) query time 1 10² 10³ 10⁴+ interpretation- dominated vectors fall out of cache sweet spot ~10²–10³ 1000 ints = 4 KB; a dozen in flight ≈ 50 KB — resident in a 1 MB L2
Fig. 3.2 — X100’s sensitivity curve: big enough to amortize interpretation, small enough to stay cache-resident.
L2 · Primitives

An X100-style primitive

/* one virtual call delivered n ≈ 1024 values;
 * below: branch-free, type-specialized, cache-resident,
 * auto-vectorized (AVX2: 8 lanes per instruction) */
size_t sel_lt_int32(const int32_t *col, int32_t bound,
                    uint32_t *sel_out, size_t n)
{
    size_t k = 0;
    for (size_t i = 0; i < n; i++) {
        sel_out[k] = (uint32_t)i;     /* write position…        */
        k += (col[i] < bound);        /* …keep it only on match */
    }
    return k;   /* positions, not copies */
}
L2 · Primitives

Selection vectors: positions, not copies

L2 · HyPer

Vectorize or compile? The HyPer counterpoint

L2 · HyPer

Three execution models

DimensionVolcanoVectorized (X100 → DuckDB)JIT (HyPer → Umbra)
Overhead / value~100s of cycles~0.03–0.5 cycles≈0 (fused loop)
IntermediatesOne tuple in flightCache-resident vectors (~4 KB/col)None — registers
SIMDImpossibleNaturalPossible, harder
Startup latencyNoneNone — precompiledms–100s of ms / query
ProfilingEasy, per-operatorEasy, per-primitiveHard — opaque loop
Engineering costLowMediumHigh
L2 · HyPer

The compile-time tax

L2 · The agent angle

Agents: hundreds of queries, not a dozen

L2 · The agent angle

The sizes became formats — and won

  • Parquet = C-Store’s storage chapter, standardized.
  • Dictionary + RLE + bit-packing per column chunk.
  • Min/max zone maps per row group skip predicates.
  • Arrow = the X100 vector, standardized in memory.
  • Engine, dataframe, tool-call boundary pass vectors by pointer.
  • DuckDB over Arrow on Parquet in S3 = the 2005 research stack, verbatim.
A 50× gap isn’t an optimization opportunity — it’s the sound of the wrong architecture hitting the right workload.
— Week 3 lecture notes
Checkpoint · Discussion

Before you leave

Readings

Read before Thursday