Every agent memory system shipped since 2023 is a storage engine wearing a trench coat. This week we take the coat off and grade what’s underneath.
Lecture 1 — Agent memory systems, read as database designs · Lecture 2 — Temporal knowledge, consolidation, and forgetting
Strip the branding off any memory product and you find a workload spec.
The physics didn’t change. The client did.
episodic appends/day for a support agent at 10k conversations × 20 turns — trivial; a single Postgres instance yawns at it. The hard part is recall on every turn.
p95 end-to-end latency: full-history replay vs. retrieval over distilled memories on LOCOMO-length conversations (Mem0 paper) — with >90% fewer tokens billed. Not an optimization; the difference between a product and a demo.
relative improvement over OpenAI’s built-in memory on LOCOMO question answering (LLM-as-judge). The graph variant Mem0ᵍ adds a couple more points on temporal and multi-hop questions.
UPDATE keeps no tombstone, no before-image, no provenance.NOOP vs. “contact is Marcus.”contact_for(account) per account; writes invalidate the old row.| Property | Mem0 | MemGPT | Zep / Graphiti | A DBMS would say |
|---|---|---|---|---|
| Write path | LLM extract → LLM upsert | Self-directed tool calls | Edges into temporal graph | Log first, derive later |
| Read path | Vector top-k over facts | Model-issued search | Cosine + BM25 + graph | Optimizer picks the path |
| Durability | Lossy at extraction | Lossy at eviction | Edges invalidated, kept | WAL or it didn’t happen |
| Audit | Overwrites destroy history | Edits unversioned | Bi-temporal lineage | Every version AS OF |
| Consistency | Async → stale reads | Serial until you shard | Summaries can lag | Define isolation, enforce it |
Mutable key–value memory is the wrong data model. The right one is from the 1990s.
t_valid / t_invalidTextbook bi-temporality: Snodgrass’s TSQL2 (1995), standardized in SQL:2011.
t_invalid IS NULL.Zep on Deep Memory Retrieval vs. MemGPT’s 93.4% — plus up to 18.5% accuracy gains on LongMemEval at ~90% lower latency than full-context baselines. It retrieves the right version of a fact.
e^(−Δt/τ) + use count — is generational GC.AS OF queries disagree for the same date?