Week 11: Memory Is a Database Problem

Learning objectives — after this week you can…

State the four workload requirements of agent memory (episodic append, semantic recall, namespace isolation, cheap point reads) and map each to a classical storage structure.
Trace a write through Mem0’s extract–consolidate pipeline and identify exactly where durability, determinism, and auditability are lost.
Explain MemGPT’s main/external context split as buffer-pool management, including what its “page replacement policy” actually is.
Design a bi-temporal edge schema (t_valid / t_invalid) and show how contradiction-as-invalidation reproduces Type-2 slowly-changing dimensions from 1990s warehousing.
Articulate the semantic phantom problem and argue why derived-belief consistency is an isolation question the field has not yet answered.

Lecture 1 · Tuesday

Agent Memory Systems, Read as Database Designs

Strip the branding off any agent memory product and you find a workload spec. An agent finishing a conversational turn must append an episode — cheap, sequential, never blocking the response. Before its next turn it must recall semantically: “what do I know that bears on this?”, which in practice means hybrid retrieval — vector similarity for paraphrase, BM25 for exact tokens like error codes and invoice numbers, graph traversal for multi-hop relations (“who manages the person who owns this service?”). It needs namespace isolation, because memory leaking between user A and user B is a privacy breach, not a recall improvement. And it needs cheap point reads, because “what’s this user’s timezone?” must cost microseconds, not an embedding call. Sequential append, secondary indexes, multi-tenancy, primary-key lookup: you have known these four requirements since the second week of your first databases course. The physics didn’t change. The client did.

The workload, made precise

Put numbers on it. A support agent serving 10k conversations/day with 20 turns each generates ~200k episodic appends daily — trivial write volume; a single Postgres instance yawns at it. The hard part is the read side: every turn issues at least one recall query with a latency budget inside the human’s patience window, and recall quality directly moves task success. The Mem0 paper measures the naïve alternative — replay the entire history into the prompt — at a p95 end-to-end latency around 17 seconds on LOCOMO-length conversations, versus roughly 1.4 s for retrieval over distilled memories, with more than 90% fewer tokens billed. That is not a clever optimization; that is the difference between a usable product and a demo. Recall it from Week 4: the moment you cannot afford to scan, you must index, and the moment you index, you must decide what the index is over. Agent memory systems differ mainly in that one decision.

Mem0: an LLM-driven upsert pipeline

Mem0 (arXiv 2504.19413) answers: index over extracted facts, not raw transcript. Its write path is a two-phase pipeline. Extraction: after each exchange, an LLM call reads the new messages plus a rolling conversation summary and emits candidate memories — short declarative facts (“user is vegetarian”, “project Helios ships March 4”). Consolidation: for each candidate, the system retrieves the most similar existing memories and a second LLM call picks one of four operations: ADD, UPDATE, DELETE, or NOOP. Read that list again. It is an upsert resolver — the same merge logic a CDC pipeline runs against a dimension table — except the comparator is a language model rather than a key equality test. The payoff is real: Mem0 reports a 26% relative improvement over OpenAI’s built-in memory on LOCOMO question answering (LLM-as-judge), and its graph-extended variant Mem0^g adds another couple of points on temporal and multi-hop questions by also writing facts as subject–predicate–object edges.

Now grade it as a database. Durability: the raw episode may be durable, but the memory is whatever the extractor happened to notice. A fact the LLM skips was never written; there is no redo log for attention. Determinism: run the same transcript twice and consolidation can produce different stores — temperature-zero helps but model updates break replay. Audit: when an UPDATE overwrites “user lives in Berlin” with “user lives in Lisbon”, default configurations keep no tombstone, no before-image, no provenance pointer back to the source turn. A bank examiner would fail this system in an afternoon. None of these are fatal — they are fixable, with a fact log, source citations on every memory row, and versioned writes — but you must see them, and the only way to see them is to read the system as a database.

MemGPT: the buffer pool, rediscovered

MemGPT (Packer et al., 2023) takes the OS metaphor literally. The model’s context window is main context — scarce, fast, fixed-size, like RAM. Everything else is external context: recall storage (the full message log) and archival storage (arbitrary documents and notes), both unbounded, both reachable only through explicit function calls the model itself issues — archival_memory_search, conversation_search, and self-edits to a small pinned “working context.” When the FIFO message queue nears the token limit, the system raises a memory-pressure warning, the model summarizes and evicts, and the evicted pages live on in recall storage. Translate to our vocabulary: main context is the buffer pool, eviction-with-summary is page replacement where the “page” is rewritten lossily on the way out, and the warning at ~70% occupancy is a high-water mark triggering a flush. The genuinely novel move is who runs the replacement policy: the model does, via tool calls — the application program is its own buffer manager. DBAs spent thirty years learning why application-managed caching is hard (hint rot, working-set misestimation, no global view), and every one of those failure modes reappears here: MemGPT agents forget to page in what they need and confabulate instead of faulting.

Field noteA team I advised shipped a Mem0-style memory for a sales agent. Three weeks in, a customer asked why the agent kept calling him by his predecessor’s name. Root cause: the consolidation LLM had judged “new account contact is Dana” as a NOOP against “account contact is Marcus” — similar embeddings, different truth. The fix was not a better prompt. It was a unique constraint: one contact_for(account) fact per account, writes must invalidate the old row. They reinvented the primary key, eight months and one churned customer late.

Report card

Property	Mem0	MemGPT	Zep / Graphiti	What a DBMS would say
Write path	LLM extract → LLM upsert (ADD/UPDATE/DELETE/NOOP)	Self-directed tool calls; eviction summaries	Entity/edge extraction into temporal graph	Log first, derive later; never lose the base write
Read path	Vector top-k over facts (+ graph in Mem0^g)	Model-issued search over recall/archival	Hybrid: cosine + BM25 + graph traversal, fused	Optimizer picks access path; client shouldn’t have to
Durability	Lossy at extraction; no redo for missed facts	Lossy at eviction (summary ≠ page)	Episodes retained; edges invalidated, not deleted	WAL or it didn’t happen
Audit / provenance	Weak by default; overwrites destroy history	Working-context edits unversioned	Strong: bi-temporal lineage per edge	Every version queryable `AS OF`
Consistency	Async consolidation → stale reads possible	Single-agent serial; fine until you shard	Graph consistent; derived summaries can lag	Define the isolation level, then enforce it

An agent that cannot say why it believes something is just a cache with opinions.

Lecture 2 · Thursday

Temporal Knowledge, Consolidation, and Forgetting

Tuesday’s systems mostly treat memory as a mutable key–value store: new fact in, old fact gone. Thursday’s thesis is that this is the wrong data model, and that the right one was worked out by the database community in the 1990s while today’s agent framework authors were in elementary school. Facts about the world are not values to overwrite; they are intervals to close. “Alice is the on-call engineer” was true from March 3 to March 17. When Bob takes over, the Alice-edge does not become false retroactively — it becomes bounded. Zep’s Graphiti engine (arXiv 2501.13956) builds agent memory on exactly this insight: a temporally-aware knowledge graph in which every edge carries validity timestamps, contradiction closes intervals instead of deleting rows, and the past remains queryable. The result isn’t just tidier bookkeeping — Zep beats MemGPT on the Deep Memory Retrieval benchmark (94.8% vs 93.4%) and posts up to 18.5% accuracy gains on LongMemEval with ~90% lower latency than full-context baselines, precisely because temporal structure lets it retrieve the right version of a fact.

Bi-temporal edges: t_valid and t_invalid

Graphiti stores knowledge as a graph of entities and relationship edges, where each edge carries two timelines. The valid-time pair — t_valid, t_invalid — records when the fact held in the world. The transaction-time pair — when the system learned and when it superseded the record — tracks the database’s own epistemic history. This is textbook bi-temporality: Snodgrass’s TSQL2 proposal formalized it in 1995, and SQL:2011 finally standardized PERIOD columns and AS OF queries. The agent-flavored twist is the ingestion trigger. When a new episode yields an edge that contradicts an existing one — “Dana is the account contact” arriving while “Marcus is the account contact” is open — Graphiti does not delete Marcus. It sets the Marcus edge’s t_invalid to the new fact’s t_valid and inserts Dana as a fresh open-ended edge. Three queries become trivial that overwrite-based stores cannot answer at all: what is true now (filter t_invalid IS NULL), what was true on April 2 (interval containment), and what did we believe on April 2 (the transaction-time variant — essential for audit, because what the agent believed when it acted is what determines liability).

Fig. 11.1 — Contradiction as invalidation. The arrival of the Dana edge closes the Marcus edge’s validity interval (t_invalid := Mar 17) but preserves the row. Point-in-time queries (AS OF) return the edge whose interval contains the query timestamp; “current truth” is simply t_invalid IS NULL.

The field did this in the 90s

Recognize the pattern from your warehousing reading: this is Kimball’s Type-2 slowly-changing dimension, circa 1996. When a customer moves, you don’t update the row — you expire it (row_effective_date, row_expiration_date, is_current flag) and insert a successor, so historical facts join to the dimension as it was. Graphiti’s edge invalidation is a Type-2 SCD with an LLM deciding when two rows are “the same dimension member,” which is both its power (it handles natural-language contradiction, fuzzy entity matching, partial updates) and its risk (the expiry trigger is now probabilistic). The lesson for you as designers: when an agent-memory vendor describes “temporal awareness” as novel, the right response is to ask which of TSQL2’s query classes they support and what their answer to coalescing adjacent intervals is. The vocabulary already exists. Use it, and you can spot the gaps in an afternoon.

Consolidation is compaction; forgetting is GC

Two more borrowed mechanisms hide in plain sight. Consolidation — merging episodic fragments into stable semantic facts, the thing cognitive scientists say sleep does — is structurally identical to LSM-tree compaction. Raw episodes are L0: small, recent, overlapping, fast to write. Periodic background passes merge them into wider, deduplicated, contradiction-resolved runs (entity summaries, community summaries in Graphiti’s case). The design questions transfer verbatim: when do you compact (write-triggered, size-tiered, scheduled), what does compaction cost (here, LLM tokens — a real dollar number you can put in a cost model), and what is read amplification before compaction versus write amplification after? Forgetting is garbage collection with a policy knob. TTL deletion (“drop episodes after 90 days”) is the blunt instrument; decay-scored eviction — rank memories by a function like score = recency_weight · e^−Δt/τ + use_count · w_u and evict the tail — is generational GC by another name. And here the temporal model earns its keep: in a bi-temporal store you can forget retrievability (drop the index entry) without forgetting the record (keep the audit row), which is exactly the split GDPR-era systems need. Overwrite-based memories can’t make that distinction; they have only one delete, and it destroys evidence.

Historical asideRichard Snodgrass spent roughly a decade (the proposal circulated in 1992, the design committee formed in 1993) trying to get TSQL2 into the SQL standard; the valid-time and transaction-time change proposals were actually accepted by ANSI and forwarded to ISO in early 1997 as SQL/Temporal, but the project collapsed — it was formally cancelled in 2001 over committee disagreements — and bi-temporal support didn’t land until SQL:2011 — by which time most practitioners had hand-rolled validity columns anyway. The agent-memory community is now re-deriving his taxonomy, point by point, in GitHub issues. There is a citation graph that should exist and doesn’t. Be the generation that fixes it.

The semantic phantom: derived beliefs and an open isolation question

Finish with the problem nobody has solved. Consolidated memories are derived data — summaries, entity profiles, community digests computed from base episodes. Derived data goes stale. Now picture the failure: at 09:00 the agent’s entity summary says “Acme’s contact is Marcus, renewal likely.” At 09:05 an episode arrives — Marcus has left the company. The bi-temporal graph updates correctly: Marcus’s edge is invalidated within seconds. But the summary is regenerated lazily, on the next consolidation pass, at 09:30. At 09:12 the agent reads the summary, drafts a renewal email to Marcus, and sends it. Every base read was consistent. The action was still wrong. We call this a semantic phantom, by analogy to phantom reads: a belief that no longer corresponds to any committed base fact, yet remains readable because it lives in a derived representation. Classical answers map awkwardly. Synchronous summary maintenance is the materialized-view answer — correct, but it puts an LLM call on the write path, multiplying latency and cost. Snapshot semantics would demand the summary declare the episode-LSN it was derived from, and the agent refuse to act when base facts have advanced past it — a freshness predicate on reads, which no shipping memory system exposes. Versioned beliefs with read-time validation is the OCC answer: act, then check the summary’s watermark before commit, and abort the action if stale. Notice what’s genuinely open: actions are not transactions — you cannot abort a sent email — so “abort on validation failure” must become “block before the side effect,” which requires the system to know which reads feed which irreversible actions. That is a dependency-tracking problem, an isolation-level problem, and an agent-architecture problem all at once. It is also, I’d argue, the most important unsolved database problem in this course — and next week, when an agent with stale beliefs also holds your production credentials, you will see why.

Readings

Read Before Thursday

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — Chhikara, Khant, Aryan, Singh, Yadav; arXiv:2504.19413, 2025.Read §3 as a write-path spec: extraction then ADD/UPDATE/DELETE/NOOP consolidation. Focus on the latency/token tables and ask where an audit log would have to live.

Zep: A Temporal Knowledge Graph Architecture for Agent Memory — Rasmussen, Paliychuk, Beauvais, Ryan, Chalef; arXiv:2501.13956, 2025.The Graphiti bi-temporal edge model is the heart. Map t_valid/t_invalid onto SQL:2011 periods and Type-2 SCDs as you read — the correspondence is nearly exact.

MemGPT: Towards LLMs as Operating Systems — Packer, Wooders, Lin, Fang, Patil, Stoica, Gonzalez; arXiv:2310.08560, 2023.Read the OS metaphor adversarially: main/external context is a buffer pool with the application as its own buffer manager. List the failure modes that DBMSs solved by NOT doing this.

Exercises

This Week’s Problems

Exercise 11.1 · warm-up

Write the SQL DDL for a bi-temporal fact-edge table (subject, predicate, object, t_valid, t_invalid, tx_recorded, tx_superseded, source_episode_id). Then write three queries: (a) current truth for a given subject–predicate, (b) valid-time AS OF a past date, and (c) transaction-time “what did we believe on date D” — and explain in one paragraph why (b) and (c) can return different answers for the same D.

Exercise 11.2 · core

Grade Mem0 as a storage engine. Using the paper’s pipeline, identify: (1) the precise point where a fact can be permanently lost despite a durable episode log, (2) two distinct sources of non-determinism on replay, and (3) the minimal schema additions (be concrete — columns, constraints, triggers) that would give every memory row provenance back to its source turn and make every UPDATE auditable. Estimate the token-cost overhead of your fix per 1,000 conversational turns, stating your assumptions about extraction batch size and model pricing.

Exercise 11.3 · stretch

Design an isolation level for derived beliefs. Specify belief-snapshot isolation: every derived artifact (summary, entity profile) carries a watermark — the maximum episode LSN it reflects — and every irreversible action declares its read set. Define (a) the validation rule that must pass before a side effect executes, (b) the behavior on validation failure given that actions cannot abort, and (c) a relaxation lattice (which predicates may go stale, by how much, for which action classes — sending email vs. answering a question). Then argue the hard part: prove, or refute with a counterexample, that your scheme prevents all semantic phantoms without ever requiring a synchronous LLM call on the write path. We believe the unqualified claim is false; characterize precisely the class of workloads for which it holds.

❦