What snack should I buy for Mira's study group?
Expected: sea-salt pistachios
Basic memory should recover a fact stated in an earlier session.
We built a miniature LongMemEval-style demo that compares three memory approaches on direct recall, stale updates, and abstention. It is intentionally small: one Python file, one Harbor task wrapper, three scenario sandboxes, and one shareable Islo page.
Retrieval benchmarks usually measure whether a system can find relevant text. Long-term memory needs more: the system must remember old facts, prefer newer corrections, and refuse to invent answers when the memory does not contain them. This demo compresses that problem into three tiny questions over five chat sessions.
The direct runner, demo.py, produces deterministic JSON results. Harbor
wraps the same benchmark as an eval task with a verifier and reward file. Islo runs
the three scenarios as separate sandbox trials and gives the result page a public share URL.
Expected: sea-salt pistachios
Basic memory should recover a fact stated in an earlier session.
Expected: OAK, not stale SFO.
This is the important failure: keyword retrieval can find a relevant old fact and still be wrong.
Expected: not enough information
A useful memory system should know when the answer is absent.
| Approach | Behavior | Result |
|---|---|---|
no_memory |
Has no chat history, so it abstains. It only passes the abstention case. | 1/3 |
keyword_memory |
Retrieves the top keyword-overlap session. It answers direct lookup but can pick stale evidence. | 2/3 |
update_aware_memory |
Uses lexical retrieval, prefers newer relevant facts for update questions, and abstains when evidence is missing. | 3/3 |
Harbor is the eval wrapper. The demo includes a minimal task under
harbor/longmem-mini/: task metadata, an instruction, a Python
environment image, a verifier script, and an oracle solution.
The verifier runs the benchmark, checks the expected 1/3, 2/3, 3/3
leaderboard, and writes /logs/verifier/reward.txt. That makes the
toy benchmark look like a real agent-eval task. In this local environment,
Docker was not running, so the Harbor task is packaged and ready but was not
completed through local Docker.
harbor/longmem-mini/
task.toml
instruction.md
environment/Dockerfile
environment/demo.py
tests/test.sh
solution/solve.sh
harbor run -p harbor/longmem-mini -a oracle
Islo is the execution and sharing layer. islo.yaml defines the
lightweight Python sandbox. The repo can be cloned into an Islo environment,
the benchmark can run there, and the static report can be served from inside
the sandbox. We ran the three simple scenarios across three Islo Python sandboxes
and got the same 3/3, 2/3, 1/3 leaderboard.
The important demo move is islo share: it turns the sandbox-hosted
result page into a public URL, so the GitHub/HN post can point to a live,
reproducible run instead of a screenshot.
islo use longmem-mini \
--source github://zozo123/longmem-mini-on-islo
python demo.py
python3 parallel_islo_eval.py --backend islo --prefix longmem-mini
python -m http.server 8080 -d site
islo share longmem-mini 8080 --public --ttl 24h
harbor run -p harbor/longmem-mini -a oracle --env islo
The result is not meant to be surprising. It is meant to be inspectable. A no-memory system can only abstain. A keyword system improves recall, but the update question exposes stale retrieval. The update-aware strategy wins because it treats time and abstention as part of the task.
That is the smallest useful lesson for long-memory evaluation: do not only ask whether memory found something. Ask whether it found the current thing, and whether it knows when nothing is there.
Tiny LongMemEval-style benchmark on Islo
This is a miniature deterministic demo,
not an official LongMemEval score.
Leaderboard
- update_aware_memory: 3/3 (100%)
- keyword_memory: 2/3 (67%)
- no_memory: 1/3 (33%)
| Layer | Status | Artifact |
|---|---|---|
| Local Python eval | Completed. Produced the 3/2/1 leaderboard. |
runs/latest/results.json |
| Parallel Islo eval | Completed across three live Islo Python sandboxes, one scenario per sandbox. | runs/latest/parallel_islo_results.json |
| Harbor wrapper | Packaged as a minimal Harbor task with verifier and reward file contract. | harbor/longmem-mini/ |
| Harbor local run | Attempted, but Docker was not running on this machine. | EVAL_REPORT.md |
Title: Show HN: Tiny long-memory benchmark running in an Islo sandbox
Short text: I made a tiny LongMemEval-style benchmark that runs in seconds, is packaged as a Harbor task, and was run across three Islo sandboxes. It compares no memory, keyword memory, and update-aware memory on three cases: direct lookup, stale update, and abstention. The point is not a leaderboard score; it is an inspectable demo of what long-memory evals need to test before scaling to LongMemEval or LoCoMo.