Harbor / Islo / Long-Memory Smoke Test

A tiny long-memory benchmark, fanned out across Islo sandboxes.

We built a miniature LongMemEval-style demo that compares three memory approaches on direct recall, stale updates, and abstention. It is intentionally small: one Python file, one Harbor task wrapper, three scenario sandboxes, and one shareable Islo page.

3memory questions
3simple approaches
3Islo scenario sandboxes
1Harbor task wrapper
Abstract

What we did

Retrieval benchmarks usually measure whether a system can find relevant text. Long-term memory needs more: the system must remember old facts, prefer newer corrections, and refuse to invent answers when the memory does not contain them. This demo compresses that problem into three tiny questions over five chat sessions.

The direct runner, demo.py, produces deterministic JSON results. Harbor wraps the same benchmark as an eval task with a verifier and reward file. Islo runs the three scenarios as separate sandbox trials and gives the result page a public share URL.

Built with

The two pieces this demo leans on

Direct Lookup

What snack should I buy for Mira's study group?

Expected: sea-salt pistachios

Basic memory should recover a fact stated in an earlier session.

Knowledge Update

For the Portland trip, which airport should I use now?

Expected: OAK, not stale SFO.

This is the important failure: keyword retrieval can find a relevant old fact and still be wrong.

Abstention

What is Sam's jacket size?

Expected: not enough information

A useful memory system should know when the answer is absent.

Method

The three approaches

Approach Behavior Result
no_memory Has no chat history, so it abstains. It only passes the abstention case. 1/3
keyword_memory Retrieves the top keyword-overlap session. It answers direct lookup but can pick stale evidence. 2/3
update_aware_memory Uses lexical retrieval, prefers newer relevant facts for update questions, and abstains when evidence is missing. 3/3
Harbor

How Harbor is used

Harbor is the eval wrapper. The demo includes a minimal task under harbor/longmem-mini/: task metadata, an instruction, a Python environment image, a verifier script, and an oracle solution.

The verifier runs the benchmark, checks the expected 1/3, 2/3, 3/3 leaderboard, and writes /logs/verifier/reward.txt. That makes the toy benchmark look like a real agent-eval task. In this local environment, Docker was not running, so the Harbor task is packaged and ready but was not completed through local Docker.

harbor/longmem-mini/
  task.toml
  instruction.md
  environment/Dockerfile
  environment/demo.py
  tests/test.sh
  solution/solve.sh

harbor run -p harbor/longmem-mini -a oracle
Islo

How Islo is used

Islo is the execution and sharing layer. islo.yaml defines the lightweight Python sandbox. The repo can be cloned into an Islo environment, the benchmark can run there, and the static report can be served from inside the sandbox. We ran the three simple scenarios across three Islo Python sandboxes and got the same 3/3, 2/3, 1/3 leaderboard.

The important demo move is islo share: it turns the sandbox-hosted result page into a public URL, so the GitHub/HN post can point to a live, reproducible run instead of a screenshot.

islo use longmem-mini \
  --source github://zozo123/longmem-mini-on-islo

python demo.py
python3 parallel_islo_eval.py --backend islo --prefix longmem-mini
python -m http.server 8080 -d site
islo share longmem-mini 8080 --public --ttl 24h

harbor run -p harbor/longmem-mini -a oracle --env islo
Results

Why the 3/2/1 result is the story

The result is not meant to be surprising. It is meant to be inspectable. A no-memory system can only abstain. A keyword system improves recall, but the update question exposes stale retrieval. The update-aware strategy wins because it treats time and abstention as part of the task.

That is the smallest useful lesson for long-memory evaluation: do not only ask whether memory found something. Ask whether it found the current thing, and whether it knows when nothing is there.

Tiny LongMemEval-style benchmark on Islo
This is a miniature deterministic demo,
not an official LongMemEval score.

Leaderboard
- update_aware_memory: 3/3 (100%)
- keyword_memory: 2/3 (67%)
- no_memory: 1/3 (33%)
Run Status

What actually ran

Layer Status Artifact
Local Python eval Completed. Produced the 3/2/1 leaderboard. runs/latest/results.json
Parallel Islo eval Completed across three live Islo Python sandboxes, one scenario per sandbox. runs/latest/parallel_islo_results.json
Harbor wrapper Packaged as a minimal Harbor task with verifier and reward file contract. harbor/longmem-mini/
Harbor local run Attempted, but Docker was not running on this machine. EVAL_REPORT.md
HN Post

Suggested post

Title: Show HN: Tiny long-memory benchmark running in an Islo sandbox

Short text: I made a tiny LongMemEval-style benchmark that runs in seconds, is packaged as a Harbor task, and was run across three Islo sandboxes. It compares no memory, keyword memory, and update-aware memory on three cases: direct lookup, stale update, and abstention. The point is not a leaderboard score; it is an inspectable demo of what long-memory evals need to test before scaling to LongMemEval or LoCoMo.