Harbor / Islo / Long-Memory Smoke Test

A tiny long-memory benchmark, fanned out across Islo sandboxes.

We built a miniature LongMemEval-style demo that compares three memory approaches on direct recall, stale updates, and abstention. It is intentionally small: one Python file, one Harbor task wrapper, three scenario sandboxes, and one shareable Islo page.

Built for Islo Code Harbor framework Result JSON Parallel Islo JSON

3memory questions

3simple approaches

3Islo scenario sandboxes

1Harbor task wrapper

Abstract

What we did

Retrieval benchmarks usually measure whether a system can find relevant text. Long-term memory needs more: the system must remember old facts, prefer newer corrections, and refuse to invent answers when the memory does not contain them. This demo compresses that problem into three tiny questions over five chat sessions.

The direct runner, demo.py, produces deterministic JSON results. Harbor wraps the same benchmark as an eval task with a verifier and reward file. Islo runs the three scenarios as separate sandbox trials and gives the result page a public share URL.

Built with

The two pieces this demo leans on

Islo.dev Persistent, shareable sandboxes for coding agents. Here it runs one memory scenario per isolated Python sandbox. Harbor Agent evaluation framework. Here it provides the task shape: instruction, environment, verifier, solution, and reward file. Code repo Full source, generated results, HN post draft, eval report, and GitHub Pages source for this tiny benchmark.

Direct Lookup

What snack should I buy for Mira's study group?

Expected: sea-salt pistachios

Basic memory should recover a fact stated in an earlier session.

Knowledge Update

For the Portland trip, which airport should I use now?

Expected: OAK, not stale SFO.

This is the important failure: keyword retrieval can find a relevant old fact and still be wrong.

Abstention

What is Sam's jacket size?

Expected: not enough information

A useful memory system should know when the answer is absent.

Method

The three approaches

Approach	Behavior	Result
`no_memory`	Has no chat history, so it abstains. It only passes the abstention case.	1/3
`keyword_memory`	Retrieves the top keyword-overlap session. It answers direct lookup but can pick stale evidence.	2/3
`update_aware_memory`	Uses lexical retrieval, prefers newer relevant facts for update questions, and abstains when evidence is missing.	3/3

Harbor

How Harbor is used

Harbor is the eval wrapper. The demo includes a minimal task under harbor/longmem-mini/: task metadata, an instruction, a Python environment image, a verifier script, and an oracle solution.

The verifier runs the benchmark, checks the expected 1/3, 2/3, 3/3 leaderboard, and writes /logs/verifier/reward.txt. That makes the toy benchmark look like a real agent-eval task. In this local environment, Docker was not running, so the Harbor task is packaged and ready but was not completed through local Docker.

harbor/longmem-mini/
  task.toml
  instruction.md
  environment/Dockerfile
  environment/demo.py
  tests/test.sh
  solution/solve.sh

harbor run -p harbor/longmem-mini -a oracle

Islo

How Islo is used

Islo is the execution and sharing layer. islo.yaml defines the lightweight Python sandbox. The repo can be cloned into an Islo environment, the benchmark can run there, and the static report can be served from inside the sandbox. We ran the three simple scenarios across three Islo Python sandboxes and got the same 3/3, 2/3, 1/3 leaderboard.

The important demo move is islo share: it turns the sandbox-hosted result page into a public URL, so the GitHub/HN post can point to a live, reproducible run instead of a screenshot.

islo use longmem-mini \
  --source github://zozo123/longmem-mini-on-islo

python demo.py
python3 parallel_islo_eval.py --backend islo --prefix longmem-mini
python -m http.server 8080 -d site
islo share longmem-mini 8080 --public --ttl 24h

harbor run -p harbor/longmem-mini -a oracle --env islo

Results

Why the 3/2/1 result is the story

The result is not meant to be surprising. It is meant to be inspectable. A no-memory system can only abstain. A keyword system improves recall, but the update question exposes stale retrieval. The update-aware strategy wins because it treats time and abstention as part of the task.

That is the smallest useful lesson for long-memory evaluation: do not only ask whether memory found something. Ask whether it found the current thing, and whether it knows when nothing is there.

Tiny LongMemEval-style benchmark on Islo
This is a miniature deterministic demo,
not an official LongMemEval score.

Leaderboard
- update_aware_memory: 3/3 (100%)
- keyword_memory: 2/3 (67%)
- no_memory: 1/3 (33%)

Run Status

What actually ran

Layer	Status	Artifact
Local Python eval	Completed. Produced the `3/2/1` leaderboard.	`runs/latest/results.json`
Parallel Islo eval	Completed across three live Islo Python sandboxes, one scenario per sandbox.	`runs/latest/parallel_islo_results.json`
Harbor wrapper	Packaged as a minimal Harbor task with verifier and reward file contract.	`harbor/longmem-mini/`
Harbor local run	Attempted, but Docker was not running on this machine.	`EVAL_REPORT.md`

HN Post

A tiny long-memory benchmark, fanned out across Islo sandboxes.

What we did

The two pieces this demo leans on

What snack should I buy for Mira's study group?

For the Portland trip, which airport should I use now?

What is Sam's jacket size?

The three approaches

How Harbor is used

How Islo is used

Why the 3/2/1 result is the story

What actually ran

Suggested post