Version 3 · Current

How it works

v3 hardens the v2 architecture rather than replacing it. The cycle-log schema, cycle budget, and three-tier memory system are unchanged. What changed is how checkpoints are verified, who gets to write to durable memory, and how cleanly benchmark runs are isolated from each other.

Evidence-gated checkpoints

In v2, a checkpoint was reached the moment the Superego decided it was. In v3, the Superego still proposes checkpoint flips, but a deterministic validator checks them against accumulated dialogue and screen history before they land.

Each checkpoint has a validator — a compiled regex over the actual observed game text and scene metadata. If the evidence is there, the flip is accepted and recorded. If not, it's rejected, logged, and the checkpoint stays closed. This catches the failure mode where a confident model hallucinates progress that never happened.

Example validator


"marle_met": dialogue contains "my pendant" or "return.*pendant" or "you bumped into me"

Rejection rate becomes a new signal: a clean run accepts ~every proposed flip; a run with many rejections is a model that's been over-claiming progress.

Primary and secondary checkpoints

The mission is now explicit: progress the story AND explore the world. The checkpoint set has two tiers.

8 primary (story)

Main-path beats from bedroom exit through first time travel. Drives leaderboard rank. v3 added two intermediate beats (telepod demo announced, candy gate passed) that v2 glossed over.

10 secondary (exploration)

Optional fair activities and NPC interactions that reward models that notice and engage with the world. Shown as a separate badge row on run pages; does not affect primary rank.

Superego-only memory writes

v2 let the Ego write to CORE.md via a memory_write tool. In v3 that tool is gone. All durable-memory edits are now part of the Superego's directive processing — which means every CORE.md update is evaluated against the same strategic view that owns checkpoint proposals.

The upshot: the Ego focuses on button presses, and durable memory is curated by a single layer that already reasons about evidence and quest state.

Reflection layer removed

v2 ran a separate async reflection cycle that curated long-term beliefs between cycles. v3 folds that responsibility into the Superego and deletes the reflection module. One fewer LLM call per consolidation boundary, and one fewer place for behavior to drift.

Per-(agent, model) skill scoping

v2 stored skills at a flat data/skills/<game>/... path. Skills learned by one model could influence the benchmark run of the next model to be tested. v3 scopes skills per (agent, model): data/skills/<agent>/<model>/<game>/.... No cross-model bleed. Fair benchmarking.

Cold-start seeds

Starting a new benchmark for a model used to require manually creating its soul file and SQLite memory DB. v3 does this automatically: when a previously-unused (agent, model) pair is invoked, the harness instantiates a fresh soul and memory from templates in data/templates/. Every run starts from the same baseline, which is what the leaderboard rank assumes.

Pure vision — no shortcuts

The agent still has no access to game memory, no tile maps, no scripted paths. It sees pixels and presses buttons. Evidence validation operates on accumulated dialogue and screen descriptions that the perception layer extracted — the agent isn't reading the ROM, and neither is the validator.

What changed from v2?

New evidence-validation gate on all checkpoint flips (prevents LLM hallucination).
Primary checkpoint set grew from 6 to 8 (added telepod-demo-announced and candy-gate-passed).
New 11-checkpoint exploration track — mission reframed as progress AND explore.
Durable memory writes are now Superego-only; Ego's memory_write tool removed.
Dedicated reflection cycle removed; Superego absorbs belief curation.
Skills scoped per (agent, model) on disk — no cross-model contamination.
Automated cold-start seeding of soul files and memory DBs from templates.

Read the v2 architecture →