How It Works
This page describes the v2 architecture, which is archived. The current architecture is documented under Version 3. v2 consolidated learnings from our earlier Hermes experiments — procedural knowledge, curated memory, and strategic reasoning that reacts to the game state instead of ticking on a fixed interval — on top of the original v1 vision-only premise.
The game agent loop
Every decision the agent makes is a vision LLM call — no hardcoded game logic, no emulator hooks, no ROM hacking. The agent sees the screen and decides what buttons to press.
Each cycle:
- Perceive — capture a screenshot via Win32
PrintWindowthrough MCP - Analyze — Phase 1 vision call extracts scene structure, dialogue text, character position, NPCs
- Evaluate — scores progress relative to the previous frame and flags stuck states
- Strategize — the Superego fires on events (screen change, dialogue end, stuck streak, checkpoint, heartbeat) and issues a directive
- Plan — the Ego integrates analysis, evaluator feedback, directive, retrieved memories, and context-matched skills
- Act — button presses are sent to the emulator
Event-driven Superego
In v1 the Superego ran every N cycles regardless of what happened. In v2 it is triggered
by concrete events — so it thinks when something has changed, not on a clock. Each firing
records its trigger_reason
so the mix can be inspected afterwards.
screen_change
Scene type switched (overworld → dialog, dungeon → menu).
dialogue_end
A dialogue sequence completed — good time to reassess.
stuck_streak
Evaluator flagged stuck for several consecutive cycles.
checkpoint
A benchmark checkpoint was reached — strategy recalibrates for the next.
heartbeat
Safety fallback: fires every 6 cycles even if no other event has occurred.
A directive is a structured object — objective, priority, warnings, hints, quest_state — not a free-text dump. The Ego receives it as high-signal guidance.
Skill library — procedural knowledge
Skills are YAML-frontmatter markdown files describing generalizable procedures ("open the menu", "dismiss a dialogue box", "leave a shop"). They're indexed by context — screen type and location keywords — and the matching ones are injected into the Ego's prompt when they apply. Each skill tracks its success count and last-used timestamp, so the library improves over time.
Skills are stored per-(agent, model) on disk so one model's learnings never leak into another model's benchmark run. The Superego can also propose new skills when it recognizes a pattern worth generalizing.
CORE.md — curated notes
A bounded, agent-writable markdown file (max 2500 chars) that holds durable facts the agent decided were worth remembering: NPC quirks, map landmarks, recurring game mechanics. Separate from the episodic SQLite memory tiers.
CORE.md is loaded once at session start and frozen into the prompt prefix so it benefits from prompt caching. Every write is scanned for prompt-injection patterns before being accepted.
Three-tier memory with async consolidation
Episodic memory has three tiers — short-term, medium-term, long-term — and runs an asynchronous consolidation loop that compresses short-term observations upward as they accrue importance. Importance-weighted eviction keeps each tier bounded.
Two per-cycle health metrics surface on run pages:
recall_topic_count
(how many distinct topics the retrieval step pulled from) and
recall_top_topic_share
(what fraction of the retrieved memories clustered on one topic). Together they show
whether memory is fragmenting or collapsing.
Per-(agent, model) isolation
Every benchmark run starts from a blank slate. Each (agent, model) pair has its own soul file, memory database, skill library directory, CORE.md, and log folder. Runs are fully comparable across models with no cross-contamination.
The harness supports Anthropic direct, OpenRouter, and LM Studio. The
reasoning_budget
field caps thinking tokens for extended-reasoning models.
Pure vision — no shortcuts
The agent has no access to game memory, no tile maps, no scripted paths. It sees pixels and presses buttons. This makes ChronoBench a true test of visual reasoning, spatial navigation, and long-horizon planning.
What changed from v1?
- Superego runs on events (screen change, stuck streak, checkpoint) instead of a fixed interval.
- New procedural knowledge layer — skill library — injected into Ego prompts by context.
- Curated agent-writable notes in CORE.md, separate from episodic memory.
- Memory consolidation runs asynchronously in the background.
- New per-cycle metrics: memory health and skill usage are recorded and surfaced on run pages.