How It Works
This page describes the v1 architecture. The current architecture is documented under Version 3.
The game agent loop
Every decision the agent makes is a vision LLM call — there is no hardcoded game logic, no emulator hooks, and no ROM hacking. The agent sees the screen and decides what buttons to press, just like a human player would.
Each cycle follows this loop:
- Perceive — capture a screenshot via Win32
PrintWindowthrough MCP (no focus steal) - Analyze — the vision LLM describes the scene as structured JSON
- Evaluate — a parallel evaluator compares before/after screenshots and scores progress
- Plan — the Ego integrates analysis, evaluation, memory, and directives to choose actions
- Act — button presses are sent to the emulator
Three-layer architecture
The agent uses a three-layer cognitive architecture inspired by Freudian psychology:
Evaluator
Runs in parallel with analysis (when max_concurrent ≥ 2).
Compares the current screenshot against the previous one, assigning a progress score
and detecting stuck states. Provides immediate feedback to the Ego.
Ego
The conscious planner. Weighs all inputs — scene analysis, evaluator feedback, Superego directives, and retrieved memories — to decide the next action. Includes a fallback retry mechanism when output is empty or malformed.
Superego
Runs every N cycles (configurable via think_interval).
Sets high-level strategy, tracks quest milestones, detects persistent stuck states, and
issues directives that override the Ego's local decisions when needed.
Memory System
The agent has a three-tier SQLite memory system with async consolidation:
- Short-term — recent cycle observations, fading quickly
- Mid-term — consolidated patterns from short-term, Jaccard semantic deduplication prevents bloat
- Long-term — persistent lessons and strategies
When the agent transitions from stuck to unstuck, an async LLM task extracts the lesson into long-term memory — ensuring the agent learns from its mistakes across the entire session.
Model Isolation
Each model gets its own soul file, memory database, and log folder. This means every benchmark run starts from the same blank slate, and runs are fully comparable across models.
The harness supports multiple providers: Anthropic direct API, OpenRouter (200+ models),
and LM Studio for local models. The reasoning_budget field
caps thinking tokens for extended-reasoning models.
Pure Vision — No Shortcuts
The agent has no access to game memory, no tile maps, no scripted paths. It sees pixels and presses buttons. This makes ChronoBench a true test of visual reasoning, spatial navigation, and long-horizon planning — the same capabilities that matter for real-world autonomous agents.