Version 1 · Archive

How It Works

This page describes the v1 architecture. The current architecture is documented under Version 3.

The game agent loop

Every decision the agent makes is a vision LLM call — there is no hardcoded game logic, no emulator hooks, and no ROM hacking. The agent sees the screen and decides what buttons to press, just like a human player would.

Each cycle follows this loop:

Perceive — capture a screenshot via Win32 PrintWindow through MCP (no focus steal)
Analyze — the vision LLM describes the scene as structured JSON
Evaluate — a parallel evaluator compares before/after screenshots and scores progress
Plan — the Ego integrates analysis, evaluation, memory, and directives to choose actions
Act — button presses are sent to the emulator

Three-layer architecture

The agent uses a three-layer cognitive architecture inspired by Freudian psychology:

Evaluator

Runs in parallel with analysis (when max_concurrent ≥ 2). Compares the current screenshot against the previous one, assigning a progress score and detecting stuck states. Provides immediate feedback to the Ego.

Ego

The conscious planner. Weighs all inputs — scene analysis, evaluator feedback, Superego directives, and retrieved memories — to decide the next action. Includes a fallback retry mechanism when output is empty or malformed.

Superego

Runs every N cycles (configurable via think_interval). Sets high-level strategy, tracks quest milestones, detects persistent stuck states, and issues directives that override the Ego's local decisions when needed.

Memory System

The agent has a three-tier SQLite memory system with async consolidation:

Short-term — recent cycle observations, fading quickly
Mid-term — consolidated patterns from short-term, Jaccard semantic deduplication prevents bloat
Long-term — persistent lessons and strategies

When the agent transitions from stuck to unstuck, an async LLM task extracts the lesson into long-term memory — ensuring the agent learns from its mistakes across the entire session.

Model Isolation

Each model gets its own soul file, memory database, and log folder. This means every benchmark run starts from the same blank slate, and runs are fully comparable across models.

The harness supports multiple providers: Anthropic direct API, OpenRouter (200+ models), and LM Studio for local models. The reasoning_budget field caps thinking tokens for extended-reasoning models.

Pure Vision — No Shortcuts

The agent has no access to game memory, no tile maps, no scripted paths. It sees pixels and presses buttons. This makes ChronoBench a true test of visual reasoning, spatial navigation, and long-horizon planning — the same capabilities that matter for real-world autonomous agents.