Version 1 · Archive

Methodology

This page reflects v1 scoring. See v3 methodology for current runs.

Checkpoints

Each benchmark run is scored against named checkpoints in the opening sequence of Chrono Trigger. Checkpoints are ordered — a run ends when the agent either reaches the final checkpoint or exhausts its cycle budget.

# Checkpoint Description
1 Left Bedroom Crono leaves the bedroom (top floor of house)
2 Exited House Crono exits the front door onto the overworld
3 Reached the Fair Agent reaches Leene Square / Millennial Fair
4 Met Marle Agent encounters Marle and retrieves her pendant
5 Reached Telepod Agent reaches Lucca's telepod demonstration
6 Time Traveled Agent enters the time gate (first time travel)

What Cycles Mean

A cycle is one complete perception-action loop: the agent captures a screenshot, analyzes the scene, plans its next move, and executes button presses. Each cycle involves at least two LLM calls (analysis + ego planning), plus periodic evaluator and superego calls.

The cycle budget caps how many cycles a run is allowed. When the budget is exhausted, the session ends and a final summary is written. Fewer cycles to reach a checkpoint means the model navigated more efficiently.

Metrics

Cycles per Checkpoint

The first cycle where each checkpoint was confirmed reached. Lower is better — it means the model found its way faster.

Stuck Count

Number of times the evaluator detected the agent was making no progress. High stuck counts indicate the model struggles with spatial reasoning or gets caught in loops.

Ego Retries

Times the ego planner produced empty or malformed output and had to retry. Indicates model reliability under the structured output constraints.

Tokens In / Out

Total tokens consumed across all LLM calls in the session. Combined with model pricing, this gives the cost of a run.

Avg Cycle Time

Mean wall-clock time per cycle in milliseconds. Includes all LLM latency, screenshot capture, and emulator interaction.

Caveats

  • Save state starting point — all runs begin from the same save state (Crono in bed). This controls the starting condition but means results reflect this specific game segment, not general gameplay ability.
  • Stochastic output — LLM responses are non-deterministic. Two runs with the same model can produce different results. Multiple runs per model give a more reliable picture.
  • Provider differences — the same model accessed through different providers (e.g., Anthropic direct vs. OpenRouter) may behave differently due to quantization, routing, or API-layer differences.
  • Cost comparison — pricing varies significantly across providers and models. Cost-per-checkpoint is an estimate based on published token prices at the time of the run.

Supported Providers

Anthropic

Direct API access to Claude models

OpenRouter

200+ models from multiple providers through a unified API

LM Studio

Local model inference for offline benchmarking