Version 2 · Archive

Methodology

How v2 runs were scored. v2 is archived; see v3 methodology for current runs. The checkpoint set was unchanged from v1, but v2's cycle budget (200) and reasoning loop differed, so cycle-efficiency metrics aren't directly comparable across versions.

Why separate versions?

Each major harness change gets its own leaderboard rather than mixing incompatible data. v2 ran on a 200-cycle budget with event-driven Superego; v3 added evidence-gated checkpoint validation, two additional primary checkpoints, and an 11-checkpoint exploration track.

View v3 current → · v1 archive →

Checkpoints

Each run is scored against the same six checkpoints used in v1 — the opening sequence of Chrono Trigger. A run ends when the agent reaches the final checkpoint or exhausts its cycle budget.

# Checkpoint Description
1 Left Bedroom Crono leaves the bedroom (top floor of house)
2 Exited House Crono exits the front door onto the overworld
3 Reached the Fair Agent reaches Leene Square / Millennial Fair
4 Met Marle Agent encounters Marle and retrieves her pendant
5 Reached Telepod Agent reaches Lucca's telepod demonstration
6 Time Traveled Agent enters the time gate (first time travel)

Cycle budget

A cycle is one full perception-action loop — screenshot, Phase 1 analysis, evaluator, optional Superego call, Ego planning, action. The v2 budget is 200 cycles

Fewer cycles to reach a checkpoint still means more efficient navigation, but a v2 run with more cycles than a v1 run isn't necessarily "worse" — the per-cycle work differs.

Metrics on the leaderboard

Cycles per Checkpoint

First cycle each checkpoint was confirmed. Lower = faster navigation.

Stuck Count

Cycles the evaluator flagged as making no progress. High counts suggest spatial-reasoning struggles or action loops.

Skills used v2

Distinct procedural skills the model invoked during the run — a rough proxy for how much the skill library contributed.

Superego calls v2

Total strategic-layer directive calls during the run. The per-run page breaks this down by trigger reason (screen_change, stuck_streak, heartbeat, etc.).

Memory health v2

Average recall topic count and top-topic share per cycle. Fragmentation (too many topics) or collapse (one dominating topic) are both bad signals.

Tokens In / Out

Total tokens across all LLM calls in the session. Combined with pricing to compute run cost.

Avg Cycle Time

Mean wall-clock milliseconds per cycle, including LLM latency, screenshot capture, and emulator interaction.

Caveats

  • Save state starting point — all runs begin from the same save state (Crono in bed). Results reflect this specific opening segment, not general gameplay ability.
  • Stochastic output — LLM responses are non-deterministic. Two runs with the same model can produce different results; multiple runs give a more reliable picture.
  • Provider differences — the same model served through different providers (Anthropic direct vs. OpenRouter) can behave differently due to quantization, routing, or API-layer variance.
  • Cost comparison — pricing varies by provider and moves over time. Cost-per-checkpoint is an estimate from published token prices at the time of the run.
  • Cross-version comparisons — checkpoint progress is comparable across v1 and v2; cycle counts and cost efficiency are not.

Supported providers

Anthropic

Direct API access to Claude models

OpenRouter

200+ models from multiple providers through a unified API

LM Studio

Local model inference for offline benchmarking