Methodology
How v2 runs were scored. v2 is archived; see v3 methodology for current runs. The checkpoint set was unchanged from v1, but v2's cycle budget (200) and reasoning loop differed, so cycle-efficiency metrics aren't directly comparable across versions.
Why separate versions?
Each major harness change gets its own leaderboard rather than mixing incompatible data. v2 ran on a 200-cycle budget with event-driven Superego; v3 added evidence-gated checkpoint validation, two additional primary checkpoints, and an 11-checkpoint exploration track.
Checkpoints
Each run is scored against the same six checkpoints used in v1 — the opening sequence of Chrono Trigger. A run ends when the agent reaches the final checkpoint or exhausts its cycle budget.
| # | Checkpoint | Description |
|---|---|---|
| 1 | Left Bedroom | Crono leaves the bedroom (top floor of house) |
| 2 | Exited House | Crono exits the front door onto the overworld |
| 3 | Reached the Fair | Agent reaches Leene Square / Millennial Fair |
| 4 | Met Marle | Agent encounters Marle and retrieves her pendant |
| 5 | Reached Telepod | Agent reaches Lucca's telepod demonstration |
| 6 | Time Traveled | Agent enters the time gate (first time travel) |
Cycle budget
A cycle is one full perception-action loop — screenshot, Phase 1 analysis, evaluator, optional Superego call, Ego planning, action. The v2 budget is 200 cycles
Fewer cycles to reach a checkpoint still means more efficient navigation, but a v2 run with more cycles than a v1 run isn't necessarily "worse" — the per-cycle work differs.
Metrics on the leaderboard
Cycles per Checkpoint
First cycle each checkpoint was confirmed. Lower = faster navigation.
Stuck Count
Cycles the evaluator flagged as making no progress. High counts suggest spatial-reasoning struggles or action loops.
Skills used v2
Distinct procedural skills the model invoked during the run — a rough proxy for how much the skill library contributed.
Superego calls v2
Total strategic-layer directive calls during the run. The per-run page breaks this down by trigger reason (screen_change, stuck_streak, heartbeat, etc.).
Memory health v2
Average recall topic count and top-topic share per cycle. Fragmentation (too many topics) or collapse (one dominating topic) are both bad signals.
Tokens In / Out
Total tokens across all LLM calls in the session. Combined with pricing to compute run cost.
Avg Cycle Time
Mean wall-clock milliseconds per cycle, including LLM latency, screenshot capture, and emulator interaction.
Caveats
- Save state starting point — all runs begin from the same save state (Crono in bed). Results reflect this specific opening segment, not general gameplay ability.
- Stochastic output — LLM responses are non-deterministic. Two runs with the same model can produce different results; multiple runs give a more reliable picture.
- Provider differences — the same model served through different providers (Anthropic direct vs. OpenRouter) can behave differently due to quantization, routing, or API-layer variance.
- Cost comparison — pricing varies by provider and moves over time. Cost-per-checkpoint is an estimate from published token prices at the time of the run.
- Cross-version comparisons — checkpoint progress is comparable across v1 and v2; cycle counts and cost efficiency are not.
Supported providers
Anthropic
Direct API access to Claude models
OpenRouter
200+ models from multiple providers through a unified API
LM Studio
Local model inference for offline benchmarking