Methodology
How v3 runs are scored. The checkpoint set expanded and split into two tiers, and an evidence-validation gate now controls when a checkpoint can be claimed. Cycle budget is unchanged from v2 (200).
Why v3?
The harness added an evidence-validation framework, two new intermediate primary checkpoints, and an 11-checkpoint exploration track. Rank semantics are different enough that mixing v2 and v3 runs would mislead. v2 is frozen; all new runs land here.
Primary checkpoints
Primary checkpoints are the main-story beats. The furthest primary reached drives the leaderboard rank; cycles-to-checkpoint is the tiebreaker.
| # | Checkpoint | ID |
|---|---|---|
| 1 | Left Bedroom | bedroom_exit |
| 2 | Exited House | house_exit |
| 3 | Reached the Fair | fair_entered |
| 4 | Met Marle | marle_met |
| 5 | Telepod Demo Announced | lucca_demo_announced |
| 6 | Passed Candy Gate | marle_candy_gate_passed |
| 7 | Reached Telepod | telepod_reached |
| 8 | Time Traveled | 1000ad_left |
Secondary checkpoints (exploration)
Secondary checkpoints reward models that engage with the game world beyond the critical path — fair activities, NPC interactions, optional dialog. They're shown on each run page and counted on the leaderboard, but they do not affect primary rank.
Evidence validation
A checkpoint is only recorded as reached when a deterministic validator confirms matching evidence in the accumulated dialogue and screen history. The Superego proposes flips based on its strategic view; the validator decides whether the game actually showed the required dialogue or scene.
When the validator rejects a flip, the checkpoint stays closed and the rejection is counted. The run page shows the per-checkpoint rejection breakdown when present.
Evidence validation rate = accepted / (accepted + rejected). A clean run hovers near 1.0; a run that over-claims progress drops visibly.
Cycle budget
A cycle is one perception-action loop: screenshot, Phase 1 analysis, evaluator, optional Superego call, Ego plan, actions. v3 keeps the 200-cycle budget from v2.
v3 removes the dedicated reflection cycle, so per-cycle work is slightly lighter than v2 on average. Cost comparisons across versions are not apples-to-apples.
Metrics on the leaderboard
Last primary
Furthest primary checkpoint reached. Drives rank.
Cycles per primary checkpoint
First cycle each primary was confirmed. Lower = faster navigation.
Secondary reached v3
Number of exploration checkpoints reached out of 10.
Evidence rejections v3
Count of quest-state flips the validator rejected. High counts suggest a model over-claiming progress.
Stuck count
Cycles the evaluator flagged as making no progress.
Skills · Superego calls
Distinct procedural skills injected and total Superego directive calls. Carried over from v2.
Tokens · Cost · Avg cycle
Same accounting as v2. Combined with pricing to estimate per-run cost.
Caveats
- Save state starting point — every run begins from Crono in bed. Results reflect this specific opening segment.
- Stochastic output — LLM responses are non-deterministic; two runs of the same model can diverge.
- Provider differences — the same model via different providers (Anthropic direct vs. OpenRouter) can behave differently.
- Cross-version comparisons — checkpoints
bedroom_exit,house_exit,fair_entered,marle_met,telepod_reached, and1000ad_leftexist in every version, so "how far did this model get" is comparable back to v1. Cycle counts and costs are not directly comparable across versions. - Evidence rate interpretation — a low validation rate isn't always the model's fault; sometimes the Superego proposes on thin evidence. Treat it as directional.
Supported providers
Anthropic
Direct API access to Claude models
OpenRouter
200+ models from multiple providers through a unified API
LM Studio
Local model inference for offline benchmarking