Version 3 · Current

Methodology

How v3 runs are scored. The checkpoint set expanded and split into two tiers, and an evidence-validation gate now controls when a checkpoint can be claimed. Cycle budget is unchanged from v2 (200).

Why v3?

The harness added an evidence-validation framework, two new intermediate primary checkpoints, and an 11-checkpoint exploration track. Rank semantics are different enough that mixing v2 and v3 runs would mislead. v2 is frozen; all new runs land here.

Browse the v2 archive →

Primary checkpoints

Primary checkpoints are the main-story beats. The furthest primary reached drives the leaderboard rank; cycles-to-checkpoint is the tiebreaker.

# Checkpoint ID
1 Left Bedroom bedroom_exit
2 Exited House house_exit
3 Reached the Fair fair_entered
4 Met Marle marle_met
5 Telepod Demo Announced lucca_demo_announced
6 Passed Candy Gate marle_candy_gate_passed
7 Reached Telepod telepod_reached
8 Time Traveled 1000ad_left

Secondary checkpoints (exploration)

Secondary checkpoints reward models that engage with the game world beyond the critical path — fair activities, NPC interactions, optional dialog. They're shown on each run page and counted on the leaderboard, but they do not affect primary rank.

sec_allowance Collected allowance
sec_gato Defeated Gato
sec_soda Bought soda
sec_race_bet Placed race bet
sec_melchior Talked to Melchior
sec_cat_returned Returned the cat
sec_lunch_eaten Ate lunch
sec_pendant_sold Sold the pendant (red herring)
sec_bekkler_lab Visited Bekkler's Lab
sec_wait_battle_mode Switched to Wait battle mode

Evidence validation

A checkpoint is only recorded as reached when a deterministic validator confirms matching evidence in the accumulated dialogue and screen history. The Superego proposes flips based on its strategic view; the validator decides whether the game actually showed the required dialogue or scene.

When the validator rejects a flip, the checkpoint stays closed and the rejection is counted. The run page shows the per-checkpoint rejection breakdown when present.

Evidence validation rate = accepted / (accepted + rejected). A clean run hovers near 1.0; a run that over-claims progress drops visibly.

Cycle budget

A cycle is one perception-action loop: screenshot, Phase 1 analysis, evaluator, optional Superego call, Ego plan, actions. v3 keeps the 200-cycle budget from v2.

v3 removes the dedicated reflection cycle, so per-cycle work is slightly lighter than v2 on average. Cost comparisons across versions are not apples-to-apples.

Metrics on the leaderboard

Last primary

Furthest primary checkpoint reached. Drives rank.

Cycles per primary checkpoint

First cycle each primary was confirmed. Lower = faster navigation.

Secondary reached v3

Number of exploration checkpoints reached out of 10.

Evidence rejections v3

Count of quest-state flips the validator rejected. High counts suggest a model over-claiming progress.

Stuck count

Cycles the evaluator flagged as making no progress.

Skills · Superego calls

Distinct procedural skills injected and total Superego directive calls. Carried over from v2.

Tokens · Cost · Avg cycle

Same accounting as v2. Combined with pricing to estimate per-run cost.

Caveats

  • Save state starting point — every run begins from Crono in bed. Results reflect this specific opening segment.
  • Stochastic output — LLM responses are non-deterministic; two runs of the same model can diverge.
  • Provider differences — the same model via different providers (Anthropic direct vs. OpenRouter) can behave differently.
  • Cross-version comparisons — checkpoints bedroom_exit, house_exit, fair_entered, marle_met, telepod_reached, and 1000ad_left exist in every version, so "how far did this model get" is comparable back to v1. Cycle counts and costs are not directly comparable across versions.
  • Evidence rate interpretation — a low validation rate isn't always the model's fault; sometimes the Superego proposes on thin evidence. Treat it as directional.

Supported providers

Anthropic

Direct API access to Claude models

OpenRouter

200+ models from multiple providers through a unified API

LM Studio

Local model inference for offline benchmarking