ChronoBench

Measuring how far different language models can progress in Chrono Trigger using an autonomous vision-based game agent.

The harness has evolved enough that runs aren't directly comparable across releases. Each major version publishes its own leaderboard rather than mixing incompatible data.

Primary checkpoints bedroom-exit through time-traveled are the same across all versions, so "how far did this model get" compares cleanly. v3 adds two intermediate primaries and a secondary exploration track. Cycle counts, cost efficiency, and evidence-rejection metrics are version-specific.