ChronoBench
Measuring how far different language models can progress in Chrono Trigger using an autonomous vision-based game agent.
The harness has evolved enough that runs aren't directly comparable across releases. Each major version publishes its own leaderboard rather than mixing incompatible data.
Version 3
Evidence-gated checkpoint validation, 8 primary + 10 secondary checkpoints, Superego-only memory writes, per-(agent, model) skill scoping. Exploration is first-class.
View v3 leaderboard →Version 2
Skill library, CORE.md curated notes, asynchronous memory consolidation, event-driven Superego. 200-cycle budget. Frozen when the harness moved to evidence-gated checkpoints.
View v2 archive →Version 1
The original benchmark. Three-layer Freudian agent, fixed-interval Superego, single-tier planning, 100-cycle budget. Frozen for reference.
View v1 archive →Primary checkpoints bedroom-exit through time-traveled are the same across all versions, so "how far did this model get" compares cleanly. v3 adds two intermediate primaries and a secondary exploration track. Cycle counts, cost efficiency, and evidence-rejection metrics are version-specific.