ChronoBench

Measuring how far different language models can progress in Chrono Trigger using an autonomous vision-based game agent.

The harness has evolved enough that runs aren't directly comparable across releases. Each major version publishes its own leaderboard rather than mixing incompatible data.

v3 · Current 10 runs

Version 3

Evidence-gated checkpoint validation, 8 primary + 10 secondary checkpoints, Superego-only memory writes, per-(agent, model) skill scoping. Exploration is first-class.

View v3 leaderboard →

v2 · Archive 6 runs

Version 2

Skill library, CORE.md curated notes, asynchronous memory consolidation, event-driven Superego. 200-cycle budget. Frozen when the harness moved to evidence-gated checkpoints.

View v2 archive →

v1 · Archive 17 runs

Version 1

The original benchmark. Three-layer Freudian agent, fixed-interval Superego, single-tier planning, 100-cycle budget. Frozen for reference.

View v1 archive →

Primary checkpoints bedroom-exit through time-traveled are the same across all versions, so "how far did this model get" compares cleanly. v3 adds two intermediate primaries and a secondary exploration track. Cycle counts, cost efficiency, and evidence-rejection metrics are version-specific.