Version 3 · Current
ChronoBench
Measuring how far different language models can progress in Chrono Trigger using a vision-based agent with evidence-gated checkpoints and a first-class exploration track.
Last updated: 2026-05-10 · 200-cycle budget · 8 primary + 10 secondary checkpoints.
Leaderboard
| Model | Provider | Last Checkpoint | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Cycles | Stuck | Skills | Superego | Secondary | Ev. rejected | Tokens In | ms/cycle | Est. Cost | Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| google/gemini-3-flash-preview | OpenRouter | 1000ad_left | 14 | 45 | 50 | 76 | 94 | 104 | 107 | 182 | 200* | 33 | 8 | 34 | 2/10 | — | 4,553,617 | 12,228 | $2.44 | 2026-04-20 |
| google/gemini-3-flash-preview | OpenRouter | telepod_reached | 14 | 29 | 70 | 82 | 176 | 185 | 188 | — | 200* | 27 | 1 | 30 | 1/10 | — | 4,787,830 | 13,034 | $2.59 | 2026-04-19 |
| google/gemma-4-26b-a4b | Local | marle_met | 23 | 40 | 52 | 136 | — | — | — | — | 200* | 39 | 1 | 40 | 1/10 | — | 2,787,833 | 26,434 | Free | 2026-04-20 |
| google/gemma-4-e4b | Local | fair_entered | 19 | 34 | 119 | — | — | — | — | — | 200* | 118 | 2 | 30 | — | — | 2,931,298 | 22,651 | Free | 2026-04-20 |
| x-ai/grok-4.3 | OpenRouter | fair_entered | 18 | 18 | 117 | — | — | — | — | — | 200* | 30 | — | 33 | 1/10 | — | 2,422,871 | 34,489 | $3.64 | 2026-05-09 |
| openai/gpt-5.4-nano | Local | house_exit | 48 | 58 | 134 | — | — | — | — | — | 200* | 102 | 4 | 37 | — | — | 2,490,333 | 16,049 | Free | 2026-04-19 |
| qwen/qwen3.6-35b-a3b | OpenRouter | house_exit | — | 5 | — | — | — | — | — | — | 200* | 13 | — | 20 | — | — | 2,537,849 | 41,733 | $0.00 | 2026-04-20 |
| google/gemini-3.1-flash-lite | OpenRouter | house_exit | 18 | 33 | — | — | — | — | — | — | 200* | 25 | — | 54 | 1/10 | — | 4,657,715 | 11,588 | $1.24 | 2026-05-08 |
| mistralai/mistral-medium-3-5 | OpenRouter | house_exit | 18 | 18 | — | — | — | — | — | — | 200* | 18 | 1 | 22 | 1/10 | — | 2,593,600 | 12,474 | $4.27 | 2026-05-09 |
| moonshotai/kimi-k2.6 | OpenRouter | — | — | — | — | — | — | — | — | — | 12 | 0 | — | 3 | — | — | 123,714 | 151,848 | $0.18 | 2026-04-20 |
* = cycle budget exhausted
Cycles per checkpoint
Estimated cost per checkpoint
Exploration track
Secondary checkpoints reached by each model's best run. These reward engaging with the world beyond the critical path and do not affect primary rank.
| Model | allowance | gato | soda | race_bet | melchior | cat_returned | lunch_eaten | pendant_sold | bekkler_lab | wait_battle_mode | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| google/gemini-3-flash-preview | 25 | — | — | — | — | — | — | — | — | 1 | 2/10 |
| google/gemma-4-26b-a4b | — | — | — | — | 81 | — | — | — | — | — | 1/10 |
| google/gemma-4-e4b | — | — | — | — | — | — | — | — | — | — | 0/10 |
| x-ai/grok-4.3 | 54 | — | — | — | — | — | — | — | — | — | 1/10 |
| openai/gpt-5.4-nano | — | — | — | — | — | — | — | — | — | — | 0/10 |
| qwen/qwen3.6-35b-a3b | — | — | — | — | — | — | — | — | — | — | 0/10 |
| google/gemini-3.1-flash-lite | 26 | — | — | — | — | — | — | — | — | — | 1/10 |
| mistralai/mistral-medium-3-5 | 137 | — | — | — | — | — | — | — | — | — | 1/10 |
| moonshotai/kimi-k2.6 | — | — | — | — | — | — | — | — | — | — | 0/10 |
Cells show the first cycle each secondary checkpoint was confirmed. Hover labels spell out each checkpoint.