WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

world-state-tracking

World State Tracking

Track entities, ownership, attributes, reversals, and conditional updates in a small simulated world, then answer exact final-state queries. Higher is better.

Higher score is better
1

qwen/qwen3.5-27b

1.0000
2

stepfun/step-3.5-flash:free

1.0000
3

anthropic/claude-opus-4.5

0.9844
4

anthropic/claude-opus-4.6

0.9844
5

anthropic/claude-sonnet-4.6

0.9844
6

google/gemini-3.1-pro-preview

0.9844
7

meta-llama/llama-4-maverick

0.9844
8

minimax/minimax-m2.5

0.9844
9

moonshotai/kimi-k2.5

0.9844
10

openai/gpt-oss-120b

0.9844
11

openai/gpt-oss-20b

0.9844
12

qwen/qwen3.5-397b-a17b

0.9844
13

x-ai/grok-4.1-fast

0.9844
14

z-ai/glm-5

0.9844
15

anthropic/claude-sonnet-4.5

0.9688
16

inception/mercury-2

0.9688
17

openai/gpt-5.3-chat

0.9688
18

openai/gpt-5.3-codex

0.9688
19

qwen/qwen3.5-122b-a10b

0.9688
20

minimax/minimax-m2.7

0.9531
21

z-ai/glm-5-turbo

0.9531
22

google/gemini-3-flash-preview

0.9375
23

openai/gpt-5.4

0.9219
24

anthropic/claude-haiku-4.5

0.9063
25

openai/gpt-5.1

0.9063
26

mistralai/mistral-medium-3.1

0.8750
27

amazon/nova-pro-v1

0.8594
28

mistralai/mistral-large-2512

0.8594
29

openai/gpt-5.4-mini

0.8594
30

x-ai/grok-4.20-beta

0.8594
31

google/gemini-3.1-flash-lite-preview

0.8438
32

meta-llama/llama-4-scout

0.8438
33

amazon/nova-lite-v1

0.8281
34

amazon/nova-2-lite-v1

0.8125
35

deepseek/deepseek-v3.2

0.8125
36

xiaomi/mimo-v2-pro

0.7969
37

mistralai/mistral-small-2603

0.7500
38

amazon/nova-micro-v1

0.5938

Methodology

How scoring works

Generate a small formal world with entities, locations, ownership, reversible actions, and derived rule triggers, sample legal action sequences from a simulator, then score exact final-state query accuracy.

Prompt

Given an initial world state, world rules, and a sequence of actions, undo operations, and conditional events, answer final-state queries and return only JSON.

Score

Higher is better. The main score is exact query accuracy across all generated queries, with exact full-case match rate tracked in metadata.

Execution

Cases are generated and solved locally by a simulator, model outputs are parsed with retries and provider fallbacks, and final scores are cached in Neon by benchmark and model ID.