WeirdBench

Benchmark leaderboard

Back Home

world-state-tracking

World State Tracking

Name: World State Tracking
Creator: WeirdBench

Track entities, ownership, attributes, reversals, and conditional updates in a small simulated world, then answer exact final-state queries. Higher is better.

Higher score is better

qwen/qwen3.5-27b

1.0000

stepfun/step-3.5-flash:free

1.0000

anthropic/claude-opus-4.5

0.9844

anthropic/claude-opus-4.6

0.9844

anthropic/claude-sonnet-4.6

0.9844

google/gemini-3.1-pro-preview

0.9844

meta-llama/llama-4-maverick

0.9844

minimax/minimax-m2.5

0.9844

moonshotai/kimi-k2.5

0.9844

openai/gpt-oss-120b

0.9844

openai/gpt-oss-20b

0.9844

qwen/qwen3.5-397b-a17b

0.9844

x-ai/grok-4.1-fast

0.9844

z-ai/glm-5

0.9844

anthropic/claude-sonnet-4.5

0.9688

inception/mercury-2

0.9688

openai/gpt-5.3-chat

0.9688

openai/gpt-5.3-codex

0.9688

qwen/qwen3.5-122b-a10b

0.9688

minimax/minimax-m2.7

0.9531

z-ai/glm-5-turbo

0.9531

google/gemini-3-flash-preview

0.9375

openai/gpt-5.4

0.9219

anthropic/claude-haiku-4.5

0.9063

openai/gpt-5.1

0.9063

mistralai/mistral-medium-3.1

0.8750

amazon/nova-pro-v1

0.8594

mistralai/mistral-large-2512

0.8594

openai/gpt-5.4-mini

0.8594

x-ai/grok-4.20-beta

0.8594

google/gemini-3.1-flash-lite-preview

0.8438

meta-llama/llama-4-scout

0.8438

amazon/nova-lite-v1

0.8281

amazon/nova-2-lite-v1

0.8125

deepseek/deepseek-v3.2

0.8125

xiaomi/mimo-v2-pro

0.7969

mistralai/mistral-small-2603

0.7500

amazon/nova-micro-v1

0.5938

Methodology

How scoring works

Generate a small formal world with entities, locations, ownership, reversible actions, and derived rule triggers, sample legal action sequences from a simulator, then score exact final-state query accuracy.

Prompt

Given an initial world state, world rules, and a sequence of actions, undo operations, and conditional events, answer final-state queries and return only JSON.

Score

Higher is better. The main score is exact query accuracy across all generated queries, with exact full-case match rate tracked in metadata.

Execution

Cases are generated and solved locally by a simulator, model outputs are parsed with retries and provider fallbacks, and final scores are cached in Neon by benchmark and model ID.