qwen/qwen3.5-27b
WeirdBench
Benchmark leaderboard
world-state-tracking
World State Tracking
Track entities, ownership, attributes, reversals, and conditional updates in a small simulated world, then answer exact final-state queries. Higher is better.
stepfun/step-3.5-flash:free
anthropic/claude-opus-4.5
anthropic/claude-opus-4.6
anthropic/claude-sonnet-4.6
google/gemini-3.1-pro-preview
meta-llama/llama-4-maverick
minimax/minimax-m2.5
moonshotai/kimi-k2.5
openai/gpt-oss-120b
openai/gpt-oss-20b
qwen/qwen3.5-397b-a17b
x-ai/grok-4.1-fast
z-ai/glm-5
anthropic/claude-sonnet-4.5
inception/mercury-2
openai/gpt-5.3-chat
openai/gpt-5.3-codex
qwen/qwen3.5-122b-a10b
minimax/minimax-m2.7
z-ai/glm-5-turbo
google/gemini-3-flash-preview
openai/gpt-5.4
anthropic/claude-haiku-4.5
openai/gpt-5.1
mistralai/mistral-medium-3.1
amazon/nova-pro-v1
mistralai/mistral-large-2512
openai/gpt-5.4-mini
x-ai/grok-4.20-beta
google/gemini-3.1-flash-lite-preview
meta-llama/llama-4-scout
amazon/nova-lite-v1
amazon/nova-2-lite-v1
deepseek/deepseek-v3.2
xiaomi/mimo-v2-pro
mistralai/mistral-small-2603
amazon/nova-micro-v1
Methodology
How scoring works
Generate a small formal world with entities, locations, ownership, reversible actions, and derived rule triggers, sample legal action sequences from a simulator, then score exact final-state query accuracy.
Prompt
Given an initial world state, world rules, and a sequence of actions, undo operations, and conditional events, answer final-state queries and return only JSON.
Score
Higher is better. The main score is exact query accuracy across all generated queries, with exact full-case match rate tracked in metadata.
Execution
Cases are generated and solved locally by a simulator, model outputs are parsed with retries and provider fallbacks, and final scores are cached in Neon by benchmark and model ID.