anthropic/claude-opus-4.6
Coverage 5/5 · Avg benchmark score 91.5
WeirdBench
Consolidated leaderboard
index
A single ranking across every WeirdBench benchmark. Raw scores are first converted into benchmark-local scores relative to the leader, so small raw gaps stay small while lower-is-better and higher-is-better benchmarks can still live in the same table.
anthropic/claude-opus-4.6
Coverage 5/5 · Avg benchmark score 91.5
openai/gpt-5.5
Coverage 5/5 · Avg benchmark score 89.7
openai/gpt-5.3-chat
Coverage 5/5 · Avg benchmark score 88.4
anthropic/claude-sonnet-4.5
Coverage 5/5 · Avg benchmark score 86.5
anthropic/claude-opus-4.7
Coverage 5/5 · Avg benchmark score 86.0
google/gemini-3-flash-preview
Coverage 5/5 · Avg benchmark score 81.1
anthropic/claude-opus-4.1
Coverage 5/5 · Avg benchmark score 80.0
openai/gpt-5.1
Coverage 5/5 · Avg benchmark score 79.1
anthropic/claude-haiku-4.5
Coverage 5/5 · Avg benchmark score 76.9
openai/gpt-5.4
Coverage 5/5 · Avg benchmark score 76.8
moonshotai/kimi-k2.6
Coverage 5/5 · Avg benchmark score 76.2
openai/gpt-oss-120b
Coverage 5/5 · Avg benchmark score 75.5
google/gemini-3.1-flash-lite-preview
Coverage 5/5 · Avg benchmark score 75.0
mistralai/mistral-small-2603
Coverage 5/5 · Avg benchmark score 74.5
moonshotai/kimi-k2.5
Coverage 5/5 · Avg benchmark score 74.4
meta-llama/llama-4-maverick
Coverage 5/5 · Avg benchmark score 73.2
anthropic/claude-opus-4.5
Coverage 5/5 · Avg benchmark score 72.3
openai/gpt-5.3-codex
Coverage 4/5 · Avg benchmark score 90.2
mistralai/mistral-large-2512
Coverage 5/5 · Avg benchmark score 71.9
mistralai/mistral-medium-3.1
Coverage 5/5 · Avg benchmark score 70.7
x-ai/grok-4.1-fast
Coverage 4/5 · Avg benchmark score 87.7
google/gemma-4-26b-a4b-it
Coverage 5/5 · Avg benchmark score 69.9
anthropic/claude-sonnet-4.6
Coverage 4/5 · Avg benchmark score 85.9
inception/mercury-2
Coverage 5/5 · Avg benchmark score 68.0
openai/gpt-5.4-mini
Coverage 5/5 · Avg benchmark score 67.1
amazon/nova-micro-v1
Coverage 5/5 · Avg benchmark score 65.0
deepseek/deepseek-v3.2
Coverage 5/5 · Avg benchmark score 63.5
openai/gpt-oss-20b
Coverage 4/5 · Avg benchmark score 78.6
google/gemma-4-31b-it
Coverage 5/5 · Avg benchmark score 62.6
amazon/nova-lite-v1
Coverage 5/5 · Avg benchmark score 60.0
z-ai/glm-5
Coverage 4/5 · Avg benchmark score 74.8
stepfun/step-3.5-flash:free
Coverage 4/5 · Avg benchmark score 73.7
minimax/minimax-m2.7
Coverage 4/5 · Avg benchmark score 72.4
amazon/nova-2-lite-v1
Coverage 5/5 · Avg benchmark score 56.8
qwen/qwen3.5-122b-a10b
Coverage 3/5 · Avg benchmark score 94.0
xiaomi/mimo-v2-pro
Coverage 4/5 · Avg benchmark score 70.3
x-ai/grok-4.20-beta
Coverage 5/5 · Avg benchmark score 56.2
minimax/minimax-m2.5
Coverage 5/5 · Avg benchmark score 55.5
amazon/nova-pro-v1
Coverage 5/5 · Avg benchmark score 55.1
google/gemini-3.1-pro-preview
Coverage 4/5 · Avg benchmark score 67.0
qwen/qwen3.5-397b-a17b
Coverage 3/5 · Avg benchmark score 73.8
meta-llama/llama-4-scout
Coverage 5/5 · Avg benchmark score 39.6
z-ai/glm-5-turbo
Coverage 2/5 · Avg benchmark score 93.4
qwen/qwen3.5-27b
Coverage 2/5 · Avg benchmark score 62.3
Methodology
Each benchmark is normalized relative to its current best raw score instead of being flattened into leaderboard positions. That means ties stay tied, near-ties stay near-ties, and lower-is-better benchmarks still compare cleanly with higher-is-better ones.
Normalization
Higher-is-better benchmarks are scored relative to the leader. Lower-is-better benchmarks use the inverse ratio to the leader, with a safe fallback if scores cross zero.
Coverage
Models missing benchmarks are not dropped. Their average is multiplied by benchmark coverage ratio so partial coverage is visible and penalized.
Final Score
Final index = average normalized benchmark score × coverage ratio. Higher is better.