google/gemini-3.1-pro-preview
Coverage 3/3 · Avg benchmark score 90.6
WeirdBench
Consolidated leaderboard
index
A single ranking across every WeirdBench benchmark. Raw scores are first converted into benchmark-local rank scores so lower-is-better and higher-is-better benchmarks can live in the same table.
google/gemini-3.1-pro-preview
Coverage 3/3 · Avg benchmark score 90.6
qwen/qwen3.5-397b-a17b
Coverage 3/3 · Avg benchmark score 81.1
anthropic/claude-opus-4.6
Coverage 3/3 · Avg benchmark score 80.5
qwen/qwen3.5-27b
Coverage 3/3 · Avg benchmark score 78.2
openai/gpt-5.3-codex
Coverage 3/3 · Avg benchmark score 77.6
z-ai/glm-5
Coverage 3/3 · Avg benchmark score 72.9
x-ai/grok-4.1-fast
Coverage 3/3 · Avg benchmark score 72.0
openai/gpt-oss-120b
Coverage 3/3 · Avg benchmark score 71.6
anthropic/claude-opus-4.5
Coverage 3/3 · Avg benchmark score 70.1
anthropic/claude-sonnet-4.6
Coverage 3/3 · Avg benchmark score 67.1
qwen/qwen3.5-122b-a10b
Coverage 3/3 · Avg benchmark score 63.8
z-ai/glm-5-turbo
Coverage 3/3 · Avg benchmark score 60.2
moonshotai/kimi-k2.5
Coverage 3/3 · Avg benchmark score 59.2
stepfun/step-3.5-flash:free
Coverage 2/3 · Avg benchmark score 83.7
openai/gpt-oss-20b
Coverage 3/3 · Avg benchmark score 53.9
inception/mercury-2
Coverage 3/3 · Avg benchmark score 53.5
anthropic/claude-haiku-4.5
Coverage 3/3 · Avg benchmark score 50.7
openai/gpt-5.4
Coverage 3/3 · Avg benchmark score 50.2
openai/gpt-5.3-chat
Coverage 3/3 · Avg benchmark score 48.8
google/gemini-3.1-flash-lite-preview
Coverage 3/3 · Avg benchmark score 47.4
google/gemini-3-flash-preview
Coverage 3/3 · Avg benchmark score 47.1
meta-llama/llama-4-maverick
Coverage 3/3 · Avg benchmark score 45.6
xiaomi/mimo-v2-pro
Coverage 3/3 · Avg benchmark score 42.8
anthropic/claude-sonnet-4.5
Coverage 3/3 · Avg benchmark score 42.4
deepseek/deepseek-v3.2
Coverage 3/3 · Avg benchmark score 42.2
openai/gpt-5.4-mini
Coverage 3/3 · Avg benchmark score 40.5
minimax/minimax-m2.7
Coverage 2/3 · Avg benchmark score 56.7
openai/gpt-5.1
Coverage 3/3 · Avg benchmark score 37.2
mistralai/mistral-medium-3.1
Coverage 3/3 · Avg benchmark score 35.9
x-ai/grok-4.20-beta
Coverage 3/3 · Avg benchmark score 28.3
mistralai/mistral-large-2512
Coverage 3/3 · Avg benchmark score 25.7
mistralai/mistral-small-2603
Coverage 3/3 · Avg benchmark score 25.4
amazon/nova-pro-v1
Coverage 3/3 · Avg benchmark score 22.5
amazon/nova-lite-v1
Coverage 3/3 · Avg benchmark score 13.2
minimax/minimax-m2.5
Coverage 1/3 · Avg benchmark score 32.4
amazon/nova-2-lite-v1
Coverage 3/3 · Avg benchmark score 5.8
amazon/nova-micro-v1
Coverage 3/3 · Avg benchmark score 5.6
meta-llama/llama-4-scout
Coverage 2/3 · Avg benchmark score 4.2
Methodology
Each benchmark is ranked independently and converted into a normalized 0-100 scale, where 100 is first place and 0 is last place for that benchmark. This avoids mixing incompatible raw score scales and automatically respects both lower-is-better and higher-is-better benchmarks.
Normalization
A model gets a benchmark-local rank score from 0 to 100 based on its leaderboard position within that benchmark.
Coverage
Models missing benchmarks are not dropped. Their average is multiplied by benchmark coverage ratio so partial coverage is visible and penalized.
Final Score
Final index = average normalized benchmark score × coverage ratio. Higher is better.