WeirdBench

WeirdBench

Consolidated leaderboard

Back Home

index

WeirdBench Intelligence Index

A single ranking across every WeirdBench benchmark. Raw scores are first converted into benchmark-local rank scores so lower-is-better and higher-is-better benchmarks can live in the same table.

3 benchmarks
1

google/gemini-3.1-pro-preview

Coverage 3/3 · Avg benchmark score 90.6

90.57
Semantic Diversity0.232691.9
Hidden Rule Sequence Continuation0.950094.1
World State Tracking0.984485.7
2

qwen/qwen3.5-397b-a17b

Coverage 3/3 · Avg benchmark score 81.1

81.15
Semantic Diversity0.236483.8
Hidden Rule Sequence Continuation0.950088.2
World State Tracking0.984471.4
3

anthropic/claude-opus-4.6

Coverage 3/3 · Avg benchmark score 80.5

80.48
Semantic Diversity0.2158100.0
Hidden Rule Sequence Continuation0.800050.0
World State Tracking0.984491.4
4

qwen/qwen3.5-27b

Coverage 3/3 · Avg benchmark score 78.2

78.17
Semantic Diversity0.243775.7
Hidden Rule Sequence Continuation0.833358.8
World State Tracking1.0000100.0
5

openai/gpt-5.3-codex

Coverage 3/3 · Avg benchmark score 77.6

77.55
Semantic Diversity0.239678.4
Hidden Rule Sequence Continuation1.0000100.0
World State Tracking0.968854.3
6

z-ai/glm-5

Coverage 3/3 · Avg benchmark score 72.9

72.86
Semantic Diversity0.246967.6
Hidden Rule Sequence Continuation0.950085.3
World State Tracking0.984465.7
7

x-ai/grok-4.1-fast

Coverage 3/3 · Avg benchmark score 72.0

72.04
Semantic Diversity0.232094.6
Hidden Rule Sequence Continuation0.816752.9
World State Tracking0.984468.6
8

openai/gpt-oss-120b

Coverage 3/3 · Avg benchmark score 71.6

71.58
Semantic Diversity0.256440.5
Hidden Rule Sequence Continuation1.000097.1
World State Tracking0.984477.1
9

anthropic/claude-opus-4.5

Coverage 3/3 · Avg benchmark score 70.1

70.06
Semantic Diversity0.235286.5
Hidden Rule Sequence Continuation0.750029.4
World State Tracking0.984494.3
10

anthropic/claude-sonnet-4.6

Coverage 3/3 · Avg benchmark score 67.1

67.10
Semantic Diversity0.233489.2
Hidden Rule Sequence Continuation0.750023.5
World State Tracking0.984488.6
11

qwen/qwen3.5-122b-a10b

Coverage 3/3 · Avg benchmark score 63.8

63.75
Semantic Diversity0.254548.6
Hidden Rule Sequence Continuation0.950091.2
World State Tracking0.968851.4
12

z-ai/glm-5-turbo

Coverage 3/3 · Avg benchmark score 60.2

60.15
Semantic Diversity0.245873.0
Hidden Rule Sequence Continuation0.850061.8
World State Tracking0.953145.7
13

moonshotai/kimi-k2.5

Coverage 3/3 · Avg benchmark score 59.2

59.23
Semantic Diversity0.249059.5
Hidden Rule Sequence Continuation0.800038.2
World State Tracking0.984480.0
14

stepfun/step-3.5-flash:free

Coverage 2/3 · Avg benchmark score 83.7

55.80
Semantic Diversity0.246470.3
World State Tracking1.000097.1
15

openai/gpt-oss-20b

Coverage 3/3 · Avg benchmark score 53.9

53.86
Semantic Diversity0.273810.8
Hidden Rule Sequence Continuation0.883376.5
World State Tracking0.984474.3
16

inception/mercury-2

Coverage 3/3 · Avg benchmark score 53.5

53.52
Semantic Diversity0.264327.0
Hidden Rule Sequence Continuation0.850073.5
World State Tracking0.968860.0
17

anthropic/claude-haiku-4.5

Coverage 3/3 · Avg benchmark score 50.7

50.70
Semantic Diversity0.227797.3
Hidden Rule Sequence Continuation0.700017.6
World State Tracking0.906337.1
18

openai/gpt-5.4

Coverage 3/3 · Avg benchmark score 50.2

50.22
Semantic Diversity0.255245.9
Hidden Rule Sequence Continuation0.850064.7
World State Tracking0.921940.0
19

openai/gpt-5.3-chat

Coverage 3/3 · Avg benchmark score 48.8

48.80
Semantic Diversity0.267821.6
Hidden Rule Sequence Continuation0.850067.6
World State Tracking0.968857.1
20

google/gemini-3.1-flash-lite-preview

Coverage 3/3 · Avg benchmark score 47.4

47.45
Semantic Diversity0.238781.1
Hidden Rule Sequence Continuation0.800044.1
World State Tracking0.843817.1
21

google/gemini-3-flash-preview

Coverage 3/3 · Avg benchmark score 47.1

47.09
Semantic Diversity0.251151.4
Hidden Rule Sequence Continuation0.800047.1
World State Tracking0.937542.9
22

meta-llama/llama-4-maverick

Coverage 3/3 · Avg benchmark score 45.6

45.64
Semantic Diversity0.250354.1
Hidden Rule Sequence Continuation0.35000.0
World State Tracking0.984482.9
23

xiaomi/mimo-v2-pro

Coverage 3/3 · Avg benchmark score 42.8

42.79
Semantic Diversity0.256243.2
Hidden Rule Sequence Continuation0.900079.4
World State Tracking0.79695.7
24

anthropic/claude-sonnet-4.5

Coverage 3/3 · Avg benchmark score 42.4

42.39
Semantic Diversity0.258137.8
Hidden Rule Sequence Continuation0.750026.5
World State Tracking0.968862.9
25

deepseek/deepseek-v3.2

Coverage 3/3 · Avg benchmark score 42.2

42.21
Semantic Diversity0.248462.2
Hidden Rule Sequence Continuation0.816755.9
World State Tracking0.81258.6
26

openai/gpt-5.4-mini

Coverage 3/3 · Avg benchmark score 40.5

40.48
Semantic Diversity0.268716.2
Hidden Rule Sequence Continuation0.900082.4
World State Tracking0.859422.9
27

minimax/minimax-m2.7

Coverage 2/3 · Avg benchmark score 56.7

37.81
Semantic Diversity0.247864.9
World State Tracking0.953148.6
28

openai/gpt-5.1

Coverage 3/3 · Avg benchmark score 37.2

37.21
Semantic Diversity0.249356.8
Hidden Rule Sequence Continuation0.750020.6
World State Tracking0.906334.3
29

mistralai/mistral-medium-3.1

Coverage 3/3 · Avg benchmark score 35.9

35.91
Semantic Diversity0.258935.1
Hidden Rule Sequence Continuation0.800041.2
World State Tracking0.875031.4
30

x-ai/grok-4.20-beta

Coverage 3/3 · Avg benchmark score 28.3

28.34
Semantic Diversity0.262429.7
Hidden Rule Sequence Continuation0.800035.3
World State Tracking0.859420.0
31

mistralai/mistral-large-2512

Coverage 3/3 · Avg benchmark score 25.7

25.66
Semantic Diversity0.267918.9
Hidden Rule Sequence Continuation0.783332.4
World State Tracking0.859425.7
32

mistralai/mistral-small-2603

Coverage 3/3 · Avg benchmark score 25.4

25.38
Semantic Diversity0.28642.7
Hidden Rule Sequence Continuation0.850070.6
World State Tracking0.75002.9
33

amazon/nova-pro-v1

Coverage 3/3 · Avg benchmark score 22.5

22.53
Semantic Diversity0.264924.3
Hidden Rule Sequence Continuation0.650014.7
World State Tracking0.859428.6
34

amazon/nova-lite-v1

Coverage 3/3 · Avg benchmark score 13.2

13.19
Semantic Diversity0.268713.5
Hidden Rule Sequence Continuation0.633311.8
World State Tracking0.828114.3
35

minimax/minimax-m2.5

Coverage 1/3 · Avg benchmark score 32.4

10.81
Semantic Diversity0.261932.4
36

amazon/nova-2-lite-v1

Coverage 3/3 · Avg benchmark score 5.8

5.77
Semantic Diversity0.29650.0
Hidden Rule Sequence Continuation0.41675.9
World State Tracking0.812511.4
37

amazon/nova-micro-v1

Coverage 3/3 · Avg benchmark score 5.6

5.64
Semantic Diversity0.27428.1
Hidden Rule Sequence Continuation0.43338.8
World State Tracking0.59380.0
38

meta-llama/llama-4-scout

Coverage 2/3 · Avg benchmark score 4.2

2.78
Semantic Diversity0.27685.4
Hidden Rule Sequence Continuation0.41672.9

Methodology

How index scoring works

Each benchmark is ranked independently and converted into a normalized 0-100 scale, where 100 is first place and 0 is last place for that benchmark. This avoids mixing incompatible raw score scales and automatically respects both lower-is-better and higher-is-better benchmarks.

Normalization

A model gets a benchmark-local rank score from 0 to 100 based on its leaderboard position within that benchmark.

Coverage

Models missing benchmarks are not dropped. Their average is multiplied by benchmark coverage ratio so partial coverage is visible and penalized.

Final Score

Final index = average normalized benchmark score × coverage ratio. Higher is better.