WeirdBench

WeirdBench

Consolidated leaderboard

Back Home

index

WeirdBench Intelligence Index

A single ranking across every WeirdBench benchmark. Raw scores are first converted into benchmark-local scores relative to the leader, so small raw gaps stay small while lower-is-better and higher-is-better benchmarks can still live in the same table.

5 benchmarks
1

anthropic/claude-opus-4.6

Coverage 5/5 · Avg benchmark score 91.5

91.53
AI Writing Detection0.980098.0
Nutrition Prediction19.594582.2
Semantic Diversity0.2158100.0
Orthographic Diversity5.138388.5
Wordle4.050088.9
2

openai/gpt-5.5

Coverage 5/5 · Avg benchmark score 89.7

89.74
AI Writing Detection1.0000100.0
Nutrition Prediction21.210889.0
Semantic Diversity0.237291.0
Orthographic Diversity4.570078.7
Wordle4.000090.0
3

openai/gpt-5.3-chat

Coverage 5/5 · Avg benchmark score 88.4

88.38
AI Writing Detection0.989999.0
Nutrition Prediction20.677286.8
Semantic Diversity0.267880.6
Orthographic Diversity4.691280.8
Wordle3.800094.7
4

anthropic/claude-sonnet-4.5

Coverage 5/5 · Avg benchmark score 86.5

86.49
AI Writing Detection0.989999.0
Nutrition Prediction19.317281.1
Semantic Diversity0.258183.6
Orthographic Diversity5.301891.3
Wordle4.650077.4
5

anthropic/claude-opus-4.7

Coverage 5/5 · Avg benchmark score 86.0

85.98
AI Writing Detection1.0000100.0
Nutrition Prediction21.519590.3
Semantic Diversity0.224096.3
Orthographic Diversity3.455559.5
Wordle4.300083.7
6

google/gemini-3-flash-preview

Coverage 5/5 · Avg benchmark score 81.1

81.11
AI Writing Detection1.0000100.0
Nutrition Prediction19.984283.9
Semantic Diversity0.251185.9
Orthographic Diversity4.805382.8
Wordle6.800052.9
7

anthropic/claude-opus-4.1

Coverage 5/5 · Avg benchmark score 80.0

80.04
AI Writing Detection1.0000100.0
Nutrition Prediction21.556090.5
Semantic Diversity0.236691.2
Orthographic Diversity3.828165.9
Wordle6.850052.6
8

openai/gpt-5.1

Coverage 5/5 · Avg benchmark score 79.1

79.07
AI Writing Detection0.980498.0
Nutrition Prediction19.958183.8
Semantic Diversity0.249386.5
Orthographic Diversity5.226990.0
Wordle9.750036.9
9

anthropic/claude-haiku-4.5

Coverage 5/5 · Avg benchmark score 76.9

76.89
AI Writing Detection0.845484.5
Nutrition Prediction20.709786.9
Semantic Diversity0.227794.8
Orthographic Diversity4.591979.1
Wordle9.200039.1
10

openai/gpt-5.4

Coverage 5/5 · Avg benchmark score 76.8

76.76
AI Writing Detection0.862186.2
Nutrition Prediction21.584090.6
Semantic Diversity0.255284.6
Orthographic Diversity4.872883.9
Wordle9.350038.5
11

moonshotai/kimi-k2.6

Coverage 5/5 · Avg benchmark score 76.2

76.23
AI Writing Detection0.900990.1
Nutrition Prediction19.103280.2
Semantic Diversity0.250586.2
Orthographic Diversity4.838683.3
Wordle8.700041.4
12

openai/gpt-oss-120b

Coverage 5/5 · Avg benchmark score 75.5

75.49
AI Writing Detection0.842184.2
Nutrition Prediction21.352989.6
Semantic Diversity0.256484.2
Orthographic Diversity-49.074133.7
Wordle4.200085.7
13

google/gemini-3.1-flash-lite-preview

Coverage 5/5 · Avg benchmark score 75.0

75.02
AI Writing Detection0.892989.3
Nutrition Prediction20.587286.4
Semantic Diversity0.238790.4
Orthographic Diversity3.778965.1
Wordle8.200043.9
14

mistralai/mistral-small-2603

Coverage 5/5 · Avg benchmark score 74.5

74.49
AI Writing Detection0.684968.5
Nutrition Prediction23.8232100.0
Semantic Diversity0.286475.4
Orthographic Diversity5.374792.6
Wordle10.000036.0
15

moonshotai/kimi-k2.5

Coverage 5/5 · Avg benchmark score 74.4

74.40
AI Writing Detection0.989999.0
Nutrition Prediction20.141384.5
Semantic Diversity0.249086.7
Orthographic Diversity-22.524665.8
Wordle10.000036.0
16

meta-llama/llama-4-maverick

Coverage 5/5 · Avg benchmark score 73.2

73.24
AI Writing Detection0.732173.2
Nutrition Prediction22.000292.3
Semantic Diversity0.250386.2
Orthographic Diversity4.543978.3
Wordle9.950036.2
17

anthropic/claude-opus-4.5

Coverage 5/5 · Avg benchmark score 72.3

72.30
AI Writing Detection0.927892.8
Nutrition Prediction18.876079.2
Semantic Diversity0.235291.7
Orthographic Diversity1.909132.9
Wordle5.550064.9
18

openai/gpt-5.3-codex

Coverage 4/5 · Avg benchmark score 90.2

72.17
Nutrition Prediction21.144088.8
Semantic Diversity0.239690.1
Orthographic Diversity4.761482.0
Wordle3.6000100.0
19

mistralai/mistral-large-2512

Coverage 5/5 · Avg benchmark score 71.9

71.87
AI Writing Detection0.666766.7
Nutrition Prediction21.443290.0
Semantic Diversity0.267980.6
Orthographic Diversity4.968485.6
Wordle9.850036.5
20

mistralai/mistral-medium-3.1

Coverage 5/5 · Avg benchmark score 70.7

70.75
AI Writing Detection0.608760.9
Nutrition Prediction22.901396.1
Semantic Diversity0.258983.3
Orthographic Diversity4.439476.5
Wordle9.750036.9
21

x-ai/grok-4.1-fast

Coverage 4/5 · Avg benchmark score 87.7

70.20
AI Writing Detection0.980498.0
Nutrition Prediction18.306476.8
Semantic Diversity0.232093.0
Orthographic Diversity4.822883.1
22

google/gemma-4-26b-a4b-it

Coverage 5/5 · Avg benchmark score 69.9

69.92
AI Writing Detection0.989999.0
Nutrition Prediction19.842683.3
Semantic Diversity0.257683.8
Orthographic Diversity2.749547.4
Wordle9.950036.2
23

anthropic/claude-sonnet-4.6

Coverage 4/5 · Avg benchmark score 85.9

68.75
AI Writing Detection0.959295.9
Nutrition Prediction20.712086.9
Semantic Diversity0.233492.5
Orthographic Diversity3.974168.5
24

inception/mercury-2

Coverage 5/5 · Avg benchmark score 68.0

68.04
AI Writing Detection0.875087.5
Nutrition Prediction21.897891.9
Semantic Diversity0.264381.6
Orthographic Diversity-77.00000.0
Wordle4.550079.1
25

openai/gpt-5.4-mini

Coverage 5/5 · Avg benchmark score 67.1

67.11
AI Writing Detection0.680368.0
Nutrition Prediction15.224663.9
Semantic Diversity0.268780.3
Orthographic Diversity5.002786.2
Wordle9.700037.1
26

amazon/nova-micro-v1

Coverage 5/5 · Avg benchmark score 65.0

65.01
AI Writing Detection0.643864.4
Nutrition Prediction14.838862.3
Semantic Diversity0.274278.7
Orthographic Diversity-7.715883.7
Wordle10.000036.0
27

deepseek/deepseek-v3.2

Coverage 5/5 · Avg benchmark score 63.5

63.48
AI Writing Detection0.481948.2
Nutrition Prediction17.407273.1
Semantic Diversity0.248486.9
Orthographic Diversity4.189272.2
Wordle9.700037.1
28

openai/gpt-oss-20b

Coverage 4/5 · Avg benchmark score 78.6

62.90
Nutrition Prediction23.746799.7
Semantic Diversity0.273878.8
Orthographic Diversity5.8053100.0
Wordle10.000036.0
29

google/gemma-4-31b-it

Coverage 5/5 · Avg benchmark score 62.6

62.63
AI Writing Detection0.989999.0
Nutrition Prediction19.622082.4
Semantic Diversity0.238090.7
Orthographic Diversity-74.00003.6
Wordle9.600037.5
30

amazon/nova-lite-v1

Coverage 5/5 · Avg benchmark score 60.0

60.03
AI Writing Detection0.00000.0
Nutrition Prediction21.608490.7
Semantic Diversity0.268780.3
Orthographic Diversity5.407193.1
Wordle10.000036.0
31

z-ai/glm-5

Coverage 4/5 · Avg benchmark score 74.8

59.83
Nutrition Prediction19.421581.5
Semantic Diversity0.246987.4
Orthographic Diversity5.471994.3
Wordle10.000036.0
32

stepfun/step-3.5-flash:free

Coverage 4/5 · Avg benchmark score 73.7

58.97
Nutrition Prediction17.926075.2
Semantic Diversity0.246487.6
Orthographic Diversity5.575496.0
Wordle10.000036.0
33

minimax/minimax-m2.7

Coverage 4/5 · Avg benchmark score 72.4

57.94
Nutrition Prediction17.575573.8
Semantic Diversity0.247887.1
Orthographic Diversity5.391292.9
Wordle10.000036.0
34

amazon/nova-2-lite-v1

Coverage 5/5 · Avg benchmark score 56.8

56.82
AI Writing Detection0.421142.1
Nutrition Prediction16.083067.5
Semantic Diversity0.296572.8
Orthographic Diversity3.812965.7
Wordle10.000036.0
35

qwen/qwen3.5-122b-a10b

Coverage 3/5 · Avg benchmark score 94.0

56.42
AI Writing Detection0.989999.0
Nutrition Prediction23.418898.3
Semantic Diversity0.254584.8
36

xiaomi/mimo-v2-pro

Coverage 4/5 · Avg benchmark score 70.3

56.21
Nutrition Prediction19.243280.8
Semantic Diversity0.256284.2
Orthographic Diversity4.649180.1
Wordle10.000036.0
37

x-ai/grok-4.20-beta

Coverage 5/5 · Avg benchmark score 56.2

56.20
AI Writing Detection0.666766.7
Nutrition Prediction22.538394.6
Semantic Diversity0.262482.2
Orthographic Diversity-77.00000.0
Wordle9.600037.5
38

minimax/minimax-m2.5

Coverage 5/5 · Avg benchmark score 55.5

55.45
AI Writing Detection0.800080.0
Nutrition Prediction14.666261.6
Semantic Diversity0.261982.4
Orthographic Diversity-62.674117.3
Wordle10.000036.0
39

amazon/nova-pro-v1

Coverage 5/5 · Avg benchmark score 55.1

55.13
AI Writing Detection0.03573.6
Nutrition Prediction21.184988.9
Semantic Diversity0.264981.5
Orthographic Diversity3.814065.7
Wordle10.000036.0
40

google/gemini-3.1-pro-preview

Coverage 4/5 · Avg benchmark score 67.0

53.59
Nutrition Prediction17.130271.9
Semantic Diversity0.232692.8
Orthographic Diversity3.905367.3
Wordle10.000036.0
41

qwen/qwen3.5-397b-a17b

Coverage 3/5 · Avg benchmark score 73.8

44.29
AI Writing Detection0.989999.0
Semantic Diversity0.236491.3
Orthographic Diversity-51.166731.2
42

meta-llama/llama-4-scout

Coverage 5/5 · Avg benchmark score 39.6

39.62
AI Writing Detection0.02412.4
Nutrition Prediction16.610969.7
Semantic Diversity0.276878.0
Orthographic Diversity-67.071512.0
Wordle10.000036.0
43

z-ai/glm-5-turbo

Coverage 2/5 · Avg benchmark score 93.4

37.37
Nutrition Prediction23.597599.1
Semantic Diversity0.245887.8
44

qwen/qwen3.5-27b

Coverage 2/5 · Avg benchmark score 62.3

24.91
Semantic Diversity0.243788.6
Wordle10.000036.0

Methodology

How index scoring works

Each benchmark is normalized relative to its current best raw score instead of being flattened into leaderboard positions. That means ties stay tied, near-ties stay near-ties, and lower-is-better benchmarks still compare cleanly with higher-is-better ones.

Normalization

Higher-is-better benchmarks are scored relative to the leader. Lower-is-better benchmarks use the inverse ratio to the leader, with a safe fallback if scores cross zero.

Coverage

Models missing benchmarks are not dropped. Their average is multiplied by benchmark coverage ratio so partial coverage is visible and penalized.

Final Score

Final index = average normalized benchmark score × coverage ratio. Higher is better.