WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

semantic-diversity

Semantic Diversity

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Lower score is better
1

anthropic/claude-opus-4.6

0.2158
2

anthropic/claude-haiku-4.5

0.2277
3

x-ai/grok-4.1-fast

0.2320
4

google/gemini-3.1-pro-preview

0.2326
5

anthropic/claude-sonnet-4.6

0.2334
6

anthropic/claude-opus-4.5

0.2352
7

qwen/qwen3.5-397b-a17b

0.2364
8

google/gemini-3.1-flash-lite-preview

0.2387
9

openai/gpt-5.3-codex

0.2396
10

qwen/qwen3.5-27b

0.2437
11

z-ai/glm-5-turbo

0.2458
12

stepfun/step-3.5-flash:free

0.2464
13

z-ai/glm-5

0.2469
14

minimax/minimax-m2.7

0.2478
15

deepseek/deepseek-v3.2

0.2484
16

moonshotai/kimi-k2.5

0.2490
17

openai/gpt-5.1

0.2493
18

meta-llama/llama-4-maverick

0.2503
19

google/gemini-3-flash-preview

0.2511
20

qwen/qwen3.5-122b-a10b

0.2545
21

openai/gpt-5.4

0.2552
22

xiaomi/mimo-v2-pro

0.2562
23

openai/gpt-oss-120b

0.2564
24

anthropic/claude-sonnet-4.5

0.2581
25

mistralai/mistral-medium-3.1

0.2589
26

minimax/minimax-m2.5

0.2619
27

x-ai/grok-4.20-beta

0.2624
28

inception/mercury-2

0.2643
29

amazon/nova-pro-v1

0.2649
30

openai/gpt-5.3-chat

0.2678
31

mistralai/mistral-large-2512

0.2679
32

openai/gpt-5.4-mini

0.2687
33

amazon/nova-lite-v1

0.2687
34

openai/gpt-oss-20b

0.2738
35

amazon/nova-micro-v1

0.2742
36

meta-llama/llama-4-scout

0.2768
37

mistralai/mistral-small-2603

0.2864
38

amazon/nova-2-lite-v1

0.2965

Methodology

How scoring works

Generate 20 words, embed them, and score average pairwise semantic similarity.

Prompt

Generate exactly 20 English words and return only a JSON array of lowercase single words.

Score

Lower is better. Lower scores mean the chosen words are less semantically related to each other.

Execution

Benchmark runners execute locally, cache results in Neon, and skip recomputation for models that already have stored scores.