WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

semantic-diversity

Semantic Diversity

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Lower score is better
1

anthropic/claude-opus-4.6

0.2158
2

anthropic/claude-opus-4.7

0.2240
3

anthropic/claude-haiku-4.5

0.2277
4

x-ai/grok-4.1-fast

0.2320
5

google/gemini-3.1-pro-preview

0.2326
6

anthropic/claude-sonnet-4.6

0.2334
7

anthropic/claude-opus-4.5

0.2352
8

qwen/qwen3.5-397b-a17b

0.2364
9

anthropic/claude-opus-4.1

0.2366
10

openai/gpt-5.5

0.2372
11

google/gemma-4-31b-it

0.2380
12

google/gemini-3.1-flash-lite-preview

0.2387
13

openai/gpt-5.3-codex

0.2396
14

qwen/qwen3.5-27b

0.2437
15

z-ai/glm-5-turbo

0.2458
16

stepfun/step-3.5-flash:free

0.2464
17

z-ai/glm-5

0.2469
18

minimax/minimax-m2.7

0.2478
19

deepseek/deepseek-v3.2

0.2484
20

moonshotai/kimi-k2.5

0.2490
21

openai/gpt-5.1

0.2493
22

meta-llama/llama-4-maverick

0.2503
23

moonshotai/kimi-k2.6

0.2505
24

google/gemini-3-flash-preview

0.2511
25

qwen/qwen3.5-122b-a10b

0.2545
26

openai/gpt-5.4

0.2552
27

xiaomi/mimo-v2-pro

0.2562
28

openai/gpt-oss-120b

0.2564
29

google/gemma-4-26b-a4b-it

0.2576
30

anthropic/claude-sonnet-4.5

0.2581
31

mistralai/mistral-medium-3.1

0.2589
32

minimax/minimax-m2.5

0.2619
33

x-ai/grok-4.20-beta

0.2624
34

inception/mercury-2

0.2643
35

amazon/nova-pro-v1

0.2649
36

openai/gpt-5.3-chat

0.2678
37

mistralai/mistral-large-2512

0.2679
38

openai/gpt-5.4-mini

0.2687
39

amazon/nova-lite-v1

0.2687
40

openai/gpt-oss-20b

0.2738
41

amazon/nova-micro-v1

0.2742
42

meta-llama/llama-4-scout

0.2768
43

mistralai/mistral-small-2603

0.2864
44

amazon/nova-2-lite-v1

0.2965

Methodology

How scoring works

Generate 20 words, embed them, and score average pairwise semantic similarity.

Prompt

Generate exactly 20 English words and return only a JSON array of lowercase single words.

Score

Lower is better. Lower scores mean the chosen words are less semantically related to each other.

Execution

Benchmark runners execute locally, cache results in Neon, and skip recomputation for models that already have stored scores.