WeirdBench

Benchmark leaderboard

Back Home

semantic-diversity

Semantic Diversity

Name: Semantic Diversity
Creator: WeirdBench

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Lower score is better

anthropic/claude-opus-4.6

0.2158

anthropic/claude-opus-4.7

0.2240

anthropic/claude-haiku-4.5

0.2277

x-ai/grok-4.1-fast

0.2320

google/gemini-3.1-pro-preview

0.2326

anthropic/claude-sonnet-4.6

0.2334

anthropic/claude-opus-4.5

0.2352

qwen/qwen3.5-397b-a17b

0.2364

anthropic/claude-opus-4.1

0.2366

openai/gpt-5.5

0.2372

google/gemma-4-31b-it

0.2380

google/gemini-3.1-flash-lite-preview

0.2387

openai/gpt-5.3-codex

0.2396

qwen/qwen3.5-27b

0.2437

z-ai/glm-5-turbo

0.2458

stepfun/step-3.5-flash:free

0.2464

z-ai/glm-5

0.2469

minimax/minimax-m2.7

0.2478

deepseek/deepseek-v3.2

0.2484

moonshotai/kimi-k2.5

0.2490

openai/gpt-5.1

0.2493

meta-llama/llama-4-maverick

0.2503

moonshotai/kimi-k2.6

0.2505

google/gemini-3-flash-preview

0.2511

qwen/qwen3.5-122b-a10b

0.2545

openai/gpt-5.4

0.2552

xiaomi/mimo-v2-pro

0.2562

openai/gpt-oss-120b

0.2564

google/gemma-4-26b-a4b-it

0.2576

anthropic/claude-sonnet-4.5

0.2581

mistralai/mistral-medium-3.1

0.2589

minimax/minimax-m2.5

0.2619

x-ai/grok-4.20-beta

0.2624

inception/mercury-2

0.2643

amazon/nova-pro-v1

0.2649

openai/gpt-5.3-chat

0.2678

mistralai/mistral-large-2512

0.2679

openai/gpt-5.4-mini

0.2687

amazon/nova-lite-v1

0.2687

openai/gpt-oss-20b

0.2738

amazon/nova-micro-v1

0.2742

meta-llama/llama-4-scout

0.2768

mistralai/mistral-small-2603

0.2864

amazon/nova-2-lite-v1

0.2965

Methodology

How scoring works

Generate 20 words, embed them, and score average pairwise semantic similarity.

Prompt

Generate exactly 20 English words and return only a JSON array of lowercase single words.

Score

Lower is better. Lower scores mean the chosen words are less semantically related to each other.

Execution

Benchmark runners execute locally, cache results in Neon, and skip recomputation for models that already have stored scores.