anthropic/claude-opus-4.6
WeirdBench
Benchmark leaderboard
semantic-diversity
Semantic Diversity
Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.
anthropic/claude-haiku-4.5
x-ai/grok-4.1-fast
google/gemini-3.1-pro-preview
anthropic/claude-sonnet-4.6
anthropic/claude-opus-4.5
qwen/qwen3.5-397b-a17b
google/gemini-3.1-flash-lite-preview
openai/gpt-5.3-codex
qwen/qwen3.5-27b
z-ai/glm-5-turbo
stepfun/step-3.5-flash:free
z-ai/glm-5
minimax/minimax-m2.7
deepseek/deepseek-v3.2
moonshotai/kimi-k2.5
openai/gpt-5.1
meta-llama/llama-4-maverick
google/gemini-3-flash-preview
qwen/qwen3.5-122b-a10b
openai/gpt-5.4
xiaomi/mimo-v2-pro
openai/gpt-oss-120b
anthropic/claude-sonnet-4.5
mistralai/mistral-medium-3.1
minimax/minimax-m2.5
x-ai/grok-4.20-beta
inception/mercury-2
amazon/nova-pro-v1
openai/gpt-5.3-chat
mistralai/mistral-large-2512
openai/gpt-5.4-mini
amazon/nova-lite-v1
openai/gpt-oss-20b
amazon/nova-micro-v1
meta-llama/llama-4-scout
mistralai/mistral-small-2603
amazon/nova-2-lite-v1
Methodology
How scoring works
Generate 20 words, embed them, and score average pairwise semantic similarity.
Prompt
Generate exactly 20 English words and return only a JSON array of lowercase single words.
Score
Lower is better. Lower scores mean the chosen words are less semantically related to each other.
Execution
Benchmark runners execute locally, cache results in Neon, and skip recomputation for models that already have stored scores.