openai/gpt-oss-20b
WeirdBench
Benchmark leaderboard
orthographic-diversity
Orthographic Diversity
Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.
stepfun/step-3.5-flash:free
z-ai/glm-5
amazon/nova-lite-v1
minimax/minimax-m2.7
mistralai/mistral-small-2603
anthropic/claude-sonnet-4.5
openai/gpt-5.1
anthropic/claude-opus-4.6
openai/gpt-5.4-mini
mistralai/mistral-large-2512
openai/gpt-5.4
moonshotai/kimi-k2.6
x-ai/grok-4.1-fast
google/gemini-3-flash-preview
openai/gpt-5.3-codex
openai/gpt-5.3-chat
xiaomi/mimo-v2-pro
anthropic/claude-haiku-4.5
openai/gpt-5.5
meta-llama/llama-4-maverick
mistralai/mistral-medium-3.1
deepseek/deepseek-v3.2
anthropic/claude-sonnet-4.6
google/gemini-3.1-pro-preview
anthropic/claude-opus-4.1
amazon/nova-pro-v1
amazon/nova-2-lite-v1
google/gemini-3.1-flash-lite-preview
anthropic/claude-opus-4.7
google/gemma-4-26b-a4b-it
anthropic/claude-opus-4.5
amazon/nova-micro-v1
moonshotai/kimi-k2.5
openai/gpt-oss-120b
qwen/qwen3.5-397b-a17b
minimax/minimax-m2.5
meta-llama/llama-4-scout
google/gemma-4-31b-it
inception/mercury-2
x-ai/grok-4.20-beta
Methodology
How scoring works
Generate 20 candidate words from one fixed prompt, validate them against the installed npm English word list plus format rules, then score average pairwise Levenshtein distance minus deterministic penalties.
Prompt
Output exactly 20 real English words, one per line, 4 to 9 letters each, lowercase only, chosen to be as orthographically different from one another as possible.
Score
Higher is better. Raw score equals average pairwise Levenshtein distance minus penalties for invalid words, duplicates, trivial variants, shared prefixes and suffixes, and repeated character n-grams.
Execution
Validation and scoring happen locally with no judge model and no human grading, and results are cached in Neon by benchmark and model ID.