openai/gpt-oss-20b
WeirdBench
Benchmark leaderboard
orthographic-diversity
Orthographic Diversity
Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.
stepfun/step-3.5-flash:free
z-ai/glm-5
amazon/nova-lite-v1
minimax/minimax-m2.7
mistralai/mistral-small-2603
anthropic/claude-sonnet-4.5
openai/gpt-5.1
openai/gpt-5.4-mini
mistralai/mistral-large-2512
x-ai/grok-4.1-fast
google/gemini-3-flash-preview
openai/gpt-5.3-codex
anthropic/claude-haiku-4.5
openai/gpt-5.3-chat
xiaomi/mimo-v2-pro
meta-llama/llama-4-maverick
anthropic/claude-opus-4.6
mistralai/mistral-medium-3.1
deepseek/deepseek-v3.2
openai/gpt-5.4
anthropic/claude-sonnet-4.6
google/gemini-3.1-pro-preview
amazon/nova-pro-v1
amazon/nova-2-lite-v1
google/gemini-3.1-flash-lite-preview
anthropic/claude-opus-4.5
amazon/nova-micro-v1
moonshotai/kimi-k2.5
openai/gpt-oss-120b
qwen/qwen3.5-397b-a17b
minimax/minimax-m2.5
meta-llama/llama-4-scout
inception/mercury-2
x-ai/grok-4.20-beta
Methodology
How scoring works
Generate 20 candidate words from one fixed prompt, validate them against the installed npm English word list plus format rules, then score average pairwise Levenshtein distance minus deterministic penalties.
Prompt
Output exactly 20 real English words, one per line, 4 to 9 letters each, lowercase only, chosen to be as orthographically different from one another as possible.
Score
Higher is better. Raw score equals average pairwise Levenshtein distance minus penalties for invalid words, duplicates, trivial variants, shared prefixes and suffixes, and repeated character n-grams.
Execution
Validation and scoring happen locally with no judge model and no human grading, and results are cached in Neon by benchmark and model ID.