WeirdBench

Benchmark leaderboard

Back Home

orthographic-diversity

Orthographic Diversity

Name: Orthographic Diversity
Creator: WeirdBench

Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.

Higher score is better

openai/gpt-oss-20b

5.8053

stepfun/step-3.5-flash:free

5.5754

z-ai/glm-5

5.4719

amazon/nova-lite-v1

5.4071

minimax/minimax-m2.7

5.3912

mistralai/mistral-small-2603

5.3747

anthropic/claude-sonnet-4.5

5.3018

openai/gpt-5.1

5.2269

anthropic/claude-opus-4.6

5.1383

openai/gpt-5.4-mini

5.0027

mistralai/mistral-large-2512

4.9684

openai/gpt-5.4

4.8728

moonshotai/kimi-k2.6

4.8386

x-ai/grok-4.1-fast

4.8228

google/gemini-3-flash-preview

4.8053

openai/gpt-5.3-codex

4.7614

openai/gpt-5.3-chat

4.6912

xiaomi/mimo-v2-pro

4.6491

anthropic/claude-haiku-4.5

4.5919

openai/gpt-5.5

4.5700

meta-llama/llama-4-maverick

4.5439

mistralai/mistral-medium-3.1

4.4394

deepseek/deepseek-v3.2

4.1892

anthropic/claude-sonnet-4.6

3.9741

google/gemini-3.1-pro-preview

3.9053

anthropic/claude-opus-4.1

3.8281

amazon/nova-pro-v1

3.8140

amazon/nova-2-lite-v1

3.8129

google/gemini-3.1-flash-lite-preview

3.7789

anthropic/claude-opus-4.7

3.4555

google/gemma-4-26b-a4b-it

2.7495

anthropic/claude-opus-4.5

1.9091

amazon/nova-micro-v1

-7.7158

moonshotai/kimi-k2.5

-22.5246

openai/gpt-oss-120b

-49.0741

qwen/qwen3.5-397b-a17b

-51.1667

minimax/minimax-m2.5

-62.6741

meta-llama/llama-4-scout

-67.0715

google/gemma-4-31b-it

-74.0000

inception/mercury-2

-77.0000

x-ai/grok-4.20-beta

-77.0000

Methodology

How scoring works

Generate 20 candidate words from one fixed prompt, validate them against the installed npm English word list plus format rules, then score average pairwise Levenshtein distance minus deterministic penalties.

Prompt

Output exactly 20 real English words, one per line, 4 to 9 letters each, lowercase only, chosen to be as orthographically different from one another as possible.

Score

Higher is better. Raw score equals average pairwise Levenshtein distance minus penalties for invalid words, duplicates, trivial variants, shared prefixes and suffixes, and repeated character n-grams.

Execution

Validation and scoring happen locally with no judge model and no human grading, and results are cached in Neon by benchmark and model ID.