WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

orthographic-diversity

Orthographic Diversity

Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.

Higher score is better
1

openai/gpt-oss-20b

5.8053
2

stepfun/step-3.5-flash:free

5.5754
3

z-ai/glm-5

5.4719
4

amazon/nova-lite-v1

5.4071
5

minimax/minimax-m2.7

5.3912
6

mistralai/mistral-small-2603

5.3747
7

anthropic/claude-sonnet-4.5

5.3018
8

openai/gpt-5.1

5.2269
9

openai/gpt-5.4-mini

5.0027
10

mistralai/mistral-large-2512

4.9684
11

x-ai/grok-4.1-fast

4.8228
12

google/gemini-3-flash-preview

4.8053
13

openai/gpt-5.3-codex

4.7614
14

anthropic/claude-haiku-4.5

4.7018
15

openai/gpt-5.3-chat

4.6912
16

xiaomi/mimo-v2-pro

4.6491
17

meta-llama/llama-4-maverick

4.5439
18

anthropic/claude-opus-4.6

4.4947
19

mistralai/mistral-medium-3.1

4.4394
20

deepseek/deepseek-v3.2

4.1892
21

openai/gpt-5.4

4.0602
22

anthropic/claude-sonnet-4.6

3.9741
23

google/gemini-3.1-pro-preview

3.9053
24

amazon/nova-pro-v1

3.8140
25

amazon/nova-2-lite-v1

3.8129
26

google/gemini-3.1-flash-lite-preview

3.7789
27

anthropic/claude-opus-4.5

1.9091
28

amazon/nova-micro-v1

-7.7158
29

moonshotai/kimi-k2.5

-22.5246
30

openai/gpt-oss-120b

-49.0741
31

qwen/qwen3.5-397b-a17b

-51.1667
32

minimax/minimax-m2.5

-62.6741
33

meta-llama/llama-4-scout

-67.0715
34

inception/mercury-2

-77.0000
35

x-ai/grok-4.20-beta

-77.0000

Methodology

How scoring works

Generate 20 candidate words from one fixed prompt, validate them against the installed npm English word list plus format rules, then score average pairwise Levenshtein distance minus deterministic penalties.

Prompt

Output exactly 20 real English words, one per line, 4 to 9 letters each, lowercase only, chosen to be as orthographically different from one another as possible.

Score

Higher is better. Raw score equals average pairwise Levenshtein distance minus penalties for invalid words, duplicates, trivial variants, shared prefixes and suffixes, and repeated character n-grams.

Execution

Validation and scoring happen locally with no judge model and no human grading, and results are cached in Neon by benchmark and model ID.