WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

orthographic-diversity

Orthographic Diversity

Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.

Higher score is better
1

openai/gpt-oss-20b

5.8053
2

stepfun/step-3.5-flash:free

5.5754
3

z-ai/glm-5

5.4719
4

amazon/nova-lite-v1

5.4071
5

minimax/minimax-m2.7

5.3912
6

mistralai/mistral-small-2603

5.3747
7

anthropic/claude-sonnet-4.5

5.3018
8

openai/gpt-5.1

5.2269
9

anthropic/claude-opus-4.6

5.1383
10

openai/gpt-5.4-mini

5.0027
11

mistralai/mistral-large-2512

4.9684
12

openai/gpt-5.4

4.8728
13

moonshotai/kimi-k2.6

4.8386
14

x-ai/grok-4.1-fast

4.8228
15

google/gemini-3-flash-preview

4.8053
16

openai/gpt-5.3-codex

4.7614
17

openai/gpt-5.3-chat

4.6912
18

xiaomi/mimo-v2-pro

4.6491
19

anthropic/claude-haiku-4.5

4.5919
20

openai/gpt-5.5

4.5700
21

meta-llama/llama-4-maverick

4.5439
22

mistralai/mistral-medium-3.1

4.4394
23

deepseek/deepseek-v3.2

4.1892
24

anthropic/claude-sonnet-4.6

3.9741
25

google/gemini-3.1-pro-preview

3.9053
26

anthropic/claude-opus-4.1

3.8281
27

amazon/nova-pro-v1

3.8140
28

amazon/nova-2-lite-v1

3.8129
29

google/gemini-3.1-flash-lite-preview

3.7789
30

anthropic/claude-opus-4.7

3.4555
31

google/gemma-4-26b-a4b-it

2.7495
32

anthropic/claude-opus-4.5

1.9091
33

amazon/nova-micro-v1

-7.7158
34

moonshotai/kimi-k2.5

-22.5246
35

openai/gpt-oss-120b

-49.0741
36

qwen/qwen3.5-397b-a17b

-51.1667
37

minimax/minimax-m2.5

-62.6741
38

meta-llama/llama-4-scout

-67.0715
39

google/gemma-4-31b-it

-74.0000
40

inception/mercury-2

-77.0000
41

x-ai/grok-4.20-beta

-77.0000

Methodology

How scoring works

Generate 20 candidate words from one fixed prompt, validate them against the installed npm English word list plus format rules, then score average pairwise Levenshtein distance minus deterministic penalties.

Prompt

Output exactly 20 real English words, one per line, 4 to 9 letters each, lowercase only, chosen to be as orthographically different from one another as possible.

Score

Higher is better. Raw score equals average pairwise Levenshtein distance minus penalties for invalid words, duplicates, trivial variants, shared prefixes and suffixes, and repeated character n-grams.

Execution

Validation and scoring happen locally with no judge model and no human grading, and results are cached in Neon by benchmark and model ID.