WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

wordle

Wordle

Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.

Lower score is better
1

openai/gpt-5.3-codex

3.6000
2

openai/gpt-5.3-chat

3.8000
3

openai/gpt-5.5

4.0000
4

anthropic/claude-opus-4.6

4.0500
5

openai/gpt-oss-120b

4.2000
6

anthropic/claude-opus-4.7

4.3000
7

inception/mercury-2

4.5500
8

anthropic/claude-sonnet-4.5

4.6500
9

anthropic/claude-opus-4.5

5.5500
10

google/gemini-3-flash-preview

6.8000
11

anthropic/claude-opus-4.1

6.8500
12

google/gemini-3.1-flash-lite-preview

8.2000
13

moonshotai/kimi-k2.6

8.7000
14

anthropic/claude-haiku-4.5

9.2000
15

openai/gpt-5.4

9.3500
16

google/gemma-4-31b-it

9.6000
17

x-ai/grok-4.20-beta

9.6000
18

deepseek/deepseek-v3.2

9.7000
19

openai/gpt-5.4-mini

9.7000
20

mistralai/mistral-medium-3.1

9.7500
21

openai/gpt-5.1

9.7500
22

mistralai/mistral-large-2512

9.8500
23

google/gemma-4-26b-a4b-it

9.9500
24

meta-llama/llama-4-maverick

9.9500
25

amazon/nova-2-lite-v1

10.0000
26

amazon/nova-lite-v1

10.0000
27

amazon/nova-micro-v1

10.0000
28

amazon/nova-pro-v1

10.0000
29

google/gemini-3.1-pro-preview

10.0000
30

meta-llama/llama-4-scout

10.0000
31

minimax/minimax-m2.5

10.0000
32

minimax/minimax-m2.7

10.0000
33

mistralai/mistral-small-2603

10.0000
34

moonshotai/kimi-k2.5

10.0000
35

openai/gpt-oss-20b

10.0000
36

qwen/qwen3.5-27b

10.0000
37

stepfun/step-3.5-flash:free

10.0000
38

xiaomi/mimo-v2-pro

10.0000
39

z-ai/glm-5

10.0000

Methodology

How scoring works

Use a fixed set of 20 recent Wordle answers, run a fresh chat loop for each puzzle, and score the average turns needed to solve while applying standard duplicate-letter feedback rules.

Prompt

The model is told to reply with exactly one 5-letter word per turn, that any extra text is penalized, and that duplicate letters are allowed.

Score

Lower is better. Each puzzle score is the turn the word is solved on, or 10 if the model never solves it within 10 turns. Invalid guesses still count as turns.

Execution

Benchmark runners execute locally, simulate the Wordle judge deterministically, cache results in Neon, and skip recomputation for models that already have stored scores.