WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

wordle

Wordle

Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.

Lower score is better
1

openai/gpt-5.3-codex

3.6000
2

openai/gpt-5.3-chat

3.8000
3

anthropic/claude-opus-4.6

4.0500
4

openai/gpt-oss-120b

4.2000
5

inception/mercury-2

4.5500
6

anthropic/claude-sonnet-4.5

4.6500
7

anthropic/claude-opus-4.5

5.5500
8

google/gemini-3-flash-preview

6.8000
9

anthropic/claude-opus-4.1

6.8500
10

google/gemini-3.1-flash-lite-preview

8.2000
11

anthropic/claude-haiku-4.5

9.2000
12

openai/gpt-5.4

9.3500
13

x-ai/grok-4.20-beta

9.6000
14

deepseek/deepseek-v3.2

9.7000
15

openai/gpt-5.4-mini

9.7000
16

mistralai/mistral-medium-3.1

9.7500
17

openai/gpt-5.1

9.7500
18

mistralai/mistral-large-2512

9.8500
19

meta-llama/llama-4-maverick

9.9500
20

amazon/nova-2-lite-v1

10.0000
21

amazon/nova-lite-v1

10.0000
22

amazon/nova-micro-v1

10.0000
23

amazon/nova-pro-v1

10.0000
24

meta-llama/llama-4-scout

10.0000
25

mistralai/mistral-small-2603

10.0000

Methodology

How scoring works

Use a fixed set of 20 recent Wordle answers, run a fresh chat loop for each puzzle, and score the average turns needed to solve while applying standard duplicate-letter feedback rules.

Prompt

The model is told to reply with exactly one 5-letter word per turn, that any extra text is penalized, and that duplicate letters are allowed.

Score

Lower is better. Each puzzle score is the turn the word is solved on, or 10 if the model never solves it within 10 turns. Invalid guesses still count as turns.

Execution

Benchmark runners execute locally, simulate the Wordle judge deterministically, cache results in Neon, and skip recomputation for models that already have stored scores.