WeirdBench

Benchmark leaderboard

Back Home

wordle

Wordle

Name: Wordle
Creator: WeirdBench

Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.

Lower score is better

openai/gpt-5.3-codex

3.6000

openai/gpt-5.3-chat

3.8000

openai/gpt-5.5

4.0000

anthropic/claude-opus-4.6

4.0500

openai/gpt-oss-120b

4.2000

anthropic/claude-opus-4.7

4.3000

inception/mercury-2

4.5500

anthropic/claude-sonnet-4.5

4.6500

anthropic/claude-opus-4.5

5.5500

google/gemini-3-flash-preview

6.8000

anthropic/claude-opus-4.1

6.8500

google/gemini-3.1-flash-lite-preview

8.2000

moonshotai/kimi-k2.6

8.7000

anthropic/claude-haiku-4.5

9.2000

openai/gpt-5.4

9.3500

google/gemma-4-31b-it

9.6000

x-ai/grok-4.20-beta

9.6000

deepseek/deepseek-v3.2

9.7000

openai/gpt-5.4-mini

9.7000

mistralai/mistral-medium-3.1

9.7500

openai/gpt-5.1

9.7500

mistralai/mistral-large-2512

9.8500

google/gemma-4-26b-a4b-it

9.9500

meta-llama/llama-4-maverick

9.9500

amazon/nova-2-lite-v1

10.0000

amazon/nova-lite-v1

10.0000

amazon/nova-micro-v1

10.0000

amazon/nova-pro-v1

10.0000

google/gemini-3.1-pro-preview

10.0000

meta-llama/llama-4-scout

10.0000

minimax/minimax-m2.5

10.0000

minimax/minimax-m2.7

10.0000

mistralai/mistral-small-2603

10.0000

moonshotai/kimi-k2.5

10.0000

openai/gpt-oss-20b

10.0000

qwen/qwen3.5-27b

10.0000

stepfun/step-3.5-flash:free

10.0000

xiaomi/mimo-v2-pro

10.0000

z-ai/glm-5

10.0000

Methodology

How scoring works

Use a fixed set of 20 recent Wordle answers, run a fresh chat loop for each puzzle, and score the average turns needed to solve while applying standard duplicate-letter feedback rules.

Prompt

The model is told to reply with exactly one 5-letter word per turn, that any extra text is penalized, and that duplicate letters are allowed.

Score

Lower is better. Each puzzle score is the turn the word is solved on, or 10 if the model never solves it within 10 turns. Invalid guesses still count as turns.

Execution

Benchmark runners execute locally, simulate the Wordle judge deterministically, cache results in Neon, and skip recomputation for models that already have stored scores.