openai/gpt-5.3-codex
WeirdBench
Benchmark leaderboard
wordle
Wordle
Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.
openai/gpt-5.3-chat
anthropic/claude-opus-4.6
openai/gpt-oss-120b
inception/mercury-2
anthropic/claude-sonnet-4.5
anthropic/claude-opus-4.5
google/gemini-3-flash-preview
anthropic/claude-opus-4.1
google/gemini-3.1-flash-lite-preview
anthropic/claude-haiku-4.5
openai/gpt-5.4
x-ai/grok-4.20-beta
deepseek/deepseek-v3.2
openai/gpt-5.4-mini
mistralai/mistral-medium-3.1
openai/gpt-5.1
mistralai/mistral-large-2512
meta-llama/llama-4-maverick
amazon/nova-2-lite-v1
amazon/nova-lite-v1
amazon/nova-micro-v1
amazon/nova-pro-v1
meta-llama/llama-4-scout
mistralai/mistral-small-2603
Methodology
How scoring works
Use a fixed set of 20 recent Wordle answers, run a fresh chat loop for each puzzle, and score the average turns needed to solve while applying standard duplicate-letter feedback rules.
Prompt
The model is told to reply with exactly one 5-letter word per turn, that any extra text is penalized, and that duplicate letters are allowed.
Score
Lower is better. Each puzzle score is the turn the word is solved on, or 10 if the model never solves it within 10 turns. Invalid guesses still count as turns.
Execution
Benchmark runners execute locally, simulate the Wordle judge deterministically, cache results in Neon, and skip recomputation for models that already have stored scores.