WeirdBench tests modern LLMs on the weird corners other evals skip.

Unconventional tasks, clear score direction, and a dead-simple score table. The benchmark definitions and code are written locally and published openly.

View Benchmarks View on GitHub

Recent benchmarks:

ai-writing-detection

AI Writing Detection

Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.

Top Models

anthropic/claude-opus-4.11.000

anthropic/claude-opus-4.71.000

google/gemini-3-flash-preview1.000

Higher score is better

View Benchmark

nutrition-prediction

Nutrition Prediction

Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.

Top Models

mistralai/mistral-small-260323.823

openai/gpt-oss-20b23.747

z-ai/glm-5-turbo23.598

Higher score is better

View Benchmark

semantic-diversity

Semantic Diversity

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Top Models

anthropic/claude-opus-4.60.216

anthropic/claude-opus-4.70.224

anthropic/claude-haiku-4.50.228

Lower score is better

View Benchmark

orthographic-diversity

Orthographic Diversity

Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.

Top Models

openai/gpt-oss-20b5.805

stepfun/step-3.5-flash:free5.575

z-ai/glm-55.472

Higher score is better

View Benchmark

wordle

Wordle

Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.

Top Models

openai/gpt-5.3-codex3.600

openai/gpt-5.3-chat3.800

openai/gpt-5.54.000

Lower score is better

View Benchmark