WeirdBench

WeirdBench

Unconventional LLM benchmarks.

WeirdBench tests modern LLMs on the weird corners other evals skip.

Unconventional tasks, clear score direction, and a dead-simple score table. The benchmark definitions and code are written locally and published openly.

Recent benchmarks:

ai-writing-detection

AI Writing Detection

Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.

Top Models

anthropic/claude-opus-4.11.000
anthropic/claude-opus-4.71.000
google/gemini-3-flash-preview1.000
Higher score is better
View Benchmark

nutrition-prediction

Nutrition Prediction

Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.

Top Models

mistralai/mistral-small-260323.823
openai/gpt-oss-20b23.747
z-ai/glm-5-turbo23.598
Higher score is better
View Benchmark

semantic-diversity

Semantic Diversity

Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.

Top Models

anthropic/claude-opus-4.60.216
anthropic/claude-opus-4.70.224
anthropic/claude-haiku-4.50.228
Lower score is better
View Benchmark

orthographic-diversity

Orthographic Diversity

Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.

Top Models

openai/gpt-oss-20b5.805
stepfun/step-3.5-flash:free5.575
z-ai/glm-55.472
Higher score is better
View Benchmark

wordle

Wordle

Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.

Top Models

openai/gpt-5.3-codex3.600
openai/gpt-5.3-chat3.800
openai/gpt-5.54.000
Lower score is better
View Benchmark