anthropic/claude-opus-4.1
WeirdBench
Benchmark leaderboard
ai-writing-detection
AI Writing Detection
Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.
google/gemini-3-flash-preview
anthropic/claude-sonnet-4.5
google/gemma-4-26b-a4b-it
google/gemma-4-31b-it
openai/gpt-5.3-chat
qwen/qwen3.5-122b-a10b
qwen/qwen3.5-397b-a17b
openai/gpt-5.1
x-ai/grok-4.1-fast
anthropic/claude-opus-4.6
anthropic/claude-sonnet-4.6
anthropic/claude-opus-4.5
google/gemini-3.1-flash-lite-preview
inception/mercury-2
openai/gpt-5.4
anthropic/claude-haiku-4.5
openai/gpt-oss-120b
minimax/minimax-m2.5
meta-llama/llama-4-maverick
mistralai/mistral-small-2603
openai/gpt-5.4-mini
mistralai/mistral-large-2512
x-ai/grok-4.20-beta
amazon/nova-micro-v1
mistralai/mistral-medium-3.1
deepseek/deepseek-v3.2
amazon/nova-2-lite-v1
amazon/nova-pro-v1
meta-llama/llama-4-scout
amazon/nova-lite-v1
Methodology
How scoring works
Read a fixed deterministic sample of 50 human essays and 50 AI essays from the downloaded Kaggle dataset, prompt the model once per essay to predict whether it is AI-generated, then compute binary classification precision, recall, and F1 for the AI class.
Prompt
Each essay is shown once and the model must return exactly one character: "1" for AI-generated or "0" for human-written, with no explanation.
Score
Higher is better. The benchmark score is the F1 score for detecting AI-generated essays, using label 1 as the positive class.
Execution
Benchmark runners execute locally, read the dataset from disk, use OpenRouter for predictions with reasoning disabled, cache results in Neon by benchmark and model ID, and skip recomputation for models that already have stored scores.