WeirdBench

Benchmark leaderboard

Back Home

ai-writing-detection

AI Writing Detection

Name: AI Writing Detection
Creator: WeirdBench

Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.

Higher score is better

anthropic/claude-opus-4.1

1.0000

anthropic/claude-opus-4.7

1.0000

google/gemini-3-flash-preview

1.0000

openai/gpt-5.5

1.0000

anthropic/claude-sonnet-4.5

0.9899

google/gemma-4-26b-a4b-it

0.9899

google/gemma-4-31b-it

0.9899

moonshotai/kimi-k2.5

0.9899

openai/gpt-5.3-chat

0.9899

qwen/qwen3.5-122b-a10b

0.9899

qwen/qwen3.5-397b-a17b

0.9899

openai/gpt-5.1

0.9804

x-ai/grok-4.1-fast

0.9804

anthropic/claude-opus-4.6

0.9800

anthropic/claude-sonnet-4.6

0.9592

anthropic/claude-opus-4.5

0.9278

moonshotai/kimi-k2.6

0.9009

google/gemini-3.1-flash-lite-preview

0.8929

inception/mercury-2

0.8750

openai/gpt-5.4

0.8621

anthropic/claude-haiku-4.5

0.8454

openai/gpt-oss-120b

0.8421

minimax/minimax-m2.5

0.8000

meta-llama/llama-4-maverick

0.7321

mistralai/mistral-small-2603

0.6849

openai/gpt-5.4-mini

0.6803

mistralai/mistral-large-2512

0.6667

x-ai/grok-4.20-beta

0.6667

amazon/nova-micro-v1

0.6438

mistralai/mistral-medium-3.1

0.6087

deepseek/deepseek-v3.2

0.4819

amazon/nova-2-lite-v1

0.4211

amazon/nova-pro-v1

0.0357

meta-llama/llama-4-scout

0.0241

amazon/nova-lite-v1

0.0000

Methodology

How scoring works

Read a fixed deterministic sample of 50 human essays and 50 AI essays from the downloaded Kaggle dataset, prompt the model once per essay to predict whether it is AI-generated, then compute binary classification precision, recall, and F1 for the AI class.

Prompt

Each essay is shown once and the model must return exactly one character: "1" for AI-generated or "0" for human-written, with no explanation.

Score

Higher is better. The benchmark score is the F1 score for detecting AI-generated essays, using label 1 as the positive class.

Execution

Benchmark runners execute locally, read the dataset from disk, use OpenRouter for predictions with reasoning disabled, cache results in Neon by benchmark and model ID, and skip recomputation for models that already have stored scores.