WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

ai-writing-detection

AI Writing Detection

Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.

Higher score is better
1

anthropic/claude-opus-4.1

1.0000
2

google/gemini-3-flash-preview

1.0000
3

anthropic/claude-sonnet-4.5

0.9899
4

google/gemma-4-26b-a4b-it

0.9899
5

google/gemma-4-31b-it

0.9899
6

openai/gpt-5.3-chat

0.9899
7

qwen/qwen3.5-122b-a10b

0.9899
8

qwen/qwen3.5-397b-a17b

0.9899
9

openai/gpt-5.1

0.9804
10

x-ai/grok-4.1-fast

0.9804
11

anthropic/claude-opus-4.6

0.9800
12

anthropic/claude-sonnet-4.6

0.9592
13

anthropic/claude-opus-4.5

0.9278
14

google/gemini-3.1-flash-lite-preview

0.8929
15

inception/mercury-2

0.8750
16

openai/gpt-5.4

0.8621
17

anthropic/claude-haiku-4.5

0.8454
18

openai/gpt-oss-120b

0.8421
19

minimax/minimax-m2.5

0.8000
20

meta-llama/llama-4-maverick

0.7321
21

mistralai/mistral-small-2603

0.6849
22

openai/gpt-5.4-mini

0.6803
23

mistralai/mistral-large-2512

0.6667
24

x-ai/grok-4.20-beta

0.6667
25

amazon/nova-micro-v1

0.6438
26

mistralai/mistral-medium-3.1

0.6087
27

deepseek/deepseek-v3.2

0.4819
28

amazon/nova-2-lite-v1

0.4211
29

amazon/nova-pro-v1

0.0357
30

meta-llama/llama-4-scout

0.0241
31

amazon/nova-lite-v1

0.0000

Methodology

How scoring works

Read a fixed deterministic sample of 50 human essays and 50 AI essays from the downloaded Kaggle dataset, prompt the model once per essay to predict whether it is AI-generated, then compute binary classification precision, recall, and F1 for the AI class.

Prompt

Each essay is shown once and the model must return exactly one character: "1" for AI-generated or "0" for human-written, with no explanation.

Score

Higher is better. The benchmark score is the F1 score for detecting AI-generated essays, using label 1 as the positive class.

Execution

Benchmark runners execute locally, read the dataset from disk, use OpenRouter for predictions with reasoning disabled, cache results in Neon by benchmark and model ID, and skip recomputation for models that already have stored scores.