WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

ai-writing-detection

AI Writing Detection

Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.

Higher score is better
1

anthropic/claude-opus-4.1

1.0000
2

anthropic/claude-opus-4.7

1.0000
3

google/gemini-3-flash-preview

1.0000
4

openai/gpt-5.5

1.0000
5

anthropic/claude-sonnet-4.5

0.9899
6

google/gemma-4-26b-a4b-it

0.9899
7

google/gemma-4-31b-it

0.9899
8

moonshotai/kimi-k2.5

0.9899
9

openai/gpt-5.3-chat

0.9899
10

qwen/qwen3.5-122b-a10b

0.9899
11

qwen/qwen3.5-397b-a17b

0.9899
12

openai/gpt-5.1

0.9804
13

x-ai/grok-4.1-fast

0.9804
14

anthropic/claude-opus-4.6

0.9800
15

anthropic/claude-sonnet-4.6

0.9592
16

anthropic/claude-opus-4.5

0.9278
17

moonshotai/kimi-k2.6

0.9009
18

google/gemini-3.1-flash-lite-preview

0.8929
19

inception/mercury-2

0.8750
20

openai/gpt-5.4

0.8621
21

anthropic/claude-haiku-4.5

0.8454
22

openai/gpt-oss-120b

0.8421
23

minimax/minimax-m2.5

0.8000
24

meta-llama/llama-4-maverick

0.7321
25

mistralai/mistral-small-2603

0.6849
26

openai/gpt-5.4-mini

0.6803
27

mistralai/mistral-large-2512

0.6667
28

x-ai/grok-4.20-beta

0.6667
29

amazon/nova-micro-v1

0.6438
30

mistralai/mistral-medium-3.1

0.6087
31

deepseek/deepseek-v3.2

0.4819
32

amazon/nova-2-lite-v1

0.4211
33

amazon/nova-pro-v1

0.0357
34

meta-llama/llama-4-scout

0.0241
35

amazon/nova-lite-v1

0.0000

Methodology

How scoring works

Read a fixed deterministic sample of 50 human essays and 50 AI essays from the downloaded Kaggle dataset, prompt the model once per essay to predict whether it is AI-generated, then compute binary classification precision, recall, and F1 for the AI class.

Prompt

Each essay is shown once and the model must return exactly one character: "1" for AI-generated or "0" for human-written, with no explanation.

Score

Higher is better. The benchmark score is the F1 score for detecting AI-generated essays, using label 1 as the positive class.

Execution

Benchmark runners execute locally, read the dataset from disk, use OpenRouter for predictions with reasoning disabled, cache results in Neon by benchmark and model ID, and skip recomputation for models that already have stored scores.