WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

hidden-rule-sequence

Hidden Rule Sequence Continuation

Infer latent rules from procedurally generated sequences of numbers, symbols, or mixed tokens, then predict the next items. Higher is better.

Higher score is better
1

openai/gpt-5.3-codex

1.0000
2

openai/gpt-oss-120b

1.0000
3

google/gemini-3.1-pro-preview

0.9500
4

qwen/qwen3.5-122b-a10b

0.9500
5

qwen/qwen3.5-397b-a17b

0.9500
6

z-ai/glm-5

0.9500
7

openai/gpt-5.4-mini

0.9000
8

xiaomi/mimo-v2-pro

0.9000
9

openai/gpt-oss-20b

0.8833
10

inception/mercury-2

0.8500
11

mistralai/mistral-small-2603

0.8500
12

openai/gpt-5.3-chat

0.8500
13

openai/gpt-5.4

0.8500
14

z-ai/glm-5-turbo

0.8500
15

qwen/qwen3.5-27b

0.8333
16

deepseek/deepseek-v3.2

0.8167
17

x-ai/grok-4.1-fast

0.8167
18

anthropic/claude-opus-4.6

0.8000
19

google/gemini-3-flash-preview

0.8000
20

google/gemini-3.1-flash-lite-preview

0.8000
21

mistralai/mistral-medium-3.1

0.8000
22

moonshotai/kimi-k2.5

0.8000
23

x-ai/grok-4.20-beta

0.8000
24

mistralai/mistral-large-2512

0.7833
25

anthropic/claude-opus-4.5

0.7500
26

anthropic/claude-sonnet-4.5

0.7500
27

anthropic/claude-sonnet-4.6

0.7500
28

openai/gpt-5.1

0.7500
29

anthropic/claude-haiku-4.5

0.7000
30

amazon/nova-pro-v1

0.6500
31

amazon/nova-lite-v1

0.6333
32

amazon/nova-micro-v1

0.4333
33

amazon/nova-2-lite-v1

0.4167
34

meta-llama/llama-4-scout

0.4167
35

meta-llama/llama-4-maverick

0.3500

Methodology

How scoring works

Procedurally generate held-out sequence cases across arithmetic recurrences, alternations, nested cycles, grammar-like expansions, and mixed-attribute transitions, then score exact next-item accuracy.

Prompt

Given a short sequence generated by a hidden rule, predict the next K items and return only a JSON array or object containing those items.

Score

Higher is better. The main score is exact next-item accuracy across all held-out continuation slots, with full-sequence match rate tracked in metadata.

Execution

Cases are generated locally, model outputs are parsed and scored locally, and final results are cached in Neon by benchmark and model ID.