WeirdBench

Benchmark leaderboard

Back Home

hidden-rule-sequence

Hidden Rule Sequence Continuation

Name: Hidden Rule Sequence Continuation
Creator: WeirdBench

Infer latent rules from procedurally generated sequences of numbers, symbols, or mixed tokens, then predict the next items. Higher is better.

Higher score is better

openai/gpt-5.3-codex

1.0000

openai/gpt-oss-120b

1.0000

google/gemini-3.1-pro-preview

0.9500

qwen/qwen3.5-122b-a10b

0.9500

qwen/qwen3.5-397b-a17b

0.9500

z-ai/glm-5

0.9500

openai/gpt-5.4-mini

0.9000

xiaomi/mimo-v2-pro

0.9000

openai/gpt-oss-20b

0.8833

inception/mercury-2

0.8500

mistralai/mistral-small-2603

0.8500

openai/gpt-5.3-chat

0.8500

openai/gpt-5.4

0.8500

z-ai/glm-5-turbo

0.8500

qwen/qwen3.5-27b

0.8333

deepseek/deepseek-v3.2

0.8167

x-ai/grok-4.1-fast

0.8167

anthropic/claude-opus-4.6

0.8000

google/gemini-3-flash-preview

0.8000

google/gemini-3.1-flash-lite-preview

0.8000

mistralai/mistral-medium-3.1

0.8000

moonshotai/kimi-k2.5

0.8000

x-ai/grok-4.20-beta

0.8000

mistralai/mistral-large-2512

0.7833

anthropic/claude-opus-4.5

0.7500

anthropic/claude-sonnet-4.5

0.7500

anthropic/claude-sonnet-4.6

0.7500

openai/gpt-5.1

0.7500

anthropic/claude-haiku-4.5

0.7000

amazon/nova-pro-v1

0.6500

amazon/nova-lite-v1

0.6333

amazon/nova-micro-v1

0.4333

amazon/nova-2-lite-v1

0.4167

meta-llama/llama-4-scout

0.4167

meta-llama/llama-4-maverick

0.3500

Methodology

How scoring works

Procedurally generate held-out sequence cases across arithmetic recurrences, alternations, nested cycles, grammar-like expansions, and mixed-attribute transitions, then score exact next-item accuracy.

Prompt

Given a short sequence generated by a hidden rule, predict the next K items and return only a JSON array or object containing those items.

Score

Higher is better. The main score is exact next-item accuracy across all held-out continuation slots, with full-sequence match rate tracked in metadata.

Execution

Cases are generated locally, model outputs are parsed and scored locally, and final results are cached in Neon by benchmark and model ID.