openai/gpt-5.3-codex
WeirdBench
Benchmark leaderboard
hidden-rule-sequence
Hidden Rule Sequence Continuation
Infer latent rules from procedurally generated sequences of numbers, symbols, or mixed tokens, then predict the next items. Higher is better.
openai/gpt-oss-120b
google/gemini-3.1-pro-preview
qwen/qwen3.5-122b-a10b
qwen/qwen3.5-397b-a17b
z-ai/glm-5
openai/gpt-5.4-mini
xiaomi/mimo-v2-pro
openai/gpt-oss-20b
inception/mercury-2
mistralai/mistral-small-2603
openai/gpt-5.3-chat
openai/gpt-5.4
z-ai/glm-5-turbo
qwen/qwen3.5-27b
deepseek/deepseek-v3.2
x-ai/grok-4.1-fast
anthropic/claude-opus-4.6
google/gemini-3-flash-preview
google/gemini-3.1-flash-lite-preview
mistralai/mistral-medium-3.1
moonshotai/kimi-k2.5
x-ai/grok-4.20-beta
mistralai/mistral-large-2512
anthropic/claude-opus-4.5
anthropic/claude-sonnet-4.5
anthropic/claude-sonnet-4.6
openai/gpt-5.1
anthropic/claude-haiku-4.5
amazon/nova-pro-v1
amazon/nova-lite-v1
amazon/nova-micro-v1
amazon/nova-2-lite-v1
meta-llama/llama-4-scout
meta-llama/llama-4-maverick
Methodology
How scoring works
Procedurally generate held-out sequence cases across arithmetic recurrences, alternations, nested cycles, grammar-like expansions, and mixed-attribute transitions, then score exact next-item accuracy.
Prompt
Given a short sequence generated by a hidden rule, predict the next K items and return only a JSON array or object containing those items.
Score
Higher is better. The main score is exact next-item accuracy across all held-out continuation slots, with full-sequence match rate tracked in metadata.
Execution
Cases are generated locally, model outputs are parsed and scored locally, and final results are cached in Neon by benchmark and model ID.