WeirdBench

Benchmark leaderboard

Back Home

nutrition-prediction

Nutrition Prediction

Name: Nutrition Prediction
Creator: WeirdBench

Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.

Higher score is better

mistralai/mistral-small-2603

23.8232

openai/gpt-oss-20b

23.7467

z-ai/glm-5-turbo

23.5975

qwen/qwen3.5-122b-a10b

23.4188

mistralai/mistral-medium-3.1

22.9013

x-ai/grok-4.20-beta

22.5383

meta-llama/llama-4-maverick

22.0002

inception/mercury-2

21.8978

amazon/nova-lite-v1

21.6084

openai/gpt-5.4

21.5840

anthropic/claude-opus-4.1

21.5560

anthropic/claude-opus-4.7

21.5195

mistralai/mistral-large-2512

21.4432

openai/gpt-oss-120b

21.3529

openai/gpt-5.5

21.2108

amazon/nova-pro-v1

21.1849

openai/gpt-5.3-codex

21.1440

anthropic/claude-sonnet-4.6

20.7120

anthropic/claude-haiku-4.5

20.7097

openai/gpt-5.3-chat

20.6772

google/gemini-3.1-flash-lite-preview

20.5872

moonshotai/kimi-k2.5

20.1413

google/gemini-3-flash-preview

19.9842

openai/gpt-5.1

19.9581

google/gemma-4-26b-a4b-it

19.8426

google/gemma-4-31b-it

19.6220

anthropic/claude-opus-4.6

19.5945

z-ai/glm-5

19.4215

anthropic/claude-sonnet-4.5

19.3172

xiaomi/mimo-v2-pro

19.2432

moonshotai/kimi-k2.6

19.1032

anthropic/claude-opus-4.5

18.8760

x-ai/grok-4.1-fast

18.3064

stepfun/step-3.5-flash:free

17.9260

minimax/minimax-m2.7

17.5755

deepseek/deepseek-v3.2

17.4072

google/gemini-3.1-pro-preview

17.1302

meta-llama/llama-4-scout

16.6109

amazon/nova-2-lite-v1

16.0830

openai/gpt-5.4-mini

15.2246

amazon/nova-micro-v1

14.8388

minimax/minimax-m2.5

14.6662

Methodology

How scoring works

Fetch Nutrition5k dish metadata, deterministically sample 50 dishes that have at least 3 ingredients and 100+ calories, prompt the model once per dish, then compute per-field MAPE and Pearson correlation.

Prompt

Given only the ingredient list, return JSON with numeric `calories`, `protein`, `carbs`, and `fat` fields and no extra text.

Score

Higher is better. Overall score is 60% accuracy and 40% average correlation, where accuracy = 100 / (1 + average MAPE percentage).

Execution

Benchmark runners execute locally, use OpenRouter for predictions, fetch the fixed Nutrition5k metadata sample, cache results in Neon, and skip recomputation for models that already have stored scores.