WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

nutrition-prediction

Nutrition Prediction

Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.

Higher score is better
1

mistralai/mistral-small-2603

23.8232
2

openai/gpt-oss-20b

23.7467
3

z-ai/glm-5-turbo

23.5975
4

qwen/qwen3.5-122b-a10b

23.4188
5

mistralai/mistral-medium-3.1

22.9013
6

x-ai/grok-4.20-beta

22.5383
7

meta-llama/llama-4-maverick

22.0002
8

inception/mercury-2

21.8978
9

amazon/nova-lite-v1

21.6084
10

openai/gpt-5.4

21.5840
11

anthropic/claude-opus-4.1

21.5560
12

mistralai/mistral-large-2512

21.4432
13

openai/gpt-oss-120b

21.3529
14

amazon/nova-pro-v1

21.1849
15

openai/gpt-5.3-codex

21.1440
16

anthropic/claude-sonnet-4.6

20.7120
17

anthropic/claude-haiku-4.5

20.7097
18

openai/gpt-5.3-chat

20.6772
19

google/gemini-3.1-flash-lite-preview

20.5872
20

moonshotai/kimi-k2.5

20.1413
21

google/gemini-3-flash-preview

19.9842
22

openai/gpt-5.1

19.9581
23

anthropic/claude-opus-4.6

19.5945
24

z-ai/glm-5

19.4215
25

anthropic/claude-sonnet-4.5

19.3172
26

xiaomi/mimo-v2-pro

19.2432
27

anthropic/claude-opus-4.5

18.8760
28

x-ai/grok-4.1-fast

18.3064
29

stepfun/step-3.5-flash:free

17.9260
30

minimax/minimax-m2.7

17.5755
31

deepseek/deepseek-v3.2

17.4072
32

google/gemini-3.1-pro-preview

17.1302
33

meta-llama/llama-4-scout

16.6109
34

amazon/nova-2-lite-v1

16.0830
35

openai/gpt-5.4-mini

15.2246
36

amazon/nova-micro-v1

14.8388
37

minimax/minimax-m2.5

14.6662

Methodology

How scoring works

Fetch Nutrition5k dish metadata, deterministically sample 50 dishes that have at least 3 ingredients and 100+ calories, prompt the model once per dish, then compute per-field MAPE and Pearson correlation.

Prompt

Given only the ingredient list, return JSON with numeric `calories`, `protein`, `carbs`, and `fat` fields and no extra text.

Score

Higher is better. Overall score is 60% accuracy and 40% average correlation, where accuracy = 100 / (1 + average MAPE percentage).

Execution

Benchmark runners execute locally, use OpenRouter for predictions, fetch the fixed Nutrition5k metadata sample, cache results in Neon, and skip recomputation for models that already have stored scores.