WeirdBench

WeirdBench

Benchmark leaderboard

Back Home

nutrition-prediction

Nutrition Prediction

Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.

Higher score is better
1

mistralai/mistral-small-2603

23.8232
2

openai/gpt-oss-20b

23.7467
3

z-ai/glm-5-turbo

23.5975
4

qwen/qwen3.5-122b-a10b

23.4188
5

mistralai/mistral-medium-3.1

22.9013
6

x-ai/grok-4.20-beta

22.5383
7

meta-llama/llama-4-maverick

22.0002
8

inception/mercury-2

21.8978
9

amazon/nova-lite-v1

21.6084
10

openai/gpt-5.4

21.5840
11

anthropic/claude-opus-4.1

21.5560
12

anthropic/claude-opus-4.7

21.5195
13

mistralai/mistral-large-2512

21.4432
14

openai/gpt-oss-120b

21.3529
15

openai/gpt-5.5

21.2108
16

amazon/nova-pro-v1

21.1849
17

openai/gpt-5.3-codex

21.1440
18

anthropic/claude-sonnet-4.6

20.7120
19

anthropic/claude-haiku-4.5

20.7097
20

openai/gpt-5.3-chat

20.6772
21

google/gemini-3.1-flash-lite-preview

20.5872
22

moonshotai/kimi-k2.5

20.1413
23

google/gemini-3-flash-preview

19.9842
24

openai/gpt-5.1

19.9581
25

google/gemma-4-26b-a4b-it

19.8426
26

google/gemma-4-31b-it

19.6220
27

anthropic/claude-opus-4.6

19.5945
28

z-ai/glm-5

19.4215
29

anthropic/claude-sonnet-4.5

19.3172
30

xiaomi/mimo-v2-pro

19.2432
31

moonshotai/kimi-k2.6

19.1032
32

anthropic/claude-opus-4.5

18.8760
33

x-ai/grok-4.1-fast

18.3064
34

stepfun/step-3.5-flash:free

17.9260
35

minimax/minimax-m2.7

17.5755
36

deepseek/deepseek-v3.2

17.4072
37

google/gemini-3.1-pro-preview

17.1302
38

meta-llama/llama-4-scout

16.6109
39

amazon/nova-2-lite-v1

16.0830
40

openai/gpt-5.4-mini

15.2246
41

amazon/nova-micro-v1

14.8388
42

minimax/minimax-m2.5

14.6662

Methodology

How scoring works

Fetch Nutrition5k dish metadata, deterministically sample 50 dishes that have at least 3 ingredients and 100+ calories, prompt the model once per dish, then compute per-field MAPE and Pearson correlation.

Prompt

Given only the ingredient list, return JSON with numeric `calories`, `protein`, `carbs`, and `fat` fields and no extra text.

Score

Higher is better. Overall score is 60% accuracy and 40% average correlation, where accuracy = 100 / (1 + average MAPE percentage).

Execution

Benchmark runners execute locally, use OpenRouter for predictions, fetch the fixed Nutrition5k metadata sample, cache results in Neon, and skip recomputation for models that already have stored scores.