mistralai/mistral-small-2603
WeirdBench
Benchmark leaderboard
nutrition-prediction
Nutrition Prediction
Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.
openai/gpt-oss-20b
z-ai/glm-5-turbo
qwen/qwen3.5-122b-a10b
mistralai/mistral-medium-3.1
x-ai/grok-4.20-beta
meta-llama/llama-4-maverick
inception/mercury-2
amazon/nova-lite-v1
openai/gpt-5.4
anthropic/claude-opus-4.1
mistralai/mistral-large-2512
openai/gpt-oss-120b
amazon/nova-pro-v1
openai/gpt-5.3-codex
anthropic/claude-sonnet-4.6
anthropic/claude-haiku-4.5
openai/gpt-5.3-chat
google/gemini-3.1-flash-lite-preview
moonshotai/kimi-k2.5
google/gemini-3-flash-preview
openai/gpt-5.1
anthropic/claude-opus-4.6
z-ai/glm-5
anthropic/claude-sonnet-4.5
xiaomi/mimo-v2-pro
anthropic/claude-opus-4.5
x-ai/grok-4.1-fast
stepfun/step-3.5-flash:free
minimax/minimax-m2.7
deepseek/deepseek-v3.2
google/gemini-3.1-pro-preview
meta-llama/llama-4-scout
amazon/nova-2-lite-v1
openai/gpt-5.4-mini
amazon/nova-micro-v1
minimax/minimax-m2.5
Methodology
How scoring works
Fetch Nutrition5k dish metadata, deterministically sample 50 dishes that have at least 3 ingredients and 100+ calories, prompt the model once per dish, then compute per-field MAPE and Pearson correlation.
Prompt
Given only the ingredient list, return JSON with numeric `calories`, `protein`, `carbs`, and `fat` fields and no extra text.
Score
Higher is better. Overall score is 60% accuracy and 40% average correlation, where accuracy = 100 / (1 + average MAPE percentage).
Execution
Benchmark runners execute locally, use OpenRouter for predictions, fetch the fixed Nutrition5k metadata sample, cache results in Neon, and skip recomputation for models that already have stored scores.