ai-writing-detection
AI Writing Detection
Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.
Top Models
WeirdBench
Unconventional LLM benchmarks.
Unconventional tasks, clear score direction, and a dead-simple score table. The benchmark definitions and code are written locally and published openly.
ai-writing-detection
Classify essays from a fixed balanced sample of 50 human-written and 50 AI-generated examples from the AI Generated Essays Dataset. Higher is better.
Top Models
nutrition-prediction
Predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample. Higher is better.
Top Models
semantic-diversity
Generate exactly 20 English words that are maximally semantically unrelated to each other, then score the average pairwise semantic similarity. Lower is better.
Top Models
orthographic-diversity
Search for 20 real English words that are maximally different in spelling under hard validity rules and deterministic penalties. Higher is better.
Top Models
wordle
Play 20 recent Wordle answers turn by turn with standard gray/yellow/green feedback. Invalid guesses still cost a turn, scores are capped at 10 turns per puzzle, and lower is better.
Top Models