llmpm benchmark
Evaluate any model against 68+ industry-standard tasks directly from the terminal. Compare models, track regressions, and reproduce leaderboard results — all with a single command.
CLI COMMANDS
Install benchmark backend (one-time setup)
Run a single benchmark on an installed model
Run the full Open LLM Leaderboard suite
Run multiple tasks in one pass
Benchmark any HuggingFace model directly (no install needed)
Quick smoke test — limit to 100 examples
Override few-shot count
Save full HTML report to ./results/
List all supported benchmark tasks
AVAILABLE TASKS
68LEADERBOARD SUITES
CORE BENCHMARKS
MATH & REASONING
CODE GENERATION
READING COMPREHENSION
KNOWLEDGE & FACTUALITY
COMMONSENSE REASONING
LONG CONTEXT
LANGUAGE MODELING & PERPLEXITY
MEDICAL & SCIENTIFIC
SAFETY & BIAS
MULTILINGUAL
SUMMARIZATION