OpenAI
gpt-5.2-2025-12-11
BENCHMARK SCORES
Graduate-Level Google-Proof Q&A. PhD-level multiple-choice questions in chemistry, biology, and physics. Scored by accuracy.
AIME 2025 is a mathematics competition benchmark testing advanced problem-solving.
SWE-bench Verified evaluates models on real-world software engineering tasks from GitHub issues.
Multilingual MMLU tests knowledge across many languages and subject areas.
BrowseComp tests web browsing comprehension, measuring ability to find and synthesize information from websites.
Humanity's Last Exam (HLE) is a multi-modal benchmark testing frontier knowledge across mathematics, humanities, and natural sciences with 2,500 expert-level questions.
MMMU-Pro is a more challenging version of MMMU with harder multimodal understanding tasks.