Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Humanity's Last Exam

From Wikipedia, the free encyclopedia
Language model benchmark

Humanity's Last Exam (HLE) is alanguage model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by theCenter for AI Safety andScale AI.

Creation

[edit]

Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation".[1] The test has been described as the brainchild ofDan Hendrycks, a machine learning researcher and the director of theCenter for AI Safety, who stated that he was inspired to create the test after a conversation withElon Musk, who thought the existinglanguage model benchmarks, such as theMMLU, were too easy. Hendrycks worked withScale AI to compile the questions.[2] The questions werecrowdsourced from subject matter experts from various institutions across the world.[3][4] The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts in two rounds and approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000U.S. dollars—$5,000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset".[4]

Composition

[edit]

The benchmark consists of 2,500 questions in the publicly released set. The questions "typically require graduate-level expertise or test knowledge of highly specific topics". The paper classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e.,multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmarkoverfitting.[4]

An example question:[2]

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

An independent investigation by FutureHouse, published in July 2025, suggested that around 30% of the HLE answers for text-only chemistry and biology questions could be incorrect; the benchmark's team partially replicated the findings, and said they hope to institute a continuous revisions process.[5]

Results

[edit]
Performance of various models on the benchmark
OrganizationModelAccuracy (%) ↑Calibration Error (%) ↓
Google DeepMindGemini 3 Pro Preview37.5257
OpenAIGPT-5 Pro31.6449
AnthropicClaude Opus 4.5 (Thinking)25.2055
Moonshot AIKimi K2.524.3767
Z.aiGLM 4.58.3279
Meta AILlama 4 Maverick5.6883
Mistral AIMistral Medium 34.5277
Amazon Web ServicesNova Pro4.4080
Source:Scale AI. 14 February 2026.
Performance of various non-multimodal models on the text-only subset of the benchmark
OrganizationModelAccuracy (%) ↑Calibration Error (%) ↓
OpenAIgpt-oss-120b15.4876
Alibaba CloudQwen3-235B-A22B-Thinking-250715.4378
DeepSeekDeepSeek-R1-052814.0478
Moonshot AIKimi-K2-Instruct4.6882
Amazon Web ServicesNova Micro4.4184
Source:Scale AI. 30 August 2025.

References

[edit]
  1. ^Maslej, Nestor; et al. (April 2025).The AI Index 2025 Annual Report(PDF) (Report). Institute for Human-Centered AI. pp. 141–142.
  2. ^abRoose, Kevin (23 January 2025)."When A.I. Passes This Test, Look Out".New York Times. Archived fromthe original on 29 January 2025. Retrieved24 January 2025.
  3. ^Dastin, Jeffrey; Paul, Katie (16 September 2024)."AI experts ready 'Humanity's Last Exam' to stump powerful tech".Reuters. Archived fromthe original on 8 April 2025. Retrieved24 January 2025.
  4. ^abcCenter for AI Safety; Scale AI; HLE Contributers Consortium (2026). "A benchmark of expert-level academic questions to assess AI capabilities".Nature.649:1139–1146.doi:10.1038/s41586-025-09962-4.
  5. ^Skarlinski, Michael; Laurent, Jon; Bou, Albert; White, Andrew (16 September 2025)."About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong".FutureHouse. Retrieved15 October 2025.

External links

[edit]
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Political
Social and economic
Retrieved from "https://en.wikipedia.org/w/index.php?title=Humanity%27s_Last_Exam&oldid=1338383778"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp