Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

METR

From Wikipedia, the free encyclopedia
AI model evaluation nonprofit

METR
Formation2022; 4 years ago (2022)
FounderBeth Barnes
TypeNonprofitresearch institute
Legal status501(c)(3)tax exemptcharity
PurposeAIsafety research and model evaluation
Location
Websitemetr.org

Model Evaluation and Threat Research (METR) (MEE-tər), is anonprofitresearch institute, based inBerkeley, California,[1] that evaluatesfrontierAI models' capabilities to carry out long-horizon,agentic tasks that some researchers argue could posecatastrophic risks to society.[2][3] They have worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, includingOpenAI'so3,o4-mini,GPT-4o andGPT-4.5, andAnthropic'sClaude models.[3][4][5][6][7]

METR'sCEO and founder is Beth Barnes, a formeralignment researcher atOpenAI who left in 2022 to form ARC Evals, the evaluation division ofPaul Christiano'sAlignment Research Center. In December 2023, ARC Evals was thenspun off into an independent501(c)(3) nonprofit and renamed METR.[8][9][10]

Research

[edit]

A substantial amount of METR's research is focused on evaluating the capabilities of AI systems to conductresearch and development of AI systems themselves, including RE-Bench, abenchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".[11][12]

Doubling time estimates

[edit]
A graph showing that the length of tasksfrontier models are capable of executing at a 50% success rate doubled every 7 months from 2019 to 2024. The shaded region represents a 95%confidence interval.[13]

In March 2025, METR published a paper noting that the length of software engineering tasks that the leading AI model could complete had adoubling time of around 7 months between 2019 and 2024.[14]

In January 2026, METR has released a new version of their time horizon estimates model (Time Horizon 1.1). According to their new model the rate of progress of AI capabilities has increased since 2023. They now estimate that the post-2023 doubling-time is 130.8 days (4.3 months). Progress is thus estimated to be 20% more rapid.[15]

Time horizon measurements

[edit]

METR releases a "task-completion time horizon" for analysed AI models. This measures the "task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability."[16] They release it in two variants: The 50%-time horizon, which gives the task duration at which an AI model is estimated to succeed 50% of the time and the 80%-time horizon, which gives the task duration at which an AI model is estimated to succeed 80% of the time.[16] They have two versions of horizon estimates: Time Horizon 1.1, introduced in January 2026, and the original Time Horizon 1.0.[16]

As of February 2026[update] the best performing model isGPT-5.2 (high) with a 6 hours 34 minutes 50%-time horizon and a 80%-time horizon of 55 minutes.[16] The following table provides the time horizon estimates ordered by the model's release date:[16]

Task duration (for humans)
ModelRelease dateTime Horizon 1.1Time Horizon 1.0
50%80%50%80%
GPT-2February 20192 seconds0 seconds
GPT-3May 20209 seconds2 seconds
GPT-3.5March 202236 seconds10 seconds
GPT-4March 20234 minutes37 seconds5 minutes1 minute
GPT-4
(November 2023)
November 20234 minutes34 seconds9 minutes1 minute
Claude 3 OpusMarch 20244 minutes29 seconds6 minutes1 minute
GPT-4 TurboApril 20243 minutes37 seconds7 minutes2 minutes
GPT4oMay 20246 minutes57 seconds9 minutes2 minutes
Qwen2-72BJune 20242 minutes25 seconds
Claude 3.5 Sonnet (Old)June 202411 minutes1 minute19 minutes3 minutes
Qwen2.5-72BSeptember 20245 minutes56 seconds
o1-previewSeptember 202419 minutes3 minutes22 minutes5 minutes
Claude 3.5 Sonnet (New)October 202420 minutes2 minutes30 minutes5 minutes
Deepseek-V3December 202418 minutes4 minutes
o1December 202438 minutes6 minutes41 minutes6 minutes
Claude 3.7 SonnetFebruary 202560 minutes10 minutes56 minutes15 minutes
o3April 20252 hours 1 minute24 minutes1 hour 34 minutes21 minutes
o4-miniApril 20251 hour 19 minutes16 minutes
Claude Opus 4May 20251 hour 41 minutes17 minutes1 hour 26 minutes21 minutes
DeepSeek-R1-0528May 202532 minutes4 minutes
Gemini 2.5 Pro PreviewJune 202540 minutes9 minutes
Grok 4July 20251 hour 49 minutes15 minutes
Claude Opus 4.1August 20251 hour 41 minutes19 minutes
GPT 5August 20253 hours 34 minutes32 minutes2 hours 18 minutes27 minutes
gpt-oss-120bAugust 202545 minutes7 minutes
Claude Sonnet 4.5September 20252 hours 2 minutes21 minutes
Gemini 3 ProNovember 20253 hours 57 minutes43 minutes
Claude Opus 4.5November 20255 hours 20 minutes42 minutes4 hours 49 minutes27 minutes
GPT 5.1-Codex-MaxNovember 20253 hours 57 minutes41 minutes2 hours 53 minutes32 minutes
Kimi K2 Thinking
(inference via Novita AI)
November 202558 minutes12 minutes
GPT-5.2 (high)December 20256 hours 34 minutes55 minutes

References

[edit]
  1. ^Witt, Stephen (10 October 2025)."The A.I. Prompt That Could End the World".The New York Times.Archived from the original on 29 October 2025. Retrieved29 October 2025.
  2. ^"About METR".METR.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  3. ^ab"OpenAI o3 and o4-mini System Card".OpenAI.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  4. ^"GPT-4.5 system card".OpenAI. Retrieved15 June 2025.
  5. ^"Introducing Claude 3.5 Sonnet".Anthropic.Archived from the original on 6 February 2025. Retrieved15 June 2025.
  6. ^"Details about METR's preliminary evaluation of Claude 3.7".METR's Autonomy Evaluation Resources. 4 April 2025.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  7. ^Robison, Kylie (2024-08-08)."OpenAI says its latest GPT-4o model is 'medium' risk"Archived 6 February 2026 at theWayback Machine.The VergeArchived 21 October 2025 at theWayback Machine. Retrieved 2025-10-29.
  8. ^"ARC Evals is now METR".METR Blog. 4 December 2023.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  9. ^Booth, Harry (5 September 2024)."TIME100 AI 2024: Beth Barnes".TIME.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  10. ^Henshall, Will (21 March 2024)."Nobody Knows How to Safety-Test AI".TIME.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  11. ^"Claude 3.7 Sonnet System Card".Anthropic. 24 February 2025. Retrieved15 June 2025.
  12. ^"Gemini 2.5 Pro Preview Model Card".Google. 6 June 2025.Archived from the original on 28 May 2025. Retrieved15 June 2025.
  13. ^"Measuring AI Ability to Complete Long Tasks".METR Blog. 19 March 2025.Archived from the original on 15 June 2025. Retrieved15 June 2025.
  14. ^Lovely, Garrison (19 March 2025)."AI could soon tackle projects that take humans weeks".Nature.doi:10.1038/d41586-025-00831-8.ISSN 1476-4687.Archived from the original on 1 July 2025. Retrieved15 June 2025.
  15. ^"Time Horizon 1.1".METR Blog. 29 January 2026.Archived from the original on 12 February 2026. Retrieved14 February 2026.
  16. ^abcde"Task-Completion Time Horizons of Frontier AI Models".METR. February 2026. Retrieved14 February 2026.

External links

[edit]
Concepts
Organizations
People
Books
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=METR&oldid=1338485185"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp