翻訳言語

METR、Claude Mythosの計測が困難に——タスクホライズンの50%が16時間を超過

METR（Model Evaluation and Threat Research）による評価で、Claude Mythosのタスク完了能力が従来の計測手法では捉えきれない水準に達していることが明らかになった。現在、タスクホライズン（タスク完了までの時間）の50%が16時間を超えており、従来のベンチマークではこの長期的な推論能力を適切に測定できない状況となっている。

Misplaced panic over AI progress
4.0
Gary Marcus critiques the interpretation of METR's latest "time horizon" graph, arguing that fears over rapid AI progress are misplaced. He breaks down what the data actually shows versus the overblown claims about AI taking over human tasks, emphasizing that the graph measures specific task completion times rather than general intelligence or autonomous capabilities.

METR、Claude Mythosの計測が困難に——タスクホライズンの50%が16時間を超過

関連記事

Misplaced panic over AI progress

METR、Claude Mythosの計測が困難に——タスクホライズンの50%が16時間を超過

関連記事

Misplaced panic over AI progress