METR can barely measure Claude Mythos – 50% task horizon now exceeds 16 hours
METR's latest evaluation finds that Claude Mythos's 50% task horizon now exceeds 16 hours, making it increasingly difficult for current benchmarks to measure the model's capabilities accurately.