TopicTracker
From martinalderson.comView original
TranslationTranslation

Are we in a GPT-4-style leap that evals can't see?

Gemini 3 Pro's design capabilities and Opus 4.5's reduced babysitting needs represent a subtle but significant leap that traditional benchmarks completely miss.