Writing an LLM from scratch, part 32h -- Interventions: full fat float32
The author tested training a GPT-2 small model without PyTorch's AMP and lower-precision matrix multiplication optimizations. The full float32 training took over 8 hours and cost $135, more than double the time and triple the cost of baseline runs. The resulting model showed only a tiny test loss improvement of 0.013, suggesting AMP provides significant speed benefits with minimal quality impact.