Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?
The author tested various training interventions on a GPT-2-style model, finding learning rate scheduling provided the best improvement. Random seed experiments showed weight initialization significantly affects results, with losses ranging from 3.653 to 3.692, suggesting some intervention effects may be within noise levels.