Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud
The author tested multiple training interventions on a 163M-parameter GPT-2-style model to improve its performance. The best result came from combining gradient clipping, removing dropout, using a higher scheduled learning rate, and changing weight decay to 0.01, achieving a test loss of 3.577761. This was better than the baseline loss of 3.691526 but still above the original GPT-2's 3.500.