Writing an LLM from scratch, part 32e -- Interventions: the learning rate
The author explores learning rate scheduling for training an LLM from scratch, examining why fixed learning rates can fail and discussing various decay methods including step, exponential, and cosine decay. The post focuses on implementing a cosine learning rate scheduler with warmup, following recommendations from the Chinchilla paper.