Writing an LLM from scratch, part 32f -- Interventions: weight decay
The article examines weight decay as a regularization technique in training a GPT-2 small model from scratch. It explains that weight decay adds a penalty based on the squared L2 norm of model weights to the loss function to prevent overfitting. The author explores the mathematical formulation and its implementation in the AdamW optimizer.