Writing an LLM from scratch, part 32d -- Interventions: adding attention bias
The author experiments with adding bias to attention weight matrices in a GPT-2 small model trained from scratch. Surprisingly, this intervention improved test loss by 0.023 compared to the baseline, contradicting conventional wisdom that such bias doesn't help modern LLMs. The model showed slightly better stability during training despite adding only minimal extra parameters.