The author completed training a GPT-2-like model in 44 hours on a local machine, achieving performance close to GPT-2 small. Through systematic testing of various interventions, they identified learning rate adjustments and dropout removal as most effective for improving model loss. The author plans to next implement an LLM from scratch using JAX without reference to their book.
gilesthomas-com
12 items from gilesthomas-com
Updated instruction fine-tuning tests on GPT-2-style models show OpenAI's models performed best. Some custom models with similar test loss scores showed unexpected variations in instruction-following ability, with no clear pattern emerging across all tested models.
A researcher trained a GPT-2-style language model on 3.2 billion tokens and tracked its progress through 57 checkpoints. The model evolved from generating incoherent text to producing coherent, motivational content by processing about one-third of the training data.
The author explains gradient accumulation techniques to match cloud training batch sizes locally. By accumulating gradients over multiple forward-backward passes before optimizer updates, they achieve the stabilization benefits of larger batches without requiring more GPU memory. This allows local training with effective batch sizes comparable to cloud setups.
Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud
2.0The author tested multiple training interventions on a 163M-parameter GPT-2-style model to improve its performance. The best result came from combining gradient clipping, removing dropout, using a higher scheduled learning rate, and changing weight decay to 0.01, achieving a test loss of 3.577761. This was better than the baseline loss of 3.691526 but still above the original GPT-2's 3.500.
The author tested various training interventions on a GPT-2-style model, finding learning rate scheduling provided the best improvement. Random seed experiments showed weight initialization significantly affects results, with losses ranging from 3.653 to 3.692, suggesting some intervention effects may be within noise levels.
The author tested training a GPT-2 small model without PyTorch's AMP and lower-precision matrix multiplication optimizations. The full float32 training took over 8 hours and cost $135, more than double the time and triple the cost of baseline runs. The resulting model showed only a tiny test loss improvement of 0.013, suggesting AMP provides significant speed benefits with minimal quality impact.
The author created a tool called lambda-manager to automate launching Lambda Labs instances. It monitors availability of specific instance types and launches them when they become available, then sends Telegram notifications. The tool has been running for six hours without finding the desired 8x A100 instance.
The article examines weight tying in LLMs, a technique that reduces parameters by sharing weights between input and output layers. The author tests this approach on a GPT-2 style model to see if it improves performance, despite research suggesting it typically worsens model quality.
The article examines weight decay as a regularization technique in training a GPT-2 small model from scratch. It explains that weight decay adds a penalty based on the squared L2 norm of model weights to the loss function to prevent overfitting. The author explores the mathematical formulation and its implementation in the AdamW optimizer.
The author explores learning rate scheduling for training an LLM from scratch, examining why fixed learning rates can fail and discussing various decay methods including step, exponential, and cosine decay. The post focuses on implementing a cosine learning rate scheduler with warmup, following recommendations from the Chinchilla paper.
The author experiments with adding bias to attention weight matrices in a GPT-2 small model trained from scratch. Surprisingly, this intervention improved test loss by 0.023 compared to the baseline, contradicting conventional wisdom that such bias doesn't help modern LLMs. The model showed slightly better stability during training despite adding only minimal extra parameters.