SOURCE · GILESTHOMAS-COM

gilesthomas-com

25 items from gilesthomas-com

HOTNESS

Writing an LLM from scratch, part 34b -- from bigrams to GPT-2, one component at a time (in JAX)
3.0
The author builds and trains a GPT-2 small model from scratch in JAX, starting from a basic bigram-style model and incrementally adding components like LayerNorm and Transformer blocks. Achieved a final loss of 3.418, beating their PyTorch version (3.538) and original GPT-2 small (3.499) on the same test dataset.
gilesthomas-comJul 8, 2026#Tech
Writing an LLM from scratch, part 34a -- building a JAX training loop for an LLM training run
1.0
Building a JAX training loop for an LLM from scratch using Flax NNX and Optax. First validated the harness with a minimal "A-to-A" model (embedding then projection) before adding Transformer layers. Key challenges included fixing JAX's GPU memory default and slow data iteration by committing data to CPU memory.
gilesthomas-comJun 30, 2026#Tech
Thoughts on Role Confusion
3.5
A paper finds LLMs rely on text tone rather than unfakeable role tags to determine speaker identity, explaining many jailbreaks. Text mimicking a model's reasoning trace can trick it even when tagged as user input.
gilesthomas-comJun 24, 2026#Tech
Flax debugging: making a hash of things
2.0
The author debugged a Flax NNX training loop where the loss was stuck at 10.82, indicating random guessing. By hashing the model parameters and comparing hashes across steps, they discovered the parameters weren't changing. The root cause was using @jax.jit instead of @nnx.jit, which is needed for proper in-place state propagation of parameter updates in NNX.
gilesthomas-comJun 17, 2026#Tech
10Gb/s Ethernet: switching to a Broadcom SFP+ module
1.0
An overheating 10GBASE-T SFP+ module in a home switch was replaced with a Broadcom-based model, fixing link flapping and lowering CPU temperature by about 5°C, though the new module does not report temperature and impersonates a fibre-optic Intel module in its EEPROM.
gilesthomas-comJun 16, 2026#Tech
JAX: commitment issues
4.0
A JAX developer found that array lookups on CPU-hosted data took over a second each, because arrays created with the `default_device` context manager are uncommitted to that device. Using `jax.device_put` to explicitly commit the array to the CPU reduced lookup times from ~1.2s to under 0.0002s, fixing a severe bottleneck in LLM training data loading.
gilesthomas-comJun 15, 2026#Tech
JAX backends and devices
1.0
A developer porting PyTorch code to JAX hit a GPU OOM error loading a 19 GiB dataset, because JAX defaults to the GPU backend. They solved it by using `jax.default_device(jax.devices("cpu")[0])` as a context manager to load data into RAM instead.
gilesthomas-comJun 5, 2026#Tech
Using Safetensors with Flax
2.0
Using Safetensors with Flax requires a flat dictionary of string keys to arrays, not nested dicts. The author uses nnx.to_flat_state to flatten the model state, then converts it to dot-separated keys for save_file, and reverses the process for loading.
gilesthomas-comJun 4, 2026#Tech
On first looking into JAX
0.5
An experienced PyTorch user explores JAX, highlighting its mathematical purity and functional approach. Unlike PyTorch's procedural, piecewise-optimized style, JAX uses JIT compilation and transformations like jax.grad to compute derivatives automatically. The post argues PyTorch is engineering-focused while JAX is minimalist and math-oriented.
gilesthomas-comMay 30, 2026#Tech
10Gb/s Ethernet: using mini-heatsinks with a 10GBASE-T SFP+ module
1.5
A user attached mini-heatsinks to a 10GBASE-T SFP+ module in their home network switch, achieving a temperature reduction of about 3.5°C. The discussion also highlights that older 30-meter-rated SFP+ modules (like the MikroTik S+RJ10 with a Marvell chip) run hotter than newer 100-meter-rated Broadcom-based modules, which could be a future upgrade.
gilesthomas-comMay 18, 2026#Tech
10Gb/s Ethernet: what I actually did to get it working in my home
1.0
The author upgraded his home network from 2.5Gb/s to 10Gb/s Ethernet using MikroTik switches, DAC cables, and a Protectli router. Tests confirmed the existing cabling could handle 10Gb/s, achieving ~8–9 Gb/s to the ISP. Some SFP+ modules run very hot (~93°C), requiring monitoring.
gilesthomas-comApr 29, 2026#Tech
10Gb Ethernet: what I had to (re)learn
1.5
The author describes upgrading home networking to 10Gb/s after their ISP began offering the speed, covering the history of Ethernet from 10BASE2 to modern switches. They explain key challenges with 10GBASE-T—including heat and cable requirements—and recommend cooler alternatives like SFP+ Direct Attach Copper (DAC) cables or Active Optical Cables (AOCs) for short-range, in-room connections.
gilesthomas-comApr 28, 2026#Tech
Writing an LLM from scratch, part 33 -- what I learned from finally getting round to the appendices
2.0
The author reviews the appendices of "Build a Large Language Model (from Scratch)" and found useful material on PyTorch basics, DistributedDataParallel training, and LoRA implementation. While these sections could have saved time during their explorations, they believe working through concepts independently provided deeper learning than simply reading explanations.
gilesthomas-comApr 22, 2026#Tech
Writing an LLM from scratch, part 32m -- Interventions: conclusion
2.0
The author completed training a GPT-2-like model in 44 hours on a local machine, achieving performance close to GPT-2 small. Through systematic testing of various interventions, they identified learning rate adjustments and dropout removal as most effective for improving model loss. The author plans to next implement an LLM from scratch using JAX without reference to their book.
gilesthomas-comApr 21, 2026#Tech
Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results
1.5
Updated instruction fine-tuning tests on GPT-2-style models show OpenAI's models performed best. Some custom models with similar test loss scores showed unexpected variations in instruction-following ability, with no clear pattern emerging across all tested models.
gilesthomas-comApr 20, 2026#Tech
How an LLM becomes more coherent as we train it
3.0
A researcher trained a GPT-2-style language model on 3.2 billion tokens and tracked its progress through 57 checkpoints. The model evolved from generating incoherent text to producing coherent, motivational content by processing about one-third of the training data.
gilesthomas-comApr 17, 2026#Tech
Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation
1.5
The author explains gradient accumulation techniques to match cloud training batch sizes locally. By accumulating gradients over multiple forward-backward passes before optimizer updates, they achieve the stabilization benefits of larger batches without requiring more GPU memory. This allows local training with effective batch sizes comparable to cloud setups.
gilesthomas-comApr 15, 2026#Tech
Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud
2.0
The author tested multiple training interventions on a 163M-parameter GPT-2-style model to improve its performance. The best result came from combining gradient clipping, removing dropout, using a higher scheduled learning rate, and changing weight decay to 0.01, achieving a test loss of 3.577761. This was better than the baseline loss of 3.691526 but still above the original GPT-2's 3.500.
gilesthomas-comApr 9, 2026#Tech
Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?
2.0
The author tested various training interventions on a GPT-2-style model, finding learning rate scheduling provided the best improvement. Random seed experiments showed weight initialization significantly affects results, with losses ranging from 3.653 to 3.692, suggesting some intervention effects may be within noise levels.
gilesthomas-comApr 7, 2026#Tech
Writing an LLM from scratch, part 32h -- Interventions: full fat float32
1.5
The author tested training a GPT-2 small model without PyTorch's AMP and lower-precision matrix multiplication optimizations. The full float32 training took over 8 hours and cost $135, more than double the time and triple the cost of baseline runs. The resulting model showed only a tiny test loss improvement of 0.013, suggesting AMP provides significant speed benefits with minimal quality impact.
gilesthomas-comApr 3, 2026#Tech
Automating starting Lambda Labs instances
2.0
The author created a tool called lambda-manager to automate launching Lambda Labs instances. It monitors availability of specific instance types and launches them when they become available, then sends Telegram notifications. The tool has been running for six hours without finding the desired 8x A100 instance.
gilesthomas-comApr 2, 2026#Tech
Writing an LLM from scratch, part 32g -- Interventions: weight tying
2.0
The article examines weight tying in LLMs, a technique that reduces parameters by sharing weights between input and output layers. The author tests this approach on a GPT-2 style model to see if it improves performance, despite research suggesting it typically worsens model quality.
gilesthomas-comMar 24, 2026#Tech
Writing an LLM from scratch, part 32f -- Interventions: weight decay
1.5
The article examines weight decay as a regularization technique in training a GPT-2 small model from scratch. It explains that weight decay adds a penalty based on the squared L2 norm of model weights to the loss function to prevent overfitting. The author explores the mathematical formulation and its implementation in the AdamW optimizer.
gilesthomas-comMar 23, 2026#Tech
Writing an LLM from scratch, part 32e -- Interventions: the learning rate
1.5
The author explores learning rate scheduling for training an LLM from scratch, examining why fixed learning rates can fail and discussing various decay methods including step, exponential, and cosine decay. The post focuses on implementing a cosine learning rate scheduler with warmup, following recommendations from the Chinchilla paper.
gilesthomas-comMar 10, 2026#Tech
Writing an LLM from scratch, part 32d -- Interventions: adding attention bias
1.0
The author experiments with adding bias to attention weight matrices in a GPT-2 small model trained from scratch. Surprisingly, this intervention improved test loss by 0.023 compared to the baseline, contradicting conventional wisdom that such bias doesn't help modern LLMs. The model showed slightly better stability during training despite adding only minimal extra parameters.
gilesthomas-comFeb 6, 2026#Tech

Load next 30Updated —