TAG · #DEEP-LEARNING

#deep-learning

30 items

HOTNESS

Writing an LLM from scratch, part 34b -- from bigrams to GPT-2, one component at a time (in JAX)
3.0
The author builds and trains a GPT-2 small model from scratch in JAX, starting from a basic bigram-style model and incrementally adding components like LayerNorm and Transformer blocks. Achieved a final loss of 3.418, beating their PyTorch version (3.538) and original GPT-2 small (3.499) on the same test dataset.
gilesthomas-comJul 8, 2026#Tech
From bigrams to GPT-2, one component at a time (in Jax)
1.0
The article walks through building and training a GPT-2 Small-scale model from scratch using JAX, progressing from simple bigram models to a full transformer architecture component by component.
hnJul 8, 2026#Tech
FlashAttention-4: Algorithm and Kernel Pipelining
5.0
FlashAttention-4 introduces a new algorithm and kernel pipelining design that addresses asymmetric hardware scaling, improving performance on modern GPU architectures by better managing memory and compute resources.
hnJul 3, 2026#Tech
FlashAttention-4: Algorithm and Kernel Pipelining
6.0
FlashAttention-4 introduces a co-designed algorithm and kernel pipelining approach to improve attention computation on hardware with asymmetric scaling properties, enhancing efficiency and throughput on modern accelerators.
hnJul 2, 2026#Tech
Is One Layer Enough? A Single Transformer Layer Matches Full-Parameter RL Train
2.0
A new study shows that a single transformer layer can match the performance of full-parameter reinforcement learning training, questioning the necessity of deep architectures for certain RL tasks.
hnJul 2, 2026#Tech
Modern AI: Foundations, Learning, and Systems – Videos
2.0
A playlist titled "Modern AI: Foundations, Learning, and Systems" containing videos that cover core concepts in artificial intelligence, including foundational theories, machine learning approaches, and AI system architectures.
hnJul 1, 2026#Tech
RayTention – Self-Attention via Geometric Signal Extraction
3.0
RayTention introduces a novel self-attention mechanism that replaces traditional dot-product attention with geometric signal extraction, aiming to improve efficiency and interpretability in transformer models by leveraging spatial-angular decomposition of attention patterns.
hnJul 1, 2026#Tech
Matrix Orthogonalization Improves Memory in Recurrent Models
3.0
The article discusses how applying matrix orthogonalization techniques to recurrent neural network models improves their long-term memory retention and training stability.
hnJul 1, 2026#Tech
On the Efficacy of PyTorch for High-Performance Computing
4.0
This paper evaluates PyTorch's performance in high-performance computing (HPC) contexts, analyzing its efficiency for large-scale scientific computing workloads compared to traditional HPC frameworks. It examines memory usage, computational throughput, and scalability across different hardware configurations.
hnJun 30, 2026#Tech
Building a Jax training loop for an LLM training run
7.0
The article details the process of constructing a JAX training loop for training a large language model (LLM) from scratch. It covers key components such as data loading, model initialization, forward/backward passes, and optimization steps within the JAX framework. The post serves as a practical guide for implementing efficient and scalable LLM training runs using JAX.
hnJun 30, 2026#Tech
Fail Fast, Run Faster: Shape Safe Deep Learning in Rust on Apple Silicon [pdf]
0.0
A Rust-based deep learning framework for Apple Silicon uses shape-safe tensors and compile-time checks to prevent runtime errors, leveraging Metal Performance Shaders for faster training and inference with memory safety.
hnJun 30, 2026#Tech
It's Always the Learning Rates
2.0
The blog post argues that learning rates are often the root cause of training failures in machine learning, emphasizing that tuning learning rates—rather than more complex architectural changes—frequently resolves issues like divergence or slow convergence. The author shares practical advice on learning rate schedules, warmup, and debugging training runs.
hnJun 29, 2026#Tech
Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch
3.0
A developer built NanoEuler, a GPT-2-scale language model written entirely in pure C and CUDA from scratch, without any intermediary frameworks. The project aims to understand the low-level composition of LLMs, the correlation between parameters and data, and GPU optimization. It was trained on Shakespeare.txt up to 23 million parameters and includes SFT for chatbot-like behavior.
hnJun 28, 2026#Tech
Foveon – Bayer to Foveon X3, learned, Mac App using deep learning
1.0
Foveon is a Mac application that uses deep learning to convert Bayer-pattern sensor images (standard digital camera raw files) into the look of Foveon X3 sensor output, aiming to reproduce the distinct color and detail characteristics of Foveon captures.
hnJun 28, 2026#Tech
Attention is all we have
0.0
David Bessis argues that "attention" is the fundamental cognitive ability underlying intelligence, creativity, and even mathematics. He suggests that attention, rather than logic or reasoning alone, is the core mechanism that allows humans to learn, think, and innovate.
hnJun 28, 2026#Tech
Sequence Modeling with CTC
6.5
This article explains Connectionist Temporal Classification (CTC), a method for sequence modeling that allows training recurrent neural networks on sequence data without requiring alignment between input and target sequences. It describes how CTC works, its applications in speech recognition and handwriting recognition, and its advantages over traditional approaches.
hnJun 27, 2026#Tech
Ask HN: Has Ilya Sutskever spoken publicly lately?
2.0
A user on Hacker News asks whether Ilya Sutskever has made any public appearances, talks, interviews, or published papers in the past year, noting his recent absence from the public eye.
hnJun 27, 2026#Tech
Modern GPU Programming for MLSys
2.0
This course covers modern GPU programming techniques for machine learning systems, including CUDA, Triton, and other GPU programming models, aimed at helping developers build and optimize efficient ML workloads on GPUs.
hnJun 27, 2026#Tech
Transformers Explained for Software Engineers
1.0
The article explains the Transformer architecture and attention mechanism for software engineers, breaking down key concepts like self-attention, multi-head attention, and the encoder-decoder structure that underpin modern large language models.
hnJun 26, 2026#Tech
Scaling Laws, Carefully
7.0
Reviews scaling laws in deep learning, emphasizing careful empirical measurement to predict model performance. Discusses how loss scales with compute, data, and model size, but warns violations occur when key assumptions are not met. Provides practical advice for conducting and interpreting scaling experiments.
hnJun 26, 2026#Tech
Beyond Objects
2.0
The paper "Beyond Objects" explores limitations in current object-centric AI models and proposes new approaches to represent and reason about non-object-centric aspects of the world, such as fluids, materials, and continuous phenomena, to achieve more comprehensive scene understanding.
hnJun 26, 2026#Science
Show HN: A Transformer Is All You Need
1.0
The article introduces "A Transformer Is All You Need," a project or paper that explores the transformer architecture as a foundational model for various machine learning tasks, highlighting its simplicity and effectiveness compared to previous approaches.
hnJun 26, 2026#Tech
Mapping Networks: CVPR 2026 Best Paper Award Nominee
6.0
Mapping Networks has been nominated for the CVPR 2026 Best Paper Award.
hnJun 26, 2026#Tech
Modern GPU Programming for MLSys Book
4.0
This book provides an introduction to modern GPU programming techniques with a focus on machine learning systems. It covers GPU architecture, CUDA, and optimization strategies relevant for building efficient ML models and frameworks.
hnJun 26, 2026#Tech
Scaling Laws, Carefully
2.0
A careful review of neural network scaling laws covering empirical findings, token-counting nuances, compute-optimal training (Chinchilla law), and loss measurement methodology, with recommendations for avoiding common pitfalls.
hnJun 25, 2026#Tech
World Action Models: A Survey
4.0
This survey reviews World Action Models (WAMs), which integrate world models with action generation for decision-making tasks. It covers key components, architectures, and training methods, and discusses applications in robotics, gaming, and autonomous systems, as well as open challenges.
hnJun 24, 2026#Science
An ECG biomarker for sudden cardiac death discovered with deep learning
7.5
Researchers used deep learning to discover an ECG-based biomarker that can identify individuals at high risk of sudden cardiac death, potentially enabling earlier prevention and intervention strategies.
hnJun 24, 2026#Science
Puzzling Success of Overparameterization: Lottery Tickets or Escape Dimensions?
2.0
The paper investigates why overparameterized neural networks generalize well, proposing that overparameterization creates "escape dimensions" that help gradient descent find good solutions, contrasting with the lottery ticket hypothesis.
hnJun 24, 2026#Science
Show HN: ReflexConv2d – 57% less blur in image reconstruction
3.0
ReflexConv2D is a new convolutional layer designed for image reconstruction tasks, claiming to achieve 57% less blur compared to standard methods, with the project open-sourced on GitHub.
hnJun 23, 2026#Tech
Model Size Scaling in 2023-2031
8.0
The article analyzes historical trends in AI model size scaling from 2023 to 2031, projecting that continued exponential growth in model parameters may lead to models with trillions of parameters by the early 2030s, while discussing the limitations and challenges of such scaling, including compute costs and data availability.
hnJun 23, 2026#Tech

Load next 30Updated —