The article explains that LLM inference is primarily memory-bound rather than compute-bound due to the autoregressive decoding process. Each step generates only one token and requires loading the entire model's parameters from memory, making memory bandwidth the key bottleneck. This contrasts with training, which is compute-bound because of batched processing of many tokens simultaneously.
#deep-learning
30 items
A technical breakdown of how a Large Language Model processes text: input is tokenized, embedded, and passed through attention and feed-forward layers to generate output predictions.
Vision Language Models (VLMs) combine computer vision and natural language processing, allowing AI to understand images and generate text descriptions. They work by aligning visual features from images with textual representations from language models, enabling tasks like image captioning and visual question answering.
DeepSeek-OCR is a visual language model for complex OCR tasks like scene text, math, tables, and handwriting. It converts images to visual tokens and uses two-stage training for improved reading and reasoning.
MixFont trained a Transformer-based font generation model on 20,000 commercial fonts, achieving high-quality vector font creation from single-character input and developing a font reranker, while sharing key technical lessons about data quality, model architecture, and the engineering challenges of building a frontier generative model for typography.
Physics Informed Neural Networks (PINNs) embed physical laws directly into neural network training, enabling the model to learn from both data and governing equations. This approach allows accurate predictions even with limited or noisy data, useful in scientific computing and engineering.
This article explains how AI agents operate through a three-layer architecture: perception, reasoning, and action. It details how agents perceive their environment, process information using large language models, and execute tasks via tools and APIs. The piece also covers key components like memory, planning, and feedback loops that enable autonomous decision-making.
Next-token prediction has proven surprisingly effective for training capable language models, but the approach faces fundamental limitations in planning, reasoning, and factual accuracy that scaling alone may not resolve.
KlongPy now includes a PyTorch back end, enabling array operations on both CPU and GPU via PyTorch tensors, along with an autograd engine for automatic differentiation. This allows KlongPy to leverage PyTorch's performance and GPU acceleration for numerical computing and machine learning.
The video argues that common intuitive explanations of tensors, such as "a tensor is just a multi-dimensional array," are oversimplified and misleading, and explores the deeper mathematical concept of tensors as multilinear maps that transform according to specific rules.
The article outlines key methodologies used in training frontier AI models, including data curation, scaling laws, reinforcement learning from human feedback (RLHF), and emerging techniques like process-supervised reward models. It discusses how these methods contribute to improved model performance, alignment, and reasoning capabilities.
The article explores options for local LLM inference beyond expensive NVIDIA setups, focusing on Mac hardware and distributed inference methods like layer splitting, expert parallelism, and model ensembling as alternative approaches.
The article introduces the Self Teaching Autoencoder (STAE), a neural network architecture that learns to compress and reconstruct data without requiring labeled training data or a separate pre-training phase. It combines autoencoding with self-supervised learning principles, allowing the model to teach itself effective representations directly from raw input data.
The article introduces ML-PICO, a practical system for learned image compression, focusing on what design choices matter for real-world deployment rather than just rate-distortion performance. It presents findings on architecture, entropy modeling, and engineering trade-offs to make learned compression viable in practice.
A new finding shows that matrix multiplications on GPUs execute significantly faster when the input data has predictable, regular patterns compared to random data. This performance difference arises from how GPUs handle memory access patterns and cache behavior, with structured data enabling more efficient parallel processing.
The article explains how to accelerate deep learning training from first principles, covering GPU memory hierarchy, kernel fusion, parallelization strategies, and practical techniques to maximize hardware utilization, ultimately showing that understanding these fundamentals can lead to order-of-magnitude speed improvements.
Codeep.dev is a tool that allows users to deeply explore and analyze codebases, providing insights into code structure, dependencies, and implementation details. It helps developers navigate and understand complex code more effectively.
The article compares the computational complexity of the human brain versus deep learning models, estimating that the brain operates at roughly 10^16 FLOPS while modern GPUs achieve around 10^12 FLOPS, and discusses the implications for achieving artificial general intelligence and the singularity.
The paper introduces CODA, a framework that rewrites transformer blocks as GEMM-epilogue programs, enabling flexible and efficient execution by exposing matrix multiplication and post-processing as a unified, composable computation pattern.
DashAttention introduces a differentiable and adaptable sparse hierarchical attention mechanism that improves efficiency in transformer models by learning sparse attention patterns end-to-end, reducing computational cost while maintaining model performance.
PyTorch 2.12 has been released, introducing new features and improvements including enhanced performance for torch.compile, expanded support for Intel GPU and other hardware backends, and updates to the PyTorch API. Key highlights include better integration with the TorchInductor compiler and various bug fixes and optimizations across the framework.
The article provides a formula to calculate GPU memory requirements for running large language models, helping users determine which models fit on their specific GPU hardware. It covers key factors like model parameters, quantization, activations, and context length for the 2026 generation of LLMs.
An interactive website visually explains the concepts of KV Cache and Flash Attention, two key optimization techniques used in transformer-based language models to improve inference efficiency and memory usage.
Andrew Ng's DeepLearning.AI launched a course on AI agents for image and video generation, built with Google Cloud. It teaches three evaluation techniques—image-text similarity, LLM judging, and structured rubrics—for agents to self-improve output quality.
A discussion among Kaiser, Kosowski, Jones, and Lechner debates the strengths and limitations of Transformer architectures versus emerging post-Transformer alternatives in machine learning, covering efficiency, scalability, and future directions for sequence modeling.
This paper introduces scalable packed layouts for vector-length-agnostic machine learning code generation. It proposes techniques to efficiently pack and unpack data for variable-length vector architectures, enabling portable and performant ML kernels across different hardware without sacrificing efficiency.
RigidFormer introduces a transformer-based architecture designed to learn rigid-body dynamics from data, offering a novel approach to modeling physical interactions and object motion in 3D environments.
This blog post explores the dynamics of how language models (LMs) generalize during pre-training, examining the interplay between training data, model architecture, and learning dynamics that lead to emergent abilities.
This article explains State Space Models (SSMs) by walking through their mathematical formulation alongside Python code implementations. It covers the continuous-time representation, discretization process, and how SSMs map input sequences to outputs using hidden states, drawing parallels to recurrent and convolutional neural networks.
Nous Research introduces Lighthouse Attention, a new method that improves attention mechanisms in transformer models by enabling more efficient processing of long sequences without sacrificing accuracy or speed.