Matrix Multiplication on Blackwell

This article introduces a series on matrix multiplication optimization for NVIDIA's Blackwell architecture, explaining the importance of efficient matrix math for AI workloads and outlining the hardware advancements in Blackwell that enable faster computation compared to previous architectures.

Background

NVIDIA's "Blackwell" (announced March 2024, shipping in late 2024) is the architecture powering its next-generation data-center GPUs, such as the B200. It succeeds the "Hopper" architecture (H100 GPU), which dominated AI training and inference for the past two years. The article is part 1 of a technical series focused on implementing matrix multiplication—the fundamental math operation behind neural networks—efficiently on Blackwell hardware. For readers: Blackwell is not merely a faster H100; it introduces a new "Tensor Core" design that supports finer-grained data types (like FP4 and FP6) and changes how computation is scheduled (e.g., asynchronous thread groups). Understanding these low-level details matters because matrix multiplication performance is the single biggest factor determining how fast large language models (GPT, Claude, Gemini, etc.) can run and how much they cost to serve. Companies like Meta, Microsoft, and OpenAI are already ordering Blackwell GPUs in volume.