Skip to content
TopicTracker
From HackerNewsView original
TranslationTranslation

Why LLM decode is memory-bound, not compute-bound

The article explains that LLM inference is primarily memory-bound rather than compute-bound due to the autoregressive decoding process. Each step generates only one token and requires loading the entire model's parameters from memory, making memory bandwidth the key bottleneck. This contrasts with training, which is compute-bound because of batched processing of many tokens simultaneously.

Related stories