Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors
The article explores why language models hallucinate, attributing it to a conflict between learned priors and reasoning. Experiments show models often default to strong prior associations over novel reasoning, causing fabrication.
Background
This article explores the "reasoning against priors" theory of hallucination in large language models (LLMs). The key idea: LLMs do not have a separate "truth" module—they simply predict the most likely next token based on patterns learned during training (the model's "priors"). When a model is prompted to perform a reasoning task (e.g., solving a math problem), it must override its default priors (e.g., it "knows" 2+2=4) to follow a specific chain-of-thought. Hallucinations occur when the model fails to suppress its prior knowledge and instead generates fluent but incorrect text that matches what the prompt seems to "want." The article tests this by inserting deliberate typos or contradictions into prompts and observing where the model reverts to its priors vs. following the reasoning path.