Explaining Attention with Program Synthesis
Researchers propose using program synthesis to generate interpretable explanations for attention mechanisms in transformer models, addressing the challenge of understanding how these models make decisions by producing human-readable programs that replicate attention patterns.
Background
This paper introduces a method for explaining how attention mechanisms (a core component of LLMs like ChatGPT) work, by converting their behavior into short, human-readable computer programs.
- Attention mechanisms decide which parts of an input (e.g. words in a sentence) the model should "focus on" when making a prediction. They are powerful but often opaque — it's hard to tell why the model attends to certain words.
- "Program synthesis" is an AI technique that automatically generates simple programs that match observed input-output examples. Here it's used to produce a short program that mimics the attention pattern, making the reasoning process explicit and inspectable.
- Current interpretability methods for attention (like attention heatmaps or probing classifiers) show correlation but don't reveal the underlying rule the model is following. This approach aims to produce a precise, symbolic explanation instead.
- The work sits at the intersection of mechanistic interpretability (reverse-engineering neural networks) and program synthesis, and targets a longstanding tension: models that become more capable also become harder to explain.