Explaining Attention with Program Synthesis
A new approach called Programmatic Attention Explanation (PAE) uses program synthesis to generate interpretable programs that replicate a neural network's attention patterns, offering explanations that are both precise and human-readable.
Background
- This paper proposes a novel approach to explaining attention mechanisms in transformer models (the AI architecture behind systems like ChatGPT) by translating attention patterns into human-readable programs.
- Attention mechanisms are a core component of modern AI, but understanding what attention heads actually "pay attention to" has been difficult — they are often described as inscrutable "black boxes."
- The authors use program synthesis, a technique where a computer automatically generates simple programs that replicate the behavior of more complex systems, to produce readable explanations of what each attention head computes.
- This work aims to improve interpretability of AI models, which matters for safety, debugging, and trust. It connects the fields of mechanistic interpretability (reverse-engineering AI internals) and program synthesis.
- The paper was released on arXiv, a preprint server, in June 2025 or later (arXiv uses YYMM format).