Interpretability research for large language models faces a combinatorial problem: the number of possible interactions between input features, training data points, or internal model components grows exponentially as models scale. Exhaustive analysis is computationally infeasible. A blog post from Berkeley AI Research describes SPEX and ProxySPEX, two algorithms that make interaction discovery tractable by exploiting structural properties of how real systems behave.
The post frames the problem across three parallel interpretability approaches: feature attribution (which input segments drive a prediction), data attribution (which training examples influence behavior), and mechanistic interpretability (which internal components are responsible). In each case, the fundamental challenge is the same — isolating the drivers of complex behavior requires systematically perturbing the system, and each perturbation (called an ablation) is expensive.
Attribution through ablation
The post defines a unified framework based on ablation. For feature attribution, you mask or remove specific segments of an input prompt and measure how the model’s output shifts. For data attribution, you train on different subsets and observe how test predictions change. For mechanistic interpretability, you intervene on the model’s forward pass by removing or bypassing specific internal components.
The post emphasizes that “model behavior is rarely the result of isolated components; rather, it emerges from complex dependencies and patterns.” Single-feature attribution is often insufficient because the real driver of a prediction can be an interaction between features — a double negative changing sentiment, or a fact that requires synthesizing multiple documents in a retrieval-augmented generation task.
The post notes that each ablation carries significant cost: “expensive inference calls or retrainings.”
SPEX: spectral interaction discovery
SPEX (Spectral Explainer) is built on two structural observations about real models. The first is sparsity: “relatively few interactions truly drive the output.” The second is low-degreeness: “influential interactions typically involve only a small subset of features.” These properties transform an otherwise intractable exponential search into a sparse recovery problem.
The algorithm draws on tools from signal processing and coding theory. Rather than testing interactions one at a time, SPEX uses strategically selected ablations that combine many candidate interactions together. Efficient decoding algorithms then disentangle these combined signals to isolate the specific interactions responsible for the observed behavior. The post describes this as advancing “interaction discovery to scales orders of magnitude greater than prior methods.”
For feature attribution, the post reports that SPEX “matches the high faithfulness of existing interaction techniques (Faith-Shap, Faith-Banzhaf) on short inputs” and “uniquely retains this performance” on longer inputs where prior methods degrade. Faithfulness here measures how accurately recovered attributions predict the model’s output on unseen test ablations.
ProxySPEX: hierarchy cuts ablations by 10x
ProxySPEX adds a third structural observation: hierarchy. The post describes this as meaning “where a higher-order interaction is important, its lower-order subsets are likely to be important as well.” The post describes this as an empirical property of complex machine learning models, not an assumption built into the model architecture.
Exploiting hierarchy means ProxySPEX can focus ablation budget on interactions whose lower-order components are already known to be important, rather than exhaustively testing all possible combinations. The post reports that ProxySPEX “matches the performance of SPEX with around 10x fewer ablations.”
Applications across interpretability domains
The post describes applications in all three interpretability domains. For feature attribution, SPEX can identify not just that individual words matter but that specific word combinations drive the prediction — including, the post notes, cases where a test on a modified trolley problem revealed “a dominant high-order synergy” involving four specific words. For retrieval-augmented generation tasks, the post illustrates that the necessary interaction might span multiple retrieved documents.
For data attribution, the framework identifies which training examples, in combination, influence model behavior on a test point. For mechanistic interpretability, it identifies which internal components — attention heads, MLP layers — interact to produce specific outputs. The post notes that ProxySPEX-informed pruning “can actually improve model performance on the target task.”
The post frames interpretability not as an academic exercise but as a practical step toward “safer and more trustworthy AI,” enabling both model builders and affected humans to understand how decisions are made. The underlying papers were published at ICML 2025 and NeurIPS 2025.