ARC publishes overview of its research programme connecting interpretability to AI alignment

The Alignment Research Center has published a post mapping its research programme, tracing how published work on heuristic explanations, mechanism distinction, and alignment robustness fits together as a unified response to the problem of AI misalignment.

The post, which thanks Ryan Greenblatt, Victor Lecomte, Eric Neyman, Jeff Wu, and Mark Xu for comments, accompanies an interactive diagram showing how ARC’s research problems relate — where solving one is expected to help with another. The post notes that the arrows in the diagram do not uniformly represent conjunctive dependencies, where all subproblems must be solved; some represent disjunctive ones, where any one solution is sufficient.

ARC describes its research as organised around a problem it calls alignment robustness — the property that a model behaves as intended even when confronted with unusual or adversarial inputs — and traces a path from mechanistic interpretability tools to that outcome via two intermediate steps: heuristic explanations and mechanism distinction.

The post covers ARC’s significant published work in chronological order, assigning each to a node in its research diagram. Several pieces are summarised:

“Eliciting latent knowledge” defines the ELK problem — how to determine whether a model’s outputs reflect what its internals represent — and surveys approaches that ARC has since moved away from.

“Formalizing the presumption of independence” laid out the problem of producing a formal notion of heuristic explanations, which the post describes as a reasoning structure that explains why a model achieves high performance by reasoning from the model’s weights and architecture rather than from its outputs.

A cluster of late 2022 posts introduced mechanistic anomaly detection, described as ARC’s currently preferred approach to mechanism distinction — using heuristic explanations to identify when a model is executing a mechanism different from the one observed during training.

“Formal verification, heuristic explanations and surprise accounting” introduced the surprise-accounting framework for evaluating explanation quality, comparing ARC’s approach to formal verification for neural networks and to mechanistic interpretability research more broadly.

“Backdoors as an analogy for deceptive alignment,” and the associated paper, explored backdoors in ML models as an analogy for the problem of a model that behaves well during evaluation but pursues a different objective otherwise. The post maps that analogy onto ARC’s research subgraph.

“Low Probability Estimation in Language Models” and its associated paper describe an empirical study of estimating rare outputs in small transformer language models. The post notes that the method inspired by heuristic explanations outperformed naive sampling in that setting but did not outperform red-teaming — searching for inputs that produce rare behaviours — while leaving open cases where red-teaming is theoretically insufficient.

The post also catalogues active subproblems within ARC’s heuristic-explanations agenda, including: measuring explanation quality via surprise accounting; capacity allocation, or directing explanation effort toward behaviours with potentially catastrophic consequences; avoiding cherry-picking in how explanations are found; the form of representation explanations should take; formal desiderata; a no-coincidence principle establishing that all possible behaviours are amenable to explanation; and handling empirical regularities, such as model weights tuned to match observed data distributions where no clean structural explanation exists.

The post was published alongside related ARC posts on the matching sampling principle and the computational no-coincidence conjecture, which address two of the subproblems named in this overview.