ARC releases AlgZoo, a benchmark of small neural networks that mechanistic interpretability has not fully explained

The Alignment Research Center has released AlgZoo, a collection of small neural networks — ranging from 8 to 1,408 parameters — trained on algorithmic tasks, which ARC is publishing as a set of test cases for mechanistic interpretability research.

The release, authored by Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li, and Michael Sklar, with contributions from others at and affiliated with ARC, is framed as a challenge to what the researchers describe as an underinvestment by the mechanistic interpretability field in fully understanding slightly complex models, as opposed to partially understanding very large ones.

According to the ARC post, the largest model the team believes it more or less fully understands has 32 parameters. The next model the team has put substantial effort into — a 432-parameter RNN — has not been fully interpreted. “The models are RNNs and transformers trained to perform algorithmic tasks, and range in size from 8 to 1,408 parameters,” the post states. The models are available publicly through the ARC website.

The research team argues that reaching a point where fully understanding models with a few hundred parameters is straightforward is a prerequisite for eventually being able to interpret multi-billion-parameter language models. AlgZoo is presented as a resource to spur that work, or to help the field “reckon with the magnitude of the challenge we face,” in the post’s phrasing.

ARC’s framework for evaluating the quality of an interpretation is called a mechanistic estimate: a deductive account of why a model achieves the accuracy it does, derived from the model’s internal structure rather than from sampling its outputs. The post distinguishes two metrics for assessing such estimates. The first is mean squared error versus compute: how close the mechanistic estimate gets to the model’s actual accuracy as computational budget increases. The second is surprise accounting, an information-theoretic measure of how much remains unexplained after the mechanistic estimate is applied. ARC’s current working definition of “full understanding” is a mechanistic estimate that leaves as few bits of total surprise as the number of bits of optimization used to train the model.

The post walks through three models from the second-max family — models trained to find the position of the second-largest number in a sequence — as case studies. These are M(2,2), a 10-parameter model with near-perfect accuracy; M(4,3), the 32-parameter model the team has fully reverse-engineered and considers the largest model it more or less completely understands; and M(16,10), a 432-parameter model that ARC says would represent “a major research breakthrough” if a mechanistic estimate were produced that matched random sampling under the mean-squared-error metric.

For M(16,10), the post states that even matching the performance of random sampling — let alone achieving full understanding under surprise accounting — remains out of reach for current interpretability methods.

ARC describes the AlgZoo models as composed of input-to-hidden, hidden-to-hidden, and hidden-to-output parameter matrices in a standard ReLU RNN architecture. The models are parameterised by hidden size and sequence length, and trained using softmax cross-entropy loss on standard Gaussian input sequences.

The post’s conclusion positions AlgZoo not as a solved benchmark but as an open research problem. The team invites other researchers to either make progress toward full interpretability of the included models, or to surface how large the gap between current methods and that goal actually is.