Top logits can leak task-irrelevant image information as readily as full residual stream projections

When a model produces outputs, it exposes more than just its final answer. The ranked probability distribution over tokens — the logits — contains information about the model’s internal state. Prior work has shown that probing model internals can reveal information not apparent from generations. Apple ML Research’s paper on logit information leakage describes the authors’ systematic comparison of how much information is retained at different representational levels as it passes through two natural bottlenecks between the model’s rich internal representation and its output.

The paper is authored by Masha Fedzechkina, Eleonora Gualdoni, Rita Ramos, and Sinead Williamson.

The information cascade

A transformer’s residual stream encodes rich, high-dimensional information about the input. As computation progresses toward the output, this information passes through two compression points. The first is a low-dimensional projection of the residual stream, obtained using tuned lens — a technique for examining what information the model holds at intermediate layers. The second is the set of top-k logits most likely to influence the model’s answer.

These two bottlenecks represent points at which the full internal representation is compressed for different purposes. The tuned lens projection is a diagnostic tool; the top logits are what get exposed through the model’s API in practice. A model owner might reasonably assume that the top logits — designed to represent the model’s answer distribution — reveal only information relevant to the task.

What logits actually leak

Using vision-language models as a testbed, the paper shows that this assumption is wrong. The top logit values can leak task-irrelevant information present in the image-based query — information the model owner did not intend to expose and that is not needed for the model’s answer.

In some cases, the paper finds, the top logits reveal as much information as direct projections of the full residual stream. Residual stream projections are generally understood to be information-rich and would not typically be exposed to model users. The paper identifies the finding that the top logits — which are routinely accessible through standard model APIs — can match that level of information disclosure as the practical concern for model deployment.

Implications for model deployment

This has direct relevance for multi-modal model APIs that return logprob information. Many inference APIs expose top-k log probabilities as an optional output, sometimes used for downstream tasks like calibration, uncertainty estimation, or constrained decoding. The paper’s result suggests that when the input includes images, this logprob exposure may inadvertently reveal image-derived information that the model owner assumed was not accessible to the API caller.

The risk the paper names is “unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible.” Both forms are relevant: the unintentional case covers legitimate users inadvertently gaining access to private information in a multi-tenant system, while the malicious case covers adversarial probing through the logprob interface.

Why vision-language models

The paper describes the vision-language model setting as suited to this measurement because images are encoded into rich representations that must be compressed as they flow toward text token prediction. The gap between image content and task-required output is where task-irrelevant information can persist in the logit distribution.

The paper describes the work as the first systematic measurement of how much this happens across representational levels. Full experimental details and numerical results are in the paper (arXiv:2604.09885), published in April 2026.