Top logits can leak task-irrelevant image information as readily as full residual stream projections
Apple ML Research researchers show that information accessible via a model's top-k logits can reveal image content that model owners expect to be inaccessible, sometimes matching what is exposed by direct residual stream projections.