Apple research generates realistic long-term motion with a 64x temporally compressed embedding

Generating plausible motion is central to visual intelligence systems, but existing approaches face a fundamental efficiency problem: exploring multiple possible futures through full video synthesis is prohibitively expensive. Apple ML Research’s work on long-term motion embeddings sidesteps this by operating directly in a compressed motion representation rather than in pixel space, achieving what the authors describe as orders-of-magnitude more efficient generation of long, realistic motions.

The paper is authored by Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind, and Bjorn Ommer, with the first three listed as equal contributors affiliated with CompVis at LMU, Germany and the Munich Center for Machine Learning.

The core approach

The method works in two stages. First, a motion embedding is learned from large-scale trajectory data obtained from tracker models. The embedding uses a temporal compression factor of 64x — meaning that what would require 64 frames of raw trajectory data is compressed into a single embedding vector. This high compression ratio is what enables the efficiency gains: the generative model operates in a dramatically smaller space than either raw video or full kinematics representations.

Second, a conditional flow-matching model is trained to generate motion latents within this compressed space. The conditioning signals supported are text prompts (describing the intended motion in natural language) and spatial pokes (physical position inputs that specify where motion should occur in the scene). This dual conditioning gives the model two complementary interfaces: semantic control through language and spatial control through direct position specification.

Why not use video models

Modern video generation models have strong scene understanding and can produce visually coherent dynamics. The paper’s framing positions direct video synthesis as the wrong tool for exploring motion possibilities: it is computationally expensive, and most of the video computation is spent on appearance rather than the motion dynamics that are the actual object of interest.

By abstracting away appearance and working in a pure motion embedding space, the approach can generate many possible motion trajectories quickly. The paper reports that the resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches for motion generation. The comparison spans both categories — general video synthesis and dedicated motion models — which suggests the compressed embedding approach is not simply trading one kind of quality for another.

Efficiency as a design goal

The 64x temporal compression factor is the design choice that drives the efficiency claim. A model generating sequences in this embedding space is working with sequences that are 64 times shorter than the underlying trajectory sequences would be, which reduces both memory requirements and computation for the generative model. Flow matching as the generative framework is well-suited to this setting: it learns to transform a simple distribution into the target motion distribution through a continuous-time process that can be sampled efficiently.

The combination — compressed embedding from real trajectory data, flow-matching generation conditioned on text or spatial inputs — produces a system where generating and comparing multiple plausible motion futures is tractable in a way that video synthesis is not.

The paper was published in April 2026. Source code and model weights are not mentioned in the available excerpt.