Google DeepMind's Decoupled DiLoCo trains LLMs across data centers on standard internet bandwidth

Google DeepMind has published a paper introducing Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture designed to run large language model training across geographically separated data centers using ordinary internet-scale bandwidth rather than custom high-speed interconnects. The announcement describes the system as more resilient, more flexible, and capable of training across mixed hardware generations — addressing a set of practical constraints that grow more acute as model scale increases.

The core problem being solved is synchronization. Conventional large-scale training requires chips in near-perfect lockstep, which works when they’re physically co-located but becomes a significant logistical bottleneck across distant facilities. Decoupled DiLoCo sidesteps this by splitting training across separate “islands” of compute, called learner units, that communicate asynchronously. A failure in one island doesn’t stall the others — training continues, and the failed unit reintegrates automatically when it comes back online.

How it builds on prior work

The announcement traces Decoupled DiLoCo’s lineage to two earlier Google systems. Pathways introduced a distributed AI infrastructure based on asynchronous data flow. DiLoCo, a predecessor method, dramatically reduced the bandwidth required between distributed sites, making it practical in principle to train large models across distant locations. Decoupled DiLoCo, according to the post, “brings those ideas together” by layering asynchronous learner-unit training on top of Pathways.

The key architectural insight is how communication is handled. Rather than requiring frequent synchronization that blocks progress, the system incorporates required communication into longer computation windows. This eliminates the “blocking” bottleneck where one part of the system must wait on another — which the post identifies as the reason previous Data-Parallel methods didn’t scale to global distances.

What the tests showed

DeepMind reports training a 12 billion parameter model across four separate US regions using 2–5 Gbps of wide-area networking. The post notes that this bandwidth range is “relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure.” The training result was achieved more than 20 times faster than conventional synchronization methods, according to the announcement.

To stress-test fault tolerance, the team used “chaos engineering” — deliberately introducing hardware failures during live training runs. The announcement states that Decoupled DiLoCo continued training after the loss of entire learner units and reintegrated them seamlessly when they came back online.

Testing was also conducted with Gemma 4 models. The post says the system maintained greater availability of learning clusters than traditional training methods when hardware failed, while ultimately delivering the same benchmarked ML performance. That last point matters: the architecture doesn’t require a quality tradeoff for the resilience it provides.

Mixed hardware generations as a practical capability

One aspect the announcement highlights that goes beyond fault tolerance is the ability to mix hardware generations in a single training run. The post specifically mentions TPU v6e and TPU v5p running together. According to DeepMind, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs.

The practical implication the post draws: new hardware doesn’t arrive everywhere simultaneously, so the ability to train across generations “can alleviate recurring logistical and capacity bottlenecks.” The announcement frames this more broadly as turning “stranded resources into useful capacity” — idle compute at any location, on any compatible hardware, becomes available for training jobs.

DeepMind describes the approach as part of a “full-stack” strategy spanning hardware, software infrastructure, and research, where gains increasingly come from rethinking how those layers interact. Decoupled DiLoCo is presented as one concrete instance of that approach — not a single optimization, but a rearchitecting of how training jobs relate to the physical infrastructure underneath them.

The work was done by a team spanning Google DeepMind and Google Research; leads include Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, and several others listed in the acknowledgements.