Google DeepMind has released Decoupled DiLoCo, a distributed training architecture that allows AI models to be trained across geographically separated data centers with dramatically reduced bandwidth requirements.
The system divides large training runs into separate "islands" of compute that operate asynchronously. This approach isolates hardware failures so other parts of the system continue learning uninterrupted.
Decoupled DiLoCo requires just 0.84 Gbps of bandwidth across eight data centers, compared to 198 Gbps needed by conventional data-parallel training methods. The system maintained 88% training efficiency even when simulating 1.2 million chips with high failure rates, while standard methods dropped to 27% efficiency.
Real-world validation with Gemma models
Google tested the architecture by training 12 billion parameter models across four separate US regions using only 2-5 Gbps of standard internet connectivity. The distributed training completed more than 20 times faster than conventional synchronization methods.
Benchmark tests with Gemma 4 models showed Decoupled DiLoCo achieved 64.1% average accuracy, nearly matching the 64.4% of baseline training approaches. The system demonstrated "self-healing" capabilities, seamlessly reintegrating failed compute units when they came back online.
The architecture builds on Google's earlier Pathways system and DiLoCo research. It enables training across different hardware generations, mixing TPU v6e and TPU v5p chips in single training runs without performance degradation.
"This training paradigm unlocks the ability to mix different hardware generations in a single training run," the DeepMind team wrote in their paper published on arXiv.
The system's fault tolerance was validated through "chaos engineering" tests that deliberately introduced hardware failures during training. Unlike tightly coupled systems where chip failures can halt entire training runs, Decoupled DiLoCo continued operating with minimal disruption.
Google positions the technology as essential for future AI training at scale, where maintaining synchronization across thousands of chips becomes increasingly challenging. The approach could enable organizations to utilize stranded compute resources across multiple locations for large model training.
💬 Discussion
Sign in to join the discussion.
Sign in →No comments yet — be the first.