Scaling Model Training with FSDP, PyTorch, and Ray

Summary

This document provides a comprehensive deep dive into Fully Sharded Data Parallelism (FSDP), PyTorch's native implementation of ZeRO-3, for scaling large model training. It explains FSDP's internal mechanisms, memory efficiency, and communication costs through a step-by-step walkthrough of a training iteration. FSDP shards model parameters, gradients, and optimizer states across all GPUs, reducing per-GPU memory from 32 GB (DDP) to 8 GB (FSDP with 4 GPUs) for an 8GB parameter model. This is achieved through vertical partitioning into "units" and horizontal sharding of entities across devices. The FSDP training process involves: 1) Initial setup (dataset split, model sharding). 2) Forward pass: All-Gather parameters for each unit, parallel computation, save activations, then Reshard. 3) Backward pass: All-Gather parameters, compute local gradients, Reduce-Scatter gradients, free activations. 4) Optimizer step: independent local updates on each GPU. FSDP2 offers improvements like per-parameter sharding and native DTensor support. The document demonstrates FSDP implementation using PyTorch FSDP2 and Ray Train, covering model definition (Vision Transformer), sharding, distributed checkpointing (DCP for parallel I/O and automatic resharding), and the training loop. It also introduces DeepSpeed as an alternative, configurable via JSON for ZeRO stages. Finally, a real-world project fine-tunes the 1.7B parameter Qwen3-TTS model for voice cloning. This pipeline includes data processing (Whisper transcription, Ray Data), audio code extraction (Qwen3-TTS-Tokenizer, 12Hz, 16 codebooks), distributed SFT training (freezing most parameters, training the talker, conditioning with speaker embeddings), and inference. The Qwen3-TTS talker has 847,234,560 trainable parameters.

Scaling Model Training with FSDP, PyTorch, and Ray

Summary

Build your own second brain