Web Content
This document provides a comprehensive deep dive into Fully Sharded Data Parallelism (FSDP), PyTorch's native implementation of ZeRO-3, for scaling large model training. It explains FSDP's internal mechanisms, memory efficiency, and communication costs through a step-by-step walkthrough of a training iteration. FSDP shards model parameters, gradients, and optimizer states across all GPUs, reducing per-GPU memory from 32 GB (DDP) to 8 GB (FSDP with 4 GPUs) for an 8GB parameter model. This is achieved through vertical partitioning into "units" and horizontal sharding of entities across devices. The FSDP training process involves: 1) Initial setup (dataset split, model sharding). 2) Forward pass: All-Gather parameters for each unit, parallel computation, save activations, then Reshard. 3) Backward pass: All-Gather parameters, compute local gradients, Reduce-Scatter gradients, free activations. 4) Optimizer step: independent local updates on each GPU. FSDP2 offers improvements like per-parameter sharding and native DTensor support. The document demonstrates FSDP implementation using PyTorch FSDP2 and Ray Train, covering model definition (Vision Transformer), sharding, distributed checkpointing (DCP for parallel I/O and automatic resharding), and the training loop. It also introduces DeepSpeed as an alternative, configurable via JSON for ZeRO stages. Finally, a real-world project fine-tunes the 1.7B parameter Qwen3-TTS model for voice cloning. This pipeline includes data processing (Whisper transcription, Ray Data), audio code extraction (Qwen3-TTS-Tokenizer, 12Hz, 16 codebooks), distributed SFT training (freezing most parameters, training the talker, conditioning with speaker embeddings), and inference. The Qwen3-TTS talker has 847,234,560 trainable parameters.
Save and connect content like this in your personal library.
Save any link or doc
One place for everything
AI summaries and topics
Understand at a glance
Your second brain
Search and connect ideas
Join thousands building their second brain