Distributed Training Part 5: Introduction to GPU Distributed Training Part 5: Introduction to GPU GPU Architecture How to Improve Performance with Kernels Fused Kernels Flash Attention LizAbout 6 minLLMDistributedParallelism
Distributed Training Part 4: Parallel Strategies Distributed Training Part 4: Parallel Strategies Five Dimensions of Parallelization Strategies batch dimension hidden_state dimension sequence dimension model_layer dimension model_expert dimension Optimal Training Configuration Tensor Parallelism(TP) Sequence Parallelism (SP) Context Parallelism (CP) Pipeline parallelism (PP) Expert Parallelism (PP) LizAbout 9 minLLMDistributedParallelism
Distributed Training Part 3: Data Parallelism Distributed Training Part 3: Data Parallelism Data Parallelism (DP) DP Optimization DP Practice ZeRO-1 / ZeRO-2 / ZeRO-3 (FSDP) LizAbout 9 minLLMDistributedParallelism
Distributed Training Part 2: Parallel Programming Distributed Training Part 2: Parallel Programming Broadcast Reduce & AllReduce Gather & AllGather Scatter & ReduceScatter LizAbout 5 minLLMDistributedParallel
Distributed Training Part 1: Memory Usage in Model Training Distributed Training Part 1: Memory Usage in Model Training Model Training Process and Important Hyperparameter Memory Usage in Model Training Memory Optimization Suggestions Activation Recomputation / Gradient Checkpointing Gradient Accumulation Mixed Precision Training LizAbout 10 minLLMDistributedParallel