Distributed Training Part 5: Introduction to GPU Distributed Training Part 5: Introduction to GPU GPU Architecture How to Improve Performance with Kernels Fused Kernels Flash Attention LizAbout 6 minLLMDistributedParallelism
Distributed Training Part 4: Parallel Strategies Distributed Training Part 4: Parallel Strategies Five Dimensions of Parallelization Strategies batch dimension hidden_state dimension sequence dimension model_layer dimension model_expert dimension Optimal Training Configuration Tensor Parallelism(TP) Sequence Parallelism (SP) Context Parallelism (CP) Pipeline parallelism (PP) Expert Parallelism (PP) LizAbout 9 minLLMDistributedParallelism
Distributed Training Part 3: Data Parallelism Distributed Training Part 3: Data Parallelism Data Parallelism (DP) DP Optimization DP Practice ZeRO-1 / ZeRO-2 / ZeRO-3 (FSDP) LizAbout 9 minLLMDistributedParallelism