Skip to main content
Distributed Training Part 4: Parallel Strategies

Distributed Training Part 4: Parallel Strategies

  • Five Dimensions of Parallelization Strategies
    • batch dimension
    • hidden_state dimension
    • sequence dimension
    • model_layer dimension
    • model_expert dimension
  • Optimal Training Configuration
  • Tensor Parallelism(TP)
  • Sequence Parallelism (SP)
  • Context Parallelism (CP)
  • Pipeline parallelism (PP)
  • Expert Parallelism (PP)

LizAbout 9 minLLMDistributedParallelism
Distributed Training Part 1: Memory Usage in Model Training

Distributed Training Part 1: Memory Usage in Model Training

  • Model Training Process and Important Hyperparameter
  • Memory Usage in Model Training
  • Memory Optimization Suggestions
    • Activation Recomputation / Gradient Checkpointing
    • Gradient Accumulation
    • Mixed Precision Training

LizAbout 10 minLLMDistributedParallel