A comprehensive guide and implementation for understanding distributed training in PyTorch - from low-level primitives to production deployment.
This project serves as an educational resource for understanding distributed training in PyTorch, implementing both DistributedDataParallel (DDP) and Fully Sharded Data Parallel (FSDP) from scratch using PyTorch primitives. What sets this repository apart is the complete infrastructure setup guide alongside the training implementation, bridging the gap between theoretical understanding and practical deployment.
- Implementation of DDP from scratch using PyTorch primitives
- Understanding data parallelism and gradient synchronization
- Process group management and initialization
- (Coming Soon) FSDP implementation and memory optimization
- Complete Terraform configurations for GPU cluster deployment
- Automated node discovery and coordination
- Network configuration for distributed training
- Shared filesystem setup for checkpointing
Prerequisites
- Python ≥ 3.10
- Poetry for dependency management
- Terraform for infrastructure setup
- Access to Nebius Cloud (infrastructure code can be adapted for other providers)
- Clone and setup:
git clone https://github.com/erfanMhi/distributed_training.git
cd distributed_training
poetry install
- Run training on multiple GPUs:
poetry run torchrun \
--nproc_per_node=NUM_GPUS \
src/multigpu_multi_node.py EPOCHS SAVE_FREQUENCY \
--batch_size BATCH_SIZE
- Set up cloud credentials:
export NB_AUTHKEY_PRIVATE_PATH="path/to/private/key"
export NB_AUTHKEY_PUBLIC_ID="your-public-key-id"
export NB_SA_ID="your-service-account-id"
⚠️ Important: For multi-node training on Nebius Cloud, ensure you have sufficient quota allocation. You'll need quota for at least 2 GPU nodes in your target region. Check your quotas in the Nebius Cloud Console and request increases if needed before deployment.
- Deploy infrastructure:
cd infrastructure
terraform init
terraform apply
Repository Layout
distributed_training/
├── src/ # Training implementation
│ ├── multigpu_multi_node.py # DDP training script
│ └── data_utils.py # Dataset utilities
├── infrastructure/ # Cloud deployment code
│ ├── main.tf # Main Terraform configuration
│ ├── variables.tf # Infrastructure variables
│ └── scripts/ # Deployment scripts
└── docs/ # (Coming Soon) Detailed documentation
We maintain strict code quality standards using automated tools:
- Line length: 79 characters
- Style: PEP 8 with Black formatting
- Docstrings: Google convention
- Import order: PEP 8 style with isort
- Type hints: Required for all functions
# Install development dependencies
poetry install --with dev
# Run all checks locally
poetry run black . # Code formatting
poetry run flake8 . # Style and docstring checks
poetry run isort . # Import sorting
poetry run mypy . # Type checking
All PRs are automatically verified for:
- ✅ Code formatting (Black)
- ✅ Import ordering (isort)
- ✅ Type hints (mypy)
- ✅ Style compliance (flake8)
For VS Code users, add to settings.json:
{
"python.linting.flake8Enabled": true,
"python.linting.enabled": true,
"python.formatting.provider": "black",
"editor.formatOnSave": true,
"editor.rulers": [79]
}
- Install dev dependencies:
poetry install --with dev
- Format code:
poetry run black .
- Sort imports:
poetry run isort .
- Run type checks:
poetry run mypy .
- Verify style:
poetry run flake8 .
All PRs must pass CI checks before merging.
Training Parameters
Parameter | Description | Default |
---|---|---|
total_epochs |
Number of training epochs | - |
save_every |
Checkpoint frequency | - |
batch_size |
Batch size per GPU | 32 |
Infrastructure Parameters
Parameter | Description | Default |
---|---|---|
cluster_size |
Number of nodes | 1 |
training_epochs |
Total epochs | 10 |
save_frequency |
Checkpoint frequency | 5 |
- Process group initialization and management
- Gradient synchronization across nodes
- Efficient data loading with DistributedSampler
- Checkpoint management for fault tolerance
- H100 GPU cluster orchestration
- Inter-node networking setup
- Shared filesystem configuration
- Automatic training coordination
Status | Feature |
---|---|
✅ | Basic DDP implementation |
✅ | Multi-node training support |
✅ | Infrastructure automation |
🚧 | FSDP implementation |
📝 | Performance optimization guides |
🎯 | Multi-cloud support |
We welcome contributions of all kinds! Here's how you can help:
- 📝 Improve documentation
- 🐛 Report or fix bugs
- ✨ Propose or implement new features
- 🚀 Optimize performance
Please feel free to submit issues and pull requests.