Skip to content

Implementation of DDP and FSDP in PyTorch from Scratch using torch primitives

License

Notifications You must be signed in to change notification settings

erfanMhi/distributed_training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 PyTorch Distributed Training from Ground Up

Python 3.10+ Poetry License: MIT

A comprehensive guide and implementation for understanding distributed training in PyTorch - from low-level primitives to production deployment.

Getting StartedFeaturesDocumentationContributing


🎯 Introduction

This project serves as an educational resource for understanding distributed training in PyTorch, implementing both DistributedDataParallel (DDP) and Fully Sharded Data Parallel (FSDP) from scratch using PyTorch primitives. What sets this repository apart is the complete infrastructure setup guide alongside the training implementation, bridging the gap between theoretical understanding and practical deployment.

📚 Learning Path

1. Distributed Training Fundamentals

  • Implementation of DDP from scratch using PyTorch primitives
  • Understanding data parallelism and gradient synchronization
  • Process group management and initialization
  • (Coming Soon) FSDP implementation and memory optimization

2. Infrastructure Setup

  • Complete Terraform configurations for GPU cluster deployment
  • Automated node discovery and coordination
  • Network configuration for distributed training
  • Shared filesystem setup for checkpointing

🚀 Quick Start

Prerequisites
  • Python ≥ 3.10
  • Poetry for dependency management
  • Terraform for infrastructure setup
  • Access to Nebius Cloud (infrastructure code can be adapted for other providers)

🖥 Local Development

  1. Clone and setup:
git clone https://github.com/erfanMhi/distributed_training.git
cd distributed_training
poetry install
  1. Run training on multiple GPUs:
poetry run torchrun \
    --nproc_per_node=NUM_GPUS \
    src/multigpu_multi_node.py EPOCHS SAVE_FREQUENCY \
    --batch_size BATCH_SIZE

Cloud Deployment

  1. Set up cloud credentials:
export NB_AUTHKEY_PRIVATE_PATH="path/to/private/key"
export NB_AUTHKEY_PUBLIC_ID="your-public-key-id"
export NB_SA_ID="your-service-account-id"

⚠️ Important: For multi-node training on Nebius Cloud, ensure you have sufficient quota allocation. You'll need quota for at least 2 GPU nodes in your target region. Check your quotas in the Nebius Cloud Console and request increases if needed before deployment.

  1. Deploy infrastructure:
cd infrastructure
terraform init
terraform apply

📂 Project Structure

Repository Layout
distributed_training/
├── src/                      # Training implementation
│   ├── multigpu_multi_node.py  # DDP training script
│   └── data_utils.py           # Dataset utilities
├── infrastructure/           # Cloud deployment code
│   ├── main.tf              # Main Terraform configuration
│   ├── variables.tf         # Infrastructure variables
│   └── scripts/             # Deployment scripts
└── docs/                    # (Coming Soon) Detailed documentation

🎨 Code Style Standards

We maintain strict code quality standards using automated tools:

Python Style Guide

  • Line length: 79 characters
  • Style: PEP 8 with Black formatting
  • Docstrings: Google convention
  • Import order: PEP 8 style with isort
  • Type hints: Required for all functions

Tools and Configuration

# Install development dependencies
poetry install --with dev

# Run all checks locally
poetry run black .        # Code formatting
poetry run flake8 .      # Style and docstring checks
poetry run isort .       # Import sorting
poetry run mypy .        # Type checking

Pre-commit Checks

All PRs are automatically verified for:

  • ✅ Code formatting (Black)
  • ✅ Import ordering (isort)
  • ✅ Type hints (mypy)
  • ✅ Style compliance (flake8)

IDE Setup

For VS Code users, add to settings.json:

{
    "python.linting.flake8Enabled": true,
    "python.linting.enabled": true,
    "python.formatting.provider": "black",
    "editor.formatOnSave": true,
    "editor.rulers": [79]
}

Contributing Code

  1. Install dev dependencies: poetry install --with dev
  2. Format code: poetry run black .
  3. Sort imports: poetry run isort .
  4. Run type checks: poetry run mypy .
  5. Verify style: poetry run flake8 .

All PRs must pass CI checks before merging.

⚙️ Configuration

Training Parameters
Parameter Description Default
total_epochs Number of training epochs -
save_every Checkpoint frequency -
batch_size Batch size per GPU 32
Infrastructure Parameters
Parameter Description Default
cluster_size Number of nodes 1
training_epochs Total epochs 10
save_frequency Checkpoint frequency 5

📖 Implementation Details

Distributed Training

  • Process group initialization and management
  • Gradient synchronization across nodes
  • Efficient data loading with DistributedSampler
  • Checkpoint management for fault tolerance

Infrastructure

  • H100 GPU cluster orchestration
  • Inter-node networking setup
  • Shared filesystem configuration
  • Automatic training coordination

🛣 Roadmap

Status Feature
Basic DDP implementation
Multi-node training support
Infrastructure automation
🚧 FSDP implementation
📝 Performance optimization guides
🎯 Multi-cloud support

🤝 Contributing

We welcome contributions of all kinds! Here's how you can help:

  • 📝 Improve documentation
  • 🐛 Report or fix bugs
  • ✨ Propose or implement new features
  • 🚀 Optimize performance

Please feel free to submit issues and pull requests.

📬 Contact

Email GitHub


If you find this project helpful, please consider giving it a ⭐!

GitHub stars

About

Implementation of DDP and FSDP in PyTorch from Scratch using torch primitives

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published