🚀 PyTorch Distributed Training from Ground Up

A comprehensive guide and implementation for understanding distributed training in PyTorch - from low-level primitives to production deployment.

Getting Started • Features • Documentation • Contributing

🎯 Introduction

This project serves as an educational resource for understanding distributed training in PyTorch, implementing both DistributedDataParallel (DDP) and Fully Sharded Data Parallel (FSDP) from scratch using PyTorch primitives. What sets this repository apart is the complete infrastructure setup guide alongside the training implementation, bridging the gap between theoretical understanding and practical deployment.

📚 Learning Path

1. Distributed Training Fundamentals

Implementation of DDP from scratch using PyTorch primitives
Understanding data parallelism and gradient synchronization
Process group management and initialization
(Coming Soon) FSDP implementation and memory optimization

2. Infrastructure Setup

Complete Terraform configurations for GPU cluster deployment
Automated node discovery and coordination
Network configuration for distributed training
Shared filesystem setup for checkpointing

🚀 Quick Start

Prerequisites

Python ≥ 3.10
Poetry for dependency management
Terraform for infrastructure setup
Access to Nebius Cloud (infrastructure code can be adapted for other providers)

🖥 Local Development

Clone and setup:

git clone https://github.com/erfanMhi/distributed_training.git
cd distributed_training
poetry install

Run training on multiple GPUs:

poetry run torchrun \
    --nproc_per_node=NUM_GPUS \
    src/multigpu_multi_node.py EPOCHS SAVE_FREQUENCY \
    --batch_size BATCH_SIZE

Cloud Deployment

Set up cloud credentials:

export NB_AUTHKEY_PRIVATE_PATH="path/to/private/key"
export NB_AUTHKEY_PUBLIC_ID="your-public-key-id"
export NB_SA_ID="your-service-account-id"

⚠️ Important: For multi-node training on Nebius Cloud, ensure you have sufficient quota allocation. You'll need quota for at least 2 GPU nodes in your target region. Check your quotas in the Nebius Cloud Console and request increases if needed before deployment.

Deploy infrastructure:

cd infrastructure
terraform init
terraform apply

📂 Project Structure

Repository Layout

distributed_training/
├── src/                      # Training implementation
│   ├── multigpu_multi_node.py  # DDP training script
│   └── data_utils.py           # Dataset utilities
├── infrastructure/           # Cloud deployment code
│   ├── main.tf              # Main Terraform configuration
│   ├── variables.tf         # Infrastructure variables
│   └── scripts/             # Deployment scripts
└── docs/                    # (Coming Soon) Detailed documentation

🎨 Code Style Standards

We maintain strict code quality standards using automated tools:

Python Style Guide

Line length: 79 characters
Style: PEP 8 with Black formatting
Docstrings: Google convention
Import order: PEP 8 style with isort
Type hints: Required for all functions

Tools and Configuration

# Install development dependencies
poetry install --with dev

# Run all checks locally
poetry run black .        # Code formatting
poetry run flake8 .      # Style and docstring checks
poetry run isort .       # Import sorting
poetry run mypy .        # Type checking

Pre-commit Checks

All PRs are automatically verified for:

✅ Code formatting (Black)
✅ Import ordering (isort)
✅ Type hints (mypy)
✅ Style compliance (flake8)

IDE Setup

For VS Code users, add to settings.json:

{
    "python.linting.flake8Enabled": true,
    "python.linting.enabled": true,
    "python.formatting.provider": "black",
    "editor.formatOnSave": true,
    "editor.rulers": [79]
}

Contributing Code

Install dev dependencies: poetry install --with dev
Format code: poetry run black .
Sort imports: poetry run isort .
Run type checks: poetry run mypy .
Verify style: poetry run flake8 .

All PRs must pass CI checks before merging.

⚙️ Configuration

Training Parameters

Parameter	Description	Default
`total_epochs`	Number of training epochs	-
`save_every`	Checkpoint frequency	-
`batch_size`	Batch size per GPU	32

Infrastructure Parameters

Parameter	Description	Default
`cluster_size`	Number of nodes	1
`training_epochs`	Total epochs	10
`save_frequency`	Checkpoint frequency	5

📖 Implementation Details

Distributed Training

Process group initialization and management
Gradient synchronization across nodes
Efficient data loading with DistributedSampler
Checkpoint management for fault tolerance

Infrastructure

H100 GPU cluster orchestration
Inter-node networking setup
Shared filesystem configuration
Automatic training coordination

🛣 Roadmap

Status	Feature
✅	Basic DDP implementation
✅	Multi-node training support
✅	Infrastructure automation
🚧	FSDP implementation
📝	Performance optimization guides
🎯	Multi-cloud support

🤝 Contributing

We welcome contributions of all kinds! Here's how you can help:

📝 Improve documentation
🐛 Report or fix bugs
✨ Propose or implement new features
🚀 Optimize performance

Please feel free to submit issues and pull requests.

📬 Contact

If you find this project helpful, please consider giving it a ⭐!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
conf		conf
infrastructure		infrastructure
src		src
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 PyTorch Distributed Training from Ground Up

🎯 Introduction

📚 Learning Path

1. Distributed Training Fundamentals

2. Infrastructure Setup

🚀 Quick Start

🖥 Local Development

Cloud Deployment

📂 Project Structure

🎨 Code Style Standards

Python Style Guide

Tools and Configuration

Pre-commit Checks

IDE Setup

Contributing Code

⚙️ Configuration

📖 Implementation Details

Distributed Training

Infrastructure

🛣 Roadmap

🤝 Contributing

📬 Contact

About

Releases

Packages

Languages

License

erfanMhi/distributed_training

Folders and files

Latest commit

History

Repository files navigation

🚀 PyTorch Distributed Training from Ground Up

🎯 Introduction

📚 Learning Path

1. Distributed Training Fundamentals

2. Infrastructure Setup

🚀 Quick Start

🖥 Local Development

Cloud Deployment

📂 Project Structure

🎨 Code Style Standards

Python Style Guide

Tools and Configuration

Pre-commit Checks

IDE Setup

Contributing Code

⚙️ Configuration

📖 Implementation Details

Distributed Training

Infrastructure

🛣 Roadmap

🤝 Contributing

📬 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages