MLX Distributed Training (Beta)

A privacy-first distributed training framework built on MLX for Apple Silicon, enabling secure and efficient AI model training across multiple devices while preserving data privacy.

What We're Building

We're training a decoder-only transformer model from scratch, optimized for Apple Silicon:

Architecture: Decoder-only transformer (similar to GPT-4/Llama 3)
- 22 transformer layers
- 2048 embedding dimensions
- 16 attention heads
- 8192 max sequence length
- Training optimizations: Flash Attention, Grouped Query Attention (GQA), RoPE embeddings, SwiGLU activations
Goal: Train a competitive 1B parameter model that can match or exceed Llama 3.2's performance using distributed consumer hardware instead of traditional GPU clusters. Overall, we're aiming to push the boundaries of what's possible with Apple Silicon and see how performance scales with increasing model size on consumer hardware.

System Architecture

Features

Privacy-First: All training happens on your devices, keeping sensitive data under your control
Efficient: Optimized for Apple Silicon using MLX, enabling fast training on consumer hardware
Distributed: Scale training across multiple Macs for better performance
Flexible: Support for various model architectures and training configurations

Introduction to Distributed Training with MLX

This project explores the potential of distributed training on Apple Silicon, specifically targeting the development of large language models. By leveraging MLX's distributed communication framework, we're pushing the boundaries of what's possible with consumer hardware.

The primary goal is ambitious yet practical: train a 1B parameter model using a network of Mac devices that outperforms state-of-the-art results (such as llama 3.2). Traditional approaches to training models of this scale typically require expensive cloud resources or specialized hardware. This implementation demonstrates that with efficient distributed algorithms and Apple's unified architecture, we can achieve comparable results using devices many developers already own.

This framework is designed for ML engineers and researchers interested in:

Implementing and optimizing distributed training systems
Exploring novel approaches to model parallelism and gradient synchronization
Understanding the practical aspects of training large language models
Contributing to the advancement of decentralized ML infrastructure

Why MLX for Distributed Training?

After extensive experimentation with various frameworks, MLX emerged as the optimal choice for distributed training on Apple Silicon for several compelling reasons:

Native Silicon Architecture Integration
- Direct compilation to Metal, maximizing M-series chip performance
- Seamless utilization of the Neural Engine and unified memory
- Optimized memory bandwidth and computational throughput
- Performance that consistently outpaces traditional frameworks on Apple hardware
Advanced Communication Architecture
- High-efficiency MPI-based inter-device communication
- Zero-copy gradient synchronization through optimized all-reduce operations
- Network stack specifically tuned for Apple's hardware ecosystem
- Minimal overhead in multi-device coordination
Sophisticated Memory Management
- Leverages unified memory architecture for optimal resource utilization
- Implements dynamic batch size adjustment based on device capabilities
- Advanced gradient checkpointing for memory-constrained scenarios
- Comprehensive monitoring and profiling capabilities

Our research and development focus on several key areas:

Scaling transformer architectures to 1B-3B parameters across distributed Mac systems
Implementing novel data streaming and caching strategies
Exploring hybrid parallelism techniques (data, model, and pipeline)
Developing robust distributed training protocols

This project serves as both a practical implementation and a research platform, enabling the ML community to explore distributed training techniques without the traditional barriers to entry. We welcome contributions from engineers and researchers interested in advancing the field of distributed ML training.

Installation

System Requirements

macOS Sonoma 14.0+ (Apple Silicon)
Python 3.11+
MLX 0.20.0+
High-speed network connection (10Gbps recommended)
SSH access configured between devices

Setup and Installation

# Install system dependencies
xcode-select --install
brew install mpich

# Clone repository
git clone https://github.com/jbarnes850/mlx_distributed
cd mlx_distributed

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

# Verify setup
python scripts/verify_setup.py
python scripts/test_network.py

Start Training

# On primary device (e.g., Mac Studio M2 Ultra)
./scripts/start_training.sh --role primary

# On secondary device (e.g., MacBook M3 Max)
./scripts/start_training.sh --role secondary

Monitor Progress

# Open dashboard
open http://localhost:8050

# Watch logs
tail -f logs/training.log

Network Requirements

High-speed connection (10Gbps+ recommended)
Low latency (<1ms between devices)
SSH access configured between devices

Documentation

Implementation Details

Our distributed training implementation follows MLX's recommended practices:

Data Parallelism:
- Each device maintains a complete model copy
- Data is sharded across devices
- Gradients synchronized using mx.distributed.all_sum
- Weights broadcast periodically for consistency
Memory Management:
- Dynamic batch sizing based on device capabilities
- Gradient accumulation for effective larger batches
- Activation checkpointing for memory efficiency
- Streaming data loading to manage memory usage
Performance Optimization:
- Mixed precision training
- Separate compute/memory streams
- Flash Attention implementation
- Grouped Query Attention (GQA)
- Optimized memory layout
Monitoring and Recovery:
- Real-time performance dashboard
- Automatic error recovery
- Checkpoint management
- Network health monitoring

For more details on MLX's distributed capabilities, see:

Troubleshooting

Common Issues

Network Communication Errors
- Verify SSH keys are properly configured between devices
- Check network bandwidth using scripts/test_network.py
- Ensure all devices are on the same subnet
- Try reducing batch_size if experiencing timeouts
Memory Issues
- Enable gradient checkpointing in config
- Reduce model size or batch size
- Monitor memory usage with dashboard
- Use streaming dataset loading
Performance Problems
- Verify Metal is properly configured
- Check CPU/GPU utilization
- Monitor network bandwidth
- Adjust number of worker processes
Installation Issues
- Verify Python version compatibility
- Check MLX installation
- Review system requirements

For more detailed troubleshooting:

Check logs in logs/training.log
Use monitoring dashboard
Review Performance Tuning Guide
Join our Discord Community

Performance Tuning

For detailed information about our hardware configuration, training process, and performance optimizations, please see our Performance Tuning Guide. This guide includes:

Current hardware specifications and configurations
Training time estimates and comparisons
Detailed performance optimization strategies
Memory management techniques
Monitoring and stability measures

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

MIT License - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
configs		configs
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
train.py		train.py
train_distributed.py		train_distributed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Distributed Training (Beta)

What We're Building

System Architecture

Features

Introduction to Distributed Training with MLX

Why MLX for Distributed Training?

Installation

System Requirements

Setup and Installation

Start Training

Monitor Progress

Network Requirements

Documentation

Implementation Details

Troubleshooting

Common Issues

Performance Tuning

Contributing

License

About

Releases 1

Packages

Languages

License

jbarnes850/mlx-disitrubted-training

Folders and files

Latest commit

History

Repository files navigation

MLX Distributed Training (Beta)

What We're Building

System Architecture

Features

Introduction to Distributed Training with MLX

Why MLX for Distributed Training?

Installation

System Requirements

Setup and Installation

Start Training

Monitor Progress

Network Requirements

Documentation

Implementation Details

Troubleshooting

Common Issues

Performance Tuning

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages