These scripts provide an easy way to get started with multinode FSDP training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image. If you would like to run FSDP with SLURM, please refer to README.md.
Before running this training, you'll need to create an Amazon EKS or a SageMaker HyperPod EKS cluster. Instructions can be found in 1.architectures, the aws-do-eks project, or the eks-blueprints project.
Clone this repo.
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/10.FSDP
The example requires building a container. We are going to use the nccl-tests container as base. The nccl-tests container image is a prerequisite.
It can either be pulled from gallery.ecr.aws/hpc-cloud
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/hpc-cloud
docker pull public.ecr.aws/hpc-cloud/nccl-tests:latest
or it can be built using the code below.
pushd ../../micro-benchmarks/nccl-tests
docker build -t nccl-tests:latest -f nccl-tests.Dockerfile .
popd
It is recommended to use the public image as base. Building the base image from source can take longer than 30 min. If you decide to build your base image from source, then please change the FROM
line in the Dockerfile to your local base image, prior to building the fsdp
image.
If the envsubst utility is not available in your environment, please install it, following the instructions appropriate for your operating system.
Deploy the Kubeflow training operator
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
Build a container image for this example using the code below:
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
docker build -t ${REGISTRY}fsdp:pytorch2.2 .
In this step we create a container registry if one does not exist, and push the container image to it.
# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"fsdp\" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
aws ecr create-repository --repository-name fsdp
fi
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
# Push image to registry
docker image push ${REGISTRY}fsdp:pytorch2.2
For this example, we'll be using the C4 dataset, which is several hundred gigabytes. Instead of downloading the whole thing, the create_streaming_dataloaders
function will stream the dataset from HuggingFace, so there's no data prep required for running this training.
If you'd like to instead use your own dataset, you can do so by formatting it as a HuggingFace dataset, and passing its location to the --dataset_path
argument.
Generate the Kubernetes manifest and apply it to the cluster.
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.2
export INSTANCE_TYPE=
export NUM_NODES=
export GPU_PER_NODE=
export EFA_PER_NODE=
export FI_PROVIDER=efa
cat fsdp.yaml-template | envsubst > fsdp.yaml
kubectl apply -f ./fsdp.yaml
EFA level variables are available for adjustment in fsdp.yaml-template Keep FI_* values commented out for non-efa instances (G5, G4d, P3) or P5 Uncomment FI_* values for P4d instances
You can also adjust the training parameters in TRAINING_ARGS
(for example, to train Llama 2 70b). Additional parameters can be found in model/arguments.py
. Note that we use the same directory for both --checkpoint_dir
and --resume_from_checkpoint
. If there are multiple checkpoints, --resume_from_checkpoint
will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
To see the status of your job, use the commands below
kubectl get pytorchjob
kubectl get pods
NAME STATE AGE
fsdp Running 5m
NAME READY STATUS RESTARTS AGE
etcd-7787559c74-l9g92 1/1 Running 0 5m
fsdp-worker-0 1/1 Running 0 5m
fsdp-worker-1 1/1 Running 0 5m
fsdp-worker-2 1/1 Running 0 5m
fsdp-worker-3 1/1 Running 0 5m
fsdp-worker-4 1/1 Running 0 5m
fsdp-worker-5 1/1 Running 0 5m
fsdp-worker-6 1/1 Running 0 5m
fsdp-worker-7 1/1 Running 0 5m
Each of the pods produces job logs. One of the pods is elected master during job initialization. Only this pod will show the progress of the training job in its log. To find out which pod is currently the master, run the command below.
kubectl logs fsdp-worker-0 | grep master_addr=
[2024-06-11 18:59:56,193] torch.distributed.elastic.agent.server.api: [INFO] master_addr=fsdp-worker-1
This shows that the pod fsdp-worker-1
is currently the master. To look at the current job logs, use the command below:
kubectl logs -f fsdp-worker-1
...
2024-06-12 00:08:25 I [train.py:102] Batch 979 Loss: 5.63272, Speed: 0.43 samples/sec, lr: 0.000091
2024-06-12 00:08:44 I [train.py:102] Batch 980 Loss: 5.63327, Speed: 0.43 samples/sec, lr: 0.000091
2024-06-12 00:09:03 I [train.py:102] Batch 981 Loss: 5.95147, Speed: 0.43 samples/sec, lr: 0.000091
2024-06-12 00:09:21 I [train.py:102] Batch 982 Loss: 5.45894, Speed: 0.43 samples/sec, lr: 0.000091
To stop the current training job, use the following command.
kubectl delete -f ./fsdp.yaml
If you wish to launch a new job, you must first stop the previous one, even if it is in Completed
state.
To modify training for a 13 or 70B Llama 2 model, just change the corresponding parameters based on the values in the Llama 2 paper.
Param | 7B | 13B | 70B |
---|---|---|---|
intermediate_size | 11008 | 13824 | 28672 |
num_key_value_heads | 32 | 40 | 8 |
hidden_width | 4096 | 5120 | 8192 |
num_layers | 32 | 40 | 80 |
num_heads | 32 | 40 | 64 |