This module builds off PyTorch/XLA and enables you to reliabliy instatiate a PyTorch Distributed Cloud TPU training enviroment using Google Cloud Builder.
This module does the following
- Creates Cloud TPU pod
- Creates a NFS share to allow for the sharing of code between compute instances
- Creates a GCE Managed Instance Group (MIG) based on size of Cloud TPU pod
- Allows user to specify a script that is used to customize the image in the instance group
- Create a shared persistent disk (PD) that is used to host the dataset used for training
- Allows the user to specificy a script to prepare the data before loading it to the shared persistant disk
Clone the repo to your local enviroment.
git clone https://github.com/mugithi/google-terraform-pytorch-tpu.git
cd google-terraform-pytorch-tpu
Enable GCP services using the following command
gcloud services enable cloudbuild.googleapis.com \
compute.googleapis.com \
iam.googleapis.com \
tpu.googleapis.com \
file.googleapis.com
export PROJECT=$(gcloud info --format='value(config.project)')
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT --format 'value(projectNumber)')
export [email protected]
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/iam.serviceAccountUser'
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/compute.admin'
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/iam.serviceAccountActor'
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/file.editor'
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/compute.securityAdmin'
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/storage.admin'
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/tpu.admin'
Modify values file and set the training environment build id and project values. Initialize the enviroment using the command below.
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=initialize
Initializing the traning enviroment creates GCS bucket to store both the configuration information and training information as follows
- tf_backend: TF state for filestore, cloud tpu, mig
- tf_backend/workspace: Workspace to store enviromental variables used by Cloud Buld in values.env
- tf_backend/worksplace/env_setup/scripts/: Scripts used to modify the instance group
- tf_backend/worksplace/env_setup/models/: Scripts that are loaded at start time to configure the instance group for training for a particular model. It comes preloaded with RoBERTa on Fairseq
- tf_backend: model specific training scripts
- dataset: bucket to store the training dataset
Each version of the environment is tracked using the variable ENV_BUILD_NAME
unique to each environment. In order to create seperate enviroments, a specify a new ENV_BUILD_NAME
.
It is recomended that you keep seperate versions of the cloned cloud build repo for each environment to easily allow to easily version your scripts and variable file.
Modify values file and set the cloud TPU, managed instance group and shared nfs parameters. Create the training enviroment using the command below.
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=create
Running this command creates Filestore, Cloud TPU and Managed Instance Group using values in the variable file.
Please note that if the SHARED_PD_DISK_ATTACH
variable is set to true
and the Shared persistant disk is not initialized, you will see the error resourceNotFound.
Step #5 - "terraform-google-mig": Step #1 - "terraform-google-mig": Error: Error creating InstanceGroupManager: googleapi: Error 404: The resource 'projects/pytorch-tpu-cb-test/zones/europe-west4-a/disks/pd-ssd-2055-20 200430' was not found, notFound
In order to resolve this error, change the SHARED_PD_DISK_ATTACH
variable to false
or create the Shared Persistant Disk using the command _BUILD_ACTION=create,_DISK=true
After initialzing the environment, you can bigin training your PyTorch models on Cloud TPU. The following example models are made avaiable for you to start with as part of this repo.
- Test model using ImageNet and Synethetic Data
- RoBERTa model on FAIRseq
- COMMING SOON: Wave2vec model on FAIRseq
Modify values file and set the shared persistent disk and gcs training dataset parameters. Initialize the shared persistent disk using the command below.
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=update,_DISK=true,_MIG=true
When you run the _BUILD_ACTION=update,_DISK=true
command,
- A new persistant disk is created if non exists. If one exists, it is destroyed and new one is created
- This disk is mounted to a temporary GCE instance that runs any data preparation as specified in the data_prep_seed_shared_disk_pd.sh
- The new disk is formated in ext4 and mounted it in path specified in the
$MOUNT_POINT/shared_pd
as specified in the mount point variable file - Mounts the new shared persistent to the managed instance group as read only volume.
The update command _BUILD_ACTION=update,_DISK=true
can be used to reload new training data into the shared persistent disk.
Please note that updates to the shared persistent disk will only take place if you change its changing the SHARED_PD_DISK_SIZE="XXXX"
variable. If you do not change the size of the persistent disk when running an update, you will see the error.
Step #1 - "terraform-google-disk": Step #0 - "terraform-google-disk-seed": Error: Error creating instance: googleapi: Error 400: The disk resource 'projects/xxxx' is already being used by 'projects/xxxx', resourceInUseByAnotherResource
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=update,_TPU=true
When this comamnd is run, a new Cloud TPU is created created or existing one is updated.
The update command can be used to upgrade from a v3-8 to a v3-128 by changing the TPU_ACCELERATOR_TYPE="v3-32"
variable or Cloud TPU PyTorch version from torch-1.5 to torch nightly by changing cloud TPU_PYTORCH_VERSION="pytorch-1.5"
variable.
The update command can also be used to recreate the Cloud TPU after destroying it using the gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=destroy,_TPU=true
command.
If you specify a specific GCE torch-nightly version using the GCE_IMAGE_VERSION="20200427"
variable, cloud build will configure the Cloud TPU runtime to match the MIG GCE image version. If no value is called out in the GCE_IMAGE_VERSION=""
variable , the latest nightly version is used.
Please note that updating the Cloud TPU pod does not modify the MIG. In order to change both the Cloud TPU and MIG, they both need to be explicity included in the cloud build substitation as follows _BUILD_ACTION=update,_TPU=true,_MIG=true
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=update,_DISK=true,_MIG=true
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=update,_MIG=true
When this comamnd is run, a new MIG is created created or existing one is updated.
The update command can be used to change the number of VMs in the MIG by changing the TPU_ACCELERATOR_TYPE="v3-32"
variable or size of the shared persistent disk that stores the training data by changing the SHARED_PD_DISK_SIZ='1024'
variable
If you specify a specific GCE torch-nightly using the GCE_IMAGE_VERSION="20200427"
variable and set the pytorch version in the TPU_PYTORCH_VERSION="pytorch-1.5"
variable, cloudbuild will provision a MIG using the torch-nightly specified GCE_IMAGE version. In all other cases, cloud build will use the latest nightly versionn.
Please note that updating the Cloud TPU enviroment does not modify the MIG size. In order to change both the Cloud TPU and MIG, they both need to be explicity included in the cloud build substitation as follows _BUILD_ACTION=update,_TPU=true,_MIG=true
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=destroy
Please note that destroying the environment using does not remove the GCS buckets and shared persistent disk. You can recreate the training enviroment by simply reruning the _BUILD_ACTION=create
command.
In order to completly destroy the entire enviroment, you need to run the above step and then destroy the shared persistant disk and GCS buckets
In order to delete the shared persistant disk run the command below
gcloud builds submit --config=cloudbuild.yaml . --substitutions _BUILD_ACTION=destroy,_DISK=true
In order to delete the GCS buckets, navigate to the GCS in the Google Cloud Console and delete the buckets titled
- your_project_id*-dataset
- your_project_id*-tf-backend