README.md

GKE Cluster with nvidia gpu operator

Google cloud platform account and project
gcloud CLI
terraform CLI
To run this module assumes elevated permissions (Kubernetes Engine Admin) in your GCP account, specifically permissions to create VPC networks, GKE clusters, and Compute nodes. This will not work on accounts using the "free plan" as you cannot use GPU nodes until a billing account is attached and activated.
You will need to enable both the Kubernetes API and the Compute Engine APIs enabled. Click the GKE tab in the GCP panel for your project and enable the GKE API, which will also enable the Compute engine API at the same time
Ensure you have GPU Quota in your desired region/zone. You can [request](GPU Quota) if it is not enabled in a new account. You will need quota for both GPUS_ALL_REGIONS and for the specific GPU in the desired region.

Copy terraform.tfvars.example to terraform.tfvars and update the values as desired.

cp terraform.tfvars.example terraform.tfvars

gcloud auth application-default login

terraform init

terraform plan

terraform apply

After the cluster has been created, you can connect to the cluster with kubectl by running the following two commands after the cluster is created:

gcloud components install gke-gcloud-auth-plugin
gcloud container clusters get-credentials <CLUSTER_NAME> --region=<REGION>

Note: Steps 4-6 can be done automatically by running the setup.sh script in this directory.

chmod +x setup.sh
./setup.sh

You can delete the cluster with the following command:

terraform destroy