diff --git a/infrastructure/deployments/README.md b/infrastructure/deployments/README.md new file mode 100644 index 0000000..575a83f --- /dev/null +++ b/infrastructure/deployments/README.md @@ -0,0 +1,124 @@ +# Deploying standalone KFP pipelines on top of the cluster + +## Deploying KFP pipeline + +[kubeflow pipelines documentation](https://www.kubeflow.org/docs/components/pipelines/installation/standalone-deployment/#disable-the-public-endpoint) + +[kubeflow pipelines github](https://github.com/kubeflow/pipelines/tree/0.5.1) + + +The documentation instructs you to deploy kubeflow pipelines on a GKE cluster, we've already deployed the cluster with terraform. + +Make sure you are authenticated to the GKE cluster hosting KFP +```commandline +gcloud container clusters get-credentials fondant-cluster --zone=europe-west4-a +``` + +## Customizing GCS and Cloud SQL for Artefact, Pipeline and Metadata storage + +[GCP services setup](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample) + +We will setup `GCS` to store artefact and pipeline specs. This is done by deploying the `minio-gcs gateway` service. +We will also use `CloudSQL` to store all the ML metadata associated with our pipeline runs. This will guarantee that you are +able to retrieve metadata (experiments outputs, lineage visualization, ...) from our previous pipeline runs in case the GKE cluster where KFP is deployed get deleted. + + +First clone the [Kubeflow Pipelines GitHub repository](https://github.com/kubeflow/pipelines), and use it as your working directory. + +```commandline +git clone https://github.com/kubeflow/pipelines +``` + +Next, we will need to customize our own values in the deployment manifest before deploying the additional services. + +**Note**: You will need the `CloudSQL` user root password that was created in the terraform setup. In your GCP project, go to +`Secret Manager` and retrieve the `sql-key` password that was stored there. + +### Customize values + +Make sure to modify the following files: + +* `manifests/kustomize/env/gcp/params.env` + +```bash +pipelineDb=pipelinedb +mlmdDb=metadb +cacheDb=cachedb +bucketName=-kfp-artifacts #bucket to store the artifacts and pipeline specs (created in TF) +gcsProjectId=::kfp-metadata # Metadata db (created in TF) +``` + + `manifests/kustomize/base/installs/generic/mysql-secret.yaml` + + + +Specify the `root` user password retrieved from `Secret Manager` in the designated fields. Make sure **you do not commit** the secret to your git history. + +### Applying the customized resources + +After specifying the required parameters, you can now install the additional services: + +```commandline +kubectl apply -k manifests/kustomize/cluster-scoped-resources +kubectl wait crd/applications.app.k8s.io --for condition=established --timeout=60s +kubectl apply -k manifests/kustomize/env/gcp +``` + +The process of deploying the GCP resources may take from 3-5 minutes, you can check that status of the +newly deployed services: +```commandline +kubectl -n kubeflow get pods +``` + +## Installing GPU drivers + +Next, we need to install the GPU drivers on the GPU node pools to use them in KFP. +To install the GPU nodes, you have to manually scale up the GPU +pool from 0 to 1 to ensure that the installation takes effect (More on this issue [here](https://github.com/kubeflow/pipelines/issues/2561).). + +After that, apply: + +```kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml``` + +After installing the drivers, you can set the pool back to 0. The pool can scale +up back again when called in the pipeline steps since autoscaling is enabled. + +## Installing the Spark operator (Optional) +An additional installation is required to setup the [K8 operator for Spark](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator#installation). + +```commandline +helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator +helm install my-release spark-operator/spark-operator --namespace kubeflow +``` +This will install the Kubernetes Operator for Apache Spark into the namespace `kubeflow`. +The operator by default watches and handles SparkApplications in every namespaces. + +## Deleting the KFP services +Run the following commands to delete the deployed KFP services from your GKE cluster. + + +```commandline +kubectl delete -k manifests/kustomize/cluster-scoped-resources +kubectl delete -k manifests/kustomize/env/gcp +``` + +## Accessing KFP pipeline +There are three ways to connect to KFP UI, first make sure you autheticate to the GKE cluster hosting KFP: + +**1) Port-forwarding to access the kubernetes service** +```commandline +kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80 +``` + +Open the Kubeflow Pipelines UI at http://localhost:8080 + +**2) Using the IAP URL of KFP** +```commandline +kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com +``` + +**3) AI platform** + +Alternatively, you can visualize your pipeline from the AI platform UI in the `pipelines` section. + diff --git a/infrastructure/terraform/.gitignore b/infrastructure/terraform/.gitignore new file mode 100644 index 0000000..c035e72 --- /dev/null +++ b/infrastructure/terraform/.gitignore @@ -0,0 +1,2 @@ +.terraform +.terraform.lock.hcl diff --git a/infrastructure/terraform/README.md b/infrastructure/terraform/README.md new file mode 100644 index 0000000..5b01df7 --- /dev/null +++ b/infrastructure/terraform/README.md @@ -0,0 +1,79 @@ +# Terraform by Nimbus + +Nimbus can generate [Terraform](https://www.terraform.io/) configuration to help you set up the +necessary infrastructure at the start of your project and for each generated component. + +### Directory structure + +The Terraform directory consists of 4 subdirectories: +- **backend** + Contains the configuration for the Terraform backend which stores the state of the main + configuration. The backend configuration and state are separated to prevent chicken-and-egg + issues. For more info, check the [README.md](./backend/README.md) in the subdirectory. +- **environment** + Contains the `.tfvars` files with the global variables for the project. Originally, this + directory contains a single `project.tfvars` file, but when moving to multiple environments, + you'll want separate `.tfvars` files per environment. You can pass these files to any Terraform + command using the `--var-file` argument +- **main** + The main Terraform configuration of the project. +- **modules** + [Modules](https://developer.hashicorp.com/terraform/language/modules) that can be shared across + configurations or just be used to package logical resources together. + +### Terraform version + +The supported terraform version can be found in [versions.tf](./versions.tf). +We recommend [tfenv](https://github.com/tfutils/tfenv) to manage different +Terraform versions on your machine. + +### Remote state + +By default, Terraform stores state locally in a file named `terraform.tfstate`. When working with +Terraform in a team, use of a local file makes Terraform usage complicated because each user must +make sure they always have the latest state data before running Terraform and make sure that nobody +else runs Terraform at the same time. To solve this Terraform allows to store the state remotely. +In general, storing the state remotely has much more advantages than just working in a team. It +also increases reliability in terms of backups, keeps your infrastructure secure by making the +process of applying the Terraform updates stable and secure, and facilitates the use of Terraform +in automated pipelines. At ML6 we believe using the remote state is the way of working with Terraform. + +Nimbus setup enforces the remote state by generating `backend.tf` file, which makes sure Terraform +automatically writes the state data to the cloud. The `backend` subdirectory contains the +configuration of this backend infrastructure. + +### Multiple environments + +#### Terraform variables + +To keep your terraform commands nice and short we recommend one file per environment. +If you want to start using multiple environments, duplicate the `project.tfvars` file and rename appropriately. + +``` +. . +├── terraform/ ├── terraform/ +| ├── environment/ | ├── environment/ +| └── project.tfvars -----> | ├── sbx.tfvars +| └--> | └── dev.tfvars +| ├── modules/ | ├── modules/ +| └── main.tf | └── main.tf +└── ... └── ... +``` + +You can now change variables per environment. For example you could change the project of your dev environment. + +To execute terraform commands you will now provide one of the tfvars files + +```bash +terraform plan --var-file environment/sbx.tfvars +``` + +#### Terraform workspaces + +When you create resources with terraform they are recorded in a terraform state. +To have multiple environments running at the same time you will have to create multiple states, +one for each environment. To do this use the `terraform workspace` command. + +### Component documentation + +For more detailed documentation about certain components, check the [Nimbus documentation](https://nimbus-documentation-dot-ml6-internal-tools.uc.r.appspot.com/). \ No newline at end of file diff --git a/infrastructure/terraform/README/cluster.md b/infrastructure/terraform/README/cluster.md new file mode 100644 index 0000000..9ec306a --- /dev/null +++ b/infrastructure/terraform/README/cluster.md @@ -0,0 +1,23 @@ +# README for cluster.tf + +## Workload identities + +Using workload identities is the new recommended way to authenticate GKE pods with google service accounts. To do this, `cluster.tf` includes all the required boilerplate code that maps google service accounts to kubernetes service accounts. This is all you need to start using workload identities. + +## Configure your service accounts in Terraform + +In `cluster.tf` you will find a variable `service_accounts`. By updating this variable you can configure one or more service accounts. They will automatically be created and initialized to work with workload identities. By default we create a simple service account `primary-pod-sa` without any permissions. + +## Assigning the service account to pods + +By default, pods in a GKE cluster will not use the service account specified here. +Update the yaml definition of the deployment to assign a service account: +``` +spec: + serviceAccountName: KSA_NAME +``` + +Note that the name should be the name of the kubernetes service account that is linked to the google service account. If you are unsure of the name run `kubectl get serviceaccounts` to list all available kubernetes service accounts. + +## Relevant documentation +https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to diff --git a/infrastructure/terraform/README/wif.md b/infrastructure/terraform/README/wif.md new file mode 100644 index 0000000..21dd9eb --- /dev/null +++ b/infrastructure/terraform/README/wif.md @@ -0,0 +1,33 @@ + +# README for Workload Identity Federation + +## Prerequisites + +You need to have the following IAM roles to apply the terraform configuration: + +- roles/iam.workloadIdentityPoolAdmin +- roles/iam.serviceAccountAdmin + +## Usage + +You can use Nimbus to generate the components required for workload identity federation via `nimbus gcp init`. + +### Variables + +Note: the default value for the `bitbucket_repo_ids` local in `wif.tf` is an empty list. +You have to update it with the UUID of the bitbucket repositories that should have +access! +You can find more info in the Bitbucket documentation linked at the end. + +All other variables are set by Nimbus to use the ML6 Bitbucket workspace by default. If you need to connect a different +workspace, you can check the descriptions of these variables in the generated `wif` module (`modules/wif/variables.tf`) + +## Background information + +If you want to learn more about the concepts behind workload identity federation, you can check out the following sources: + +- [Chapter Conference Talk (April 2022)](https://docs.google.com/presentation/d/1liJi-QdurS1cJ2W57CYy0kbbSLx5GR8YlQPDJXcy3cw/edit?usp=sharing) +- [Google Documentation](https://cloud.google.com/iam/docs/workload-identity-federation) +- [Bitbucket Documentation](https://support.atlassian.com/bitbucket-cloud/docs/integrate-pipelines-with-resource-servers-using-oidc/) +- Google Cloud blogposts: [1](https://cloud.google.com/blog/products/identity-security/enable-keyless-access-to-gcp-with-workload-identity-federation) and [2](https://cloud.google.com/blog/products/identity-security/enabling-keyless-authentication-from-github-actions) + diff --git a/infrastructure/terraform/backend/.gitignore b/infrastructure/terraform/backend/.gitignore new file mode 100644 index 0000000..33d9722 --- /dev/null +++ b/infrastructure/terraform/backend/.gitignore @@ -0,0 +1,2 @@ +.terraform* +terraform.tfstate.backup \ No newline at end of file diff --git a/infrastructure/terraform/backend/README.md b/infrastructure/terraform/backend/README.md new file mode 100644 index 0000000..d01cb9c --- /dev/null +++ b/infrastructure/terraform/backend/README.md @@ -0,0 +1,28 @@ +# Terraform backend infrastructure + +This directory defines the infrastructure for the backend where Terraform stores its state. The +state of the backend cannot be stored in the backend itself, which is why it is separated from the +main Terraform configuration. + +The state of the backend configuration is instead stored locally and can be tracked in git. This is +an acceptable solution since the backend should rarely be changed. + +This means that you now have 2 separate Terraform configuration and states. + +- The backend configuration: for which the state is stored locally and tracked in git. +- The main configuration: for which the state is stored remotely in the backend. + +## Deployment + +To deploy, run the following steps from this directory: + +```commandline +terraform init +terraform apply --var-file ../environment/project.tfvars +``` + +## Tracking the state + +The state of this infrastructure needs to be tracked separately by adding the +`terraform.tfstate` file to git. Next to the `main.tf` file and `README.md`, all other files can be +ignored. A `.gitignore` file is automatically generated to handle this. diff --git a/infrastructure/terraform/backend/main.tf b/infrastructure/terraform/backend/main.tf new file mode 100644 index 0000000..f7078ab --- /dev/null +++ b/infrastructure/terraform/backend/main.tf @@ -0,0 +1,44 @@ +/****************************************** + Variables + *****************************************/ + +variable "project" { + description = "GCP project name" + type = string +} + +variable "region" { + description = "Default GCP region for resources" + type = string + default = "europe-west1" +} + +variable "zone" { + description = "Default GCP zone for resources" + type = string + default = "europe-west1-b" +} + +/****************************************** + Google provider configuration + *****************************************/ + +provider "google" { + project = var.project + region = var.region + zone = var.zone +} + +/****************************************** + State storage configuration + *****************************************/ + +resource "google_storage_bucket" "terraform_state" { + name = "${var.project}_terraform" + location = var.region + uniform_bucket_level_access = true + + versioning { + enabled = true + } +} diff --git a/infrastructure/terraform/backend/terraform.tfstate b/infrastructure/terraform/backend/terraform.tfstate new file mode 100644 index 0000000..a4a8878 --- /dev/null +++ b/infrastructure/terraform/backend/terraform.tfstate @@ -0,0 +1,52 @@ +{ + "version": 4, + "terraform_version": "1.4.6", + "serial": 1, + "lineage": "7e6b8586-fe78-96ad-aaca-ba5027c2c6e4", + "outputs": {}, + "resources": [ + { + "mode": "managed", + "type": "google_storage_bucket", + "name": "terraform_state", + "provider": "provider[\"registry.terraform.io/hashicorp/google\"]", + "instances": [ + { + "schema_version": 0, + "attributes": { + "autoclass": [], + "cors": [], + "custom_placement_config": [], + "default_event_based_hold": false, + "encryption": [], + "force_destroy": false, + "id": "boreal-array-387713_terraform", + "labels": {}, + "lifecycle_rule": [], + "location": "EUROPE-WEST1", + "logging": [], + "name": "boreal-array-387713_terraform", + "project": "boreal-array-387713", + "public_access_prevention": "inherited", + "requester_pays": false, + "retention_policy": [], + "self_link": "https://www.googleapis.com/storage/v1/b/boreal-array-387713_terraform", + "storage_class": "STANDARD", + "timeouts": null, + "uniform_bucket_level_access": true, + "url": "gs://boreal-array-387713_terraform", + "versioning": [ + { + "enabled": true + } + ], + "website": [] + }, + "sensitive_attributes": [], + "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjo2MDAwMDAwMDAwMDAsInJlYWQiOjI0MDAwMDAwMDAwMCwidXBkYXRlIjoyNDAwMDAwMDAwMDB9fQ==" + } + ] + } + ], + "check_results": null +} diff --git a/infrastructure/terraform/environment/project.tfvars b/infrastructure/terraform/environment/project.tfvars new file mode 100644 index 0000000..5aa8698 --- /dev/null +++ b/infrastructure/terraform/environment/project.tfvars @@ -0,0 +1,3 @@ +project = "boreal-array-387713" +region = "europe-west4" +zone = "europe-west4-a" \ No newline at end of file diff --git a/infrastructure/terraform/main/artifactregistry.tf b/infrastructure/terraform/main/artifactregistry.tf new file mode 100644 index 0000000..874a06a --- /dev/null +++ b/infrastructure/terraform/main/artifactregistry.tf @@ -0,0 +1,33 @@ +resource "google_project_service" "artifactregistry" { + project = var.project + service = "artifactregistry.googleapis.com" +} + +resource "google_artifact_registry_repository" "docker_artifact" { + provider = google-beta + location = var.region + project = var.project + repository_id = "${var.project}-default-repository" + format = "DOCKER" + + depends_on = [google_project_service.artifactregistry] +} + +resource "google_artifact_registry_repository" "kfp_template_artifact" { + location = var.region + project = var.project + repository_id = "${var.project}-kfp-template-repository" + format = "KFP" + + depends_on = [google_project_service.artifactregistry] +} + + +resource "google_storage_bucket" "cloudbuild" { + name = "${var.project}_cloudbuild_artifacts" + location = var.region + uniform_bucket_level_access = true + versioning { + enabled = true + } +} diff --git a/infrastructure/terraform/main/backend.tf b/infrastructure/terraform/main/backend.tf new file mode 100644 index 0000000..6311f44 --- /dev/null +++ b/infrastructure/terraform/main/backend.tf @@ -0,0 +1,13 @@ + +/****************************************** + Remote backend configuration + *****************************************/ + +# setup of the backend gcs bucket that will keep the remote state + +terraform { + backend "gcs" { + bucket = "boreal-array-387713_terraform" + prefix = "terraform/state" + } +} diff --git a/infrastructure/terraform/main/cluster.tf b/infrastructure/terraform/main/cluster.tf new file mode 100644 index 0000000..f8442ba --- /dev/null +++ b/infrastructure/terraform/main/cluster.tf @@ -0,0 +1,94 @@ +/****************************************** + Variables + *****************************************/ + +variable "master_authorized_networks" { + description = "IPv4 CIDR blocks that are authorized to connect to cluster master." + type = list(object({ cidr_block = string, display_name = string })) + default = [ + { + cidr_block = "81.245.5.156/32" + display_name = "ML6-office-1" + }, + { + cidr_block = "84.198.172.145/32" + display_name = "ML6-office-2" + }, + { + cidr_block = "81.245.214.29/32" + display_name = "Philippe-home" + }, + { + cidr_block = "109.130.56.145/32" + display_name = "Robbe-home" + }, + { + cidr_block = "178.117.11.246/32" + display_name = "Niels-home" + }, + ] +} + +locals { + node_pools = [ + { + name = "default-pool", + machine_type = "n1-standard-2", + disk_type = "pd-standard", + disk_size = 100, + autoscaling = false, + preemptible = false, + # standalone kfp takes at minimum 3 n1-standard-2 nodes + node_count = 3, + min_count = 3, + max_count = 3, + accelerator_type = "", + accelerator_count = 0, + local_ssd_ephemeral_count = 0 + }, + { + name = "work-pool", + machine_type = "n1-standard-2", + disk_type = "pd-standard", + disk_size = 100, + autoscaling = true, + preemptible = false, + node_count = 3, + min_count = 0, + max_count = 100, + accelerator_type = "", + accelerator_count = 0, + local_ssd_ephemeral_count = 0 + }, + ] +} + + +/****************************************** + GKE configuration + *****************************************/ +module "gke_cluster" { + source = "../modules/kubernetes" + project = var.project + zone = var.zone + region = var.region + node_pools = local.node_pools + cluster_name = "fondant-cluster" + master_ipv4_cidr_block = "172.16.0.0/28" + master_authorized_networks = var.master_authorized_networks + ip_range_pods_name = module.vpc-network.subnets_secondary_ranges[0] + ip_range_services_name = module.vpc-network.subnets_secondary_ranges[1] + network_name = module.vpc-network.network_name + subnetwork_name = module.vpc-network.subnets_names[0] +} + +resource "google_project_service" "iam_credentials" { + project = var.project + service = "iamcredentials.googleapis.com" +} + +resource "google_project_service" "iam" { + project = var.project + service = "iam.googleapis.com" + depends_on = [google_project_service.iam_credentials] +} \ No newline at end of file diff --git a/infrastructure/terraform/main/iam.tf b/infrastructure/terraform/main/iam.tf new file mode 100644 index 0000000..730b463 --- /dev/null +++ b/infrastructure/terraform/main/iam.tf @@ -0,0 +1,25 @@ +resource "google_project_iam_member" "user_roles" { + for_each = toset(var.user_roles) + role = each.key + project = var.project + member = "group:team@skyhaus.com" +} + +variable "user_roles" { + description = "IAM roles to bind to users" + type = list(string) + default = [ + "roles/container.clusterViewer", + "roles/storage.admin", + "roles/artifactregistry.repoAdmin", + "roles/cloudbuild.builds.editor", + "roles/serviceusage.serviceUsageConsumer", + "roles/viewer" + ] +} + +resource "google_service_account_iam_member" "user_svc_user" { + service_account_id = module.gke_cluster.service_account.id + role = "roles/iam.serviceAccountUser" + member = "group:team@skyhaus.com" +} \ No newline at end of file diff --git a/infrastructure/terraform/main/main.tf b/infrastructure/terraform/main/main.tf new file mode 100644 index 0000000..5a2701e --- /dev/null +++ b/infrastructure/terraform/main/main.tf @@ -0,0 +1,59 @@ +/****************************************** + Google provider configuration + *****************************************/ + +provider "google" { + project = var.project + region = var.region + zone = var.zone +} + +provider "google-beta" { + project = var.project + region = var.region + zone = var.zone +} + +/****************************************** + Variables + *****************************************/ + +variable "project" { + description = "GCP project name" + type = string +} + +variable "region" { + description = "Default GCP region for resources" + type = string + default = "europe-west1" +} + +variable "zone" { + description = "Default GCP zone for resources" + type = string + default = "europe-west1-b" +} + +locals { + network_name = "main-network" + subnetwork_name = var.region +} + +/****************************************** + VPC configuration + *****************************************/ + +module "vpc-network" { + source = "../modules/vpc" + project = var.project + region = var.region + network_name = local.network_name + subnetwork_name = local.subnetwork_name + subnetwork_ip_range = "10.10.10.0/24" + ip_range_pods = "10.28.0.0/16" + ip_range_pods_name = "${local.subnetwork_name}-gke-pods" + ip_range_services = "10.0.0.0/20" + ip_range_services_name = "${local.subnetwork_name}-gke-services" + vpc_connector = "true" +} diff --git a/infrastructure/terraform/main/services.tf b/infrastructure/terraform/main/services.tf new file mode 100644 index 0000000..25de217 --- /dev/null +++ b/infrastructure/terraform/main/services.tf @@ -0,0 +1,19 @@ +variable "gcp_service_list" { + description = "The list of apis necessary for the project" + type = list(string) + default = [ + "cloudbuild.googleapis.com", + "notebooks.googleapis.com", + "secretmanager.googleapis.com", + "servicenetworking.googleapis.com", + "sqladmin.googleapis.com", + "compute.googleapis.com" + ] +} +resource "google_project_service" "gcp_services" { + for_each = toset(var.gcp_service_list) + project = var.project + service = each.key + disable_dependent_services = true +} + diff --git a/infrastructure/terraform/main/sql.tf b/infrastructure/terraform/main/sql.tf new file mode 100644 index 0000000..778ceaf --- /dev/null +++ b/infrastructure/terraform/main/sql.tf @@ -0,0 +1,82 @@ +/****************************************** + VPC Peering configuration + *****************************************/ + +resource "google_compute_global_address" "private_ip_address" { + provider = google-beta + + name = "private-ip-address" + purpose = "VPC_PEERING" + address_type = "INTERNAL" + prefix_length = 16 + network = module.vpc-network.network_name +} + +resource "google_service_networking_connection" "private_vpc_connection" { + network = module.vpc-network.network_name + service = "servicenetworking.googleapis.com" + reserved_peering_ranges = [google_compute_global_address.private_ip_address.name] +} + +/****************************************** + database configuration + *****************************************/ + +resource "google_sql_database_instance" "metadata-database" { + name = "kfp-metadata" + database_version = "MYSQL_5_7" + region = var.region + deletion_protection = true + depends_on = [google_service_networking_connection.private_vpc_connection] + + # Be careful here as the disk size can automatically increase which can cause Terraform to delete + # the database if the disk_size specified is smaller than the resized amount + settings { + tier = "db-n1-standard-1" + disk_size = 50 + disk_type = "PD_SSD" + disk_autoresize = false + + ip_configuration { + ipv4_enabled = false + private_network = "projects/${var.project}/global/networks/main-network" + } + backup_configuration { + enabled = true + location = "eu" + start_time = "00:00" + } + } +} + +/****************************************** + Default user configuration + *****************************************/ +# Store password in secret manager and create user + +resource "google_secret_manager_secret" "sql-key" { + secret_id = "sql-key" + + replication { + automatic = true + } + depends_on = [google_project_service.gcp_services] +} + +resource "random_password" "sql_password" { + length = 16 + special = true + override_special = "!#$%&*()-_=+[]{}<>:?" +} + +resource "google_secret_manager_secret_version" "sql-key-1" { + secret = google_secret_manager_secret.sql-key.id + secret_data = random_password.sql_password.result + depends_on = [google_secret_manager_secret.sql-key] +} + +resource "google_sql_user" "sql-user" { + name = "root" + instance = google_sql_database_instance.metadata-database.name + password = google_secret_manager_secret_version.sql-key-1.secret_data +} \ No newline at end of file diff --git a/infrastructure/terraform/main/versions.tf b/infrastructure/terraform/main/versions.tf new file mode 100644 index 0000000..9d276f4 --- /dev/null +++ b/infrastructure/terraform/main/versions.tf @@ -0,0 +1,12 @@ +terraform { + required_version = ">= 1.0.0" + + required_providers { + google = { + source = "hashicorp/google" + } + google-beta = { + source = "hashicorp/google-beta" + } + } +} diff --git a/infrastructure/terraform/modules/kubernetes/main.tf b/infrastructure/terraform/modules/kubernetes/main.tf new file mode 100644 index 0000000..f3c9f2b --- /dev/null +++ b/infrastructure/terraform/modules/kubernetes/main.tf @@ -0,0 +1,174 @@ +/****************************************** + Resources + *****************************************/ + +resource "google_project_service" "container" { + project = var.project + service = "container.googleapis.com" +} + +resource "google_project_service" "containerregistry" { + project = var.project + service = "containerregistry.googleapis.com" + disable_on_destroy = false +} + +resource "google_project_service" "iam_credentials" { + project = var.project + service = "iamcredentials.googleapis.com" +} + +resource "google_project_service" "iam" { + project = var.project + service = "iam.googleapis.com" + depends_on = [google_project_service.iam_credentials] +} + +/****************************************** + GSA and bucket access + *****************************************/ + +# create user GSA +resource "google_service_account" "kfp-pipeline-user" { + project = var.project + account_id = "svc-kfp-user" + description = "Service account for the KFP pipelines" +} + + +# set roles +resource "google_project_iam_member" "kfp_pipeline_roles" { + for_each = toset(var.kfp_pipeline_roles) + role = each.key + project = var.project + member = "serviceAccount:${google_service_account.kfp-pipeline-user.email}" +} + + +variable "kfp_pipeline_roles" { + description = "IAM roles to bind on service account" + type = list(string) + default = [ + "roles/artifactregistry.reader", + "roles/logging.logWriter", + "roles/monitoring.metricWriter", + "roles/monitoring.viewer", + "roles/stackdriver.resourceMetadata.writer", + "roles/storage.objectViewer", + "roles/cloudsql.client" + ] +} + +resource "google_storage_bucket" "datasets" { + name = "${var.project}_datasets" + location = var.region + uniform_bucket_level_access = true + versioning { + enabled = true + } +} + + +resource "google_storage_bucket" "models" { + name = "${var.project}_models" + location = var.region + uniform_bucket_level_access = true + versioning { + enabled = true + } +} + +resource "google_storage_bucket" "kfp-artifacts" { + name = "${var.project}_kfp-artifacts" + location = var.region + uniform_bucket_level_access = true + versioning { + enabled = true + } +} + +resource "google_storage_bucket_iam_binding" "dataset_binding" { + bucket = google_storage_bucket.datasets.name + + members = [ + "serviceAccount:${google_service_account.kfp-pipeline-user.email}" + ] + + role = "roles/storage.objectViewer" +} + +resource "google_storage_bucket_iam_binding" "models_binding" { + bucket = google_storage_bucket.models.name + + members = [ + "serviceAccount:${google_service_account.kfp-pipeline-user.email}" + ] + + role = "roles/storage.objectViewer" +} + +resource "google_storage_bucket_iam_binding" "artifact_binding" { + bucket = google_storage_bucket.kfp-artifacts.name + + members = [ + "serviceAccount:${google_service_account.kfp-pipeline-user.email}" + ] + + role = "roles/storage.admin" +} + +/****************************************** + Cluster + *****************************************/ + +module "gke-cluster" { + source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster" + version = "24.0.0" + + project_id = google_project_service.container.project + name = var.cluster_name + region = var.region + regional = false + zones = [var.zone] + release_channel = "STABLE" + + network = var.network_name + subnetwork = var.subnetwork_name + ip_range_pods = var.ip_range_pods_name + ip_range_services = var.ip_range_services_name + service_account = google_service_account.kfp-pipeline-user.email + master_ipv4_cidr_block = var.master_ipv4_cidr_block + enable_private_nodes = true + master_authorized_networks = var.master_authorized_networks + grant_registry_access = true + remove_default_node_pool = true + network_policy = true + maintenance_start_time = "00:00" + identity_namespace = "" + node_metadata = "UNSPECIFIED" + + + node_pools = [ + for node_pool_spec in var.node_pools : + { + name = node_pool_spec.name + machine_type = node_pool_spec.machine_type + disk_size_gb = node_pool_spec.disk_size + disk_type = node_pool_spec.disk_type + image_type = "COS_CONTAINERD" + auto_repair = true + auto_upgrade = true + preemptible = node_pool_spec.preemptible + node_count = node_pool_spec.node_count + autoscaling = node_pool_spec.autoscaling + min_count = node_pool_spec.min_count + max_count = node_pool_spec.max_count + accelerator_type = node_pool_spec.accelerator_type + accelerator_count = node_pool_spec.accelerator_count + local_ssd_ephemeral_count = node_pool_spec.local_ssd_ephemeral_count #375 Gb per ssd + } + ] + + node_pools_taints = var.node_pools_taints + +} diff --git a/infrastructure/terraform/modules/kubernetes/outputs.tf b/infrastructure/terraform/modules/kubernetes/outputs.tf new file mode 100644 index 0000000..fb91f32 --- /dev/null +++ b/infrastructure/terraform/modules/kubernetes/outputs.tf @@ -0,0 +1,16 @@ +output "endpoint" { + description = "The cluster endpoint" + sensitive = true + value = module.gke-cluster.endpoint +} + +output "ca_certificate" { + description = "The cluster ca certificate (base64 encoded)" + sensitive = true + value = module.gke-cluster.ca_certificate +} + +output "service_account" { + description = "main cluster service account" + value = google_service_account.kfp-pipeline-user +} \ No newline at end of file diff --git a/infrastructure/terraform/modules/kubernetes/variables.tf b/infrastructure/terraform/modules/kubernetes/variables.tf new file mode 100644 index 0000000..61691bd --- /dev/null +++ b/infrastructure/terraform/modules/kubernetes/variables.tf @@ -0,0 +1,62 @@ +variable "project" { + description = "GCP project name" + type = string +} + +variable "region" { + description = "Default GCP region for resources" + type = string + default = "europe-west4" +} + +variable "zone" { + description = "Default GCP zone for resources" + type = string + default = "europe-west4-a" +} + +variable "cluster_name" { + description = "GKE cluster name." + type = string +} + +variable "master_authorized_networks" { + description = "IPv4 CIDR blocks that are authorized to connect to cluster master." + type = list(object({ cidr_block = string, display_name = string })) +} + +variable "ip_range_pods_name" { + description = "Name of the secondary IP range to be used by GKE pods." + type = string +} + +variable "ip_range_services_name" { + description = "Name of the secondary IP range to be used by GKE services." + type = string +} + +variable "master_ipv4_cidr_block" { + description = "IPv4 CIDR block to be used by cluster master." + type = string +} + +variable "network_name" { + description = "Name of the VPC network." + type = string +} + +variable "subnetwork_name" { + description = "Name of the subnetwork to be used by GKE nodes." + type = string +} + +variable "node_pools" { + description = "List of specs for GKE node pools to be created in the cluster." + type = list(map(string)) +} + +variable "node_pools_taints" { + description = "List of maps, each representing a node pool taint" + type = map(list(object({ key = string, value = string, effect = string }))) + default = {} +} diff --git a/infrastructure/terraform/modules/vpc/main.tf b/infrastructure/terraform/modules/vpc/main.tf new file mode 100644 index 0000000..9906efd --- /dev/null +++ b/infrastructure/terraform/modules/vpc/main.tf @@ -0,0 +1,90 @@ +resource "google_project_service" "compute" { + project = var.project + service = "compute.googleapis.com" + disable_dependent_services = true + disable_on_destroy = false +} + +module "vpc-network" { + source = "terraform-google-modules/network/google" + version = "5.1.0" + + project_id = var.project + network_name = var.network_name + + description = "Custom network created with Nimbus" + + subnets = [ + { + subnet_name = var.subnetwork_name + subnet_ip = var.subnetwork_ip_range + subnet_region = var.region + subnet_private_access = "true" + } + ] + + secondary_ranges = { + (var.subnetwork_name) = [ + { + range_name = var.ip_range_pods_name + ip_cidr_range = var.ip_range_pods + }, + { + range_name = var.ip_range_services_name + ip_cidr_range = var.ip_range_services + } + ] + } + + depends_on = [google_project_service.compute] +} + +resource "google_compute_router" "default" { + name = "nat-router-${var.region}" + region = var.region + network = module.vpc-network.network_name + +} + +resource "google_compute_router_nat" "default" { + name = "nat-config-${var.region}" + router = google_compute_router.default.name + region = var.region + nat_ip_allocate_option = "AUTO_ONLY" + source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES" +} + +resource "google_compute_firewall" "allow-ssh-from-iap" { + name = "allow-ssh-from-iap" + network = module.vpc-network.network_name + project = var.project + + source_ranges = [ + "35.235.240.0/20", + ] + + allow { + protocol = "tcp" + ports = ["22", ] + } +} + +resource "google_vpc_access_connector" "connector" { + count = var.vpc_connector ? 1 : 0 + name = var.subnetwork_name + region = var.region + ip_cidr_range = var.vpc_connector_ip_range + network = module.vpc-network.network_name + min_throughput = 200 + max_throughput = 300 + + depends_on = [ + google_project_service.vpcaccess + ] +} + +resource "google_project_service" "vpcaccess" { + count = var.vpc_connector ? 1 : 0 + project = var.project + service = "vpcaccess.googleapis.com" +} diff --git a/infrastructure/terraform/modules/vpc/outputs.tf b/infrastructure/terraform/modules/vpc/outputs.tf new file mode 100644 index 0000000..4826c07 --- /dev/null +++ b/infrastructure/terraform/modules/vpc/outputs.tf @@ -0,0 +1,14 @@ +output "network_name" { + value = module.vpc-network.network_name + description = "Network name" +} + +output "subnets_names" { + value = module.vpc-network.subnets_names + description = "Subnet name" +} + +output "subnets_secondary_ranges" { + value = module.vpc-network.subnets_secondary_ranges[0][*].range_name + description = "Subnet secondary ranges" +} diff --git a/infrastructure/terraform/modules/vpc/variables.tf b/infrastructure/terraform/modules/vpc/variables.tf new file mode 100644 index 0000000..f8f9c63 --- /dev/null +++ b/infrastructure/terraform/modules/vpc/variables.tf @@ -0,0 +1,56 @@ +variable "project" { + description = "GCP project name" + type = string +} + +variable "region" { + description = "Default GCP region for resources" + type = string + default = "europe-west1" +} + +variable "network_name" { + description = "Name of the VPC network." + type = string +} + +variable "subnetwork_name" { + description = "Name of the subnetwork to be used by GKE nodes." + type = string +} + +variable "subnetwork_ip_range" { + description = "IPv4 CIDR block to be used by GKE nodes." + type = string +} + +variable "ip_range_pods_name" { + description = "Name of the secondary IP range to be used by GKE pods." + type = string +} + +variable "ip_range_pods" { + description = "IPv4 CIDR block to be used by GKE pods." + type = string +} + +variable "ip_range_services_name" { + description = "Name of the secondary IP range to be used by GKE services." + type = string +} + +variable "ip_range_services" { + description = "IPv4 CIDR block to be used by GKE services." + type = string +} + +variable "vpc_connector" { + description = "Whether or not to add a vpc_connector." + type = bool +} + +variable "vpc_connector_ip_range" { + description = "An unreserved /28 internal IP range used by the VPC connector" + type = string + default = "10.8.0.0/28" +}