ml6team · GeorgesLorre · May 25, 2023 · May 25, 2023 · May 25, 2023 · May 26, 2023
diff --git a/infrastructure/deployments/README.md b/infrastructure/deployments/README.md
@@ -0,0 +1,124 @@
+# Deploying standalone KFP pipelines on top of the cluster
+
+## Deploying KFP pipeline
+
+[kubeflow pipelines documentation](https://www.kubeflow.org/docs/components/pipelines/installation/standalone-deployment/#disable-the-public-endpoint)
+
+[kubeflow pipelines github](https://github.com/kubeflow/pipelines/tree/0.5.1)
+
+
+The documentation instructs you to deploy kubeflow pipelines on a GKE cluster, we've already deployed the cluster with terraform.
+
+Make sure you are authenticated to the GKE cluster hosting KFP 
+```commandline
+gcloud container clusters get-credentials fondant-cluster --zone=europe-west4-a
+```
+
+## Customizing GCS and Cloud SQL for Artefact, Pipeline and Metadata storage
+
+[GCP services setup](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample)
+
+We will setup `GCS` to store artefact and pipeline specs. This is done by deploying the `minio-gcs gateway` service.
+We will also use `CloudSQL` to store all the ML metadata associated with our pipeline runs. This will guarantee that you are
+able to retrieve metadata (experiments outputs, lineage visualization, ...) from our previous pipeline runs in case the GKE cluster where KFP is deployed get deleted.
+
+
+First clone the [Kubeflow Pipelines GitHub repository](https://github.com/kubeflow/pipelines), and use it as your working directory.
+
+```commandline
+git clone https://github.com/kubeflow/pipelines
+```
+
+Next, we will need to customize our own values in the deployment manifest before deploying the additional services.  
+
+**Note**: You will need the `CloudSQL` user root password that was created in the terraform setup. In your GCP project, go to
+`Secret Manager` and retrieve the `sql-key` password that was stored there. 
+
+### Customize values
+
+Make sure to modify the following files:  
+
+* `manifests/kustomize/env/gcp/params.env`  
+
+```bash
+pipelineDb=pipelinedb
+mlmdDb=metadb
+cacheDb=cachedb
+bucketName=<PROJECT_ID>-kfp-artifacts #bucket to store the artifacts and pipeline specs (created in TF)
+gcsProjectId=<PROJECT_ID # GCP project ID
+gcsCloudSqlInstanceName=<PROJECT_ID>:<DB_REGION>:kfp-metadata # Metadata db (created in TF)
+```
+
+ `manifests/kustomize/base/installs/generic/mysql-secret.yaml`
+
+
+
+Specify the `root` user password retrieved from `Secret Manager` in the designated fields. Make sure **you do not commit** the secret to your git history.
+
+### Applying the customized resources
+
+After specifying the required parameters, you can now install the additional services:
+
+```commandline
+kubectl apply -k manifests/kustomize/cluster-scoped-resources
+kubectl wait crd/applications.app.k8s.io --for condition=established --timeout=60s
+kubectl apply -k manifests/kustomize/env/gcp
+```
+
+The process of deploying the GCP resources may take from 3-5 minutes, you can check that status of the 
+newly deployed services: 
+```commandline
+kubectl -n kubeflow get pods
+```
+
+## Installing GPU drivers 
+
+Next, we need to install the GPU drivers on the GPU node pools to use them in KFP.
+To install the GPU nodes, you have to manually scale up the GPU
+pool from 0 to 1 to ensure that the installation takes effect (More on this issue [here](https://github.com/kubeflow/pipelines/issues/2561).).    
+
+After that, apply:
+
+```kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml```
+
+After installing the drivers, you can set the pool back to 0. The pool can scale
+up back again when called in the pipeline steps since autoscaling is enabled.
+
+## Installing the Spark operator (Optional)
+An additional installation is required to setup the [K8 operator for Spark](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator#installation).
+
+```commandline
+helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
+helm install my-release spark-operator/spark-operator --namespace kubeflow 
+```
+This will install the Kubernetes Operator for Apache Spark into the namespace `kubeflow`.
+The operator by default watches and handles SparkApplications in every namespaces. 
+
+## Deleting the KFP services
+Run the following commands to delete the deployed KFP services from your GKE cluster. 
+
+
+```commandline
+kubectl delete -k manifests/kustomize/cluster-scoped-resources
+kubectl delete -k manifests/kustomize/env/gcp
+```
+
+## Accessing KFP pipeline
+There are three ways to connect to KFP UI, first make sure you autheticate to the GKE cluster hosting KFP:    
+
+**1) Port-forwarding to access the kubernetes service**
+```commandline
+kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
+```
+
+Open the Kubeflow Pipelines UI at http://localhost:8080  
+
+**2) Using the IAP URL of KFP**
+```commandline
+kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com
+```
+
+**3) AI platform**
+
+Alternatively, you can visualize your pipeline from the AI platform UI in the `pipelines` section.
+
diff --git a/infrastructure/terraform/.gitignore b/infrastructure/terraform/.gitignore
@@ -0,0 +1,2 @@
+.terraform
+.terraform.lock.hcl
diff --git a/infrastructure/terraform/README.md b/infrastructure/terraform/README.md
@@ -0,0 +1,79 @@
+# Terraform by Nimbus
+
+Nimbus can generate [Terraform](https://www.terraform.io/) configuration to help you set up the 
+necessary infrastructure at the start of your project and for each generated component.
+
+### Directory structure
+
+The Terraform directory consists of 4 subdirectories:
+- **backend**  
+  Contains the configuration for the Terraform backend which stores the state of the main 
+  configuration. The backend configuration and state are separated to prevent chicken-and-egg 
+  issues. For more info, check the [README.md](./backend/README.md) in the subdirectory.
+- **environment**  
+  Contains the `.tfvars` files with the global variables for the project. Originally, this 
+  directory contains a single `project.tfvars` file, but when moving to multiple environments, 
+  you'll want separate `.tfvars` files per environment. You can pass these files to any Terraform 
+  command using the `--var-file` argument
+- **main**  
+  The main Terraform configuration of the project. 
+- **modules**  
+  [Modules](https://developer.hashicorp.com/terraform/language/modules) that can be shared across 
+  configurations or just be used to package logical resources together.
+
+### Terraform version
+
+The supported terraform version can be found in [versions.tf](./versions.tf).
+We recommend [tfenv](https://github.com/tfutils/tfenv) to manage different
+Terraform versions on your machine.
+
+### Remote state
+
+By default, Terraform stores state locally in a file named `terraform.tfstate`. When working with
+Terraform in a team, use of a local file makes Terraform usage complicated because each user must
+make sure they always have the latest state data before running Terraform and make sure that nobody
+else runs Terraform at the same time. To solve this Terraform allows to store the state remotely.
+In general, storing the state remotely has much more advantages than just working in a team. It 
+also increases reliability in terms of backups, keeps your infrastructure secure by making the 
+process of applying the Terraform updates stable and secure, and facilitates the use of Terraform 
+in automated pipelines. At ML6 we believe using the remote state is the way of working with Terraform.
+
+Nimbus setup enforces the remote state by generating `backend.tf` file, which makes sure Terraform 
+automatically writes the state data to the cloud. The `backend` subdirectory contains the 
+configuration of this backend infrastructure.
+
+### Multiple environments
+
+#### Terraform variables
+
+To keep your terraform commands nice and short we recommend one file per environment.
+If you want to start using multiple environments, duplicate the `project.tfvars` file and rename appropriately.
+
+```
+.                                           .
+├── terraform/                              ├── terraform/
+|     ├── environment/                      |     ├── environment/
+|         └── project.tfvars  ----->        |         ├── sbx.tfvars
+|                               └-->        |         └── dev.tfvars
+|     ├── modules/                          |     ├── modules/
+|     └── main.tf                           |     └── main.tf
+└── ...                                     └── ...
+```
+
+You can now change variables per environment. For example you could change the project of your dev environment.
+
+To execute terraform commands you will now provide one of the tfvars files
+
+```bash
+terraform plan --var-file environment/sbx.tfvars
+```
+
+#### Terraform workspaces
+
+When you create resources with terraform they are recorded in a terraform state.
+To have multiple environments running at the same time you will have to create multiple states,
+one for each environment. To do this use the `terraform workspace` command.
+
+### Component documentation
+
+For more detailed documentation about certain components, check the [Nimbus documentation](https://nimbus-documentation-dot-ml6-internal-tools.uc.r.appspot.com/).
diff --git a/infrastructure/terraform/README/cluster.md b/infrastructure/terraform/README/cluster.md
@@ -0,0 +1,23 @@
+# README for cluster.tf
+
+## Workload identities
+
+Using workload identities is the new recommended way to authenticate GKE pods with google service accounts. To do this, `cluster.tf` includes all the required boilerplate code that maps google service accounts to kubernetes service accounts. This is all you need to start using workload identities.
+
+## Configure your service accounts in Terraform
+
+In `cluster.tf` you will find a variable `service_accounts`. By updating this variable you can configure one or more service accounts. They will automatically be created and initialized to work with workload identities. By default we create a simple service account `primary-pod-sa` without any permissions.
+
+## Assigning the service account to pods
+
+By default, pods in a GKE cluster will not use the service account specified here.
+Update the yaml definition of the deployment to assign a service account:
+```
+spec:
+  serviceAccountName: KSA_NAME
+```
+
+Note that the name should be the name of the kubernetes service account that is linked to the google service account. If you are unsure of the name run `kubectl get serviceaccounts` to list all available kubernetes service accounts.
+
+## Relevant documentation
+https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to
diff --git a/infrastructure/terraform/README/wif.md b/infrastructure/terraform/README/wif.md
@@ -0,0 +1,33 @@
+
+# README for Workload Identity Federation
+
+## Prerequisites
+
+You need to have the following IAM roles to apply the terraform configuration:
+
+- roles/iam.workloadIdentityPoolAdmin
+- roles/iam.serviceAccountAdmin
+
+## Usage
+
+You can use Nimbus to generate the components required for workload identity federation via `nimbus gcp init`.
+
+### Variables
+
+Note: the default value for the `bitbucket_repo_ids` local in `wif.tf` is an empty list.
+You have to update it with the UUID of the bitbucket repositories that should have
+access!
+You can find more info in the Bitbucket documentation linked at the end.
+
+All other variables are set by Nimbus to use the ML6 Bitbucket workspace by default. If you need to connect a different 
+workspace, you can check the descriptions of these variables in the generated `wif` module (`modules/wif/variables.tf`)
+
+## Background information
+
+If you want to learn more about the concepts behind workload identity federation, you can check out the following sources:
+
+- [Chapter Conference Talk (April 2022)](https://docs.google.com/presentation/d/1liJi-QdurS1cJ2W57CYy0kbbSLx5GR8YlQPDJXcy3cw/edit?usp=sharing)
+- [Google Documentation](https://cloud.google.com/iam/docs/workload-identity-federation)
+- [Bitbucket Documentation](https://support.atlassian.com/bitbucket-cloud/docs/integrate-pipelines-with-resource-servers-using-oidc/)
+- Google Cloud blogposts: [1](https://cloud.google.com/blog/products/identity-security/enable-keyless-access-to-gcp-with-workload-identity-federation) and [2](https://cloud.google.com/blog/products/identity-security/enabling-keyless-authentication-from-github-actions)
+
diff --git a/infrastructure/terraform/backend/.gitignore b/infrastructure/terraform/backend/.gitignore
@@ -0,0 +1,2 @@
+.terraform*
+terraform.tfstate.backup
diff --git a/infrastructure/terraform/backend/README.md b/infrastructure/terraform/backend/README.md
@@ -0,0 +1,28 @@
+# Terraform backend infrastructure
+
+This directory defines the infrastructure for the backend where Terraform stores its state. The
+state of the backend cannot be stored in the backend itself, which is why it is separated from the
+main Terraform configuration.
+
+The state of the backend configuration is instead stored locally and can be tracked in git. This is
+an acceptable solution since the backend should rarely be changed.
+
+This means that you now have 2 separate Terraform configuration and states.
+
+- The backend configuration: for which the state is stored locally and tracked in git.
+- The main configuration: for which the state is stored remotely in the backend.
+
+## Deployment
+
+To deploy, run the following steps from this directory:
+
+```commandline
+terraform init
+terraform apply --var-file ../environment/project.tfvars
+```
+
+## Tracking the state
+
+The state of this infrastructure needs to be tracked separately by adding the
+`terraform.tfstate` file to git. Next to the `main.tf` file and `README.md`, all other files can be
+ignored. A `.gitignore` file is automatically generated to handle this.
diff --git a/infrastructure/terraform/backend/main.tf b/infrastructure/terraform/backend/main.tf
@@ -0,0 +1,44 @@
+/******************************************
+	Variables
+ *****************************************/
+
+variable "project" {
+  description = "GCP project name"
+  type        = string
+}
+
+variable "region" {
+  description = "Default GCP region for resources"
+  type        = string
+  default     = "europe-west1"
+}
+
+variable "zone" {
+  description = "Default GCP zone for resources"
+  type        = string
+  default     = "europe-west1-b"
+}
+
+/******************************************
+	Google provider configuration
+ *****************************************/
+
+provider "google" {
+  project = var.project
+  region  = var.region
+  zone    = var.zone
+}
+
+/******************************************
+  State storage configuration
+ *****************************************/
+
+resource "google_storage_bucket" "terraform_state" {
+  name                        = "${var.project}_terraform"
+  location                    = var.region
+  uniform_bucket_level_access = true
+
+  versioning {
+    enabled = true
+  }
+}
diff --git a/infrastructure/terraform/backend/terraform.tfstate b/infrastructure/terraform/backend/terraform.tfstate
@@ -0,0 +1,52 @@
+{
+  "version": 4,
+  "terraform_version": "1.4.6",
+  "serial": 1,
+  "lineage": "7e6b8586-fe78-96ad-aaca-ba5027c2c6e4",
+  "outputs": {},
+  "resources": [
+    {
+      "mode": "managed",
+      "type": "google_storage_bucket",
+      "name": "terraform_state",
+      "provider": "provider[\"registry.terraform.io/hashicorp/google\"]",
+      "instances": [
+        {
+          "schema_version": 0,
+          "attributes": {
+            "autoclass": [],
+            "cors": [],
+            "custom_placement_config": [],
+            "default_event_based_hold": false,
+            "encryption": [],
+            "force_destroy": false,
+            "id": "boreal-array-387713_terraform",
+            "labels": {},
+            "lifecycle_rule": [],
+            "location": "EUROPE-WEST1",
+            "logging": [],
+            "name": "boreal-array-387713_terraform",
+            "project": "boreal-array-387713",
+            "public_access_prevention": "inherited",
+            "requester_pays": false,
+            "retention_policy": [],
+            "self_link": "https://www.googleapis.com/storage/v1/b/boreal-array-387713_terraform",
+            "storage_class": "STANDARD",
+            "timeouts": null,
+            "uniform_bucket_level_access": true,
+            "url": "gs://boreal-array-387713_terraform",
+            "versioning": [
+              {
+                "enabled": true
+              }
+            ],
+            "website": []
+          },
+          "sensitive_attributes": [],
+          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjo2MDAwMDAwMDAwMDAsInJlYWQiOjI0MDAwMDAwMDAwMCwidXBkYXRlIjoyNDAwMDAwMDAwMDB9fQ=="
+        }
+      ]
+    }
+  ],
+  "check_results": null
+}