Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform setup for CC demo #4

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions infrastructure/deployments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Deploying standalone KFP pipelines on top of the cluster

## Deploying KFP pipeline

[kubeflow pipelines documentation](https://www.kubeflow.org/docs/components/pipelines/installation/standalone-deployment/#disable-the-public-endpoint)

[kubeflow pipelines github](https://github.com/kubeflow/pipelines/tree/0.5.1)


The documentation instructs you to deploy kubeflow pipelines on a GKE cluster, we've already deployed the cluster with terraform.

Make sure you are authenticated to the GKE cluster hosting KFP
```commandline
gcloud container clusters get-credentials fondant-cluster --zone=europe-west4-a
```

## Customizing GCS and Cloud SQL for Artefact, Pipeline and Metadata storage

[GCP services setup](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample)

We will setup `GCS` to store artefact and pipeline specs. This is done by deploying the `minio-gcs gateway` service.
We will also use `CloudSQL` to store all the ML metadata associated with our pipeline runs. This will guarantee that you are
able to retrieve metadata (experiments outputs, lineage visualization, ...) from our previous pipeline runs in case the GKE cluster where KFP is deployed get deleted.


First clone the [Kubeflow Pipelines GitHub repository](https://github.com/kubeflow/pipelines), and use it as your working directory.

```commandline
git clone https://github.com/kubeflow/pipelines
```

Next, we will need to customize our own values in the deployment manifest before deploying the additional services.

**Note**: You will need the `CloudSQL` user root password that was created in the terraform setup. In your GCP project, go to
`Secret Manager` and retrieve the `sql-key` password that was stored there.

### Customize values

Make sure to modify the following files:

* `manifests/kustomize/env/gcp/params.env`

```bash
pipelineDb=pipelinedb
mlmdDb=metadb
cacheDb=cachedb
bucketName=<PROJECT_ID>-kfp-artifacts #bucket to store the artifacts and pipeline specs (created in TF)
gcsProjectId=<PROJECT_ID # GCP project ID
gcsCloudSqlInstanceName=<PROJECT_ID>:<DB_REGION>:kfp-metadata # Metadata db (created in TF)
```

`manifests/kustomize/base/installs/generic/mysql-secret.yaml`



Specify the `root` user password retrieved from `Secret Manager` in the designated fields. Make sure **you do not commit** the secret to your git history.

### Applying the customized resources

After specifying the required parameters, you can now install the additional services:

```commandline
kubectl apply -k manifests/kustomize/cluster-scoped-resources
kubectl wait crd/applications.app.k8s.io --for condition=established --timeout=60s
kubectl apply -k manifests/kustomize/env/gcp
```

The process of deploying the GCP resources may take from 3-5 minutes, you can check that status of the
newly deployed services:
```commandline
kubectl -n kubeflow get pods
```

## Installing GPU drivers

Next, we need to install the GPU drivers on the GPU node pools to use them in KFP.
To install the GPU nodes, you have to manually scale up the GPU
pool from 0 to 1 to ensure that the installation takes effect (More on this issue [here](https://github.com/kubeflow/pipelines/issues/2561).).

After that, apply:

```kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml```

After installing the drivers, you can set the pool back to 0. The pool can scale
up back again when called in the pipeline steps since autoscaling is enabled.

## Installing the Spark operator (Optional)
An additional installation is required to setup the [K8 operator for Spark](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator#installation).

```commandline
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install my-release spark-operator/spark-operator --namespace kubeflow
```
This will install the Kubernetes Operator for Apache Spark into the namespace `kubeflow`.
The operator by default watches and handles SparkApplications in every namespaces.

## Deleting the KFP services
Run the following commands to delete the deployed KFP services from your GKE cluster.


```commandline
kubectl delete -k manifests/kustomize/cluster-scoped-resources
kubectl delete -k manifests/kustomize/env/gcp
```

## Accessing KFP pipeline
There are three ways to connect to KFP UI, first make sure you autheticate to the GKE cluster hosting KFP:

**1) Port-forwarding to access the kubernetes service**
```commandline
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
```

Open the Kubeflow Pipelines UI at http://localhost:8080

**2) Using the IAP URL of KFP**
```commandline
kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com
```

**3) AI platform**

Alternatively, you can visualize your pipeline from the AI platform UI in the `pipelines` section.

2 changes: 2 additions & 0 deletions infrastructure/terraform/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.terraform
.terraform.lock.hcl
79 changes: 79 additions & 0 deletions infrastructure/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Terraform by Nimbus

Nimbus can generate [Terraform](https://www.terraform.io/) configuration to help you set up the
necessary infrastructure at the start of your project and for each generated component.

### Directory structure

The Terraform directory consists of 4 subdirectories:
- **backend**
Contains the configuration for the Terraform backend which stores the state of the main
configuration. The backend configuration and state are separated to prevent chicken-and-egg
issues. For more info, check the [README.md](./backend/README.md) in the subdirectory.
- **environment**
Contains the `.tfvars` files with the global variables for the project. Originally, this
directory contains a single `project.tfvars` file, but when moving to multiple environments,
you'll want separate `.tfvars` files per environment. You can pass these files to any Terraform
command using the `--var-file` argument
- **main**
The main Terraform configuration of the project.
- **modules**
[Modules](https://developer.hashicorp.com/terraform/language/modules) that can be shared across
configurations or just be used to package logical resources together.

### Terraform version

The supported terraform version can be found in [versions.tf](./versions.tf).
We recommend [tfenv](https://github.com/tfutils/tfenv) to manage different
Terraform versions on your machine.

### Remote state

By default, Terraform stores state locally in a file named `terraform.tfstate`. When working with
Terraform in a team, use of a local file makes Terraform usage complicated because each user must
make sure they always have the latest state data before running Terraform and make sure that nobody
else runs Terraform at the same time. To solve this Terraform allows to store the state remotely.
In general, storing the state remotely has much more advantages than just working in a team. It
also increases reliability in terms of backups, keeps your infrastructure secure by making the
process of applying the Terraform updates stable and secure, and facilitates the use of Terraform
in automated pipelines. At ML6 we believe using the remote state is the way of working with Terraform.

Nimbus setup enforces the remote state by generating `backend.tf` file, which makes sure Terraform
automatically writes the state data to the cloud. The `backend` subdirectory contains the
configuration of this backend infrastructure.

### Multiple environments

#### Terraform variables

To keep your terraform commands nice and short we recommend one file per environment.
If you want to start using multiple environments, duplicate the `project.tfvars` file and rename appropriately.

```
. .
├── terraform/ ├── terraform/
| ├── environment/ | ├── environment/
| └── project.tfvars -----> | ├── sbx.tfvars
| └--> | └── dev.tfvars
| ├── modules/ | ├── modules/
| └── main.tf | └── main.tf
└── ... └── ...
```

You can now change variables per environment. For example you could change the project of your dev environment.

To execute terraform commands you will now provide one of the tfvars files

```bash
terraform plan --var-file environment/sbx.tfvars
```

#### Terraform workspaces

When you create resources with terraform they are recorded in a terraform state.
To have multiple environments running at the same time you will have to create multiple states,
one for each environment. To do this use the `terraform workspace` command.

### Component documentation

For more detailed documentation about certain components, check the [Nimbus documentation](https://nimbus-documentation-dot-ml6-internal-tools.uc.r.appspot.com/).
23 changes: 23 additions & 0 deletions infrastructure/terraform/README/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# README for cluster.tf

## Workload identities

Using workload identities is the new recommended way to authenticate GKE pods with google service accounts. To do this, `cluster.tf` includes all the required boilerplate code that maps google service accounts to kubernetes service accounts. This is all you need to start using workload identities.

## Configure your service accounts in Terraform

In `cluster.tf` you will find a variable `service_accounts`. By updating this variable you can configure one or more service accounts. They will automatically be created and initialized to work with workload identities. By default we create a simple service account `primary-pod-sa` without any permissions.

## Assigning the service account to pods

By default, pods in a GKE cluster will not use the service account specified here.
Update the yaml definition of the deployment to assign a service account:
```
spec:
serviceAccountName: KSA_NAME
```

Note that the name should be the name of the kubernetes service account that is linked to the google service account. If you are unsure of the name run `kubectl get serviceaccounts` to list all available kubernetes service accounts.

## Relevant documentation
https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to
33 changes: 33 additions & 0 deletions infrastructure/terraform/README/wif.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@

# README for Workload Identity Federation

## Prerequisites

You need to have the following IAM roles to apply the terraform configuration:

- roles/iam.workloadIdentityPoolAdmin
- roles/iam.serviceAccountAdmin

## Usage

You can use Nimbus to generate the components required for workload identity federation via `nimbus gcp init`.

### Variables

Note: the default value for the `bitbucket_repo_ids` local in `wif.tf` is an empty list.
You have to update it with the UUID of the bitbucket repositories that should have
access!
You can find more info in the Bitbucket documentation linked at the end.

All other variables are set by Nimbus to use the ML6 Bitbucket workspace by default. If you need to connect a different
workspace, you can check the descriptions of these variables in the generated `wif` module (`modules/wif/variables.tf`)

## Background information

If you want to learn more about the concepts behind workload identity federation, you can check out the following sources:

- [Chapter Conference Talk (April 2022)](https://docs.google.com/presentation/d/1liJi-QdurS1cJ2W57CYy0kbbSLx5GR8YlQPDJXcy3cw/edit?usp=sharing)
- [Google Documentation](https://cloud.google.com/iam/docs/workload-identity-federation)
- [Bitbucket Documentation](https://support.atlassian.com/bitbucket-cloud/docs/integrate-pipelines-with-resource-servers-using-oidc/)
- Google Cloud blogposts: [1](https://cloud.google.com/blog/products/identity-security/enable-keyless-access-to-gcp-with-workload-identity-federation) and [2](https://cloud.google.com/blog/products/identity-security/enabling-keyless-authentication-from-github-actions)

2 changes: 2 additions & 0 deletions infrastructure/terraform/backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.terraform*
terraform.tfstate.backup
28 changes: 28 additions & 0 deletions infrastructure/terraform/backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Terraform backend infrastructure

This directory defines the infrastructure for the backend where Terraform stores its state. The
state of the backend cannot be stored in the backend itself, which is why it is separated from the
main Terraform configuration.

The state of the backend configuration is instead stored locally and can be tracked in git. This is
an acceptable solution since the backend should rarely be changed.

This means that you now have 2 separate Terraform configuration and states.

- The backend configuration: for which the state is stored locally and tracked in git.
- The main configuration: for which the state is stored remotely in the backend.

## Deployment

To deploy, run the following steps from this directory:

```commandline
terraform init
terraform apply --var-file ../environment/project.tfvars
```

## Tracking the state

The state of this infrastructure needs to be tracked separately by adding the
`terraform.tfstate` file to git. Next to the `main.tf` file and `README.md`, all other files can be
ignored. A `.gitignore` file is automatically generated to handle this.
44 changes: 44 additions & 0 deletions infrastructure/terraform/backend/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
/******************************************
Variables
*****************************************/

variable "project" {
description = "GCP project name"
type = string
}

variable "region" {
description = "Default GCP region for resources"
type = string
default = "europe-west1"
}

variable "zone" {
description = "Default GCP zone for resources"
type = string
default = "europe-west1-b"
}

/******************************************
Google provider configuration
*****************************************/

provider "google" {
project = var.project
region = var.region
zone = var.zone
}

/******************************************
State storage configuration
*****************************************/

resource "google_storage_bucket" "terraform_state" {
name = "${var.project}_terraform"
location = var.region
uniform_bucket_level_access = true

versioning {
enabled = true
}
}
52 changes: 52 additions & 0 deletions infrastructure/terraform/backend/terraform.tfstate
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{
"version": 4,
"terraform_version": "1.4.6",
"serial": 1,
"lineage": "7e6b8586-fe78-96ad-aaca-ba5027c2c6e4",
"outputs": {},
"resources": [
{
"mode": "managed",
"type": "google_storage_bucket",
"name": "terraform_state",
"provider": "provider[\"registry.terraform.io/hashicorp/google\"]",
"instances": [
{
"schema_version": 0,
"attributes": {
"autoclass": [],
"cors": [],
"custom_placement_config": [],
"default_event_based_hold": false,
"encryption": [],
"force_destroy": false,
"id": "boreal-array-387713_terraform",
"labels": {},
"lifecycle_rule": [],
"location": "EUROPE-WEST1",
"logging": [],
"name": "boreal-array-387713_terraform",
"project": "boreal-array-387713",
"public_access_prevention": "inherited",
"requester_pays": false,
"retention_policy": [],
"self_link": "https://www.googleapis.com/storage/v1/b/boreal-array-387713_terraform",
"storage_class": "STANDARD",
"timeouts": null,
"uniform_bucket_level_access": true,
"url": "gs://boreal-array-387713_terraform",
"versioning": [
{
"enabled": true
}
],
"website": []
},
"sensitive_attributes": [],
"private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjo2MDAwMDAwMDAwMDAsInJlYWQiOjI0MDAwMDAwMDAwMCwidXBkYXRlIjoyNDAwMDAwMDAwMDB9fQ=="
}
]
}
],
"check_results": null
}
Loading