docs: Add recommended GCP roles and privileges

gluent · Jul 18, 2024 · 529eedc · 529eedc
1 parent 6130e07
commit 529eedc
Show file tree

Hide file tree

Showing 2 changed files with 111 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -7,9 +7,9 @@ At present GOE is a command line tool. Alongside installing the Python package w
 # Offload Home
 In addition to the GOE software we need a supporting directory tree called the Offload Home. This is identified using the `OFFLOAD_HOME` environment variable. In this directory we keep configuration files, logs and the GOE software if you choose not to run scripts directly out of the cloned repo. Offload Home will also typically contain a Python virtual environment into which the GOE package and dependencies will be installed, you can run these out of the repository directory but, for separation of duties purposes, may choose to keep the source code away from users of the tool.
 
-# Installation from a package
+# Installation From a Package
 
-## Installing the GOE Python package
+## Installing the GOE Python Package
 1) Copy the package to the target host, this may not be the same host the repository was cloned to.
 2) Create the Offload Home directory, for example:
 ```
@@ -37,7 +37,7 @@ GOE_WHEEL="goe_framework-<goe-version>-py3-none-any.whl"
 python3 -m pip install lib/${GOE_WHEEL}
 ```
 
-## Configuration file
+## Configuration File
 Create `offload.env` in the Offload Home, this file contains the necessary configuration specific to your environment:
 ```
 cp ${OFFLOAD_HOME}/conf/oracle-bigquery-offload.env.template ${OFFLOAD_HOME}/conf/offload.env
@@ -64,7 +64,7 @@ If using Dataproc Batches to provide Spark:
 - GOOGLE_DATAPROC_REGION
 - GOOGLE_DATAPROC_BATCHES_SUBNET
 
-## Install database objects
+## Install Database Objects
 To install supporting database objects you need access to a database admin account that can create users, grant them system privileges and create objects in the schemas created. SYSTEM has been used in the example below but this is *not* a necessity:
 ```
 cd ${OFFLOAD_HOME}/setup
@@ -78,11 +78,11 @@ alter user goe_adm identified by ...;
 alter user goe_app identified by ...;
 ```
 
-# Building a custom package
+# Building a Custom Package
 
 If you want to test with the latest commits that have not yet been included in a Github release you can build a custom package from the repository.
 
-## Prepare the host/cloned repository
+## Prepare the Host/Cloned Repository
 Debian prerequisites:
 ```
 sudo apt-get -y install make python3-venv
@@ -101,13 +101,13 @@ curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64
 . ~/.bash_profile
 ```
 
-## Make a GOE package
+## Make a GOE Package
 To create a package which contains all required artifacts for running GOE commands use the `make` target below:
 ```
 make clean && make package
 ```
 
-# Install for development
+# Install for Development
 To create a Python virtual environment and install all required dependencies into the repository directory:
 ```
 make clean && make install-dev
@@ -122,7 +122,10 @@ Note only the Python dependencies for Oracle and BigQuery are installed by defau
 make install-dev-extras
 ```
 
-# Running commands
+# Supporting Infrastructure
+GOE requires access to cloud storage and Spark. For a Google Cloud instalation this is described here: [Google Cloud Platform Setup for a BigQuery Target](docs/gcp_setup.md)
+
+# Running Commands
 Activate the GOE Python virtual environment:
 ```
 source ./.venv/bin/activate

diff --git a/docs/gcp_setup.md b/docs/gcp_setup.md
@@ -4,15 +4,22 @@
 
 This page details Google Cloud components required, with recommended minimal privileges, to use GOE in your GCP project.
 
-## Service Account
+### Service Account
 
 A service account should be provisioned from the GCP project. This service account can be used by any service that will execute GOE commands, for example it could be attached to a GCE virtual machine.
 
-## Cloud Storage Bucket
+### Cloud Storage Bucket
 
 A cloud storage bucket is required to stage data before ingesting it into BigQuery. Ensure the bucket is in a location compatible with the target BigQuery dataset.
 
-## Roles
+### Dataproc (Spark)
+
+For non-trivially sized tables GOE uses Spark to copy data from the source database to cloud storage. In a GCP setting this is likely to be provided by one of two services:
+
+1. Dataproc Batches
+1. Dataproc
+
+### Roles
 
 The role names below are used throughput this page but can be changed to suit company policies. These roles will provide adequate access to stage data in cloud storage and load it into BigQuery.
 
@@ -30,10 +37,12 @@ The role names below are used throughput this page but can be changed to suit co
 | `goe_dataproc_role` |      N    | Permissions to interact with a permanent Dataproc cluster.                      |
 | `goe_batches_role`  |      N    | Permissions to interact with Dataproc Batches service.                          |
 
-## Compute Engine Virtual Machine
+### Compute Engine Virtual Machine
 
 To work interactively with GOE you need to be able run commands with appropriate permissions. Rather than downloading service account keys we believe it is better to attach the service account to a GCE virtual machine and run all commands from there.
 
+The virtual machine does not need to be heavily resourced; most of the heavy lifting is done by Spark and BigQuery.
+
 ## Example Commands
 
 These commands are examples that can be used to create the components described above.
@@ -44,9 +53,10 @@ Note the location below must be compatible with the BigQuery dataset location.
 
 ```
 PROJECT=<your-project>
+REGION=<your-region>
 SVC_ACCOUNT=<your-service-account-name>
 BUCKET=<your-bucket>
-LOCATION=<your-location>
+LOCATION=<your-location> # EU, US or ${REGION}
 TARGET_DATASET=<your-target-dataset>
 ```
 
@@ -69,6 +79,60 @@ gcloud storage buckets create gs://${BUCKET} --project ${PROJECT} \
 --uniform-bucket-level-access
 ```
 
+### Dataproc Batches
+
+Optional commands if using Dataproc Batches.
+
+```
+gcloud compute networks subnets update ${SUBNET} \
+--project=${PROJECT} --region=${REGION} \
+--enable-private-ip-google-access
+
+gcloud projects add-iam-policy-binding ${PROJECT} \
+--member=serviceAccount:${SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
+--role=roles/dataproc.worker
+```
+
+### Dataproc
+
+Optional commands if using Dataproc.
+
+Enable required services:
+```
+gcloud services enable dataproc.googleapis.com --project ${PROJECT}
+gcloud services enable iamcredentials.googleapis.com --project=${PROJECT}
+```
+
+Values supplied below are examples only, changes will likely be required for each use case:
+```
+SUBNET=<your-subnet>
+CLUSTER_NAME=<cluster-name>
+DP_SVC_ACCOUNT=goe-dataproc
+ZONE=<your-zone>
+
+gcloud iam service-accounts create ${DP_SVC_ACCOUNT} \
+--project=${PROJECT} \
+--description="GOE Dataproc service account"
+
+gcloud projects add-iam-policy-binding ${PROJECT} \
+--member=serviceAccount:${DP_SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
+--role=roles/dataproc.worker
+
+gcloud compute networks subnets update ${SUBNET} \
+--project=${PROJECT} --region=${REGION} \
+--enable-private-ip-google-access
+
+gcloud dataproc clusters create ${CLUSTER_NAME} \
+--project ${PROJECT} --region ${REGION} --zone ${ZONE} \
+--bucket ${BUCKET} \
+--subnet projects/${PROJECT}/regions/${REGION}/subnetworks/${SUBNET} \
+--no-address \
+--single-node \
+--service-account=${DP_SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
+--master-machine-type n2-standard-16 --master-boot-disk-size 1000 \
+--image-version 2.1-debian11
+```
+
 ### Roles
 
 #### goe_gcs_role
@@ -140,7 +204,6 @@ gcloud iam roles update goe_bq_app_role --project ${PROJECT} \
 --add-permissions=bigquery.tables.delete
 ```
 
-
 #### goe_bq_stg_role
 Note that the role grant is bound to the staging BigQuery dataset (which has the same name as the target dataset but with an "_load" suffix), no project wide access is granted. The `bq` utility is used to grant the role because `gcloud` does not support these granular grants.
 
@@ -154,12 +217,41 @@ bigquery.tables.delete,\
 bigquery.tables.getData \
 --stage=GA
 
-ROLE=
 echo "GRANT \`projects/${PROJECT}/roles/goe_bq_stg_role\` ON SCHEMA ${TARGET_DATASET}_load
 TO \"serviceAccount:${SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com\";
 " | bq query --project_id=${PROJECT} --nouse_legacy_sql --location=${LOCATION}
 ```
 
+#### goe_dataproc_role
+```
+gcloud iam roles create goe_dataproc_role --project ${PROJECT} \
+--title="GOE Dataproc Access" --description="GOE Dataproc Access" \
+--permissions=dataproc.clusters.get,dataproc.clusters.use,\
+dataproc.jobs.create,dataproc.jobs.get,\
+iam.serviceAccounts.getAccessToken \
+--stage=GA
+
+gcloud projects add-iam-policy-binding ${PROJECT} \
+--member=serviceAccount:${SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
+--role=projects/${PROJECT}/roles/goe_dataproc_role
+
+gcloud projects add-iam-policy-binding ${PROJECT} \
+--member=serviceAccount:${SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
+--role=roles/iam.serviceAccountUser
+```
+
+#### goe_batches_role
+```
+gcloud iam roles create goe_batches_role --project ${PROJECT} \
+--title="GOE Dataproc Access" --description="GOE Dataproc Access" \
+--permissions=dataproc.batches.create,dataproc.batches.get \
+--stage=GA
+
+gcloud projects add-iam-policy-binding ${PROJECT} \
+--member=serviceAccount:${SVC_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
+--role=projects/${PROJECT}/roles/goe_batches_role
+```
+
 ## Compute Engine Virtual Machine
 Values supplied below are examples only, changes will likely be required for each use case:
 ```