Skip to content

Latest commit

 

History

History
271 lines (186 loc) · 12.4 KB

deployment.md

File metadata and controls

271 lines (186 loc) · 12.4 KB

Deployment

One of the key features of BigFlow is the full automation of the build and deployment process. BigFlow dockerizes your workflows and deploy them to Google Cloud Composer.

GCP runtime environment

BigFlow GCP runtime environment consists of two services:

  1. Google Cloud Composer,
  2. Docker Registry.

Typically, for one software project, teams use one or more GCP projects (for dev, test, and prod environments) and one long-running Composer instance per each GCP project.

We recommend using a single Docker Registry instance per one software project, shared by all environments. Docker images are heavy files, so pushing them only once to GCP greatly reduces subsequent deployment time (it's safe because images are immutable). Moreover, this approach ensures that artifacts are environment-independent.

There are two deployment artifacts:

  1. Airflow DAG files with workflows definitions,
  2. a Docker image with workflows computation code.

During deployment, BigFlow uploads your DAG files to Composer's DAGs folder and pushes your Docker image to Docker Registry.

Read more about deployment artifacts in Project setup and build.

Cloud Composer

Shortly speaking, a Cloud Composer is a managed Airflow.

Unfortunately for Python users, Python libraries required by DAGs have to be installed manually on Composer. To make it worse, installing dependencies forces a Composer instance to restart. It not only takes time but often fails because of dependencies clash. In the worst scenario, you need to spawn a new Composer instance.

BigFlow fixes these problems by using Docker. Each of your jobs is executed in a stable and isolated runtime environment — a Docker container.

On GCP you execute Docker images on Kubernetes. BigFlow leverages the fact that each Composer instance stands on its own (GKE) cluster. This cluster is reused by BigFlow.

Composer's service account

Before you start you will need a GCP project and a Service account.

That's important. All permissions required by a Composer itself and by your jobs have to be granted to this account.

We recommend using a default service account as a Composer's account. This account is created automatically for each GCP project. It has the following email:

Setting up a Composer Instance

Create a new Composer instance. Set only these properties (the others leave blank or default):

  • Location — close to you,
  • Machine typen1-standard-2 or higher (we recommend using n1-standard-2),
  • Disk size (GB) — 50 is enough.

BigFlow generated DAGs are compatible with Composer 1.X + Airflow 1.X and Composer >= 2.1.0 + Airflow 2.X.

That's it, wait until the new Composer Instance is ready. It should look like this:

brand_new_composer

Composer's DAGs Folder

Composer's DAGs Folder is a Cloud Storage bucket mounted to Airflow. This is the place where BigFlow uploads your DAG files.

Go to the Composer's DAGs folder:

dags_bucket

and note the bucket's name (here europe-west1-my-first-compo-ba6e3418-bucket).

Put this bucket name into the dags_bucket property in your deployment_config.py. For example:

'dags_bucket': 'europe-west1-my-first-compo-ba6e3418-bucket'

Airflow Variables

Create the env variable in the Airflow web UI:

airflow_env_variable

It is used by BigFlow to select the proper configuration from Config objects in your project.

Docker Registry

Docker Registry is a repository where Docker images are stored.

We recommend using Google Cloud Artifact Registry because it integrates seamlessly with Composer.

Docker repository name

One Artifact Registry can host many image repositories. We recommend having one image repository per one BigFlow project.

You don't need to create repositories explicitly, they are barely namespaces. All you need is to put the full repository name into the docker_repository property in deployment_config.py. For example:

'docker_repository': 'europe-west1-docker.pkg.dev/my_gcp_dev_project/my-repo-name/my-bigflow-project'

Docker Registry permissions

Ensure that your Composers have permission to pull images from a Registry.

If a Composer's service account is a
default service account and if it wants to pull from an Artifact Registry located in the same GCP project — it has the pull permission by default.

Otherwise, you have to grant read permission to a bucket, which underlies your Registry (Storage Object Viewer is enough).

Read more about Artifact Registry access control.

Container Registry deprecation

Since BigFlow 1.9.0, the Artifact Registry is used instead of Container Registry due to the latter one's deprecation. Please consult Google Cloud docs for more information:

Managing configuration in deployment_config.py

Deployment requires configuration properties. You can pass them to BigFlow directly as command-line arguments, but it's better to save them in a deployment_config.py file. We recommend this approach (called Configuration as Code) for local development and CI/CD.

The deployment_config.py can be placed in the main folder of your project. It has to contain a bigflow.Config object stored in the deployment_config variable.

The following properties are read by BigFlow from deployment_config.py if not provided as command-line arguments:

  1. gcp_project_id — Composer's project ID,
  2. dags_bucket — Composer's DAGs Folder,
  3. docker_repository — full name of a Docker repository,
  4. vault_endpoint — an Vault endpoint to obtain OAuth token, used only if authentication with Vault is chosen.

Here is the recommended structure of the deployment_config.py file:

from bigflow import Config
deployment_config = Config(name='dev',                    
                           properties={
                               'gcp_project_id': 'my_gcp_dev_project',
                               'docker_repository_project': '{gcp_project_id}',
                               'docker_repository': 'europe-west1-docker.pkg.dev/{docker_repository_project}/my-repository-name/my-bigflow-project',
                               'vault_endpoint': 'https://example.com/vault',
                               'dags_bucket': 'europe-west1-my-first-compo-ba6e3418-bucket'
                           })\
        .ad_configuration(name='prod', properties={
                               'gcp_project_id': 'my_gcp_prod_project',
                               'dags_bucket': 'europe-west1-my-first-compo-1111111-bucket'})

Having that, you can run the extremely concise deploy command, for example:

bigflow deploy-dags --config dev

Authentication methods

BigFlow supports two GCP authentication methods: local account and service account.

Local Account Authentication

The local account method is used mostly for local development. It relies on a local user gcloud account, which is, typically, your personal account. A service account can also be used locally if you have installed its credentials (see authenticating as a service account).

Check if a local account is authenticated by typing:

gcloud info

Authentication with Vault

Authentication with Vault is designed to automate your BigFlow deployment using CI/CD servers in case when you can't (or don't want to) install service account credentials on CI/CD servers. It is based on the Vault secrets engine. Think about Vault as an additional layer of indirection between your code and GCP service accounts.

Key concepts:

  • Vault secrets engine — additional server (not provided by GCP) used by BigFlow to authenticate on a given service account.
  • service account — a technical account not intended to be used by humans, should have the right permissions required to execute a given BigFlow command.
  • Vault endpoint — a REST endpoint with a unique URL, exposed by Vault for each service account. When queried (GET), generates a short-living OAuth token.
  • secret Vault-Token — a secret, that protects access to a Vault endpoint. Shouldn't be stored in Git.

Vault integration

To use authentication with Vault you have to pass two configuration parameters to BigFlow CLI: vault_endpoint and vault_secret. While the vault_endpoint parameter can (and should) be stored in deployment_config.pyvault_secret shouldn't be stored in Git. We recommend to keep it encrypted on your CI/CD server.

Deployment permission

Deployment means uploading various files to Cloud Storage buckets:

  1. Docker images are pushed to Artifact Registry which requires a predefined writer role
  2. DAG files are uploaded to a bucket behind Composer's DAGs Folder, write and delete access to this bucket is required.

In both cases, we recommend granting the project-level Storage Object Admin role to a user account or a service account used for deployment.

Of course, you can also grant bucket-level access only to these two buckets.

Dataflow

If you want to run a Dataflow process, you need to prepare a Cloud Storage bucket for the two required folders:

  • The staging_location folder, which Dataflow uses to store all the assets needed to run a job.
  • The temp_location folder, which Dataflow uses to store temporary files during the execution.

You only need to create a bucket, Dataflow creates the required folders in the specified bucket.

To create a bucket, go to the Cloud Storage browser and choose the project where you run your Dataflow pipeline. Next, create the "Create Bucket" button to open the bucket creator form.

Since the created bucket isn't automatically cleaned, it may lead to significant costs. To avoid it you can use Object Lifecycle Management to automatically remove data after some time or to move it to cheaper storage class. You can also set it up using terraform.

Dataflow bucket

Provide a unique id for the bucket and choose the location which is closest to you. You can leave the rest of the form fields with the defaults. After you hit the "Create" button, your bucket is ready.