One of the key features of BigFlow is the full automation of the build and deployment process. BigFlow dockerizes your workflows and deploy them to Google Cloud Composer.
BigFlow GCP runtime environment consists of two services:
Typically, for one software project, teams use one or more GCP projects (for dev, test, and prod environments) and one long-running Composer instance per each GCP project.
We recommend using a single Docker Registry instance per one software project, shared by all environments. Docker images are heavy files, so pushing them only once to GCP greatly reduces subsequent deployment time (it's safe because images are immutable). Moreover, this approach ensures that artifacts are environment-independent.
There are two deployment artifacts:
- Airflow DAG files with workflows definitions,
- a Docker image with workflows computation code.
During deployment, BigFlow uploads your DAG files to Composer's DAGs folder and pushes your Docker image to Docker Registry.
Read more about deployment artifacts in Project setup and build.
Shortly speaking, a Cloud Composer is a managed Airflow.
Unfortunately for Python users, Python libraries required by DAGs have to be installed manually on Composer. To make it worse, installing dependencies forces a Composer instance to restart. It not only takes time but often fails because of dependencies clash. In the worst scenario, you need to spawn a new Composer instance.
BigFlow fixes these problems by using Docker. Each of your jobs is executed in a stable and isolated runtime environment — a Docker container.
On GCP you execute Docker images on Kubernetes. BigFlow leverages the fact that each Composer instance stands on its own (GKE) cluster. This cluster is reused by BigFlow.
Before you start you will need a GCP project and a Service account.
That's important. All permissions required by a Composer itself and by your jobs have to be granted to this account.
We recommend using a default service account as a Composer's account. This account is created automatically for each GCP project. It has the following email:
Create a new Composer instance. Set only these properties (the others leave blank or default):
- Location — close to you,
- Machine type —
n1-standard-2
or higher (we recommend usingn1-standard-2
), - Disk size (GB) — 50 is enough.
BigFlow generated DAGs are compatible with Composer 1.X + Airflow 1.X and Composer >= 2.1.0 + Airflow 2.X.
That's it, wait until the new Composer Instance is ready. It should look like this:
Composer's DAGs Folder is a Cloud Storage bucket mounted to Airflow. This is the place where BigFlow uploads your DAG files.
Go to the Composer's DAGs folder:
and note the bucket's name
(here europe-west1-my-first-compo-ba6e3418-bucket
).
Put this bucket name into the dags_bucket
property in your
deployment_config.py
. For example:
'dags_bucket': 'europe-west1-my-first-compo-ba6e3418-bucket'
Create the env
variable in the Airflow web UI:
It is used by BigFlow to select the proper configuration from Config objects in your project.
Docker Registry is a repository where Docker images are stored.
We recommend using Google Cloud Artifact Registry because it integrates seamlessly with Composer.
One Artifact Registry can host many image repositories. We recommend having one image repository per one BigFlow project.
You don't need to create repositories explicitly, they are barely namespaces.
All you need is to put the full repository name into the docker_repository
property in
deployment_config.py
. For example:
'docker_repository': 'europe-west1-docker.pkg.dev/my_gcp_dev_project/my-repo-name/my-bigflow-project'
Ensure that your Composers have permission to pull images from a Registry.
If a Composer's service account is a
default service account
and if it wants to pull from an Artifact Registry located in the same GCP project —
it has the pull permission by default.
Otherwise, you have to grant read permission to a bucket, which underlies your Registry (Storage Object Viewer is enough).
Read more about Artifact Registry access control.
Since BigFlow 1.9.0, the Artifact Registry is used instead of Container Registry due to the latter one's deprecation. Please consult Google Cloud docs for more information:
- Container Registry deprecation
- Prepare for Container Registry shutdown
- Transition from Container Registry
Deployment requires configuration properties.
You can pass them to BigFlow directly as command-line arguments, but it's better to save them in a deployment_config.py
file.
We recommend this approach (called Configuration as Code)
for local development and CI/CD.
The deployment_config.py
can be placed in the main folder of your project.
It has to contain a bigflow.Config
object stored in the deployment_config
variable.
The following properties are read by BigFlow from deployment_config.py
if not provided as command-line arguments:
gcp_project_id
— Composer's project ID,dags_bucket
— Composer's DAGs Folder,docker_repository
— full name of a Docker repository,vault_endpoint
— an Vault endpoint to obtain OAuth token, used only if authentication with Vault is chosen.
Here is the recommended structure of the deployment_config.py
file:
from bigflow import Config
deployment_config = Config(name='dev',
properties={
'gcp_project_id': 'my_gcp_dev_project',
'docker_repository_project': '{gcp_project_id}',
'docker_repository': 'europe-west1-docker.pkg.dev/{docker_repository_project}/my-repository-name/my-bigflow-project',
'vault_endpoint': 'https://example.com/vault',
'dags_bucket': 'europe-west1-my-first-compo-ba6e3418-bucket'
})\
.ad_configuration(name='prod', properties={
'gcp_project_id': 'my_gcp_prod_project',
'dags_bucket': 'europe-west1-my-first-compo-1111111-bucket'})
Having that, you can run the extremely concise deploy
command, for example:
bigflow deploy-dags --config dev
BigFlow supports two GCP authentication methods: local account and service account.
The local account method is used mostly for local development.
It relies on a local user gcloud
account, which is, typically, your personal account.
A service account can also be used locally if you have installed its credentials (see authenticating as a service account).
Check if a local account is authenticated by typing:
gcloud info
Authentication with Vault is designed to automate your BigFlow deployment using CI/CD servers in case when you can't (or don't want to) install service account credentials on CI/CD servers. It is based on the Vault secrets engine. Think about Vault as an additional layer of indirection between your code and GCP service accounts.
Key concepts:
- Vault secrets engine — additional server (not provided by GCP) used by BigFlow to authenticate on a given service account.
- service account — a technical account not intended to be used by humans, should have the right permissions required to execute a given BigFlow command.
- Vault endpoint — a REST endpoint with a unique URL, exposed by Vault for each service account. When queried (GET), generates a short-living OAuth token.
- secret Vault-Token — a secret, that protects access to a Vault endpoint. Shouldn't be stored in Git.
To use authentication with Vault you have to pass two configuration parameters to BigFlow CLI:
vault_endpoint
and vault_secret
.
While the vault_endpoint
parameter can (and should) be stored in deployment_config.py
— vault_secret
shouldn't be stored in Git. We recommend to keep it encrypted
on your CI/CD server.
Deployment means uploading various files to Cloud Storage buckets:
- Docker images are pushed to Artifact Registry which requires a predefined writer role
- DAG files are uploaded to a bucket behind Composer's DAGs Folder, write and delete access to this bucket is required.
In both cases, we recommend granting the project-level Storage Object Admin role to a user account or a service account used for deployment.
Of course, you can also grant bucket-level access only to these two buckets.
If you want to run a Dataflow process, you need to prepare a Cloud Storage bucket for the two required folders:
- The
staging_location
folder, which Dataflow uses to store all the assets needed to run a job. - The
temp_location
folder, which Dataflow uses to store temporary files during the execution.
You only need to create a bucket, Dataflow creates the required folders in the specified bucket.
To create a bucket, go to the Cloud Storage browser and choose the project where you run your Dataflow pipeline. Next, create the "Create Bucket" button to open the bucket creator form.
Since the created bucket isn't automatically cleaned, it may lead to significant costs. To avoid it you can use Object Lifecycle Management to automatically remove data after some time or to move it to cheaper storage class. You can also set it up using terraform.
Provide a unique id for the bucket and choose the location which is closest to you. You can leave the rest of the form fields with the defaults. After you hit the "Create" button, your bucket is ready.