A Terraform module to programmatically deploy end-to-end ELT flows to BigQuery on Airbyte. Supports custom sources and integrates with the secret manager to securely store sensitive configurations. Also allows you to specify flows as YAML.
- Terraform. Tested with v1.5.3. Install Terraform
- An authenticated gcloud CLI
- Install the gcloud CLI
gcloud init
gcloud auth application-default login
- An up and running Airbyte instance on GCP
- GCP permissions
- Broad roles that will work, but not recommended for service accounts or even people.
roles/owner
roles/editor
- Recommended roles to respect the least privilege principle.
roles/bigquery.dataOwner
roles/secretmanager.admin
roles/storage.admin
- Granular permissions required to build a custom role specific for this deployment.
bigquery.datasets.create
bigquery.datasets.delete
bigquery.datasets.update
secretmanager.secrets.create
secretmanager.secrets.delete
secretmanager.versions.add
secretmanager.versions.destroy
secretmanager.versions.enable
storage.buckets.create
storage.buckets.delete
storage.buckets.getIamPolicy
storage.buckets.setIamPolicy
storage.hmacKeys.create
storage.hmacKeys.delete
storage.hmacKeys.update
- Broad roles that will work, but not recommended for service accounts or even people.
Go to the examples
directory to view all the code samples.
Get started with the module through a minimal flow example.
Most sources need to be configured with secrets (DB passwords, API keys, tokens, etc...). This example shows how to configure the module to fetch secret values from the GCP secret manager to avoid hard coding them in your configuration.
If the source you want to integrate is not in the Airbyte catalog, you can create a custom connector and use it in the module.
You can set your ELT pipelines to run on a cron schedule by setting cron_schedule
and optionally cron_timezone
.
This module is designed to be compatible with external YAML configuration files. It is a convenient way for users not proficient in Terraform to specify/modify ELT pipelines programmatically, or to integrate this module with other tools that can generate YAML files.
The module is highly opinionated to reduce the design load of the users. In a few minutes/hours, you should be able to build data flows from your sources to BigQuery.
Deploying through Terraform rather than the Airbyte UI will allow you to benefit from all the advantages of config-based deployments.
- Easier and less error-prone to upgrade environments.
- Automatable in a CI/CD for better saclability, consistency, and efficiency.
- All the configuration is centralized and versioned in git for reviews and tests.
Most of the sources will require to be set up with sensitive information such as API keys, database password, and other secrets. In order not to have these as clear text on your repo, this module integrates with the secret manager to fetch sensitive data at deployment time.
Airbyte has a lot of sources, but in the event yours is not officially supported, you can create your own and this module will be able to use it.
Even though this module is likely to only be used by data engineers who are proficient with Terraform, it might be useful to de-couple the ELT configuration details from the TF code through a YAML file.
- Users who don't know Terraform can update the config files themselves more easily.
- It becomes possible to have a front or form that generates theses YAML files to then be automatically deployed by Terraform.
- It separates concerns and avoids super long terraform files if you have alot of flows.
Under the hood, the data going from your sources through Airbyte and to BigQuery will always be staged in a GCS bucket as Avro files. This is important for disaster recovery, reprocessings, backfills, archival, compliance, etc...
A lot of attention was given to provide useful error messages when you misconfigure a source. If you're stuck, make sure to refer to the Airbyte connector catalog, or to the full connectors spec to check what your source requires.
As this module depends on an available Airbyte deployment at plan time, it can not live in the same terraform state as the Airbyte infrastructure deployment itself. You will first need to deploy the Airbyte VM/cluster, and then the ELT flows separately.
It is very difficult to use from TF Cloud. You would either need to expose the Airbyte instance to the public internet, or find a way to create an SSH tunnel to it from the TF Cloud runner. If you find a neat way to work around this issue, hit me up at [email protected].
When calling the module, you will need to specify a flow_configuration
. This page documents this structure.
module "airbyte_flows" {
source = "artefactory/airbyte-flows/google"
version = "~> 0"
project_id = local.project_id
airbyte_service_account_email = local.airbyte_service_account
flows_configuration = {} # <-- This right here
}
map(object({
flow_name = string # Display name for your data flow
source_name = string # Name of the source. Either one from https://docs.airbyte.com/category/sources or a custom one.
custom_source = optional(object({ # Default: null. If source_name is not in the Airbyte sources catalog, you need to specify where to find it
docker_repository = string # Docker Repository URL (e.g. 112233445566.dkr.ecr.us-east-1.amazonaws.com/source-custom) or DockerHub identifier (e.g. airbyte/source-postgres)
docker_image_tag = string # Docker image tag
documentation_url = string # Custom source documentation URL
}))
cron_schedule = optional(string, "manual") # Default: manual. Cron expression for when syncs should run (ex. "0 0 12 * * ?" =\> Will sync at 12:00 PM every day)
cron_timezone = optional(string, "UTC") # Default: UTC. One of the TZ identifiers at https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
normalize = optional(bool, true) # Default: true. Whether Airbyte should normalize the data after ingestion. https://docs.airbyte.com/understanding-airbyte/basic-normalization/
tables_to_sync = map(object({ # All streams to extract from the source and load to BigQuery
sync_mode = optional(string, "full_refresh") # Allowed: full_refresh | incremental. Default: full_refresh
destination_sync_mode = optional(string, "append") # Allowed: append | overwrite | append_dedup. Default: append
cursor_field = optional(string) # Path to the field that will be used to determine if a record is new or modified since the last sync. This field is REQUIRED if sync_mode is incremental. Otherwise it is ignored.
primary_key = optional(string) # List of the fields that will be used as primary key (multiple fields can be listed for a composite PK). This field is REQUIRED if destination_sync_mode is *_dedup. Otherwise it is ignored.
}))
source_specification = map(string) # Source-specific configurations. Refer to the connectors catalog for more info. For any string like "secret:\<secret_name\>", the module will fetch the value of `secret_name` in the Secret Manager.
destination_specification = object({
dataset_name = string # Existing dataset to which your data will be written
dataset_location = string # Allowed: EU | US | Any valid BQ region as specified here https://cloud.google.com/bigquery/docs/locations
staging_bucket_name = string # Existing bucket in which your data will be written as avro files at each connection run.
})
}))
Name | Version |
---|---|
airbyte | ~>0.1 |
~>4.75 | |
http | ~>3.4 |
Name | Version |
---|---|
http | ~>3.4 |
Name | Source | Version |
---|---|---|
airbyte_bigquery_flow | ./airbyte_bigquery_flow | n/a |
Name | Type |
---|---|
http_http.connectors_catalog | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
airbyte_service_account_email | Email address of the service account used by the Airbyte VM | string |
n/a | yes |
flows_configuration | Definition of all the flows to Bigquery that will be Terraformed to your Airbyte instance | map(object({ |
n/a | yes |
project_id | GCP project id in which the existing Airbyte instance resides | string |
n/a | yes |
No outputs.