Skip to content

snowplow-devops/terraform-google-bigquery-loader-pubsub-ce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Release CI License Registry Source

terraform-google-bigquery-loader-pubsub-ce

A Terraform module which deploys the requisite micro-services for loading BigQuery on Google running on top of Compute Engine. If you want to use a custom image for this deployment you will need to ensure it is based on top of Ubuntu 20.04.

Source acknowledgement

This module was originally sourced from a community contribution by Teghan Nightengale in 2022 - big thanks for the help in getting this one started!

Telemetry

This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.

If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id variable to include a valid email address which we can reach you at.

How do I disable it?

To disable telemetry simply set variable telemetry_enabled = false.

What are you collecting?

For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry

Usage

This module will deploy three seperate instance groups:

  1. mutator: Attempts to create the events table if it doesn't exist and then listens for new types to update the table with as custom events and entities are tracked
  2. repeater: Events that were sent with custom events and entities that have not yet been added to the events table will be re-tried later by the repeater
  3. streamloader: Core application which pulls data from an Enriched events topic and loads into BigQuery

The mutator is deployed as a singleton instance but both the repeater and streamloader can be scaled horizontally if higher throughput is needed.

# NOTE: Needs to be fed by the enrich module with valid Snowplow Events
module "enriched_topic" {
  source  = "snowplow-devops/pubsub-topic/google"
  version = "0.3.0"

  name = "enriched-topic"
}

module "bad_rows_topic" {
  source  = "snowplow-devops/pubsub-topic/google"
  version = "0.3.0"

  name = "bad-rows-topic"
}

resource "google_bigquery_dataset" "pipeline_db" {
  dataset_id = "pipeline_db"
  location   = var.region
}

resource "google_storage_bucket" "dead_letter_bucket" {
  name          = "bq-loader-dead-letter"
  location      = var.region
  force_destroy = true
}

module "bigquery_loader_pubsub" {
  source  = "snowplow-devops/bigquery-loader-pubsub-ce/google"

  accept_limited_use_license = true

  name       = "bq-loader-server"
  project_id = var.project_id

  network    = var.network
  subnetwork = var.subnetwork
  region     = var.region

  input_topic_name            = module.enriched_topic.name
  bad_rows_topic_name         = module.bad_rows_topic.name
  gcs_dead_letter_bucket_name = google_storage_bucket.dead_letter_bucket.name
  bigquery_dataset_id         = google_bigquery_dataset.pipeline_db.dataset_id

  ssh_key_pairs    = []
  ssh_ip_allowlist = ["0.0.0.0/0"]

  # Linking in the custom Iglu Server here
  custom_iglu_resolvers = [
    {
      name            = "Iglu Server"
      priority        = 0
      uri             = "http://your-iglu-server-endpoint/api"
      api_key         = var.iglu_super_api_key
      vendor_prefixes = []
    }
  ]
}

Requirements

Name Version
terraform >= 1.0.0
google >= 3.44.0

Providers

Name Version
google >= 3.44.0

Modules

Name Source Version
service snowplow-devops/service-ce/google 0.1.0
telemetry snowplow-devops/telemetry/snowplow 0.5.0

Resources

Name Type
google_bigquery_dataset_iam_member.dataset_bigquery_data_editor_binding resource
google_compute_firewall.egress resource
google_compute_firewall.ingress_ssh resource
google_project_iam_member.sa_bigquery_data_editor resource
google_project_iam_member.sa_logging_log_writer resource
google_project_iam_member.sa_pubsub_publisher resource
google_project_iam_member.sa_pubsub_subscriber resource
google_project_iam_member.sa_pubsub_viewer resource
google_project_iam_member.sa_storage_object_viewer resource
google_pubsub_subscription.failed_inserts resource
google_pubsub_subscription.input resource
google_pubsub_subscription.types resource
google_pubsub_topic.failed_inserts resource
google_pubsub_topic.types resource
google_service_account.sa resource
google_storage_bucket_iam_binding.dead_letter_storage_object_admin_binding resource

Inputs

Name Description Type Default Required
bad_rows_topic_name The name of the output topic for all bad data string n/a yes
bigquery_dataset_id The ID of the bigquery dataset to load data into string n/a yes
gcs_dead_letter_bucket_name The name of the GCS bucket to dump unloadable events into string n/a yes
input_topic_name The name of the input topic that contains enriched data to load string n/a yes
name A name which will be pre-pended to the resources created string n/a yes
network The name of the network to deploy within string n/a yes
project_id The project ID in which the stack is being deployed string n/a yes
region The name of the region to deploy within string n/a yes
accept_limited_use_license Acceptance of the SLULA terms (https://docs.snowplow.io/limited-use-license-1.0/) bool false no
app_version App version to use. This variable facilitates dev flow, the modules may not work with anything other than the default value. string "1.7.0" no
associate_public_ip_address Whether to assign a public ip address to this instance; if false this instance must be behind a Cloud NAT to connect to the internet bool true no
bigquery_partition_column The partition column to use in the dataset string "collector_tstamp" no
bigquery_require_partition_filter Whether to require a filter on the partition column in all queries bool true no
bigquery_table_id The ID of the table within a dataset to load data into (will be created if it doesn't exist) string "events" no
custom_iglu_resolvers The custom Iglu Resolvers that will be used by the loader to resolve and validate events
list(object({
name = string
priority = number
uri = string
api_key = string
vendor_prefixes = list(string)
}))
[] no
default_iglu_resolvers The default Iglu Resolvers that will be used by the loader to resolve and validate events
list(object({
name = string
priority = number
uri = string
api_key = string
vendor_prefixes = list(string)
}))
[
{
"api_key": "",
"name": "Iglu Central",
"priority": 10,
"uri": "http://iglucentral.com",
"vendor_prefixes": []
},
{
"api_key": "",
"name": "Iglu Central - Mirror 01",
"priority": 20,
"uri": "http://mirror01.iglucentral.com",
"vendor_prefixes": []
}
]
no
gcp_logs_enabled Whether application logs should be reported to GCP Logging bool true no
java_opts Custom JAVA Options string "-XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75" no
labels The labels to append to this resource map(string) {} no
machine_type_mutator The machine type to use string "e2-small" no
machine_type_repeater The machine type to use string "e2-small" no
machine_type_streamloader The machine type to use string "e2-small" no
network_project_id The project ID of the shared VPC in which the stack is being deployed string "" no
ssh_block_project_keys Whether to block project wide SSH keys bool true no
ssh_ip_allowlist The list of CIDR ranges to allow SSH traffic from list(any)
[
"0.0.0.0/0"
]
no
ssh_key_pairs The list of SSH key-pairs to add to the servers
list(object({
user_name = string
public_key = string
}))
[] no
subnetwork The name of the sub-network to deploy within; if populated will override the 'network' setting string "" no
target_size_repeater The number of servers to deploy number 1 no
target_size_streamloader The number of servers to deploy number 1 no
telemetry_enabled Whether or not to send telemetry information back to Snowplow Analytics Ltd bool true no
ubuntu_20_04_source_image The source image to use which must be based of of Ubuntu 20.04; by default the latest community version is used string "" no
user_provided_id An optional unique identifier to identify the telemetry events emitted by this stack string "" no

Outputs

Name Description
instance_group_url The full URL of the instance group created by the manager
manager_id Identifier for the instance group manager
manager_self_link The URL for the instance group manager

Copyright and license

Copyright 2022-present Snowplow Analytics Ltd.

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)