Skip to content

A Terraform module for scheduling Vertex Pipeline runs using Google Cloud Scheduler

License

Notifications You must be signed in to change notification settings

teamdatatonic/terraform-google-scheduled-vertex-pipelines

Repository files navigation

Scheduled Vertex Pipelines

This repo contains a Terraform module for scheduling a Vertex Pipeline using Google Cloud Scheduler, without the need for a Cloud Function or other "glue code".

This module is available in the Datatonic Terraform Registry.

Examples

Check out the the examples directory.

Limitations

Pipeline job names

Vertex Pipeline jobs created using the Python SDK follow the format <pipeline name>-<timestamp>. This is implemented in the SDK itself (not by the API). Pipeline jobs created using this Terraform module instead just have the numeric ID as the job name. This is for two reasons:

  • The ability to specify the job name is only available in the gRPC API, not the HTTP API (Cloud Scheduler jobs can only target the HTTP API, not the gRPC API)
  • Regardless, Cloud Scheduler jobs cannot dynamically alter the HTTP payload based on a timestamp

Caching

Using the SDK, you can override the caching behaviour for Vertex Pipeline steps. This feature is not available in the HTTP API used by this module. Instead, you can specify the caching behaviour in your actual pipeline definition.

Development

Local setup

  • Install pre-commit
  • Install the pre-commit hooks - pre-commit install

README

The README file is autogenerated using terraform-docs. This is done when you create a pull request (or push to an existing PR).

You can customise the template (including this text for example) in .github/workflows/pr-checks.yml.

Requirements

Name Version
google >= 4.0.0

Providers

Name Version
google >= 4.0.0

Modules

No modules.

Resources

Name Type
google_cloud_scheduler_job.job resource
google_compute_default_service_account.default data source
google_storage_bucket_object_content.pipeline_spec data source

Inputs

Name Description Type Default Required
cloud_scheduler_job_attempt_deadline The deadline for Cloud Scheduler job attempts. If the request handler does not respond by this deadline then the request is cancelled and the attempt is marked as a DEADLINE_EXCEEDED failure. The failed attempt can be viewed in execution logs. Cloud Scheduler will retry the job according to the RetryConfig. The allowed duration for this deadline is between 15 seconds and 30 minutes. A duration in seconds with up to nine fractional digits, terminated by 's'. Example: "3.5s" string "320s" no
cloud_scheduler_job_description A human-readable description for the Cloud Scheduler job. This string must not contain more than 500 characters. string null no
cloud_scheduler_job_name The name of the Cloud Scheduler job. string n/a yes
cloud_scheduler_region The GCP region where the Cloud Scheduler job should be executed. string n/a yes
cloud_scheduler_retry_count The number of attempts that the system will make to run a Cloud Scheduler job using the exponential backoff procedure described by maxDoublings. Values greater than 5 and negative values are not allowed. number 1 no
cloud_scheduler_sa_email Service account email to be used for executing the Cloud Scheduler job. The service account must be within the same project as the job. string null no
display_name The display name of the Pipeline. The name can be up to 128 characters long and can be consist of any UTF-8 characters. string null no
gcs_output_directory Required. A path in a Cloud Storage bucket, which will be treated as the root output directory of the pipeline. It is used by the system to generate the paths of output artifacts. The artifact paths are generated with a sub-path pattern {job_id}/{taskId}/{output_key} under the specified output directory. The service account specified in this pipeline must have the storage.objects.get and storage.objects.create permissions for this bucket. string n/a yes
kms_key_name The Cloud KMS resource identifier of the customer managed encryption key used to protect a resource. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as the Vertex Pipeline execution. string null no
labels The labels with user-defined metadata to organize PipelineJob. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. See https://goo.gl/xmQnxf for more information and examples of labels. map(string) {} no
network The full name of the Compute Engine network to which the Pipeline Job's workload should be peered. For example, projects/12345/global/networks/myVPC. Format is of the form projects/{project}/global/networks/{network}. Where {project} is a project number, as in 12345, and {network} is a network name. Private services access must already be configured for the network. Pipeline job will apply the network configuration to the GCP resources being launched, if applied, such as Vertex AI Training or Dataflow job. If left unspecified, the workload is not peered with any network. string null no
parameter_values The runtime parameters of the PipelineJob. The parameters will be passed into PipelineJob.pipeline_spec to replace the placeholders at runtime. This field is used by pipelines built using PipelineJob.pipeline_spec.schema_version 2.1.0, such as pipelines built using Kubeflow Pipelines SDK 1.9 or higher and the v2 DSL. map(any) null no
parameters Deprecated. Use RuntimeConfig.parameter_values instead. The runtime parameters of the PipelineJob. The parameters will be passed into PipelineJob.pipeline_spec to replace the placeholders at runtime. This field is used by pipelines built using PipelineJob.pipeline_spec.schema_version 2.0.0 or lower, such as pipelines built using Kubeflow Pipelines SDK 1.8 or lower. map(any) null no
pipeline_spec_path Path to the KFP pipeline spec file (YAML or JSON). This can be a local file, GCS path, or Artifact Registry path. string n/a yes
project The GCP project ID where the cloud scheduler job and Vertex Pipeline should be deployed. string n/a yes
schedule Describes the schedule on which the job will be executed. string n/a yes
time_zone Specifies the time zone to be used in interpreting schedule. The value of this field must be a time zone name from the tz database. string "UTC" no
vertex_region The GCP region where the Vertex Pipeline should be executed. string n/a yes
vertex_service_account_email The service account that the pipeline workload runs as. If not specified, the Compute Engine default service account in the project will be used. See https://cloud.google.com/compute/docs/access/service-accounts#default_service_account. Users starting the pipeline must have the iam.serviceAccounts.actAs permission on this service account. string null no

Outputs

Name Description
id an identifier for the Cloud Scheduler job resource with format projects/{{project}}/locations/{{region}}/jobs/{{name}}

About

A Terraform module for scheduling Vertex Pipeline runs using Google Cloud Scheduler

Resources

License

Stars

Watchers

Forks

Packages

No packages published