Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRODUCTION] Update production Airflow environment #428

Merged
merged 13 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: CI
name: CI-CD-DEV

on:
pull_request:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: release
name: CI-CD-PROD

on:
pull_request:
Expand Down
45 changes: 26 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ This repository contains the Airflow DAGs for the [Stellar ETL](https://github.c
- [build_time_task](#build_time_task)
- [build_export_task](#build_export_task)
- [build_gcs_to_bq_task](#build_gcs_to_bq_task)
- [build_del_ins_from_gcs_to_bq_task](#build_del_ins_from_gcs_to_bq_task)
- [build_apply_gcs_changes_to_bq_task](#build_apply_gcs_changes_to_bq_task)
- [build_batch_stats](#build_batch_stats)
- [bq_insert_job_task](#bq_insert_job_task)
Expand Down Expand Up @@ -449,25 +450,26 @@ The `airflow_variables_*.txt` files provide a set of default values for variable

### **DBT Variables**

| Variable name | Description | Should be changed? |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------- |
| dbt_full_refresh_models | JSON object. Each key should be a DBT model, and the value is a boolean controlling if the model should be run with `--full-refresh` | Yes, if desired for models that need to be full-refreshed |
| dbt_image_name | name of the `stellar-dbt` image to use | No, unless you need a specific image version |
| dbt_job_execution_timeout_seconds | timeout for dbt tasks in seconds | No, unless you want a different timeout |
| dbt_job_retries | number of times dbt_jobs will retry | No, unless you want a different retry limit |
| dbt_mart_dataset | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) for DBT marts | Yes. Change to your dataset name |
| dbt_maximum_bytes_billed | the max number of BigQuery bytes that can be billed when running DBT | No, unless you want a different limit |
| dbt_project | name of the Biquery [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) | Yes. Change to your project name |
| dbt_target | the `target` that will used to run dbt | No, unless you want a different target |
| dbt_threads | the number of threads that dbt will spawn to build a model | No, unless you want a different thread count |
| dbt_tables | name of dbt tables to copy to sandbox | No |
| dbt_internal_source_db | Name of the BigQuery [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) | Yes. Change to your project name. |
| dbt_internal_source_schema | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) | Yes. Change to your dataset name. |
| dbt_public_source_db | Name of the BigQuery [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) | Yes. Change to your project name. |
| dbt_public_source_schema | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) | Yes. Change to your dataset name. |
| dbt_slack_elementary_channel | Name of slack channel to send elementary alerts | Yes. Change to your slack channel name. |
| dbt_elementary_dataset | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) | Yes. Change to your dataset name. |
| dbt_elementary_secret | Necessary argument for elementary task | No |
| Variable name | Description | Should be changed? |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- |
| dbt_full_refresh_models | JSON object. Each key should be a DBT model, and the value is a boolean controlling if the model should be run with `--full-refresh` | Yes, if desired for models that need to be full-refreshed |
| dbt_image_name | name of the `stellar-dbt` image to use | No, unless you need a specific image version |
| dbt_job_execution_timeout_seconds | timeout for dbt tasks in seconds | No, unless you want a different timeout |
| dbt_job_retries | number of times dbt_jobs will retry | No, unless you want a different retry limit |
| dbt_mart_dataset | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) for DBT marts | Yes. Change to your dataset name |
| dbt_maximum_bytes_billed | the max number of BigQuery bytes that can be billed when running DBT | No, unless you want a different limit |
| dbt_project | name of the Biquery [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) | Yes. Change to your project name |
| dbt_target | the `target` that will used to run dbt | No, unless you want a different target |
| dbt_threads | the number of threads that dbt will spawn to build a model | No, unless you want a different thread count |
| dbt_tables | name of dbt tables to copy to sandbox | No |
| dbt_internal_source_db | Name of the BigQuery [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) | Yes. Change to your project name. |
| dbt_internal_source_schema | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) | Yes. Change to your dataset name. |
| dbt_public_source_db | Name of the BigQuery [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) | Yes. Change to your project name. |
| dbt_public_source_schema | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) | Yes. Change to your dataset name. |
| dbt_slack_elementary_channel | Name of slack channel to send elementary alerts | Yes. Change to your slack channel name. |
| dbt_elementary_dataset | Name of the BigQuery [dataset](https://cloud.google.com/bigquery/docs/datasets) | Yes. Change to your dataset name. |
| dbt_elementary_secret | Necessary argument for elementary task | No |
| dbt_transient_errors_patterns | Dictionary containing a name of a known dbt transient error as key and a list of string sentences to identify the error pattern as value | Yes, for every known error added |

### **Kubernetes-Specific Variables**

Expand Down Expand Up @@ -542,6 +544,7 @@ This section contains information about the Airflow setup. It includes our DAG d
- [build_export_task](#build_export_task)
- [build_gcs_to_bq_task](#build_gcs_to_bq_task)
- [build_apply_gcs_changes_to_bq_task](#build_apply_gcs_changes_to_bq_task)
- [build_del_ins_from_gcs_to_bq_task](#build_del_ins_from_gcs_to_bq_task)
- [build_batch_stats](#build_batch_stats)
- [bq_insert_job_task](#bq_insert_job_task)
- [cross_dependency_task](#cross_dependency_task)
Expand Down Expand Up @@ -668,6 +671,10 @@ This section contains information about the Airflow setup. It includes our DAG d

[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_gcs_to_bq_task.py) contains methods for creating tasks that appends information from a Google Cloud Storage file to a BigQuery table. These tasks will create a new table if one does not exist. These tasks are used for history archive data structures, as Stellar wants to keep a complete record of the ledger's entire history.

### **build_del_ins_from_gcs_to_bq_task**

[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_del_ins_from_gcs_to_bq_task.py) contains methods for deleting data from a specified BigQuery table according to the batch interval and also imports data from gcs to the corresponding BigQuery table. These tasks will create a new table if one does not exist. These tasks are used for history and state data structures, as Stellar wants to keep a complete record of the ledger's entire history.

### **build_apply_gcs_changes_to_bq_task**

[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_apply_gcs_changes_to_bq_task.py) contains methods for creating apply tasks. Apply tasks are used to merge a file from Google Cloud Storage into a BigQuery table. Apply tasks differ from the other task that appends in that they apply changes. This means that they update, delete, and insert rows. These tasks are used for accounts, offers, and trustlines, as the BigQuery table represents the point in time state of these data structures. This means that, for example, a merge task could alter the account balance field in the table if a user performed a transaction, delete a row in the table if a user deleted their account, or add a new row if a new account was created.
Expand Down
8 changes: 8 additions & 0 deletions airflow_variables_dev.json
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,12 @@
},
"dbt_target": "test",
"dbt_threads": 12,
"dbt_transient_errors_patterns": {
"elementary_concurrent_access": [
"Could not serialize access to table",
"due to concurrent update"
]
},
"gcs_exported_data_bucket_name": "us-central1-test-hubble-2-5f1f2dbf-bucket",
"gcs_exported_object_prefix": "dag-exported",
"image_name": "stellar/stellar-etl:98bea9a",
Expand Down Expand Up @@ -334,6 +340,7 @@
"asset_stats": 720,
"build_batch_stats": 840,
"build_bq_insert_job": 1080,
"build_del_ins_from_gcs_to_bq_task": 2000,
"build_delete_data_task": 1020,
"build_export_task": 840,
"build_gcs_to_bq_task": 960,
Expand Down Expand Up @@ -367,6 +374,7 @@
"build_bq_insert_job": 180,
"build_copy_table": 180,
"build_dbt_task": 960,
"build_del_ins_from_gcs_to_bq_task": 400,
"build_delete_data_task": 180,
"build_export_task": 420,
"build_gcs_to_bq_task": 300,
Expand Down
8 changes: 8 additions & 0 deletions airflow_variables_prod.json
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,12 @@
},
"dbt_target": "prod",
"dbt_threads": 12,
"dbt_transient_errors_patterns": {
"elementary_concurrent_access": [
"Could not serialize access to table",
"due to concurrent update"
]
},
"gcs_exported_data_bucket_name": "us-central1-hubble-14c4ca64-bucket",
"gcs_exported_object_prefix": "dag-exported",
"image_name": "stellar/stellar-etl:98bea9a",
Expand Down Expand Up @@ -332,6 +338,7 @@
"asset_stats": 420,
"build_batch_stats": 600,
"build_bq_insert_job": 840,
"build_del_ins_from_gcs_to_bq_task": 2000,
"build_delete_data_task": 780,
"build_export_task": 600,
"build_gcs_to_bq_task": 660,
Expand Down Expand Up @@ -365,6 +372,7 @@
"build_bq_insert_job": 180,
"build_copy_table": 180,
"build_dbt_task": 1800,
"build_del_ins_from_gcs_to_bq_task": 400,
"build_delete_data_task": 180,
"build_export_task": 300,
"build_gcs_to_bq_task": 300,
Expand Down
Loading
Loading