Add CI/CD automation (#130)

* fix `isort` hook and the imports orders * fix `sqlfluff` configuration * update `airflow.cfg` to mirror dev composer * update `sentry-sdk` to overcome warnings * add script to test all dags integrity * add script to upload dags and schema files to gcs * add requirements to run airflow from ci * add ci workflow * hardcode `dags/` since it won't change * change script name * add ci/cd to test and update files in airflow * fix `python-version` on `pre-commit` step * fix `isort` and `black` compatibility * fix imports hopefully for the last time * autofix from `pre-commit` * remove unnecessary file * rename `airflow.cfg` to specify the environment * add airflow configuration file from prod composer * rename file * add logger and env selection * update ci/cd files * fix formatting * update ci/cd workflow * fix ci/cd * rearrange dependencies * modify authentication method to gcp * trigger ci only when pr comes from official repo * lint merged files * add new variables * pause reset testnet dag when on prod composer * fix uploading `airflow.cfg` to gcs
stellar · Apr 25, 2023 · 3e34347 · 3e34347
1 parent 9e8257d
commit 3e34347
Show file tree

Hide file tree

Showing 25 changed files with 777 additions and 261 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,113 @@
+name: CI
+
+on:
+  pull_request:
+    types:
+      - opened
+      - reopened
+      - synchronize
+      - closed
+    branches:
+      - master
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    if: >-
+      github.event.pull_request.merged == false &&
+      github.event.pull_request.state == 'open'
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+
+      - uses: pre-commit/[email protected]
+
+  tests:
+    runs-on: ubuntu-latest
+    needs: [pre-commit]
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+
+      - name: Install dependencies
+        run: |
+          cat airflow_variables_dev.json | sed -e s/\\/home\\/airflow\\/gcs\\/dags\\/// > airflow_variables_ci.json
+          python -m pip install --upgrade pip
+          pip install -r requirements-ci.txt
+
+      - name: Init Airflow SQLite database
+        run: airflow db init
+
+      - name: Import Airflow variables
+        run: airflow variables import airflow_variables_ci.json
+
+      - name: Pytest
+        run: pytest dags/
+
+  deploy-to-dev:
+    runs-on: ubuntu-latest
+    needs: [tests]
+    # deploy to dev occurs every time
+    # someone submits a pr targeting `master`
+    # from a branch at `stellar/stellar-etl-airflow` repo
+    if: github.repository == 'stellar/stellar-etl-airflow'
+    # known caveats:
+    # if there's more than 1 person working
+    # in the same file this won't behave nicely
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+
+      - name: Install dependencies
+        run: |
+          pip install --upgrade pip
+          pip install google-cloud-storage==2.1.0
+
+      - name: Authenticate to test-hubble GCP
+        uses: google-github-actions/auth@v1
+        with:
+          credentials_json: "${{ secrets.CREDS_TEST_HUBBLE }}"
+
+      - name: Upload files to dev GCS bucket
+        run: python dags/stellar_etl_airflow/add_files_to_composer.py --bucket $BUCKET
+        env:
+          GOOGLE_CLOUD_PROJECT: test-hubble-319619
+          BUCKET: us-central1-hubble-1pt5-dev-7db0e004-bucket
+
+  promote-to-prod:
+    runs-on: ubuntu-latest
+    # deploy only occurs when pr is merged
+    if: github.event.pull_request.merged == true
+    permissions:
+      pull-requests: write
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Create pull request
+        run: |
+          gh pr create \
+          --base release \
+          --head master \
+          --reviewer stellar/platform-committers
+          --title "[PRODUCTION] Update production Airflow environment" \
+          --body "This PR was auto-generated by GitHub Actions.
+
+          After merged and closed, this PR will trigger an action that updates DAGs, libs and schemas files from prod Airflow."
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,67 @@
+name: release
+
+on:
+  pull_request:
+    types:
+      - closed
+    branches:
+      - release
+
+jobs:
+  tests:
+    runs-on: ubuntu-latest
+    if: github.event.pull_request.merged == true
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+
+      - name: Install dependencies
+        run: |
+          cat airflow_variables_dev.json | sed -e s/\\/home\\/airflow\\/gcs\\/dags\\/// > airflow_variables_ci.json
+          python -m pip install --upgrade pip
+          pip install -r requirements-ci.txt
+
+      - name: Init Airflow SQLite database
+        run: airflow db init
+
+      - name: Import Airflow variables
+        run: airflow variables import airflow_variables_ci.json
+
+      - name: Pytest
+        run: pytest dags/
+
+  release:
+    runs-on: ubuntu-latest
+    needs: [tests]
+    # deploy only occurs when pr is merged
+    if: >-
+      github.event.pull_request.merged == true &&
+      github.repository == 'stellar/stellar-etl-airflow'
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install google-cloud-storage==2.1.0
+
+      - name: Authenticate to hubble GCP
+        uses: google-github-actions/auth@v1
+        with:
+          credentials_json: "${{ secrets.CREDS_PROD_HUBBLE }}"
+
+      - name: Upload files to prod GCS bucket
+        run: python dags/stellar_etl_airflow/add_files_to_composer.py --bucket $BUCKET --env prod
+        env:
+          GOOGLE_CLOUD_PROJECT: hubble-261722
+          BUCKET: us-central1-hubble-2-d948d67b-bucket
diff --git a/README.md b/README.md
@@ -40,6 +40,7 @@ This repository contains the Airflow DAGs for the [Stellar ETL](https://github.c
 <br>
 
 # Installation and Setup
+
 - [Google Cloud Platform](#google-cloud-platform)
 - [Cloud Composer](#cloud-composer)
 - [Airflow Variables Explanation](#airflow-variables-explanation)
@@ -75,6 +76,7 @@ Below are instructions to intialize the Google Cloud SDK and create the GCP proj
 
 - Open the [Cloud Storage browser](https://console.cloud.google.com/storage/browser)
 - [Create](https://cloud.google.com/storage/docs/creating-buckets) a new Google Storage bucket that will store exported files
+
   > **_NOTE:_** Creating a new Cloud Composer environment will automatically create a new GCS bucket.
 
   > **_NOTE:_** The dataset name you choose corresponds to the Airflow variable "gcs_exported_data_bucket_name".
@@ -85,7 +87,7 @@ Below are instructions to intialize the Google Cloud SDK and create the GCP proj
 
 <br>
 
-***
+---
 
 ## **Cloud Composer**
 
@@ -102,7 +104,7 @@ Create a new Cloud Composer environment using the [UI](https://console.cloud.goo
 > Composer 2 environments use autopilot exclusively for resource management.
 
 > **_Note_**: If no service account is provided, GCP will use the default GKE service account. For quick setup this is an easy option.
-Remember to adjust the disk size, machine type, and node count to fit your needs. The python version must be 3, and the image must be `composer-1.16.11-airflow-1.10.14` or later. GCP deprecates support for older versions of composer and airflow. It is recommended that you select a stable, latest version to avoid an environment upgrade. See [the command reference page](https://cloud.google.com/sdk/gcloud/reference/composer/environments/create) for a detailed list of parameters.
+> Remember to adjust the disk size, machine type, and node count to fit your needs. The python version must be 3, and the image must be `composer-1.16.11-airflow-1.10.14` or later. GCP deprecates support for older versions of composer and airflow. It is recommended that you select a stable, latest version to avoid an environment upgrade. See [the command reference page](https://cloud.google.com/sdk/gcloud/reference/composer/environments/create) for a detailed list of parameters.
 
 > **_TROUBLESHOOTING:_** If the environment creation fails because the "Composer Backend timed out" try disabling and enabling the Cloud Composer API. If the creation fails again, try creating a service account with Owner permissions and use it to create the Composer environment.
 
@@ -322,7 +324,7 @@ name: etl-data
 <summary>Add Poststart Script to Airflow Workers</summary>
 Find the namespace name in the airflow-worker config file. It should be near the top of the file, and may look like `composer-1-12-0-airflow-1-10-10-2fca78f7`. This value will be used in later commands.
 
-Next, open the cloud shell. Keep your airflow-worker configuration file open, or save it. In the cloud shell, create a text file called `poststart.sh` by running the command: `nano poststart.sh`. Then, copy the text from the `poststart.sh` file in this repository into the newly opened file. 
+Next, open the cloud shell. Keep your airflow-worker configuration file open, or save it. In the cloud shell, create a text file called `poststart.sh` by running the command: `nano poststart.sh`. Then, copy the text from the `poststart.sh` file in this repository into the newly opened file.
 
 - If you changed the path for the local folder in the previous step, make sure that you edit line 13:
 
@@ -334,21 +336,21 @@ Next, open the cloud shell. Keep your airflow-worker configuration file open, or
 
   ```bash
   gcloud container clusters get-credentials <cluster_name> --region=<composer_region>
-  
+
   kubectl create configmap start-config --from-file poststart.sh -n <namespace_name>
   ```
 
 - Return to the airflow-worker config file. Add a new volumeMount to /etc/scripts.
 
   ```
   ...
-  
+
   volumeMounts:
   ...
   - mountPath: /etc/scripts
   name: config-volume
   ...
-  
+
   ```
 
 - Then, add a new Volume that links to the configMap you created.
@@ -415,7 +417,7 @@ The `airflow_variables.txt` file provides a set of default values for variables.
 
 <br>
 
-***
+---
 
 ## **Airflow Variables Explanation**
 
@@ -438,14 +440,13 @@ The `airflow_variables.txt` file provides a set of default values for variables.
 | owner                         | the name of the owner of the Airflow DAGs                                                                                                           | Yes.                                                                  |
 | schema_filepath               | file path to schema folder                                                                                                                          | No, unless schemas are in a different location                        |
 | table_ids                     | JSON object. Each key should be a data structure, and the value should be the name of the BigQuery table                                            | Yes, if desired. Make sure each type has a different table name.      |
-| cluster_fields                | JSON object. Each key should be a BigQuery table, and the value is a list of columns that the table is clustered by                                            | Yes, if desired for tables that want clustering      |
-| parititon_fields              | JSON object. Each key should be a BigQuery table, and the value is a JSON object of type and field to partition by                                            | Yes, if desired for tables that want partitioning     |
-| gcs_exported_object_prefix    | String to prefix run_id export task output path with          | Yes, if desired to prefix run_id    |
-| sentry_dsn                    | Sentry Data Source Name to tell where Sentry SDK should send events        | Yes     |
-| sentry_environment            | Environment that sentry alerts will fire          | Yes    |
-| use_testnet                   | Flag to use testnet data instead of mainnet          | Yes, if desired to use testnet data    |
-| task_timeout                  | JSON object. Each key should be the airflow util task name, and the value is the timeout in seconds          | Yes, if desired to give tasks timeout    |
-
+| cluster_fields                | JSON object. Each key should be a BigQuery table, and the value is a list of columns that the table is clustered by                                 | Yes, if desired for tables that want clustering                       |
+| parititon_fields              | JSON object. Each key should be a BigQuery table, and the value is a JSON object of type and field to partition by                                  | Yes, if desired for tables that want partitioning                     |
+| gcs_exported_object_prefix    | String to prefix run_id export task output path with                                                                                                | Yes, if desired to prefix run_id                                      |
+| sentry_dsn                    | Sentry Data Source Name to tell where Sentry SDK should send events                                                                                 | Yes                                                                   |
+| sentry_environment            | Environment that sentry alerts will fire                                                                                                            | Yes                                                                   |
+| use_testnet                   | Flag to use testnet data instead of mainnet                                                                                                         | Yes, if desired to use testnet data                                   |
+| task_timeout                  | JSON object. Each key should be the airflow util task name, and the value is the timeout in seconds                                                 | Yes, if desired to give tasks timeout                                 |
 
 ### **Kubernetes-Specific Variables**
 
@@ -476,7 +477,7 @@ Here are some example `volume_config` values. Note that a ReadWriteMany volume i
 ## **Starting Up**
 
 > **_NOTE:_** Google Cloud Composer instance of airflow has limited CLI support.
-[Supported Airflow CLI commands](https://cloud.google.com/composer/docs/composer-2/access-airflow-cli#supported-commands)
+> [Supported Airflow CLI commands](https://cloud.google.com/composer/docs/composer-2/access-airflow-cli#supported-commands)
 
 First, this image has a shows the Airflow web UI components for pausing and triggering DAGs:
 ![Airflow UI](documentation/images/AirflowUI.png)
@@ -518,34 +519,38 @@ This section contains information about the Airflow setup. It includes our DAG d
 ### **History Archive with Captive Core DAG**
 
 [This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/history_archive_with_captive_core_dag.py):
+
 - exports transactions, operations, trades, and effects from Stellar using CaptiveCore
 - inserts into BigQuery
-> *_NOTE:_* SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
+  > _*NOTE:*_ SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
 
 ![History Archive with Captive Core Dag](documentation/images/history_archive_with_captive_core.png)
 
 ### **History Archive without Captive Core DAG**
 
 [This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/history_archive_without_captive_core_dag.py):
+
 - exports assets and ledgers from Stellar's history archives
 - inserts into BigQuery
-> *_NOTE:_* SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
+  > _*NOTE:*_ SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
 
 ![History Archive Dag](documentation/images/history_archive_without_captive_core.png)
 
 ### **State Table Export DAG**
 
 [This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/state_table_dag.py)
+
 - exports accounts, account_signers, offers, claimable_balances, liquidity pools, and trustlines
 - inserts into BigQuery
 
 ![Bucket List DAG](documentation/images/state_table_export.png)
 
 ### **Bucket List DAG (Unsupported)**
 
-> *_NOTE:_* Bucket List DAG is unsupported.
+> _*NOTE:*_ Bucket List DAG is unsupported.
 
 [This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/bucket_list_dag.py):
+
 - exports from Stellar's bucket list, which contains data on accounts, offers, trustlines, account signers, liqudity pools, and claimable balances
 - inserts into BigQuery
 
@@ -575,12 +580,12 @@ Apply tasks can also be used to insert unique values only. This behavior is used
 
 ### **build_batch_stats**
 
-[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_batch_stats.py) pulls and inserts batch stats into BigQuery. 
+[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_batch_stats.py) pulls and inserts batch stats into BigQuery.
 Data is inserted into `history_archives_dag_runs`.
 
 ### **bq_insert_job_task**
 
-[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_bq_insert_job_task.py) contains methods for creating BigQuery insert job tasks. 
+[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_bq_insert_job_task.py) contains methods for creating BigQuery insert job tasks.
 The task will read the query from the specified sql file and will return a BigQuery job operator configured to the GCP project and datasets defined.
 
 ### **cross_dependency_task**
@@ -591,7 +596,7 @@ The task will read the query from the specified sql file and will return a BigQu
 
 [This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_delete_data_task.py) deletes data from a specified BigQuery `project.dataset.table` according to the batch interval.
 
-> *_NOTE:_* If the batch interval is changed, the deleted data might not align with the prior batch intervals.
+> _*NOTE:*_ If the batch interval is changed, the deleted data might not align with the prior batch intervals.
 
 <br>
 
@@ -691,4 +696,4 @@ Once you make a change, you can test it using the Airflow command line interface
 
 This guide can also be useful for testing deployment in a new environment. Follow this testing process for all the taks in your DAGs to ensure that they work end-to-end.
 
-An alternative to the testing flow above is to `trigger` the task in the Airflow UI. From there you are able to view the task status, log, and task details.
+An alternative to the testing flow above is to `trigger` the task in the Airflow UI. From there you are able to view the task status, log, and task details.