Skip to content

Commit

Permalink
Add CI/CD automation (#130)
Browse files Browse the repository at this point in the history
* fix `isort` hook and the imports orders

* fix `sqlfluff` configuration

* update `airflow.cfg` to mirror dev composer

* update `sentry-sdk` to overcome warnings

* add script to test all dags integrity

* add script to upload dags and schema files to gcs

* add requirements to run airflow from ci

* add ci workflow

* hardcode `dags/` since it won't change

* change script name

* add ci/cd to test and update files in airflow

* fix `python-version` on `pre-commit` step

* fix `isort` and `black` compatibility

* fix imports hopefully for the last time

* autofix from `pre-commit`

* remove unnecessary file

* rename `airflow.cfg` to specify the environment

* add airflow configuration file from prod composer

* rename file

* add logger and env selection

* update ci/cd files

* fix formatting

* update ci/cd workflow

* fix ci/cd

* rearrange dependencies

* modify authentication method to gcp

* trigger ci only when pr comes from official repo

* lint merged files

* add new variables

* pause reset testnet dag when on prod composer

* fix uploading `airflow.cfg` to gcs
  • Loading branch information
lucaszanotelli authored Apr 25, 2023
1 parent 9e8257d commit 3e34347
Show file tree
Hide file tree
Showing 25 changed files with 777 additions and 261 deletions.
113 changes: 113 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
name: CI

on:
pull_request:
types:
- opened
- reopened
- synchronize
- closed
branches:
- master

jobs:
pre-commit:
runs-on: ubuntu-latest
if: >-
github.event.pull_request.merged == false &&
github.event.pull_request.state == 'open'
steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8

- uses: pre-commit/[email protected]

tests:
runs-on: ubuntu-latest
needs: [pre-commit]

steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8

- name: Install dependencies
run: |
cat airflow_variables_dev.json | sed -e s/\\/home\\/airflow\\/gcs\\/dags\\/// > airflow_variables_ci.json
python -m pip install --upgrade pip
pip install -r requirements-ci.txt
- name: Init Airflow SQLite database
run: airflow db init

- name: Import Airflow variables
run: airflow variables import airflow_variables_ci.json

- name: Pytest
run: pytest dags/

deploy-to-dev:
runs-on: ubuntu-latest
needs: [tests]
# deploy to dev occurs every time
# someone submits a pr targeting `master`
# from a branch at `stellar/stellar-etl-airflow` repo
if: github.repository == 'stellar/stellar-etl-airflow'
# known caveats:
# if there's more than 1 person working
# in the same file this won't behave nicely

steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8

- name: Install dependencies
run: |
pip install --upgrade pip
pip install google-cloud-storage==2.1.0
- name: Authenticate to test-hubble GCP
uses: google-github-actions/auth@v1
with:
credentials_json: "${{ secrets.CREDS_TEST_HUBBLE }}"

- name: Upload files to dev GCS bucket
run: python dags/stellar_etl_airflow/add_files_to_composer.py --bucket $BUCKET
env:
GOOGLE_CLOUD_PROJECT: test-hubble-319619
BUCKET: us-central1-hubble-1pt5-dev-7db0e004-bucket

promote-to-prod:
runs-on: ubuntu-latest
# deploy only occurs when pr is merged
if: github.event.pull_request.merged == true
permissions:
pull-requests: write

steps:
- uses: actions/checkout@v3

- name: Create pull request
run: |
gh pr create \
--base release \
--head master \
--reviewer stellar/platform-committers
--title "[PRODUCTION] Update production Airflow environment" \
--body "This PR was auto-generated by GitHub Actions.
After merged and closed, this PR will trigger an action that updates DAGs, libs and schemas files from prod Airflow."
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
67 changes: 67 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: release

on:
pull_request:
types:
- closed
branches:
- release

jobs:
tests:
runs-on: ubuntu-latest
if: github.event.pull_request.merged == true
steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8

- name: Install dependencies
run: |
cat airflow_variables_dev.json | sed -e s/\\/home\\/airflow\\/gcs\\/dags\\/// > airflow_variables_ci.json
python -m pip install --upgrade pip
pip install -r requirements-ci.txt
- name: Init Airflow SQLite database
run: airflow db init

- name: Import Airflow variables
run: airflow variables import airflow_variables_ci.json

- name: Pytest
run: pytest dags/

release:
runs-on: ubuntu-latest
needs: [tests]
# deploy only occurs when pr is merged
if: >-
github.event.pull_request.merged == true &&
github.repository == 'stellar/stellar-etl-airflow'
steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install google-cloud-storage==2.1.0
- name: Authenticate to hubble GCP
uses: google-github-actions/auth@v1
with:
credentials_json: "${{ secrets.CREDS_PROD_HUBBLE }}"

- name: Upload files to prod GCS bucket
run: python dags/stellar_etl_airflow/add_files_to_composer.py --bucket $BUCKET --env prod
env:
GOOGLE_CLOUD_PROJECT: hubble-261722
BUCKET: us-central1-hubble-2-d948d67b-bucket
51 changes: 28 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ This repository contains the Airflow DAGs for the [Stellar ETL](https://github.c
<br>

# Installation and Setup

- [Google Cloud Platform](#google-cloud-platform)
- [Cloud Composer](#cloud-composer)
- [Airflow Variables Explanation](#airflow-variables-explanation)
Expand Down Expand Up @@ -75,6 +76,7 @@ Below are instructions to intialize the Google Cloud SDK and create the GCP proj

- Open the [Cloud Storage browser](https://console.cloud.google.com/storage/browser)
- [Create](https://cloud.google.com/storage/docs/creating-buckets) a new Google Storage bucket that will store exported files

> **_NOTE:_** Creating a new Cloud Composer environment will automatically create a new GCS bucket.
> **_NOTE:_** The dataset name you choose corresponds to the Airflow variable "gcs_exported_data_bucket_name".
Expand All @@ -85,7 +87,7 @@ Below are instructions to intialize the Google Cloud SDK and create the GCP proj
<br>

***
---

## **Cloud Composer**

Expand All @@ -102,7 +104,7 @@ Create a new Cloud Composer environment using the [UI](https://console.cloud.goo
> Composer 2 environments use autopilot exclusively for resource management.
> **_Note_**: If no service account is provided, GCP will use the default GKE service account. For quick setup this is an easy option.
Remember to adjust the disk size, machine type, and node count to fit your needs. The python version must be 3, and the image must be `composer-1.16.11-airflow-1.10.14` or later. GCP deprecates support for older versions of composer and airflow. It is recommended that you select a stable, latest version to avoid an environment upgrade. See [the command reference page](https://cloud.google.com/sdk/gcloud/reference/composer/environments/create) for a detailed list of parameters.
> Remember to adjust the disk size, machine type, and node count to fit your needs. The python version must be 3, and the image must be `composer-1.16.11-airflow-1.10.14` or later. GCP deprecates support for older versions of composer and airflow. It is recommended that you select a stable, latest version to avoid an environment upgrade. See [the command reference page](https://cloud.google.com/sdk/gcloud/reference/composer/environments/create) for a detailed list of parameters.
> **_TROUBLESHOOTING:_** If the environment creation fails because the "Composer Backend timed out" try disabling and enabling the Cloud Composer API. If the creation fails again, try creating a service account with Owner permissions and use it to create the Composer environment.
Expand Down Expand Up @@ -322,7 +324,7 @@ name: etl-data
<summary>Add Poststart Script to Airflow Workers</summary>
Find the namespace name in the airflow-worker config file. It should be near the top of the file, and may look like `composer-1-12-0-airflow-1-10-10-2fca78f7`. This value will be used in later commands.

Next, open the cloud shell. Keep your airflow-worker configuration file open, or save it. In the cloud shell, create a text file called `poststart.sh` by running the command: `nano poststart.sh`. Then, copy the text from the `poststart.sh` file in this repository into the newly opened file.
Next, open the cloud shell. Keep your airflow-worker configuration file open, or save it. In the cloud shell, create a text file called `poststart.sh` by running the command: `nano poststart.sh`. Then, copy the text from the `poststart.sh` file in this repository into the newly opened file.

- If you changed the path for the local folder in the previous step, make sure that you edit line 13:

Expand All @@ -334,21 +336,21 @@ Next, open the cloud shell. Keep your airflow-worker configuration file open, or

```bash
gcloud container clusters get-credentials <cluster_name> --region=<composer_region>

kubectl create configmap start-config --from-file poststart.sh -n <namespace_name>
```

- Return to the airflow-worker config file. Add a new volumeMount to /etc/scripts.

```
...
volumeMounts:
...
- mountPath: /etc/scripts
name: config-volume
...
```

- Then, add a new Volume that links to the configMap you created.
Expand Down Expand Up @@ -415,7 +417,7 @@ The `airflow_variables.txt` file provides a set of default values for variables.

<br>

***
---

## **Airflow Variables Explanation**

Expand All @@ -438,14 +440,13 @@ The `airflow_variables.txt` file provides a set of default values for variables.
| owner | the name of the owner of the Airflow DAGs | Yes. |
| schema_filepath | file path to schema folder | No, unless schemas are in a different location |
| table_ids | JSON object. Each key should be a data structure, and the value should be the name of the BigQuery table | Yes, if desired. Make sure each type has a different table name. |
| cluster_fields | JSON object. Each key should be a BigQuery table, and the value is a list of columns that the table is clustered by | Yes, if desired for tables that want clustering |
| parititon_fields | JSON object. Each key should be a BigQuery table, and the value is a JSON object of type and field to partition by | Yes, if desired for tables that want partitioning |
| gcs_exported_object_prefix | String to prefix run_id export task output path with | Yes, if desired to prefix run_id |
| sentry_dsn | Sentry Data Source Name to tell where Sentry SDK should send events | Yes |
| sentry_environment | Environment that sentry alerts will fire | Yes |
| use_testnet | Flag to use testnet data instead of mainnet | Yes, if desired to use testnet data |
| task_timeout | JSON object. Each key should be the airflow util task name, and the value is the timeout in seconds | Yes, if desired to give tasks timeout |

| cluster_fields | JSON object. Each key should be a BigQuery table, and the value is a list of columns that the table is clustered by | Yes, if desired for tables that want clustering |
| parititon_fields | JSON object. Each key should be a BigQuery table, and the value is a JSON object of type and field to partition by | Yes, if desired for tables that want partitioning |
| gcs_exported_object_prefix | String to prefix run_id export task output path with | Yes, if desired to prefix run_id |
| sentry_dsn | Sentry Data Source Name to tell where Sentry SDK should send events | Yes |
| sentry_environment | Environment that sentry alerts will fire | Yes |
| use_testnet | Flag to use testnet data instead of mainnet | Yes, if desired to use testnet data |
| task_timeout | JSON object. Each key should be the airflow util task name, and the value is the timeout in seconds | Yes, if desired to give tasks timeout |

### **Kubernetes-Specific Variables**

Expand Down Expand Up @@ -476,7 +477,7 @@ Here are some example `volume_config` values. Note that a ReadWriteMany volume i
## **Starting Up**

> **_NOTE:_** Google Cloud Composer instance of airflow has limited CLI support.
[Supported Airflow CLI commands](https://cloud.google.com/composer/docs/composer-2/access-airflow-cli#supported-commands)
> [Supported Airflow CLI commands](https://cloud.google.com/composer/docs/composer-2/access-airflow-cli#supported-commands)
First, this image has a shows the Airflow web UI components for pausing and triggering DAGs:
![Airflow UI](documentation/images/AirflowUI.png)
Expand Down Expand Up @@ -518,34 +519,38 @@ This section contains information about the Airflow setup. It includes our DAG d
### **History Archive with Captive Core DAG**

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/history_archive_with_captive_core_dag.py):

- exports transactions, operations, trades, and effects from Stellar using CaptiveCore
- inserts into BigQuery
> *_NOTE:_* SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
> _*NOTE:*_ SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
![History Archive with Captive Core Dag](documentation/images/history_archive_with_captive_core.png)

### **History Archive without Captive Core DAG**

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/history_archive_without_captive_core_dag.py):

- exports assets and ledgers from Stellar's history archives
- inserts into BigQuery
> *_NOTE:_* SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
> _*NOTE:*_ SDF writes to both a private dataset and public dataset. Non-SDF instances will probably only need to write to a single private dataset.
![History Archive Dag](documentation/images/history_archive_without_captive_core.png)

### **State Table Export DAG**

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/state_table_dag.py)

- exports accounts, account_signers, offers, claimable_balances, liquidity pools, and trustlines
- inserts into BigQuery

![Bucket List DAG](documentation/images/state_table_export.png)

### **Bucket List DAG (Unsupported)**

> *_NOTE:_* Bucket List DAG is unsupported.
> _*NOTE:*_ Bucket List DAG is unsupported.
[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/bucket_list_dag.py):

- exports from Stellar's bucket list, which contains data on accounts, offers, trustlines, account signers, liqudity pools, and claimable balances
- inserts into BigQuery

Expand Down Expand Up @@ -575,12 +580,12 @@ Apply tasks can also be used to insert unique values only. This behavior is used

### **build_batch_stats**

[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_batch_stats.py) pulls and inserts batch stats into BigQuery.
[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_batch_stats.py) pulls and inserts batch stats into BigQuery.
Data is inserted into `history_archives_dag_runs`.

### **bq_insert_job_task**

[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_bq_insert_job_task.py) contains methods for creating BigQuery insert job tasks.
[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_bq_insert_job_task.py) contains methods for creating BigQuery insert job tasks.
The task will read the query from the specified sql file and will return a BigQuery job operator configured to the GCP project and datasets defined.

### **cross_dependency_task**
Expand All @@ -591,7 +596,7 @@ The task will read the query from the specified sql file and will return a BigQu

[This file](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/stellar_etl_airflow/build_delete_data_task.py) deletes data from a specified BigQuery `project.dataset.table` according to the batch interval.

> *_NOTE:_* If the batch interval is changed, the deleted data might not align with the prior batch intervals.
> _*NOTE:*_ If the batch interval is changed, the deleted data might not align with the prior batch intervals.
<br>

Expand Down Expand Up @@ -691,4 +696,4 @@ Once you make a change, you can test it using the Airflow command line interface

This guide can also be useful for testing deployment in a new environment. Follow this testing process for all the taks in your DAGs to ensure that they work end-to-end.

An alternative to the testing flow above is to `trigger` the task in the Airflow UI. From there you are able to view the task status, log, and task details.
An alternative to the testing flow above is to `trigger` the task in the Airflow UI. From there you are able to view the task status, log, and task details.
Loading

0 comments on commit 3e34347

Please sign in to comment.