Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow to calculate schema changes #517

Merged
merged 20 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions .github/workflows/update_dbt_marts_schema_changelog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
name: Update changelog for DBT marts

on:
schedule:
- cron: "0 23 * * 1-5" # This will run at 11 PM UTC, which is 5 PM CST
chowbao marked this conversation as resolved.
Show resolved Hide resolved

concurrency:
group: ${{ github.workflow }}-${{ github.ref_protected == 'true' && github.sha || github.ref }}-{{ github.event_name }}
cancel-in-progress: true

env:
IS_RECENCY_AIRFLOW_TASK: "false"
amishas157 marked this conversation as resolved.
Show resolved Hide resolved

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v2

- name: Authenticate to crypto-stellar GCP
uses: "google-github-actions/auth@v2"
with:
project_id: hubble-261722
credentials_json: "${{ secrets.CREDS_PROD_HUBBLE }}"

- name: Set up Google Cloud SDK
run: |
echo "Installing Google Cloud SDK..."
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt-get update && sudo apt-get install -y google-cloud-sdk

- name: Create new branch
id: create_branch
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Action"
BRANCH_NAME="update-data-schema-changelog-${{ github.run_id }}"
git checkout -b $BRANCH_NAME
echo "::set-output name=branch::$BRANCH_NAME"

- name: Run Bash Script
run: |
cd $GITHUB_WORKSPACE
output=$(. scripts/update_dbt_marts_schema_changelog.sh)
echo "$output" > changelog/dbt_marts.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is it possible to make this append instead of overwrite the entire changelog each run?

There might be a time we delete the elementary.alerts_schema_changes table like the times we have deleted the whole elementary dataset because it was broken or elementary was doing something unexpected.

If it's too hard to make this append instead of overwrite we should remember to keep a copy of elementary.alerts_schema_changes otherwise we'd lose changelog history

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm this process is pretty dependent on elementary.alerts_schema_changes. Even if we were to delete this dataset ever, that in itself will affect changelog computing part. So ideally we should not delete that dataset.

However as far as append is concerned, we could

  • parse tracked date if any
  • add date filter to get data between tracked date and today
  • update the tracked date as max(date)

Structure of .md file:

Date: 2024-10-10

Table 1

Date: 2024-09-01

Table 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, I think it's quite a bit of text formatting. We can leave it as is and can revisit later if needed. In the first place, I don't think we are going to see a lot of schema changes in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup sounds good


- name: Commit changes
id: commit_changes
run: |
git add changelog/dbt_marts.md
if git commit -m "Update changelog for DBT marts"; then
echo "Changes committed."
echo "changes_committed=true" >> $GITHUB_OUTPUT
else
echo "No changes to commit."
echo "changes_committed=false" >> $GITHUB_OUTPUT
fi

- name: Push branch
if: steps.commit_changes.outputs.changes_committed == 'true'
run: |
git push origin ${{ steps.create_branch.outputs.branch }}

- name: Create Pull Request
if: steps.commit_changes.outputs.changes_committed == 'true'
run: |
gh pr create -B master -H ${{ steps.create_branch.outputs.branch }} \
--title 'Merge ${{ steps.create_branch.outputs.branch }} into master' \
--body 'Created by GitHub action'
amishas157 marked this conversation as resolved.
Show resolved Hide resolved
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
**/*.md
1 change: 1 addition & 0 deletions .sqlfluffignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
dags/ddls/queries
changelog/dbt_marts.md
19 changes: 19 additions & 0 deletions changelog/dbt_marts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Changes in DBT marts schema

| Date | Table Name | Operation | Columns |
|------------|---------------------------------|---------------|--------------------------|
| 2024-09-12 | ASSET_STATS_AGG | column_added | airflow_start_ts |
| 2024-09-12 | DIM_DATES | column_added | airflow_start_ts |
| 2024-09-12 | DIM_MGI_WALLETS | column_added | airflow_start_ts |
| 2024-09-12 | FCT_MGI_CASHFLOW | column_added | airflow_start_ts |
| 2024-09-12 | LIQUIDITY_POOLS_VALUE | column_added | airflow_start_ts |
| 2024-09-12 | LIQUIDITY_POOLS_VALUE_HISTORY | column_added | airflow_start_ts |
| 2024-09-12 | LIQUIDITY_POOL_TRADE_VOLUME | column_added | airflow_start_ts |
| 2024-09-12 | LIQUIDITY_PROVIDERS | column_added | airflow_start_ts |
| 2024-09-12 | MGI_MONTHLY_USD_BALANCE | column_added | airflow_start_ts |
| 2024-09-12 | MGI_NETWORK_STATS_AGG | column_added | airflow_start_ts |
| 2024-09-12 | NETWORK_STATS_AGG | column_added | airflow_start_ts |
| 2024-09-12 | OHLC_EXCHANGE_FACT | column_added | airflow_start_ts |
| 2024-09-12 | PARTNERSHIP_ASSETS__ACCOUNT_HOLDERS_ACTIVITY_FACT | column_added | airflow_start_ts |
| 2024-09-12 | PARTNERSHIP_ASSETS__ASSET_ACTIVITY_FACT | column_added | airflow_start_ts |
| 2024-09-12 | PARTNERSHIP_ASSETS__MOST_ACTIVE_FACT | column_added | airflow_start_ts |
24 changes: 24 additions & 0 deletions scripts/update_dbt_marts_schema_changelog.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

result=$(bq query --format=prettyjson --nouse_legacy_sql \
'SELECT
date(detected_at) as date
, table_name
, sub_type as operation
, ARRAY_AGG(column_name) as columns
FROM
`hubble-261722`.elementary.alerts_schema_changes
amishas157 marked this conversation as resolved.
Show resolved Hide resolved
GROUP BY
1, 2, 3
ORDER BY 1 DESC, 2 ASC
')

echo "# Changes in DBT marts schema"

echo ""

echo "| Date | Table Name | Operation | Columns |"
echo "|------------|---------------------------------|---------------|--------------------------|"

echo "$result" | jq -r '.[] | "| \(.date) | \(.table_name ) | \(.operation) | \(.columns | join(", ")) |"'
echo ""
Loading