Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-647] Only submit most recent ~2 weeks of exports for integration tests #115

Merged
merged 1 commit into from
May 31, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 48 additions & 22 deletions .github/workflows/upload-and-deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ jobs:
- uses: actions/setup-python@v4
- uses: pre-commit/[email protected]


upload-files:
name: Upload files to S3 bucket in development
runs-on: ubuntu-latest
Expand Down Expand Up @@ -265,6 +264,8 @@ jobs:
permissions:
id-token: write
contents: read
env:
EXPORT_S3_KEY_PREFIX: pilot-data
steps:
- name: Setup code, pipenv, aws
uses: Sage-Bionetworks/action-pipenv-aws-setup@v3
Expand All @@ -277,11 +278,30 @@ jobs:
if: github.ref_name != 'main'
run: echo "NAMESPACE=$GITHUB_REF_NAME" >> $GITHUB_ENV

- name: Copies over test files from ingestion bucket
- name: Fetch the most recent exports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have ~2 weeks of exports from the develop input bucket to fetch? Or will this just get whatever is available even if it's not 2 weeks worth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it acts like a list slice in Python [:28].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Gotcha, should we set this as an env var? so this way we can use it in both the two integration test for staging and development, and easier to change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh that's a good idea. I accidentally merged before reading this comment. I can fix as part of #117

id: recent-exports
run: |
# Retrieve the last ~2 weeks of exports from each cohort
# Ignore keys which end with "/" and which match "owner.txt"
echo "KEYS=$(
aws s3api list-objects-v2 \
--bucket $DEV_INPUT_BUCKET \
--prefix $EXPORT_S3_KEY_PREFIX \
--query '((sort_by(Contents[? !ends_with(Key, `/`) && !contains(Key, `owner.txt`)], &LastModified)[::-1])[:28])[*].Key' |
jq -c
)" >> "$GITHUB_OUTPUT"

- name: Copy most recent exports to this namespace
run: >
echo '${{ steps.recent-exports.outputs.KEYS }}' | jq -r '.[]' | while read -r key; do
aws s3 cp "s3://$DEV_INPUT_BUCKET/$key" "s3://$DEV_INPUT_BUCKET/$NAMESPACE/${key#"$EXPORT_S3_KEY_PREFIX"/}";
done

- name: Write most recent exports to S3 cloudformation bucket
run: >
aws s3 cp s3://recover-dev-ingestion/pilot-data/ s3://$DEV_INPUT_BUCKET/$NAMESPACE/
--recursive
--exclude "owner.txt"
echo '${{ steps.recent-exports.outputs.KEYS }}' |
jq --arg bucket "s3://$DEV_INPUT_BUCKET/" '.[] |= $bucket + .' |
aws s3 cp - "s3://${{ vars.CFN_BUCKET }}/$NAMESPACE/integration_test_exports.json"

sceptre-deploy-staging:
name: Deploys to staging of prod using sceptre
Expand All @@ -293,7 +313,6 @@ jobs:
permissions:
id-token: write
contents: read

steps:
- name: Setup code, pipenv, aws
uses: Sage-Bionetworks/action-pipenv-aws-setup@v3
Expand Down Expand Up @@ -325,6 +344,8 @@ jobs:
permissions:
id-token: write
contents: read
env:
EXPORT_S3_KEY_PREFIX: main
steps:
- name: Setup code, pipenv, aws
uses: Sage-Bionetworks/action-pipenv-aws-setup@v3
Expand All @@ -333,22 +354,27 @@ jobs:
role_session_name: integration-test-${{ github.run_id }}
python_version: ${{ env.PYTHON_VERSION }}

- name: generate test events
- name: Fetch the most recent exports.
id: recent-exports
run: |
# Retrieve the last ~2 weeks of exports from each cohort
# Ignore keys which end with "/" and which match "owner.txt"
echo "KEYS=$(
aws s3api list-objects-v2 \
--bucket $PROD_INPUT_BUCKET \
--prefix "$EXPORT_S3_KEY_PREFIX/" \
--query '((sort_by(Contents[? !ends_with(Key, `/`) && !contains(Key, `owner.txt`)], &LastModified)[::-1])[:28])[*].Key' |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we happen to have more cohorts in the future or less, what do would modify here to support that?

I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we happen to have more cohorts in the future or less, what do would modify here to support that?

Assuming that these new cohorts are also exported on a daily basis, we would still see equal representation of the cohorts in the 28 most recent exports. 28 is more or less an arbitrary amount to submit for integration tests, so having fewer pediatric_v1 exports, for example, is not a big deal. But we could increase this number proportionally to the number of new cohorts if we like.

I don't think the number of cohorts would decrease, since we want to support the processing of data from any cohort in perpetuity.

I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data?

I'm not sure what you mean. This code pulls the 28 most recent ZIP exports across cohorts, so if there's a data type that's only present in one of the cohorts than we would expect to see fewer JSON files of that data type.

jq -c
)" >> "$GITHUB_OUTPUT"

- name: Copy most recent exports to this namespace
run: >
pipenv run python src/lambda_function/s3_to_glue/events/generate_test_event.py
--input-bucket $PROD_INPUT_BUCKET
--input-key-prefix $NAMESPACE
--output-directory ./src/lambda_function/s3_to_glue/events/
echo '${{ steps.recent-exports.outputs.KEYS }}' | jq -r '.[]' | while read -r key; do
aws s3 cp "s3://$PROD_INPUT_BUCKET/$key" "s3://$PROD_INPUT_BUCKET/staging/${key#"$EXPORT_S3_KEY_PREFIX"/}";
done

- name: Setup sam
uses: aws-actions/setup-sam@v2

- name: sam build lambda
- name: Write most recent exports to S3 cloudformation bucket
run: >
sam build
-t src/lambda_function/s3_to_glue/template.yaml

- name: Invoke Lambda
run: |
cd src/lambda_function/s3_to_glue/
sam local invoke -e events/records.json --parameter-overrides "S3ToJsonWorkflowName=staging-S3ToJsonWorkflow"
echo '${{ steps.recent-exports.outputs.KEYS }}' |
jq --arg bucket "s3://$PROD_INPUT_BUCKET/" '.[] |= $bucket + .' |
aws s3 cp - "s3://${{ vars.CFN_BUCKET }}/staging/integration_test_exports.json"
Loading