-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-647] Only submit most recent ~2 weeks of exports for integration tests #115
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,7 +22,6 @@ jobs: | |
- uses: actions/setup-python@v4 | ||
- uses: pre-commit/[email protected] | ||
|
||
|
||
upload-files: | ||
name: Upload files to S3 bucket in development | ||
runs-on: ubuntu-latest | ||
|
@@ -265,6 +264,8 @@ jobs: | |
permissions: | ||
id-token: write | ||
contents: read | ||
env: | ||
EXPORT_S3_KEY_PREFIX: pilot-data | ||
steps: | ||
- name: Setup code, pipenv, aws | ||
uses: Sage-Bionetworks/action-pipenv-aws-setup@v3 | ||
|
@@ -277,11 +278,30 @@ jobs: | |
if: github.ref_name != 'main' | ||
run: echo "NAMESPACE=$GITHUB_REF_NAME" >> $GITHUB_ENV | ||
|
||
- name: Copies over test files from ingestion bucket | ||
- name: Fetch the most recent exports. | ||
id: recent-exports | ||
run: | | ||
# Retrieve the last ~2 weeks of exports from each cohort | ||
# Ignore keys which end with "/" and which match "owner.txt" | ||
echo "KEYS=$( | ||
aws s3api list-objects-v2 \ | ||
--bucket $DEV_INPUT_BUCKET \ | ||
--prefix $EXPORT_S3_KEY_PREFIX \ | ||
--query '((sort_by(Contents[? !ends_with(Key, `/`) && !contains(Key, `owner.txt`)], &LastModified)[::-1])[:28])[*].Key' | | ||
jq -c | ||
)" >> "$GITHUB_OUTPUT" | ||
|
||
- name: Copy most recent exports to this namespace | ||
run: > | ||
echo '${{ steps.recent-exports.outputs.KEYS }}' | jq -r '.[]' | while read -r key; do | ||
aws s3 cp "s3://$DEV_INPUT_BUCKET/$key" "s3://$DEV_INPUT_BUCKET/$NAMESPACE/${key#"$EXPORT_S3_KEY_PREFIX"/}"; | ||
done | ||
|
||
- name: Write most recent exports to S3 cloudformation bucket | ||
run: > | ||
aws s3 cp s3://recover-dev-ingestion/pilot-data/ s3://$DEV_INPUT_BUCKET/$NAMESPACE/ | ||
--recursive | ||
--exclude "owner.txt" | ||
echo '${{ steps.recent-exports.outputs.KEYS }}' | | ||
jq --arg bucket "s3://$DEV_INPUT_BUCKET/" '.[] |= $bucket + .' | | ||
aws s3 cp - "s3://${{ vars.CFN_BUCKET }}/$NAMESPACE/integration_test_exports.json" | ||
|
||
sceptre-deploy-staging: | ||
name: Deploys to staging of prod using sceptre | ||
|
@@ -293,7 +313,6 @@ jobs: | |
permissions: | ||
id-token: write | ||
contents: read | ||
|
||
steps: | ||
- name: Setup code, pipenv, aws | ||
uses: Sage-Bionetworks/action-pipenv-aws-setup@v3 | ||
|
@@ -325,6 +344,8 @@ jobs: | |
permissions: | ||
id-token: write | ||
contents: read | ||
env: | ||
EXPORT_S3_KEY_PREFIX: main | ||
steps: | ||
- name: Setup code, pipenv, aws | ||
uses: Sage-Bionetworks/action-pipenv-aws-setup@v3 | ||
|
@@ -333,22 +354,27 @@ jobs: | |
role_session_name: integration-test-${{ github.run_id }} | ||
python_version: ${{ env.PYTHON_VERSION }} | ||
|
||
- name: generate test events | ||
- name: Fetch the most recent exports. | ||
id: recent-exports | ||
run: | | ||
# Retrieve the last ~2 weeks of exports from each cohort | ||
# Ignore keys which end with "/" and which match "owner.txt" | ||
echo "KEYS=$( | ||
aws s3api list-objects-v2 \ | ||
--bucket $PROD_INPUT_BUCKET \ | ||
--prefix "$EXPORT_S3_KEY_PREFIX/" \ | ||
--query '((sort_by(Contents[? !ends_with(Key, `/`) && !contains(Key, `owner.txt`)], &LastModified)[::-1])[:28])[*].Key' | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we happen to have more cohorts in the future or less, what do would modify here to support that? I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Assuming that these new cohorts are also exported on a daily basis, we would still see equal representation of the cohorts in the 28 most recent exports. 28 is more or less an arbitrary amount to submit for integration tests, so having fewer I don't think the number of cohorts would decrease, since we want to support the processing of data from any cohort in perpetuity.
I'm not sure what you mean. This code pulls the 28 most recent ZIP exports across cohorts, so if there's a data type that's only present in one of the cohorts than we would expect to see fewer JSON files of that data type. |
||
jq -c | ||
)" >> "$GITHUB_OUTPUT" | ||
|
||
- name: Copy most recent exports to this namespace | ||
run: > | ||
pipenv run python src/lambda_function/s3_to_glue/events/generate_test_event.py | ||
--input-bucket $PROD_INPUT_BUCKET | ||
--input-key-prefix $NAMESPACE | ||
--output-directory ./src/lambda_function/s3_to_glue/events/ | ||
echo '${{ steps.recent-exports.outputs.KEYS }}' | jq -r '.[]' | while read -r key; do | ||
aws s3 cp "s3://$PROD_INPUT_BUCKET/$key" "s3://$PROD_INPUT_BUCKET/staging/${key#"$EXPORT_S3_KEY_PREFIX"/}"; | ||
done | ||
|
||
- name: Setup sam | ||
uses: aws-actions/setup-sam@v2 | ||
|
||
- name: sam build lambda | ||
- name: Write most recent exports to S3 cloudformation bucket | ||
run: > | ||
sam build | ||
-t src/lambda_function/s3_to_glue/template.yaml | ||
|
||
- name: Invoke Lambda | ||
run: | | ||
cd src/lambda_function/s3_to_glue/ | ||
sam local invoke -e events/records.json --parameter-overrides "S3ToJsonWorkflowName=staging-S3ToJsonWorkflow" | ||
echo '${{ steps.recent-exports.outputs.KEYS }}' | | ||
jq --arg bucket "s3://$PROD_INPUT_BUCKET/" '.[] |= $bucket + .' | | ||
aws s3 cp - "s3://${{ vars.CFN_BUCKET }}/staging/integration_test_exports.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have ~2 weeks of exports from the develop input bucket to fetch? Or will this just get whatever is available even if it's not 2 weeks worth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it acts like a list slice in Python
[:28]
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Gotcha, should we set this as an env var? so this way we can use it in both the two integration test for staging and development, and easier to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh that's a good idea. I accidentally merged before reading this comment. I can fix as part of #117