-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-647] Only submit most recent ~2 weeks of exports for integration tests #115
Conversation
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM! It does look like there's some duplicated code logic, we could look to make these composite Github actions in the future but that's low priority since thats just aesthetics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple of Qs
@@ -277,11 +278,30 @@ jobs: | |||
if: github.ref_name != 'main' | |||
run: echo "NAMESPACE=$GITHUB_REF_NAME" >> $GITHUB_ENV | |||
|
|||
- name: Copies over test files from ingestion bucket | |||
- name: Fetch the most recent exports. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have ~2 weeks of exports from the develop input bucket to fetch? Or will this just get whatever is available even if it's not 2 weeks worth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it acts like a list slice in Python [:28]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Gotcha, should we set this as an env var? so this way we can use it in both the two integration test for staging and development, and easier to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh that's a good idea. I accidentally merged before reading this comment. I can fix as part of #117
aws s3api list-objects-v2 \ | ||
--bucket $PROD_INPUT_BUCKET \ | ||
--prefix "$EXPORT_S3_KEY_PREFIX/" \ | ||
--query '((sort_by(Contents[? !ends_with(Key, `/`) && !contains(Key, `owner.txt`)], &LastModified)[::-1])[:28])[*].Key' | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we happen to have more cohorts in the future or less, what do would modify here to support that?
I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we happen to have more cohorts in the future or less, what do would modify here to support that?
Assuming that these new cohorts are also exported on a daily basis, we would still see equal representation of the cohorts in the 28 most recent exports. 28 is more or less an arbitrary amount to submit for integration tests, so having fewer pediatric_v1
exports, for example, is not a big deal. But we could increase this number proportionally to the number of new cohorts if we like.
I don't think the number of cohorts would decrease, since we want to support the processing of data from any cohort in perpetuity.
I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data?
I'm not sure what you mean. This code pulls the 28 most recent ZIP exports across cohorts, so if there's a data type that's only present in one of the cohorts than we would expect to see fewer JSON files of that data type.
{namespace}/integration_test_exports.json
as a JSON array. This is for later reference in the comparison jobs. e.g.,