Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-647] Only submit most recent ~2 weeks of exports for integration tests #115

Merged
merged 1 commit into from
May 31, 2024

Conversation

philerooski
Copy link
Contributor

@philerooski philerooski commented May 30, 2024

  • Rather than copying over all input data, only copy the 28 (~2 weeks from each cohort) most recent exports to the namespaced location in the input bucket.
  • The S3 URI of these 28 exports is uploaded to the S3 cloudformation bucket under the key {namespace}/integration_test_exports.json as a JSON array. This is for later reference in the comparison jobs. e.g.,
[
  "s3://recover-dev-input-data/pilot-data/adults_v1/2023-01-12T22--02--17Z_77fefff8-b0e2-4c1b-b0c5-405554c92368",
  "s3://recover-dev-input-data/pilot-data/adults_v1/2023-01-03T20--19--27Z_6fb88c91-eb34-4278-a603-4de03dfd9c70",
  "s3://recover-dev-input-data/pilot-data/pediatric_v1/2023-06-26T22_35_22Z_8ae9d391-5c5d-4d79-bdce-6d86314caf8c",
 ...
]
  • We now initiate integration tests for staging data by copying over exports from the main namespace, rather than creating and submitting test events.

@philerooski philerooski requested a review from a team as a code owner May 30, 2024 00:24
Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! It does look like there's some duplicated code logic, we could look to make these composite Github actions in the future but that's low priority since thats just aesthetics

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of Qs

@@ -277,11 +278,30 @@ jobs:
if: github.ref_name != 'main'
run: echo "NAMESPACE=$GITHUB_REF_NAME" >> $GITHUB_ENV

- name: Copies over test files from ingestion bucket
- name: Fetch the most recent exports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have ~2 weeks of exports from the develop input bucket to fetch? Or will this just get whatever is available even if it's not 2 weeks worth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it acts like a list slice in Python [:28].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Gotcha, should we set this as an env var? so this way we can use it in both the two integration test for staging and development, and easier to change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh that's a good idea. I accidentally merged before reading this comment. I can fix as part of #117

@rxu17 rxu17 changed the title [ET-647] Only submit most recent ~2 weeks of exports for integration tests [ETL-647] Only submit most recent ~2 weeks of exports for integration tests May 31, 2024
aws s3api list-objects-v2 \
--bucket $PROD_INPUT_BUCKET \
--prefix "$EXPORT_S3_KEY_PREFIX/" \
--query '((sort_by(Contents[? !ends_with(Key, `/`) && !contains(Key, `owner.txt`)], &LastModified)[::-1])[:28])[*].Key' |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we happen to have more cohorts in the future or less, what do would modify here to support that?

I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we happen to have more cohorts in the future or less, what do would modify here to support that?

Assuming that these new cohorts are also exported on a daily basis, we would still see equal representation of the cohorts in the 28 most recent exports. 28 is more or less an arbitrary amount to submit for integration tests, so having fewer pediatric_v1 exports, for example, is not a big deal. But we could increase this number proportionally to the number of new cohorts if we like.

I don't think the number of cohorts would decrease, since we want to support the processing of data from any cohort in perpetuity.

I know some data types (like google fit samples?) only has adult data, so then based on this logic would this pull the 28 days of adult data?

I'm not sure what you mean. This code pulls the 28 most recent ZIP exports across cohorts, so if there's a data type that's only present in one of the cohorts than we would expect to see fewer JSON files of that data type.

@philerooski philerooski merged commit fc290e3 into main May 31, 2024
15 checks passed
@thomasyu888 thomasyu888 deleted the etl-647 branch October 30, 2024 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants