Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Remove use of ncov-ingest geolocation rules #79

Merged
merged 1 commit into from
Mar 4, 2025

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Remove the use of the ncov-ingest geolocation rules since Augur now uses the built-in geolocation rules by default.

Depends on the release of
nextstrain/augur#1745

Related issue(s)

Part of nextstrain/public#17

Checklist

  • Checks pass

Remove the use of the ncov-ingest geolocation rules since Augur
now uses the built-in geolocation rules by default.

Depends on the release of
<nextstrain/augur#1745>
@joverlee521
Copy link
Contributor Author

Tested locally with my local version of Augur that includes the changes from nextstrain/augur#1745

nextstrain build --augur ../augur ingest 

Diffed the output with the prod metadata, with expected changes to geolocation fields:

region country count
North America USA->Puerto Rico 107
Europe->Oceania France->New Caledonia 8
Europe->South America France->French Guiana 8

There were less changes than I had expected and I realized it's because there are many annotations already updating the geolocation fields.


Ran the results through the phylogenetic workflow with the Augur changes

mv ingest/results/* phylogenetic/data/
nextstrain build --augur ../augur phylogenetic

The build has been uploaded to staging. Since subsampling includes grouping by country, there are more sequences now that samples from Puerto Rico are grouped separately than samples from USA.

@joverlee521 joverlee521 marked this pull request as ready for review February 26, 2025 22:26
@joverlee521
Copy link
Contributor Author

Will merge this tomorrow if there are no comments.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice improvement to our analysis.

P.S. There are no New Caledonia sequences in the staging dataset. Since there are n=8 originally it's possible they were all excluded by the subsampling (especially if the 8 samples were well spread temporally), and I presume this is what happened, but perhaps double check?

@joverlee521
Copy link
Contributor Author

P.S. There are no New Caledonia sequences in the staging dataset. Since there are n=8 originally it's possible they were all excluded by the subsampling (especially if the 8 samples were well spread temporally), and I presume this is what happened, but perhaps double check?

Thanks for flagging! Looks like all of the New Caledonia sequences are filtered out because they have length less than the min_length.

@joverlee521 joverlee521 merged commit 1daa5e0 into main Mar 4, 2025
4 checks passed
@joverlee521 joverlee521 deleted the remove-ncov-ingest-geolocation-rules branch March 4, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants