Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a sequence_metadata attribute to CladeTime #27

Merged
merged 2 commits into from
Oct 4, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Oct 3, 2024

Background

This is the first step towards saving daily sequence counts by location: reichlab/variant-nowcast-hub#50

We don't have to download sequence metadata files from S3 before working with them, so this PR adds an attribute to the CladeTime class that exposes a Polars LazyFrame pointing to a Nextstrain sequence metdata file.

Next Step

Once this new feature is merged, we can add code to variant-nowcast-hub to instantiate a CladeTime object and use the LazyFrame reference to create the location/data information outlined in the above issue.

Testing

To test this new feature as a code reviewer, you'll need to install virus_clade_utils from this feature branch:

pip install "git+https://github.com/reichlab/virus-clade-utils.git@bsweger/sequence-by-state-date/50"

Then from a Python session:

import polars as pl
from virus_clade_utils.cladetime import CladeTime

# Get a CladeTime object for the most recent Nextstrain sequence metadata
ct = CladeTime()

# ct.sequence_metadata is the new attribute (a LazyFrame)
filtered_metadata = (
    ct.sequence_metadata
    .select(["country", "division", "host", "date", "clade_nextstrain"])
    .filter(
        pl.col("country") == "USA"
     )
).collect()

The first iteration of CladeTime contained a url_sequence_metadata
attribute that points to the S3 link for NextStrain's sequence
metadata file. This PR adds a sequence_metadata attribute that
supplies users with a Polars LazyFrame to the S3 file.

Note: doing the sorting and filtering on the S3 LazyFrame (i.e.,
without downloading the file first) saves time for those
interested in only a subset of the metadata (e.g., US only,
homo sapiens).
Copy link
Collaborator

@rogersbw rogersbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest returning a warning or error if no url_sequence_metadata is provided in the @sequence_metadata.getter

If CladeTime doesn't have a value for url_sequence_metadata,
there's no point in proceeding.
@bsweger bsweger requested a review from rogersbw October 4, 2024 18:31
@@ -59,6 +59,9 @@ filterwarnings = [
"ignore::DeprecationWarning",
'ignore:polars found a filename',
]
testpaths = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed up pytest by telling it where the tests are

@bsweger bsweger merged commit a15a7f3 into main Oct 4, 2024
1 check passed
@bsweger bsweger deleted the bsweger/sequence-by-state-date/50 branch October 4, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants