Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to retrieve metadata from Nextstrain's ncov pipeline run at a specific point in time #26

Merged
merged 8 commits into from
Oct 2, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Oct 1, 2024

closes #20

Background

This PR addresses the issue above and also introduces a new class to simplify the interface for working with Nextstrain files: CladeTime (because we can get sequences and clade assignments at any point in time, get it?)

The only thing CladeTime currently does is instantiate with:

  1. a specific as_of date for both Sars-Cov-2 sequences/sequence metadata
  2. a specific as_of date for the reference tree used to assign sequences to clades

CladeTime also has an ncov_metadata attribute, because that's what we need right now to get the variant nowcast hub ready.

Review Notes

The commits are reasonably organized, so starting at the first one and reviewing commit by commit is recommended.

Demo

There's more information in the README, but to try CladeTime as a code reviewer, you'll need to install the package from this branch:

pip install "git+https://github.com/reichlab/virus-clade-utils.git@bsweger/get_nextstrain_ncov_metadata"

Then, from a Python console or script (the code below is what we'll add to get_clades_to_model.py in the variant nowcast repo, so we can save the metadata along with the clade list):

from virus_clade_utils.cladetime import CladeTime
ct = CladeTime()
ct.ncov_metadata

The above should return a Python dictionary similar to:

{
    "schema_version": "v1",
    "nextclade_version": "nextclade 3.8.2",
    "nextclade_dataset_name": "SARS-CoV-2",
    "nextclade_dataset_version": "2024-09-25--21-50-30Z",
    "nextclade_tsv_sha256sum": "fbe579554e925e4dfaf74cfb4e72b52c702e671f0f0374d896f1e30ae4fe5566",
    "metadata_tsv_sha256sum": "5a4fd84a5cd3c4ead9cf730d4df10b8734898c6c3e0cae1c8c0acf432325d22c",
}

session.headers.update(headers)

if retry:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're using package functions an interactive context (in addition to using them in batch scripts), let's provide a way to make retries optional when using a requests session.


def test_cladetime_ncov_metadata():
ct = CladeTime()
ct.url_ncov_metadata = "https://httpstat.us/200"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self._ncov_metadata

@ncov_metadata.getter
def ncov_metadata(self) -> dict:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nextstrain only began publishing their pipeline metadata at the beginning on 8/1/24, so we won't find it for earlier dates.

@bsweger bsweger requested a review from rogersbw October 1, 2024 21:40
@bsweger bsweger merged commit 5ace911 into main Oct 2, 2024
1 check passed
@bsweger bsweger deleted the bsweger/get_nextstrain_ncov_metadata branch October 2, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a function to get metadata for the latest Nextclade ncov-ingest pipeline
2 participants