Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest script to link GISAID records #211

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Links records downloaded from GISAID EpiFlu to output NDJSON that can be processed in a augur curate pipeline. Each record represents a single GISAID record with metadata and 10 segment sequences. From what I've seen, the "HE" and "P3" segments are empty but they are kept for a complete representation of the downloaded data.

Related issue(s)

Related to nextstrain/fauna#162

Checklist

  • Checks pass

Links records downloaded from GISAID EpiFlu to output NDJSON that
can be processed in a `augur curate` pipeline. Each record represents a
single GISAID record with metadata and 10 segment sequences. From what
I've seen, the "HE" and "P3" segments are empty but they are kept for a
complete representation of the downloaded data.
There was an unexpected number of GISAID records that could not link to
their segment sequences in my testing. Turns out the metadata can still
have a segment accession even though the sequence itself has been
removed from GISAID.

This commit changes to output warnings per record rather than per
segment sequence so that it's easier to read through the logs.
Fail loudly to guard against potential data issues:

1. If not all segment columns are not present, then GISAID might have
changed the format of the Excel file and the expected `SEGMENTS` or
`SEGMENT_COLUMNS` need to be updated.

2. If all records are missing segment accessions, then GISAID might
have changed the format of the segment ids and the
`SEGMENT_ACCESSION_PATTERN` needs to be updated.

3. If all records are missing segment sequences, then the GISAID Excel
file and FASTA file might be from different downloads and therefore
do not contain matching records.
@joverlee521
Copy link
Contributor Author

Running on a fresh download from GISAID.

  1. Filter down by submission date to match your download limit
  2. Download "Isolates as XLS (virus metadata only)" and save to data/gisaid_epiflu_isolates.xls
  3. Download "Sequences (DNA) as FASTA" and save to data/gisaid_epiflu_sequences.fasta. Make sure to only include the DNA Accession no. field in the FASTA header.
  4. Start a Nextstrain shell nextstrain shell .
  5. Run the script within the Nextstrain shell
python3 ./ingest/scripts/link-gisaid-metadata-and-fasta.py \
  --metadata data/gisaid_epiflu_isolates.xls \
  --sequences data/gisaid_epiflu_sequence.fasta \
  > data/linked-gisaid.ndjson \
  2> data/link-warnings.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant