If you have questions regarding the methods for dataset curation or want clarification please submit a pull request and we will add to this document over time. Full details on methods and QC metrics will be provided in a forth coming publication, but we felt the importance of the dataset required that we release it before the paper was completely written.
First, we ran Fastqc to evaluate the basic quality. Next, with Samtools we used Wuhan-1 (NCBI Reference Sequence: NC_045512.2) as the reference to calculate mean and standard deviation of the depth per nucleotide. Additionally, we provided a count of the number of nucleotides with the our depth cut off (<10 for Illumina and <20 for nanopore).
Further, QC on all the datasets was done as part of Titan v1.4.4, which is a pipeline that has its origins in the public health (StaPH-B) community and is now underactive development from Theiagen. This is a containerized pipeline and it is available on bioconda. There is a pipeline for Illumina and for ONT so depending on the datatype we used one of those. Details about the pipeline and the outputs are found on their Read the Docs.
NOTE: The default for Titan is to use UShER for lineage calls and we kept this default parameter. The container for Pangolin v3.1.3 (PangoLEARN 2021-06-15) has the following dependencies in it:
- UShER v0.3.1
- Scorpio v0.3.1
- Constellations v0.0.5
- PangoLEARN 2021-06-15
- Pangolin v3.1.3
- pango-designations v1.2.13 used for pangoLEARN and UShER training
As was stated in our presentation for SPHERES, important lineages in this paper are defined as CDC defined variant of concern (VOC) or variant of interest (VOI) lineages as of May 30th, 2021. We selected for lineage-determining spike mutations while minimizing the number of SNP differences (as determined by Snippy) to the rest of the lineage. The non-VOC/VOI sequences came from Refseq, which is why we feel that are quite valuable.