EPITOME condenses a diverse set of DNA sequences (species-level or closer) into discrete, composite sequences that represent the overarching diversity of the dataset. In other words, EPITOME creates sequences that are the epitome of the dataset diversity. This is accomplished by clustering the input based on pairwise genetic distances and then selecting the most common nucleotide at each genomic position (ties selected at random). When the genetic distance is based on read mapping efficiency, EPITOME creates a set of reference genomes for consensus-based assembly pipelines, like VAPER or viralrecon.
See the wiki for more information.
Note: Nextflow requires absolute paths in samplesheets Create a samplesheet containing the taxa name, genome segment, path to a multi-fasta file of sequences for the taxa, and the expected sequence length (within 25%).
samplesheet.csv
:
taxa,segment,assembly,length
Influenza_A,HA,flu-a_HA_NCBI_2024-4-1.fasta,1950
Influenza_A,NA,flu-a_NA_NCBI_2024-4-1.fasta,1400
Measles,wg,measles_NCBI_2024-4-1.fasta,16000
Run EPITOME using the command below.
Note: See the wiki for how to assign references with existing subtype classifications (e.g., H1-H9) using the
--seeds
parameter.
nextflow run DOH-JDJ0303/epitome \
-r main \
-profile singularity \
--input samplesheet.csv \
--outdir results