-
- 1.1. Software
- 1.2. Set up directory structure
-
- 2.1. Download Illumina data from NCBI SRA
- 2.2. Check read quality and trim adapters
- 2.3. Correct reads and assemble
- 2.4. Filter ascomata genomes
- 2.5. Rescue Gyromitra leucoxantha MICH25407
- 2.6. Further contaminant filtering with NCBI Foreign Contamination Screen
- 2.7. TO DO: Organize genomes
- 2.8. TO DO: Download outgroup genomes
- 2.9. Calculate genome coverage
- 2.10. Extract rDNA sequences
- 2.11. Run BUSCO
- 2.12. Make phylogenomic tree
-
- TO DO: Figures and statistical analysis
- 5.1. TO DO: Copy data
-
- 6.1. Transfer data from UM Advanced Genomics Core to Great Lakes computing cluster
- 6.1.1. Globus
- 6.1.2. Check the demultiplexing statistics (Illumina)
- 6.1.3. Check file completeness
- 6.2. Upload data to NCBI
- 6.2.1. BioProject
- 6.2.2. BioSample
- 6.2.3. SRA
- 6.2.4. Genomes
- 6.1. Transfer data from UM Advanced Genomics Core to Great Lakes computing cluster
The following software need to be installed and findable in $PATH.
Published software used in this pipeline
antiSMASH
7.1.0 and dependencies (including meme 4.11.2 for RiPP detection) contained in aConda
environment calledantismash
ASTRAL
5.15.5BBTools
38.96bedtools
2.30.0BUSCO
5.5.0Clinker
0.0.28CUDA
11.5.1 (for computing on GPUs with SCGid)FastQC
0.11.9FastTree
2.1.10Funannotate
1.8.13 and dependencies contained in aConda
environment calledfunannotate
IQTREE2
2.2.6 (built with LSD2 for time divergence estimation)ITSx
1.1.3MUSCLE
3.8.31NCBI BLAST+
2.12.0Open MPI
4.1.4 (for parallel processing with SCGid)Python
3.10.4QUAST
5.2.0R
4.2.0SCGid
0.9b0Seqtk
1.3SPAdes
3.15.5SRA Toolkit
2.10.9trimAl
1.4.rev22Trimmomatic
0.36
Other software used in this pipeline
- BUSCO phylogenomics pipeline, BUSCO V3 (Old) release - have
BUSCO_phylogenomics.py
findable inPATH
- Extract-ITS-sequences-from-a-fungal-genome, patch-1 branch, forked from pwkooij/Extract-ITS-sequences-from-a-fungal-genome, forked from fantin-mesny/Extract-ITS-sequences-from-a-fungal-genome - have
extractITS.py
findable inPATH
The directory containing this README.md file will be the project directory. Change to the parent directory and run the code below to clone the GitHub directory discinaceae_phylogenomics
.
git clone https://github.com/aldendirks/discinaceae_phylogenomics.git
export PROJECT_DIR="$PWD/discinaceae_phylogenomics"
cd $PROJECT_DIR
mkdir -p antismash busco coverage blastdb data fastqc funannotate genomes itsx mat phylogenomics quast rdna scgid seqs spades summary
Make sure the list of samples is unix compatible and ends with a space.
export SAMPLE_LIST="$PROJECT_DIR/misc/samples.tsv"
dos2unix $SAMPLE_LIST
function file_ends_with_newline() {
[[ $(tail -c1 "$1" | wc -l) -gt 0 ]]
}
if ! file_ends_with_newline $SAMPLE_LIST
then
echo "" >> $SAMPLE_LIST
fi
Skip to step 2.7 if you'd like to go right to downloading the assembled genomes from NCBI and preparing them for phylogenomics analysis.
Download raw reads for processing. Further information on downloading data from NCBI SRA using SRA Toolkit
can be found here.
bash $PROJECT_DIR/scripts/00_download_sra_data.sh
sbatch $PROJECT_DIR/scripts/01_check_read_quality_and_trim.sh
First, run read error correction for ascomata samples. This can be a memory-intensive step and might require adjustment for different samples (memory requirement up to 200 GB). By doing this first, SPAdes assembly can proceed in a more predictable fashion.
while IFS=$'\t' read -r -a samples_array
do
if [ "${samples_array[7]}" = "ingroup" ] && [ "${samples_array[14]}" != "culture" ]; then
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
sbatch $PROJECT_DIR/scripts/02_correct_read_errors.sh $BATCH $SAMPLE
fi
done < $SAMPLE_LIST
Then assemble genomes with SPAdes
.
while IFS=$'\t' read -r -a samples_array
do
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
if [ "$BATCH" != "JGI" ] && [ "$BATCH" != "NCBI" ]; then
if [ "${samples_array[14]}" = "culture" ]; then
sbatch $PROJECT_DIR/scripts/03a_assemble_genome_isolate.sh $BATCH $SAMPLE
else
sbatch $PROJECT_DIR/scripts/03b_assemble_genome_meta.sh $BATCH $SAMPLE
fi
fi
done < $SAMPLE_LIST
SCGid removes contaminant contigs by triangulating evidence from three different modules. Make sure you have activated the SCGid
virtual environment before running the jobs with slurm.
First, run the GCT module.
while IFS=$'\t' read -r -a samples_array
do
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
if [ "$BATCH" != "JGI" ] && [ "$BATCH" != "NCBI" ] && [ "${samples_array[14]}" != "culture" ]; then
sbatch $PROJECT_DIR/scripts/04a_run_scgid_gct.sh $BATCH $SAMPLE
fi
done < $SAMPLE_LIST
Then, the codons module.
while IFS=$'\t' read -r -a samples_array
do
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
if [ "$BATCH" != "JGI" ] && [ "$BATCH" != "NCBI" ] && [ "${samples_array[14]}" != "culture" ]; then
sbatch $PROJECT_DIR/scripts/04b_run_scgid_codons.sh $BATCH $SAMPLE
fi
done < $SAMPLE_LIST
Next, the first part of the kmers module.
while IFS=$'\t' read -r -a samples_array
do
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
if [ "$BATCH" != "JGI" ] && [ "$BATCH" != "NCBI" ] && [ "${samples_array[14]}" != "culture" ]; then
sbatch $PROJECT_DIR/scripts/04c_run_scgid_kmers.sh $BATCH $SAMPLE
fi
done < $SAMPLE_LIST
Manually select regions from the ESOM map. See the SCGid
GitHub page for more information. Some additions to the notes there:
- A large
.wts
file may take a while to load. - After loading the
.wts
file, you also need to load the.bm
file. - It is useful to select "Tiled display" in the ESOM window and change Zoom to 1x.
- Under the "Classes" tab, I like to change the color of Fungi to something that pops and is easy to trace, like neon yellow. Change the color of any class by selecting the color next to the text.
- Bold the Fungi dots so they are easier to see by clicking the box on the left.
- Right click to finalize your the selection after you click around the fungi blobs.
- I'd recommend using an external mouse to make your selections. It is easy to make an accidental right click when you are clicking with a trackpad.
Finally, finish the kmers module and generate a consensus sequence. Note, the ESOM selected class ID is set to "6" within the script. While this should usually be the correct class, you might need to change it so it matches your selection.
while IFS=$'\t' read -r -a samples_array
do
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
if [ "$BATCH" != "JGI" ] && [ "$BATCH" != "NCBI" ] && [ "${samples_array[14]}" != "culture" ]; then
sbatch $PROJECT_DIR/scripts/04d_run_scgid_kmers_extract.sh $BATCH $SAMPLE
fi
done < $SAMPLE_LIST
This genome is the only one with significant fungal contamination. SCGid
does not filter out the contaminant fungal genome. The Gyromitra leucoxantha genome is extracted by BLASTing the raw reads against the assembled and filtered G. leucoxantha MICH352087 genome and using those hits in assembly.
sbatch $PROJECT_DIR/scripts/05_rescue_gyromitra_leucoxantha.sh
Upon submission to NCBI, it was found that SCGid
did not completely filter out contaminant contigs according to NCBI's Foreign Contamination Screen (FCS) tool, which is automatically run on submitted genomes. The automatically filtered genomes were downloaded and used in subsequent steps. The FCS tool suite can be installed locally but requires very high memory and large harddrive storage to run efficiently. It may be that filtering programs like SCGid
can be skipped entirely and contaminat filtering done automatically by uploading genomes to NCBI. Genomes need to pass this screening step to be uploaded to NCBI anyway so one might as well us their tool.
TO DO: Get GenBank accession numbers and make sure they can be downloaded
Skip these first two blocks of code if you would like to go right to downloading the assembled genomes.
Softlink the assembled genomes into the genomes
directory with a new name.
export GENOMES_DIR="$PROJECT_DIR/genomes"
while IFS=$'\t' read -r -a samples_array
do
BATCH="${samples_array[1]}"
SAMPLE="${samples_array[2]}"
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')" # substitute spaces with underscores and remove any trailing spaces or underscores
if [ "$BATCH" != "JGI" ] && [ "$BATCH" != "NCBI" ]; then
if [ "${samples_array[14]}" = "culture" ]; then
FILE="$PROJECT_DIR/spades/Sample_${BATCH}-AD-${SAMPLE}-isolate/scaffolds_trimmed.fasta"
else
FILE="$PROJECT_DIR/scgid/Sample_${BATCH}-AD-${SAMPLE}-meta/meta_scgid_output/consensus/meta.consensus.filtered.assembly.fasta"
fi
fi
ln -s $FILE $GENOMES_DIR/$NAME.fasta
done < $SAMPLE_LIST
Download the JGI MycoCosm Gyromitra genomes. You will need to have a profile set up with JGI. Replace "[email_address]" and "[password]" (brackets inclusive) with your JGI login email and password, respectively.
# Log in to JGI
curl 'https://signon.jgi.doe.gov/signon/create' --data-urlencode 'login=[email_address]' --data-urlencode 'password=[password]' -c ~/cookies > /dev/null
# Download genomes
while IFS=$'\t' read -r -a samples_array
do
if [ "${samples_array[7]}" = "ingroup" ] && [ "${samples_array[1]}" = "JGI" ]; then
PORTAL="${samples_array[2]}"
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
curl "https://genome.jgi.doe.gov/portal/$PORTAL/download/${PORTAL}_AssemblyScaffolds.fasta.gz" -b ~/cookies | gunzip > $GENOMES_DIR/$NAME.fasta
fi
done < $SAMPLE_LIST
Conversely, if you skipped genome assembly, you can download the assembled genomes from JGI and NCBI. See the instructions for the previous block of code on how to download genomes from JGI.
export GENOMES_DIR="$PROJECT_DIR/genomes"
while IFS=$'\t' read -r -a samples_array
do
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
if [ "${samples_array[7]}" = "ingroup" ]; then
if [ "${samples_array[1]}" = "JGI" ]; then
PORTAL="${samples_array[2]}"
curl "https://genome.jgi.doe.gov/portal/$PORTAL/download/${PORTAL}_AssemblyScaffolds.fasta.gz" -b ~/cookies | gunzip > $GENOMES_DIR/$NAME.fasta
else
ACCESSION="${samples_array[6]}"
curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/$ACCESSION/download?include_annotation_type=GENOME_FASTA" > $GENOMES_DIR/$NAME
unzip $GENOMES_DIR/$NAME
mv $GENOMES_DIR/ncbi_dataset/data/$ACCESSION/*.fna $GENOMES_DIR/$NAME.fasta
rm -r $GENOMES_DIR/ncbi_dataset $GENOMES_DIR/$NAME $GENOMES_DIR/README.md
fi
fi
done < $SAMPLE_LIST
TO DO: Maybe change 7th column (WGS accession) to GenBank accessions when I have it and change 2 for samples_array NCBI to 6.
while IFS=$'\t' read -r -a samples_array
do
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
if [ "${samples_array[7]}" = "outgroup" ]; then
if [ "${samples_array[1]}" = "JGI" ]; then
PORTAL="${samples_array[2]}"
curl "https://genome.jgi.doe.gov/portal/$PORTAL/download/${PORTAL}_AssemblyScaffolds.fasta.gz" -b ~/cookies | gunzip > $GENOMES_DIR/$NAME.fasta
elif [ "${samples_array[1]}" = "NCBI" ]; then
ACCESSION="${samples_array[2]}"
curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/$ACCESSION/download?include_annotation_type=GENOME_FASTA" > $GENOMES_DIR/$NAME
unzip $GENOMES_DIR/$NAME
mv $GENOMES_DIR/ncbi_dataset/data/$ACCESSION/*.fna $GENOMES_DIR/$NAME.fasta
rm -r $GENOMES_DIR/ncbi_dataset $GENOMES_DIR/$NAME $GENOMES_DIR/README.md
fi
fi
done < $SAMPLE_LIST
sbatch $PROJECT_DIR/scripts/09_calculate_coverage.sh
sbatch $PROJECT_DIR/scripts/10_extract_rdna.sh
BUSCO stands for benchmarking universal single-copy orthologs. BUSCOs are used in the phylogenomic analysis.
cd $PROJECT_DIR/busco
busco --download ascomycota_odb10
for SAMPLE in $PROJECT_DIR/genomes/*
do
sbatch $PROJECT_DIR/scripts/06_run_busco.sh $SAMPLE
done
The BUSCO phylogenomics pipeline is used for coalescence and to prepare the dataset for IQTREE2
concatenation analysis. You will need to define the path to the ASTRAL
library files and ASTRAL
jar file, e.g., "/home/adirks/apps/ASTRAL/Astral/lib/"
and "/home/adirks/apps/ASTRAL/Astral/astral.5.15.5.jar"
.
export ASTRAL_LIB_PATH="/path/to/Astral/lib/"
export ASTRAL_JAR_PATH="/path/to/astral.5.15.5.jar"
# consider running the following code to find the paths if available, although find can take a while
# find / -name "astral.5.15.5.jar"
sbatch $PROJECT_DIR/scripts/07_phylogenomics.sh
Genome annotation is completed with Funannotate
, available through GitHub and documented here.
First, predict genes. The JGI Gyromitra esculenta expressed sequence tags (EST) file is used to help with gene prediction.
conda deactivate
conda activate funannotate
# Get EST file from JGI
curl "https://genome.jgi.doe.gov/portal/Gyresc1/download/Gyresc1_ESTs_20160326_est.fasta.gz" -b ~/cookies | gunzip > $PROJECT_DIR/funannotate/Gyresc1_ESTs_20160326_est.fasta
# Run Funannotate
while IFS=$'\t' read -r -a samples_array
do
SPECIES="${samples_array[9]}"
ACCESSION="${samples_array[11]}"
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
if [ "${samples_array[7]}" = "ingroup" ]; then
sbatch $PROJECT_DIR/scripts/08a_run_funannotate_predict.sh -s "'"$SPECIES"'" -i $ACCESSION -g $GENOMES_DIR/$NAME.fasta
fi
done < $SAMPLE_LIST
At this point, an NCBI bioproject number, individual biosample numbers, and a submission template (.sbt
file) were generated before continuing with annotation. The template.sbt
file I used is available in the misc
directory.
Identify secondary metabolism clusters with antiSMASH
.
conda deactivate
conda activate antismash
while IFS=$'\t' read -r -a samples_array
do
SPECIES="${samples_array[9]}"
ACCESSION="${samples_array[11]}"
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
if [ "${samples_array[7]}" = "ingroup" ]; then
sbatch $PROJECT_DIR/scripts/08b_run_antismash7.sh -s "'"$SPECIES"'" -i $ACCESSION -g $GENOMES_DIR/$NAME.fasta
fi
done < $SAMPLE_LIST
Use InterProScan
for functional analysis of proteins.
conda deactivate
conda activate funannotate
while IFS=$'\t' read -r -a samples_array
do
SPECIES="${samples_array[9]}"
ACCESSION="${samples_array[11]}"
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
if [ "${samples_array[7]}" = "ingroup" ]; then
sbatch $PROJECT_DIR/scripts/08c_run_interproscan.sh -s "'"$SPECIES"'" -i $ACCESSION -g $GENOMES_DIR/$NAME.fasta
fi
done < $SAMPLE_LIST
while IFS=$'\t' read -r -a samples_array
do
SPECIES="${samples_array[9]}"
ACCESSION="${samples_array[11]}"
LOCUS_TAG="${samples_array[3]}"
NAME="$(echo ${samples_array[15]} | tr " " "_" | sed 's/[_[:space:]]*$//')"
if [ "${samples_array[7]}" = "ingroup" ]; then
sbatch $PROJECT_DIR/scripts/08d_run_funannotate_annotate.sh -g $GENOMES_DIR/$NAME.fasta -l $LOCUS_TAG
fi
done < $SAMPLE_LIST
Get genome stats.
bash $PROJECT_DIR/scripts/11a_summarize_genome_stats.sh
Get antiSMASH
secondary metabolite gene cluster counts.
bash $PROJECT_DIR/scripts/11b_summarize_antismash_counts.sh
Get CAZyme family counts.
bash $PROJECT_DIR/scripts/11c_summarize_cazyme_counts.sh
bash $PROJECT_DIR/12_ln_proteomes.sh
bash $PROJECT_DIR/13_make_blast_db.sh
bash $PROJECT_DIR/14_blast_mat_genes.sh
Get data into one directory.
cp $PROJECT_DIR/summary/counts_antismash.txt $PROJECT_DIR/data
cp $PROJECT_DIR/summary/counts_cazy.txt $PROJECT_DIR/data
cp $PROJECT_DIR/summary/genome_stats.txt $PROJECT_DIR/data
Click on the Globus link in the email you receive from the Advanced Genomics Core data team with your data. This will take you to Globus File Manager. You should see two panels. The left panel should be the Core's Globus endpoint and should show your data.
On the right panel, type or select umich#greatlakes as the "Collection". This should then take you to your home directory in Great Lakes. Enter "/nfs/turbo/lsa-tyjames/mycology/next_gen_seq_data/" as the "Path". Create or navigate to a directory to store your raw data.
On the left panel, select all of your data. Click "Start" to begin the data transfer. You can close the window; the transfer happens in the background. You can check the status of the transfer under the "Activity" button on the far left menu in Globus. A dataset of around 50 GB should only take a few minutes to complete.
Please add a plain text README file in the directory that describes what your data are.
The file DemuxStats.csv
is included with Illumina data and shows your samples various information, including the number of reads. You multiple your reads by the read length to get your total number of base pairs per sample. If you have an estimate for the total genome size, you can get a sense if your coverage goals were met.
This website has information on checking file completeness using the md5 file: https://brcf.medicine.umich.edu/cores/advanced-genomics/data-delivery/
Under Frequently Asked Questions, "How do I know I have a complete copy of the data?", it says:
cd <service-request-name>
md5sum –-quiet --check <service-request-name>.md5
The --quiet
option means it will only show output if there is a problem, so the output will be blank and processing for a while if everything is OK.
Create a BioProject to receive an accession number that will be used to link all your data. The BioProject submission portal is here.
A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project.
Once you have a BioProject accession, proceed to creating BioSamples. For a fungal genome, use the "MIGS Eukaryotic" data template package, even if the genome was filtered from a mushroom and thus the raw data are more like a metagenome. Locus tag prefixes are generated for each BioSample. You should receive an email within a few days with those numbers, which then become available on the BioProject page. Funannotate
uses locus tags as prefixes in the FASTA headers, which are required for annotated genome submission.
The BioSample database contains descriptions of biological source materials used in experimental assays.
Once you have BioSample accessions, proceed to the NCBI SRA submission portal to get SRA accession numbers and to upload the raw reads. Do not modify the name of the SRA template file that you download from NCBI.
Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.
I recommend uploading the raw read data to NCBI from Great Lakes via FTP. Using this option, you will need to put all the files in a single directory, for example one named ncbi_sra
in your scratch. The file names need to match what you included in the SRA data file.
Follow these instructions to upload your data:
- Navigate to the local directory with your data
- Establish an FTP connection with
ftp -i
- Type
open ftp-private.ncbi.nlm.nih.gov
- Provide the username listed in the Submission Portal, likely
subftp
- Enter the password listed in the Submission Portal (unique to you)
- Navigate to your account folder listed in the Sumbission Portal (unique to you)
- Make a new directory (e.g.,
mkdir sra_submission
) - Navigate to that direcotry (
cd sra_submission
) - Transfer all your local files to this directory with
mput *
- Go back to the Submission Portal page and select the upload folder. It takes at least 10 minutes for transferred files to appear in the preload option.
- Exit FTP with the command
bye
Submit genomes here. You are going to upload a .fasta
file for an unannotated genome or a .sqn
file for an annotated genome (generated by Funannotate
). Follow the instructions above for SRA files to upload the genome files. Make sure to put all your .fasta
or .sqn
files together in the same directory. I recommend only uploading the genomes (unannotated FASTA file) because NCBI will probably identify some issues with the annotation that might be difficult to resolve.