A Copy-number Variant and Structural Variant finder for short read sequencing that uses split and discordant reads as well read depth to identify potential breakpoints.
(version 2.0)
Important Step Zero for NYU HPC users
srun -c 1 -t 1:00:00 --mem 100000 --pty /bin/bash
# or use -run with the -sbatch demo/sbatch_template.txt argument to run with sbatch
source demo/module_load.sh
CVish can be run using a single Illumina paired-end sequencing library (.fastq) along with a reference genome (.fa), it will produce variant calls in both a GFF3 (.gff) and VCFv4.2 (.vcf) format by just using the command line as.
python cvish.py -run -fa <reference_genome.fa> -fastq_1 <n01_fastq.gz> -fastq_2 <n02.fastq.gz> -run_name <unique_sample_name>
# output is stored in results/<unique_sample_name>/output/<unique_sample_name>_SV_CNV.gff
CVish intends to be a requirement lite tool and as such is contained entirely in single python script. It requires the following programs to be installed in the environment:
- python (tested on version: 3.8.6) [package requirements: os, argparse, subprocess, pickle, pandas, numpy, json, re, datetime, scipy.stats, warnings]
- bwa (tested on version: 0.7.17) NB bwa is required and is not compatible with bwa-mem2
- samtools (tested on version: 1.14)
- bedtools (tested on version: 2.29.2)
- blast (tested on version: 2.11.0)
- samblaster (tested on version: 0.1.26)
- mafft (tested on version: 7.475)
- emboss (tested on version: 6.6.0)
git clone https://github.com/pspealman/CVish.git
cd CVish
wget https://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz -O S288C_Ensembl.fa.gz
gunzip S288C_Ensembl.fa
This generates the required split and discordant sam files from fastq files.
python cvish.py -run -fa S288C_Ensembl.fa -fastq_1 demo/n01_ancestor.fastq.gz -fastq_2 demo/n02_ancestor.fastq.gz -run_name ancestor_demo
head results/ancestor_demo/output/ancestor_demo_SV_CNV.gff
**To understand the .gff output check the "Step Four: Interpreting the output" section below.
In instances where an ancestor and evolved strain are sequenced the results of one run can be used to filter .
python cvish.py -run -fa S288C_Ensembl.fa -fastq_1 demo/n01_evolved.fastq.gz -fastq_2 demo/n02_evolved.fastq.gz -exclude results/ancestor_demo/output/ancestor_demo_SV_CNV.gff -run_name evolved_demo
head results/evolved_demo/output/evolved_demo_SV_CNV.gff
Before the analysis we need to download a reference genome file - this can be done using the wget command
wget https://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz -O S288C_Ensembl.fa
gunzip S288C_Ensembl.fa
head S288C_Ensembl.fa
Note that Saccharomyces cereivisiae's genome in Ensembl uses the 'Roman Numeral' (ie. 'I', 'II', 'XI') chromosome naming convention with the mitochondrial genome named 'Mito'. We can download the matching GFF from the same source as well.
wget https://ftp.ensembl.org/pub/release-112/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gff3.gz -O S288C_Ensembl.gff.gz
gunzip S288C_Ensembl.gff.gz
# Taking a subset (ie the right arm of chromosome XI) of features for demonstration purposes.
grep 'YKR' S288C_Ensembl.gff > S288C_Ensembl_XI_R.gff
head S288C_Ensembl_XI_R.gff
We can verify that the gff is also using the 'Roman Numeral' (ie. 'I', 'II', 'XI') chromosome naming convention. Note that, for demonstration purposes the gff is subset to only focus on the right arm of chromosome XI. This _XI_R.gff will be used later to calculate relative read depth for each gene in the gff, so it is truncated here to save time.
Since we want to exclude regions like the rDNA locus from analysis we want to make sure that we select the appropriate 'excluded regions' file.
head filter_files/S288C_Ensembl_exclude_regions.bed
We find the BED file uses the same naming convention, so we can proceed with Step Two
We're first going to analyze the sequencing data for the Ancestor while fitlering regions listed in the excluded_regions file using the -exclude argument.
python cvish.py -run -fa S288C_Ensembl.fa \
-fastq_1 demo/n01_ancestor.fastq.gz \
-fastq_2 demo/n02_ancestor.fastq.gz \
-exclude filter_files/S288C_Ensembl_exclude_regions.bed \
-run_name ancestor_tutorial
head results/ancestor_tutorial/output/ancestor_tutorial_SV_CNV.gff
A quick look at the ancestor_tutorial_SV_CNV.gff will show a combination of hits and excluded_regions.
Now we're going to analyze the sequencing data for the Evolved strain this time filtering regions in the ancestor_tutorial_SV_CNV.gff
python cvish.py -run \
-fa S288C_Ensembl.fa \
-fastq_1 demo/n01_evolved.fastq.gz \
-fastq_2 demo/n02_evolved.fastq.gz \
-gff S288C_Ensembl_XI_R.gff \
-exclude results/ancestor_tutorial/output//ancestor_tutorial_SV_CNV.gff \
-run_name evolved_tutorial
head results/evolved_tutorial/output/evolved_tutorial_SV_CNV.gff
Checking the evolved_tutorial_SV_CNV.gff we find that the excluded regions from both the original S288C_Ensembl_exclude_regions.bed and the ancestor_tutorial_SV_CNV.gff are now present.
head results/evolved_tutorial/output/evolved_tutorial_feature_copy_number.tsv
Because we included the -gff command CVish also generated a read depth estimation for each gene, the results of which are shown in the feature_copy_number.tsv file. note: because the demo files only contain a small number of reads on ChrXI, the read depth estimations are not meaningful, they are only included here as an example.
The SV_CNV.gff file follows a simple convention for highlighting breakpoints. Each breakpoint has at least two components: an anchor and a breezepoint. The anchor, by definition, contains a uniquely mapping sequence. Conversely, the breezepoint can map to numerous places with no restriction, this allows for potential breezepoints to map to repeat or low-complexity regions, such as gene homologs and transposon elements. When possible the assemdbled contig that generated the breakpoint is included.
[chromosome] [source] [unique_id] [start] [stop] [dot] [dot] [score] [details]
XI cvish 14_anchor_split 513833 513945 . . 32893 node_uid=14;filter=PASS;otherside=VI:55618-55685_breeze;no_ref_gff;contig=CAGATGATATTTCAACATCATGTTTTGCTTAGTAGACTCTTGCGGGCGTTCCATCCGTGTGAAATACATCATTTACACCTCGCTCTGGGTCAAGTAATCAAAAAATACCTCGTGTGGCTGCAAGAGATTGATCGGTATGCAACCTCAACAGtgttgaagctattggtaggtaacaagtgt;read_weight=1207.0;rel_chromosome_rd=1440.0;depth_median=12.0
VI cvish 14_breeze_split 55618 55685 . . 32893 node_uid=14;filter=PASS;otherside=XI:513833-513945;no_ref_gff;contig=CAGATGATATTTCAACATCATGTTTTGCTTAGTAGACTCTTGCGGGCGTTCCATCCGTGTGAAATACATCATTTACACCTCGCTCTGGGTCAAGTAATCAAAAAATACCTCGTGTGGCTGCAAGAGATTGATCGGTATGCAACCTCAACAGtgttgaagctattggtaggtaacaagtgt;read_weight=204.0;rel_chromosome_rd=202.0;depth_median=2.0
In this example the proposed splitread breakpoint "14" spans from chrXI:513833-513945 (anchor) to chrVI:55618-55685 (breezepoint). It has a score of 32893. The contig generated by the reads associated with this proposed breakpoint is CAGATGATATTTCAACATCATGTTTTGCTTAGTAGACTCTTGCGGGCGTTCCATCCGTGTGAAATACATCATTTACACCTCGCTCTGGGTCAAGTAATCAAAAAATACCTCGTGTGGCTGCAAGAGATTGATCGGTATGCAACCTCAACAGtgttgaagctattggtaggtaacaagtgt
. Note that this is in the ancestor strain, it is actually the signature of an ACT1 promoter placed before the mCitrine reporter (Lauer et al. 2018).
[chromosome] [source] [unique_id] [start] [stop] [dot] [dot] [score] [details]
#Mito cvish region_excluded 1 100000 . . 0 node_uid=Mito_0;filter=excluded_by_region
In addition to potential breakpoints, the gff file also includes regions excluded from analysis. In this case these are imported from the S288C_Ensembl_exclude_regions.bed file using the -exclude argument. These are commented out using '#' and will not be visualized if the gff is loaded in IGV (or equivalent genome browsers).
[chromosome] [source] [unique_id] [start] [stop] [dot] [dot] [score] [details]
XI cvish 3_peak 479685 479731 . . 24 node_uid=peak_3;filter=PASS;otherside=unknown;peak_detected_no_unique_sequence;contig=na;read_count=47.0;rel_chromosome_rd=213.0;depth_median=1.0
Finally, we can analyze the output from the evolved strain. In addition to filtered regions imported from ancestor_tutorial_SV_CNV.gff using the -exclude command, we also find novel potential hits, such as '3_peak' shown above. Unlike split read breakpoints based on unique_sequences "peaks" are substantial pileups of splitreads (by default, approximately >1% of splitreads). Because peaks are not derived from unique sequences, no contig is reported for their identification.
Many parameters can and should be modified for optimal performance, these will be enumerated in full below in the Configuration section but several important components are discussed here:
You can always make a template configuration file by using the make_template_file
command:
python cvish.py -template example.tsv
CVish runs as haploid by defualt, this can be adjusted by setting the expected_ploidy
value to some non-zero integer.
Several parameters can have a profound effect on the analysis. min_confidence_score
sets the lower limit for potential breakpoints to be reported. Breakpoints scoring beneath these are output as FILTERED. To limit analysis to specifc chromosomes use the filter_chromosome
parameter followed by a space or comma seperated list of permitted chromosomes.
Finally, breakpoint predictions can be generated by both split-reads and discordant reads. Utilizing both is more informative, and increases specificity and accuracy, but can increase run times 10-20% above what is seen if split-reads alone are used. To force only split reads to be used set the with_disco
parameter to false.
Preloaded alternative defaults are available in the configuration file, high_sensitivity_mode
has lower thresholds for inclusion and analysis and should detect weaker signals, at the risk of increasing FDR. While low_sensitivity_mode
has even higher thresholds than the default and should exclude more, and have lower FDR.
The run command can be used incongunction with modules and SLURM's sbatch job scheduler using the module_filename
and sbatch_filename
parameters. To generate a shell executable, but not run it, use the -no_run
argument from the command line.
python cvish.py -run -sbatch demo/sbatch_template.txt -fa S288C_Ensembl.fa -fastq_1 demo/n01_ancestor.fastq.gz -fastq_2 demo/n02_ancestor.fastq.gz -run_name ancestor_sbatch