Skip to content

pspealman/CVish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f3978d5 · Aug 21, 2024
Aug 21, 2024
Aug 12, 2024
Nov 27, 2019
Nov 27, 2019
Aug 15, 2024
Aug 9, 2024

Repository files navigation

CVish

A Copy-number Variant and Structural Variant finder for short read sequencing that uses split and discordant reads as well read depth to identify potential breakpoints.

(version 2.0)

Important Step Zero for NYU HPC users

srun -c 1 -t 1:00:00 --mem 100000 --pty /bin/bash
# or use -run with the -sbatch demo/sbatch_template.txt argument to run with sbatch
source demo/module_load.sh

Quick Start:

CVish can be run using a single Illumina paired-end sequencing library (.fastq) along with a reference genome (.fa), it will produce variant calls in both a GFF3 (.gff) and VCFv4.2 (.vcf) format by just using the command line as.

python cvish.py -run -fa <reference_genome.fa> -fastq_1 <n01_fastq.gz> -fastq_2 <n02.fastq.gz> -run_name <unique_sample_name>
# output is stored in results/<unique_sample_name>/output/<unique_sample_name>_SV_CNV.gff

Example run:

Requirements

CVish intends to be a requirement lite tool and as such is contained entirely in single python script. It requires the following programs to be installed in the environment:

  • python (tested on version: 3.8.6) [package requirements: os, argparse, subprocess, pickle, pandas, numpy, json, re, datetime, scipy.stats, warnings]
  • bwa (tested on version: 0.7.17) NB bwa is required and is not compatible with bwa-mem2
  • samtools (tested on version: 1.14)
  • bedtools (tested on version: 2.29.2)
  • blast (tested on version: 2.11.0)
  • samblaster (tested on version: 0.1.26)
  • mafft (tested on version: 7.475)
  • emboss (tested on version: 6.6.0)

Install and run preparation:

git clone https://github.com/pspealman/CVish.git
cd CVish
wget https://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz -O S288C_Ensembl.fa.gz
gunzip S288C_Ensembl.fa

Run whole analysis with -run command

This generates the required split and discordant sam files from fastq files.

 python cvish.py -run -fa S288C_Ensembl.fa -fastq_1 demo/n01_ancestor.fastq.gz -fastq_2 demo/n02_ancestor.fastq.gz -run_name ancestor_demo
 head results/ancestor_demo/output/ancestor_demo_SV_CNV.gff

**To understand the .gff output check the "Step Four: Interpreting the output" section below.

Use first results as filter for second sample [optional]

In instances where an ancestor and evolved strain are sequenced the results of one run can be used to filter .

 python cvish.py -run -fa S288C_Ensembl.fa -fastq_1 demo/n01_evolved.fastq.gz -fastq_2 demo/n02_evolved.fastq.gz -exclude results/ancestor_demo/output/ancestor_demo_SV_CNV.gff -run_name evolved_demo
 head results/evolved_demo/output/evolved_demo_SV_CNV.gff

Basic Analysis Tutorial:

Step One: Preparation

Before the analysis we need to download a reference genome file - this can be done using the wget command

  wget https://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz -O S288C_Ensembl.fa
  gunzip S288C_Ensembl.fa
  head S288C_Ensembl.fa

Note that Saccharomyces cereivisiae's genome in Ensembl uses the 'Roman Numeral' (ie. 'I', 'II', 'XI') chromosome naming convention with the mitochondrial genome named 'Mito'. We can download the matching GFF from the same source as well.

  wget https://ftp.ensembl.org/pub/release-112/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gff3.gz -O S288C_Ensembl.gff.gz
  gunzip S288C_Ensembl.gff.gz
  # Taking a subset (ie the right arm of chromosome XI) of features for demonstration purposes.
  grep 'YKR' S288C_Ensembl.gff > S288C_Ensembl_XI_R.gff 
  head S288C_Ensembl_XI_R.gff

We can verify that the gff is also using the 'Roman Numeral' (ie. 'I', 'II', 'XI') chromosome naming convention. Note that, for demonstration purposes the gff is subset to only focus on the right arm of chromosome XI. This _XI_R.gff will be used later to calculate relative read depth for each gene in the gff, so it is truncated here to save time.

Since we want to exclude regions like the rDNA locus from analysis we want to make sure that we select the appropriate 'excluded regions' file.

  head filter_files/S288C_Ensembl_exclude_regions.bed

We find the BED file uses the same naming convention, so we can proceed with Step Two

Step Two: Run Command on Ancestor Strain

We're first going to analyze the sequencing data for the Ancestor while fitlering regions listed in the excluded_regions file using the -exclude argument.

  python cvish.py -run -fa S288C_Ensembl.fa \
   -fastq_1 demo/n01_ancestor.fastq.gz \
   -fastq_2 demo/n02_ancestor.fastq.gz \
   -exclude filter_files/S288C_Ensembl_exclude_regions.bed \
   -run_name ancestor_tutorial

  head results/ancestor_tutorial/output/ancestor_tutorial_SV_CNV.gff

A quick look at the ancestor_tutorial_SV_CNV.gff will show a combination of hits and excluded_regions.

Step Three: Run Command on Evolved Strain

Now we're going to analyze the sequencing data for the Evolved strain this time filtering regions in the ancestor_tutorial_SV_CNV.gff

  python cvish.py -run \
   -fa S288C_Ensembl.fa \
   -fastq_1 demo/n01_evolved.fastq.gz \
   -fastq_2 demo/n02_evolved.fastq.gz \
   -gff S288C_Ensembl_XI_R.gff \
   -exclude results/ancestor_tutorial/output//ancestor_tutorial_SV_CNV.gff \
   -run_name evolved_tutorial

  head results/evolved_tutorial/output/evolved_tutorial_SV_CNV.gff

Checking the evolved_tutorial_SV_CNV.gff we find that the excluded regions from both the original S288C_Ensembl_exclude_regions.bed and the ancestor_tutorial_SV_CNV.gff are now present.

 head results/evolved_tutorial/output/evolved_tutorial_feature_copy_number.tsv

Because we included the -gff command CVish also generated a read depth estimation for each gene, the results of which are shown in the feature_copy_number.tsv file. note: because the demo files only contain a small number of reads on ChrXI, the read depth estimations are not meaningful, they are only included here as an example.

Step Four: Interpreting the output

How to make sense of the results in the SV_CNV.gff file

The SV_CNV.gff file follows a simple convention for highlighting breakpoints. Each breakpoint has at least two components: an anchor and a breezepoint. The anchor, by definition, contains a uniquely mapping sequence. Conversely, the breezepoint can map to numerous places with no restriction, this allows for potential breezepoints to map to repeat or low-complexity regions, such as gene homologs and transposon elements. When possible the assemdbled contig that generated the breakpoint is included.

Example of a potential hit using split read unique_sequences in ancestor_tutorial_SV_CNV.gff
[chromosome]    [source]    [unique_id]   [start]   [stop]    [dot] [dot]   [score]   [details]
XI	cvish	14_anchor_split	513833	513945	.	.	32893	node_uid=14;filter=PASS;otherside=VI:55618-55685_breeze;no_ref_gff;contig=CAGATGATATTTCAACATCATGTTTTGCTTAGTAGACTCTTGCGGGCGTTCCATCCGTGTGAAATACATCATTTACACCTCGCTCTGGGTCAAGTAATCAAAAAATACCTCGTGTGGCTGCAAGAGATTGATCGGTATGCAACCTCAACAGtgttgaagctattggtaggtaacaagtgt;read_weight=1207.0;rel_chromosome_rd=1440.0;depth_median=12.0
VI	cvish	14_breeze_split	55618	55685	.	.	32893	node_uid=14;filter=PASS;otherside=XI:513833-513945;no_ref_gff;contig=CAGATGATATTTCAACATCATGTTTTGCTTAGTAGACTCTTGCGGGCGTTCCATCCGTGTGAAATACATCATTTACACCTCGCTCTGGGTCAAGTAATCAAAAAATACCTCGTGTGGCTGCAAGAGATTGATCGGTATGCAACCTCAACAGtgttgaagctattggtaggtaacaagtgt;read_weight=204.0;rel_chromosome_rd=202.0;depth_median=2.0

In this example the proposed splitread breakpoint "14" spans from chrXI:513833-513945 (anchor) to chrVI:55618-55685 (breezepoint). It has a score of 32893. The contig generated by the reads associated with this proposed breakpoint is CAGATGATATTTCAACATCATGTTTTGCTTAGTAGACTCTTGCGGGCGTTCCATCCGTGTGAAATACATCATTTACACCTCGCTCTGGGTCAAGTAATCAAAAAATACCTCGTGTGGCTGCAAGAGATTGATCGGTATGCAACCTCAACAGtgttgaagctattggtaggtaacaagtgt. Note that this is in the ancestor strain, it is actually the signature of an ACT1 promoter placed before the mCitrine reporter (Lauer et al. 2018).

Example of a filtered excluded_region in ancestor_tutorial_SV_CNV.gff
[chromosome]    [source]    [unique_id]   [start]   [stop]    [dot] [dot]   [score]   [details]
#Mito	cvish	region_excluded	1	100000	.	.	0	node_uid=Mito_0;filter=excluded_by_region

In addition to potential breakpoints, the gff file also includes regions excluded from analysis. In this case these are imported from the S288C_Ensembl_exclude_regions.bed file using the -exclude argument. These are commented out using '#' and will not be visualized if the gff is loaded in IGV (or equivalent genome browsers).

Example of a potential hit using peaks in evolved_tutorial_SV_CNV.gff
[chromosome]    [source]    [unique_id]   [start]   [stop]    [dot] [dot]   [score]   [details]
XI	cvish	3_peak	479685	479731	.	.	24	node_uid=peak_3;filter=PASS;otherside=unknown;peak_detected_no_unique_sequence;contig=na;read_count=47.0;rel_chromosome_rd=213.0;depth_median=1.0

Finally, we can analyze the output from the evolved strain. In addition to filtered regions imported from ancestor_tutorial_SV_CNV.gff using the -exclude command, we also find novel potential hits, such as '3_peak' shown above. Unlike split read breakpoints based on unique_sequences "peaks" are substantial pileups of splitreads (by default, approximately >1% of splitreads). Because peaks are not derived from unique sequences, no contig is reported for their identification.

Useful optional arguments

Many parameters can and should be modified for optimal performance, these will be enumerated in full below in the Configuration section but several important components are discussed here:

Make an example configuration file

You can always make a template configuration file by using the make_template_file command:

python cvish.py -template example.tsv

Ploidy

CVish runs as haploid by defualt, this can be adjusted by setting the expected_ploidy value to some non-zero integer.

Analysis parameters

Several parameters can have a profound effect on the analysis. min_confidence_score sets the lower limit for potential breakpoints to be reported. Breakpoints scoring beneath these are output as FILTERED. To limit analysis to specifc chromosomes use the filter_chromosome parameter followed by a space or comma seperated list of permitted chromosomes. Finally, breakpoint predictions can be generated by both split-reads and discordant reads. Utilizing both is more informative, and increases specificity and accuracy, but can increase run times 10-20% above what is seen if split-reads alone are used. To force only split reads to be used set the with_disco parameter to false. Preloaded alternative defaults are available in the configuration file, high_sensitivity_mode has lower thresholds for inclusion and analysis and should detect weaker signals, at the risk of increasing FDR. While low_sensitivity_mode has even higher thresholds than the default and should exclude more, and have lower FDR.

-sbatch to launch SLURM scheduled job

The run command can be used incongunction with modules and SLURM's sbatch job scheduler using the module_filename and sbatch_filename parameters. To generate a shell executable, but not run it, use the -no_run argument from the command line.

python cvish.py -run -sbatch demo/sbatch_template.txt -fa S288C_Ensembl.fa -fastq_1 demo/n01_ancestor.fastq.gz -fastq_2 demo/n02_ancestor.fastq.gz -run_name ancestor_sbatch

About

Structural variant breakpoint identifier.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages