Inverted Repeat Junction identifier for use with ONT (MinION) sequencing.
Mugio is intended to aid in the identification of potential inverted repeats in seqeuncing data from Oxford-Nanopore Technologies Nanopore seqeuncing platform (MinION, PromethION, GridION, etc.).
Important note for NYU HPC users: You can load all necessary modules via:
source demo/module_load.sh
Spealman P, Burrell J, Gresham D. Inverted duplicate DNA sequences increase translocation rates through sequencing nanopores resulting in reduced base calling accuracy. Nucleic Acids Res. 2020 May 21;48(9):4940-4945. doi: 10.1093/nar/gkaa206. PMID: 32255181; PMCID: PMC7229812.
Mugio reguires the user supplied data:
- Fastq generated by Albacore or Guppy
- Aligned bam file generated by Minimap2
Mugio reguires the following programs:
- gzip 1.5+
- python 2.7+ or 3.6+
- samtools 1.6+
- bedtools 2.26.0+
- (optional) The --blast command requires blast+ 2.9.0+
Mugio reguires the following python packages:
- numpy
- scipy.stats
- matplotlib
- pandas
Mugio is a stand alone python script as such it can be run locally by merely downloading the script. Installation through git clone is the preferred method. To download
git clone https://github.com/pspealman/mugio.git
To test installation
cd mugio
python mugio.py -demo
- Purpose: Identifies loci likely to be inverted repeat junctions associated with inverted duplications.
- Format:
python mugio.py -bprd -f <fastq_file> -s <sam_file> -bam <bam_file> -o <output_path_and/or_file_prefix>
- Demo:
python mugio.py -bprd -f demo/demo.fastq -s demo/demo.sam -bam demo/demo.bam -o demo_output/demo_bprd
- Results: Identified likely candidates are recorded in the out_put path as a bed file with the suffix '_bprd'. Therefore if '-o demo_output/demo_bprd' the results will be stored in 'demo_output/demo_bprd_bprd.bed'
- Purpose: Calculates the correlation (Spearman's rho) between pre-breakpoint seqeunce length and post-breakpoint low scoring region length.
- Format:
python mugio.py --evaluate [-bpf bprd.bed | -snf sniffles.vcf] -s <sam_file> -o <output_path_and/or_file_prefix>
- Demo:
python mugio.py --evaluate -bpf demo_output/demo_bprd_bprd.bed -f demo/demo.fastq -s demo/demo.sam -o demo/demo_bprd_lengths
- Results: Identified candidates that have closed low-phred regions with trace figures generated with low scoring regions identified. These will be saved in the out_path path as sub folders named after the inverted repeat junctions coordinates.
- Purpose: Generate a figure showing the Phred score of each nucleotide (blue) as well as the median phread score calculated over a 1K window (orange).
- Format (for a single read):
python mugio.py -pp -f <fastq_file> -uid <read uid> -o <output_path_and/or_file_name>
- Demo:
python mugio.py -pp -f demo/demo.fastq -uid c1d4f6dc-cc1f-4431-a6a0-1b9f5109342c -o c1d4f6dc.png
- Format (for all reads in a fastq):
python mugio.py -pp -f <fastq_file> -o <output_path_and/or_file_name>
- Demo:
python mugio.py -pp -f demo/demo.fastq -o demo_
- Results: Figure showing and overlay of the immediate phred score and a trace of the median phred score across a rolling window. Note: the because the window is 1000 nt (by defualt) the median trace will stop 100 nt before the end of the phred score trace.
- Purpose: Generate a table with read depth metircs such as median depth, standard deviation, genome wide relative median depth, etc.
- Format:
python mugio.py -cc [-filter <chromo_name>] -f <fastq_file> -s <sam_file> -o <output_path_and/or_file_prefix>
- Demo:
python mugio.py -cc -filter NC_001224.1 -f ont_DGY1657/BC01_1657.fastq -s ont_DGY1657/BC01_1657.sam -o ont_DGY1657/cc_BC01_1657
- Results: Tab-delimited file containing rows for each scaffold, chromosome, as well as total (assumbed to be the genome). Columns are various metrics of the aligned read-depth.
- Purpse: Ge
- Format:
python mugio.py --get -f <sample_fastq> -s <sample_aligned.sam> -pct <float> -o <output_prefix>
- Demo:
python mugio.py --get_discordant -f fastq/DGY1726.fastq -s bam/DNA_DGY1726.sam -pct 0 -o mugio/DGY1726_1
- Results: Any read that is over some percentage (-pct) of non-mapping nuceltotides will get extracted and resolved into a bam alignment file.