Inverted Repeat Junction identifier for use with ONT (MinION) sequencing.
Mugio is intended to aid in the identification of potential inverted repeats in seqeuncing data from Oxford-Nanopore Technologies Nanopore seqeuncing platform (MinION, PromethION, GridION, etc.).
Important note for NYU HPC users: You can load all necessary modules via:
source demo/
Spealman P, Burrell J, Gresham D. Inverted duplicate DNA sequences increase translocation rates through sequencing nanopores resulting in reduced base calling accuracy. Nucleic Acids Res. 2020 May 21;48(9):4940-4945. doi: 10.1093/nar/gkaa206. PMID: 32255181; PMCID: PMC7229812.
Mugio reguires the user supplied data:
- Fastq generated by Albacore or Guppy
- Aligned bam file generated by Minimap2
Mugio reguires the following programs:
- gzip 1.5+
- python 2.7+ or 3.6+
- samtools 1.6+
- bedtools 2.26.0+
- (optional) The --blast command requires blast+ 2.9.0+
Mugio reguires the following python packages:
- numpy
- scipy.stats
- matplotlib
- pandas
Mugio is a stand alone python script as such it can be run locally by merely downloading the script. Installation through git clone is the preferred method. To download
git clone
To test installation
cd mugio
python -demo
- Purpose: Identifies loci likely to be inverted repeat junctions associated with inverted duplications.
- Format:
python -bprd -f <fastq_file> -s <sam_file> -bam <bam_file> -o <output_path_and/or_file_prefix>
- Demo:
python -bprd -f demo/demo.fastq -s demo/demo.sam -bam demo/demo.bam -o demo_output/demo_bprd
- Results: Identified likely candidates are recorded in the out_put path as a bed file with the suffix '_bprd'. Therefore if '-o demo_output/demo_bprd' the results will be stored in 'demo_output/demo_bprd_bprd.bed'
- Purpose: Calculates the correlation (Spearman's rho) between pre-breakpoint seqeunce length and post-breakpoint low scoring region length.
- Format:
python --evaluate [-bpf bprd.bed | -snf sniffles.vcf] -s <sam_file> -o <output_path_and/or_file_prefix>
- Demo:
python --evaluate -bpf demo_output/demo_bprd_bprd.bed -f demo/demo.fastq -s demo/demo.sam -o demo/demo_bprd_lengths
- Results: Identified candidates that have closed low-phred regions with trace figures generated with low scoring regions identified. These will be saved in the out_path path as sub folders named after the inverted repeat junctions coordinates.
- Purpose: Generate a figure showing the Phred score of each nucleotide (blue) as well as the median phread score calculated over a 1K window (orange).
- Format (for a single read):
python -pp -f <fastq_file> -uid <read uid> -o <output_path_and/or_file_name>
- Demo:
python -pp -f demo/demo.fastq -uid c1d4f6dc-cc1f-4431-a6a0-1b9f5109342c -o c1d4f6dc.png
- Format (for all reads in a fastq):
python -pp -f <fastq_file> -o <output_path_and/or_file_name>
- Demo:
python -pp -f demo/demo.fastq -o demo_
- Results: Figure showing and overlay of the immediate phred score and a trace of the median phred score across a rolling window. Note: the because the window is 1000 nt (by defualt) the median trace will stop 100 nt before the end of the phred score trace.
- Purpose: Generate a table with read depth metircs such as median depth, standard deviation, genome wide relative median depth, etc.
- Format:
python -cc [-filter <chromo_name>] -f <fastq_file> -s <sam_file> -o <output_path_and/or_file_prefix>
- Demo:
python -cc -filter NC_001224.1 -f ont_DGY1657/BC01_1657.fastq -s ont_DGY1657/BC01_1657.sam -o ont_DGY1657/cc_BC01_1657
- Results: Tab-delimited file containing rows for each scaffold, chromosome, as well as total (assumbed to be the genome). Columns are various metrics of the aligned read-depth.
- Purpse: Ge
- Format:
python --get -f <sample_fastq> -s <sample_aligned.sam> -pct <float> -o <output_prefix>
- Demo:
python --get_discordant -f fastq/DGY1726.fastq -s bam/DNA_DGY1726.sam -pct 0 -o mugio/DGY1726_1
- Results: Any read that is over some percentage (-pct) of non-mapping nuceltotides will get extracted and resolved into a bam alignment file.