-
Notifications
You must be signed in to change notification settings - Fork 4
User Manual
mrFAST is a read mapper that is designed to map short reads to reference genome with a special emphasis on the discovery of structural variation and segmental duplications. mrFAST maps short reads with respect to user defined error threshold, including indels up to 4+4 bp. This manual, describes how to choose the parameters and tune mrFAST with respect to the library settings. mrFAST is designed to find 'all' mappings for a given set of reads, however it can return one "best" map location if the relevant parameter is invoked.
NOTE: mrFAST is developed for Illumina, thus requires all reads to be at the same length. For paired-end reads, lengths of mates may be different from each other, but each "side" should have a uniform length.
If the lengths of the reads are not the same in a file, it may crash or output incorrectly formatted SAM files.
Personalized copy number and segmental duplication maps using next-generation sequencing. Can Alkan, Jeffrey M. Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O. Kitzman, Carl Baker, Maika Malig, Onur Mutlu, S. Cenk Sahinalp, Richard A. Gibbs, Evan E. Eichler. Nature Genetics, Oct, 41(10):1061-1067, 2009.
Accelerating read mapping with FastHASH. H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, C. Alkan. BMC Genomics, 14(Suppl 1):S13, 2013.
Please pull the latest version from GitHub or download from Sourceforge page. Run 'make' to build mrFAST.
mrFAST:
- generates an index of the reference genome and
- maps the reads to reference genome.
Requirements:
- zlib for the ability to read compressed FASTQ and write compressed SAM files.
- C compiler (mrFAST is developed with gcc versions > 4.1.2)
Building
On Unix/Linux systems, we recommend using GNU gcc version > 4.1.2 as your compiler and type 'make' to build. Example:
linux> make
gcc -c -O3 baseFAST.c -o baseFAST.o
gcc -c -O3 CommandLineParser.c -o CommandLineParser.o
gcc -c -O3 Common.c -o Common.o
gcc -c -O3 HashTable.c -o HashTable.o
gcc -c -O3 MrFAST.c -o MrFAST.o
gcc -c -O3 Output.c -o Output.o
gcc -c -O3 Reads.c -o Reads.o
gcc -c -O3 RefGenome.c -o RefGenome.o
gcc baseFAST.o CommandLineParser.o Common.o HashTable.o MrFAST.o Output.o Reads.o RefGenome.o
-o mrFAST -lz -lm
rm -rf *.o
Parallelization: The best way to optimize mrFAST is to split the reads into chunks that fit into the memory of the cluster nodes, and implement an MPI wrapper in an embarrassingly parallel fashion. The multithreaded version of mrFAST will be released after extensive testing. We recommend the following criteria to split the reads (assuming ~2 GB main memory):
- Single End Mode: The number of reads should be approximately ((M-600)/(4L)) million where M is the size of the memory for the cluster node (in megabytes) and L is the read length. If you have more nodes, you can make the chunks smaller to use the nodes efficiently. For example, if the library length is 50bp and the memory of nodes is 2 GB, each chunk should contain (2000-600)/(450)= 7 million reads.
- Paired End Mode: The number of reads in each file should not exceed 1 million (500,000 pairs), however chunk size of 500,000 reads (250,000 pairs) is recommended.
By default mrFAST uses the window size of 12 characters to generate its index. Please be advised that if you do not choose the window size carefully, you will lose sensitivity. Longer seed length doesn't always mean faster mapping speed, as it alters the Cheap Key Selection module of the FastHASH algorithm.
How to choose the right window size: For a given read length (l) and error threshold (e), the window size is floor(l/(e+1)). For example if the reads length is 36 and the maximum number of mismatches allowed is 2, the window size is 12. if your calculated window size is greater than default, you can use the default window size without losing the sensitivity. For example, for the read length of 64 and error threshold of 2, the windows size should be 21. You can use the default window size 12. However you cannot use 12 as window size for read length of 30 and error threshold of 2.
To index a reference genome like "refgen.fasta" run the following command:
$>./mrfast --index refgen.fasta
Upon the completion of the indexing phase, you can find "refgen.fasta.index" in the same directory as "refgen.fasta". mrFAST uses a window size of 12 (default) to make the index of the genome, this windows size can be modified with "--ws". There is a restriction on the maximum of the window size as the window size directly affects the memory usage.
$>./mrfast --index refgen.fasta --ws 13
mrFAST can map single-end reads and paired-end reads to a reference genome. mrFAST supports both fasta and fastq formats.
NOTE: mrFAST will assume that the read lengths are uniform within a file (paired-ends may differ from each other). If the lengths of the reads are not the same in a file, it may crash or output incorrectly formatted SAM files.
Single-end Reads
To map single reads to a reference genome, run the following command. Use "--seq" to specify the input file. refgen.fa and refgen.fa.index should be in the same folder. You can load a multi-sequence FASTA file as the reference genome.
$>./mrfast --search refgen.fa --seq reads.fastq
The reported locations will be saved into "output" by default. If you want to save it somewhere else, use "-o" to specify another file. mrFAST can report the unmapped reads in fasta/fastq format.
$>./mrfast --search refgen.fasta --seq reads.fastq -o my.map.sam
By default, mrFAST reports all the locations per read. If you need one "best" mapping add the "--best" parameter to the command line:
$>./mrfast --search refgen.fasta --seq reads.fastq -e 3 --best
Paired-end Reads
To map paired-end reads, use "--pe" option. If the reads are in two different files, you have to use "--seq1/--seq2" to indicate the files. If the reads are interleaved, use "--seq" to indicated the file. The distance allowed between the paired-end reads should be specified with "--min" and "--max". "--min" and "--max" specify the minmum and maximum of the inferred size (the distance between outer edges of the mapping mates).
$>./mrfast --search refgen.fasta --pe --seq reads.fastq --min 150 --max 250
Discordant Mapping
mrFAST can report the discordant mapping for use of Variation Hunter. The --min and --max options will define the minimum and maximum inferred size for concordant mapping. This is enabled by default since version 2.1.0.6
$>./mrfast --search refgen.fasta --pe --discordant-vh --seq reads.fastq --min 150 --max 750
General Options:
-v | --version Shows the current version.
-h Shows the help screen.
Indexing Options:
--index [file] Generate an index from the specified fasta file.
--ws [int] Set window size for indexing (default:12 max:15).
Searching Options:
--search [file] Search the specified genome. Index file should be in same directory as the fasta file.
--pe Search will be done in paired-end mode
--mp Search will be done in matepair mode
--seq [file] Input sequences in fasta/fastq format [file]. If pairend reads are interleaved, use this option.
--seq1 [file] Input sequences in fasta/fastq format [file] (First file). Use this option to indicate the first file of paired-end reads
--seq2 [file] Input sequences in fasta/fastq format [file] (Second file). Use this option to indicate the second file of paired-end reads.
-o [file] Output of the mapped sequences (SAM format). The default is "output".
-u [file] FASTA/FASTQ file for the unmapped sequences. The default is "unmapped".
-e [int] Maximum allowed edit distance (default 4% of the read length). Note that although the current version is limited with up to 4+4 indels, it supports any number of substitution errors.
--min [int] Min inferred distance allowed between two pairend sequences.
--max [int] Max inferred distance allowed between two pairend sequences.
--discordant-vh To return all discordant map locations ready for the Variation Hunter program, and OEA map locations ready for the NovelSeq.
--best Return "best" location only (single-end mode).
--seqcomp Indicates that the input sequences are compressed (gz).
--outcomp Indicates that output file should be compressed (gz).
--maxoea [int] Max number of One End Anchored (OEA) returned for each read pair. Minimum of 100 is recommendded for NovelSeq use.
--maxdis [int] Max number of discordant map locations returned for each read pair.
--crop [int] Crop the input reads at position [int].
--sample [string] Sample name to be added to the SAM header (optional).
--rg [string] Read group ID to be added to the SAM header (optional).
--lib [string] Library name to be added to the SAM header (optional).
Single-End Mode: In the single-end mode mrFAST will generate two files as specified by the "-o" and "-u" parameters. Default filenename if the "-o" parameter is not specified is "output"; and default filename for the "-u" parameter is "unmapped".
- output file ("-o"): Contains the map locations of the sequences in the specified genome in SAM format. mrFAST returns all possible map locations within the given edit distance ("-e") by default. If the "--best" parameter is invoked, then it will select one "best" location that has the minimum edit distance to the genome.
- unmappped file ("-u"): Contains the unmapped reads in FASTQ or FASTA format, depending on the format of the input sequences.
Paired-End and Matepair Modes: In paired-end and matepair modes, mrFAST will generate a SAM file in the paired-end mode that will store best mapping locations while utilizing the paired-end span information. In addition, it will generate a DIVET file and and OEA file (SAM format). See below:
- output file ("-o"): Contains the map locations of the sequences in the specified genome in SAM format. This file will include:
- If a read pair can be mapped concordantly, the "best" (minimum total edit distance and minimum differential from the average span) map location for the pair.
- If the read pair can not be mapped concordantly, again, the "best" (minimum total edit distance and minimum differential from the average span) map location for the pair.
- unmapped file ("-u"): Contains the orphan (both ends unmapped) reads in FASTQ or FASTA format, depending on the format of the input sequences.
- output.DIVET.vh file ("-o" option changes the prefix "output"): This file includes all possible map locations for the read pairs that cannot be concordantly mapped. This file can be loaded by VariationHunter tool for structural variation discovery.
- output_OEA.sam file: Contains the OEA (One-End-Anchored) reads (paired-end reads where only one read can be mapped to the genome). The output is in SAM format, contains the map location of read that can be mapped to the genome. The unmapped reads of an OEA read pair are not reported in separate lines; instead the sequence and quality information is given in the line that specifies the map location of the mapped read. We use optional fields NS and NQ to specify the unmapped sequence and unmapped quality information. This file can be loaded by NovelSeq tool for novel sequence discovery, however format conversion might be required; please see the NovelSeq documentation.
NOTE: mrFAST will report many (up to 100 by default) possible map locations for the "mapped" read of OEA matepais. This will generate a large file due to repeats and duplications. This file can be limited through the --maxoea parameter (version 2.1.0.0 and above)._
mrFAST mapping output format is in SAM format. For detail about the definition of the fields please refer to SAM Manual. All locations of discordant paired-end reads will be reported in DIVET format as required by the VariationHunter package. Unmapped reads (or, "orphan" read pairs in the PE mode) will be outputted in FASTQ or FASTA format, depending on the input sequence file format.