-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathminiprot.1
326 lines (319 loc) · 7.95 KB
/
miniprot.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
.TH miniprot 1 "5 March 2024" "miniprot-0.13 (r248)" "Bioinformatics tools"
.SH NAME
.PP
miniprot - protein-to-genome alignment with splicing and frameshifts
.SH SYNOPSIS
* Indexing a genome (recommended as indexing can be slow and memory hungry):
.RS 4
miniprot
.RB [ -t
.IR nThreads ]
.B -d
.I ref.mpi
.I ref.fna
.RE
* Aligning proteins to a genome:
.RS 4
miniprot
.RB [ -t
.IR nThreads ]
.I ref.mpi
.I protein.faa
>
.I output.paf
.br
miniprot
.RB [ -t
.IR nThreads ]
.I ref.fna
.I protein.faa
>
.I output.paf
.RE
.SH DESCRIPTION
Miniprot aligns protein sequences to a genome allowing potential frameshifts and splicing.
.SH OPTIONS
.SS Indexing options
.TP 10
.BI -k \ INT
K-mer size for genome-wide indexing [6]
.TP
.BI -M \ INT
Sample k-mers at a rate
.RI 1/2** INT
[1]. Increasing this option reduces peak memory but decreases sensitivity.
.TP
.BI -L \ INT
Minimum ORF length to index [30]
.TP
.BI -T \ INT
NCBI translation table (1 through 33 except 7-8 and 17-20) [1]
.TP
.BI -b \ INT
Number of bits per bin [8]. Miniprot splits the genome into non-overlapping bins of 2^8 bp in size.
.TP
.BI -d \ FILE
Write the index to
.I FILE
[].
.SS Chaining options
.TP 10
.B -S
Disable splicing. It applies
.RB ` -G1k
.B -J1k
.BR -e1k '
at the same time.
.TP
.BI -c \ NUM
Ignore k-mers occurring
.I NUM
times or more [50k]
.TP
.BI -G \ NUM
Max intron size [200k]. This option overrides
.BR -I .
.TP
.BI -I
Set max intron size to
.RI min(max(3.6*sqrt( refLen ),10000),300000)
where
.I refLen
is the total length of the input genome.
.TP
.BI -n \ NUM
Min number of syncmers in a chain [10]
.TP
.BI -m \ NUM
Min chaining score [0]
.TP
.BI -l \ INT
K-mer size for the second round of chaining [5]
.TP
.BI -e \ NUM
Max extension from chain ends for alignment or the second round of chaining [10k]
.TP
.BI -p \ FLOAT
Filter out a secondary chain/alignment if its score is
.I FLOAT
fraction of the best chain [0.5]
.TP
.BI -N \ NUM
Retain at most
.I NUM
number of secondary chains/alignments [30]
.SS Alignment options
.TP 10
.BI -O \ INT
Gap open penalty [11]
.TP
.BI -E \ INT
Gap extension penalty [1]. A gap of size
.B g
costs
.RB { -O }+{ -E }* g .
.TP
.BI -J \ INT
Intron open penalty [29]
.TP
.BI -F \ INT
Penalty for frameshifts or in-frame stop codons [23]
.TP
.BI -C \ FLOAT
Weight of splicing penalty [1]. Set to 0 to ignore splicing signals.
.TP
.BI -B \ IN
Bonus score for alignment reaching ends of proteins [5]
.TP
.BI -j \ INT
Splice model for the target genome: 2=mammal, 1=general, 0=none [1]. The mammal
model considers `G|GTR...YYYNYAG|' as the optimal splicing sequence and
penalizes other sequences based on profiles in Sibley et al (2016). According
to Irimia and Roy (2008) and Sheth et al (2006), the first `G' in the donor
exon and the poly-Y close to the acceptor may not be conserved in some species.
The general model takes `|GTR...YAG|' as the optimal sequence. Both models also
consider less frequent splice sites including `G|GC...YAG|' and `|AT...AC|'.
.TP
.BI --io-coef \ FLOAT
Logarithm intron length penalty (EXPERIMENTAL) [0.5]
.SS Input/Output options
.TP 10
.BI -t \ INT
Number of threads [4]
.TP
.B --gff
Output in the GFF3 format. `##PAF' lines in the output provide detailed
alignments.
.TP
.B --gff-only
Output in the GFF3 format without `##PAF' lines.
.TP
.B --aln
Output the residue alignment in three lines, where line `##ATN' for the target
nucleotide sequence, `##ATA' for translated amino acid sequence and `##AQA' for
the query protein sequence. On a `##ATA' line, `!' denotes a frameshift
insertion corresponding to the `F' CIGAR operator and `$' denotes a frameshift
substitution corresponding to the `G' operator.
.TP
.B --trans
Output translated protein sequences on `##STA' lines.
.TP
.B --no-cs
Do not output the cs tag
.TP
.BI --max-intron-out \ NUM
In the
.B --aln
format, if an intron is longer than
.IR NUM ,
only output
.RI ceil( NUM /2)
basepairs at the donor or the acceptor sites and write the full intron length
.I LEN
as
.RI ~ LEN ~
in the middle [200].
.TP
.BI -P \ STR
Prefix for IDs in GFF3 or GTF [MP].
.B --gff-delim
overrides this option.
.TP
.BI --gff-delim \ CHAR
Change the ID field in GFF3 to
.RI QueryName CHAR HitIndex
[]. If not specified, the default ID looks like `MP000012'. This option is only
applicable to the GFF3 output format.
.TP
.B --gtf
Output in the GTF format
.TP
.B -u
Print unmapped query proteins
.TP
.BI --outn \ NUM
Output up to
.RI min{ NUM ,
.BR -N }
alignments per query [1000].
.TP
.BI --outs \ FLOAT
Output an alignment only if its score is at least
.IR FLOAT *bestScore,
where bestScore is the best alignment score of the protein [0.99]
.TP
.BI --outc \ FLOAT
Output an alignment only if
.I FLOAT
fraction of the query protein is aligned [0.1]
.TP
.BI -K \ NUM
Query batch size [2M]
.SH OUTPUT FORMAT
.SS The GFF3 Format
Miniprot outputs alignment in the extended Pairwise mApping Format (PAF) by
default (see the next subsection). It can also output GFF3 with option
.BR --gff .
Miniprot may output three features: `mRNA', `CDS' or `stop_codon'. Here, a
stop_codon is only reported if the alignment reaches the C-terminus of the
protein and the next codon is a stop codon. Per GenCode rule, stop_codon is not
part of CDS but it is part of mRNA or exon.
Miniprot may output the following attributes in GFF3:
.TS
center box;
cb | cb | cb
l | c | l .
Attribute Type Description
_
ID str mRNA identifier
Parent str Identifier of the parent feature
Rank int Rank among all hits of the query
Identity real Fraction of exact amino acid matches
Positive real Fraction of positive amino acid matches
Donor str 2bp at the donor site if not GT
Acceptor str 2bp at the acceptor site if not AG
Frameshift int Number of frameshift events in alignment
StopCodon int Number of in-frame stop codons
Target str Protein coordinate in alignment
.TE
.SS The PAF Format
PAF gives detailed alignment. It is a TAB-delimited text format with each line
consisting of at least 12 fields as are described in the following table:
.TS
center box;
cb | cb | cb
r | c | l .
Col Type Description
_
1 string Protein sequence name
2 int Protein sequence length
3 int Protein start coordinate (0-based)
4 int Protein end coordinate (0-based)
5 char `+' for forward strand; `-' for reverse
6 string Contig sequence name
7 int Contig sequence length
8 int Contig start coordinate on the original strand
9 int Contig end coordinate on the original strand
10 int Number of matching nucleotides
11 int Number of nucleotides in alignment excl. introns
12 int Mapping quality (0-255 with 255 for missing)
.TE
.PP
PAF may optionally have additional fields in the SAM-like typed key-value
format. Miniprot may output the following tags:
.TS
center box;
cb | cb | cb
r | c | l .
Tag Type Description
_
AS i Alignment score from dynamic programming
ms i Alignment score excluding introns
np i Number of amino acid matches with positive scores
fs i Number of frameshifts
st i Number of in-frame stop codons
da i Distance to the nearest start codon
do i Distance to the nearest stop codon
cg Z Protein CIGAR
cs Z Difference string
.TE
.PP
A protein CIGAR consists of the following operators:
.TS
center box;
cb | cb
r | l .
Op Description
_
nM Alignment match. Consuming n*3 nucleotides and n amino acids
nI Insertion. Consuming n amino acids
nD Deletion. Consuming n*3 nucleotides
nF Frameshift deletion. Consuming n nucleotides
nG Frameshift match. Consuming n nucleotides and 1 amino acid
nN Phase-0 intron. Consuming n nucleotides
nU Phase-1 intron. Consuming n nucleotides and 1 amino acid
nV Phase-2 intron. Consuming n nucleotides and 1 amino acid
.TE
.PP
The
.B cs
tag encodes difference sequences. It consists of a series of operations:
.TS
center box;
cb | cb |cb
r | l | l .
Op Regex Description
_
: [0-9]+ Number of identical amino acids
* [acgtn]+[A-Z*] Substitution: ref to query
+ [A-Z]+ # aa inserted to the reference
- [acgtn]+ # nt deleted from the reference
~ [acgtn]{2}[0-9]+[acgtn]{2} Intron length and splice signal
.TE
.SH LIMITATIONS
.TP 2
*
The DP alignment score (the AS tag) is not accurate.
.TP
*
Need to introduce more heuristics for improved accuracy.