-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsimple_errant_kmer_formula.py
executable file
·112 lines (79 loc) · 2.93 KB
/
simple_errant_kmer_formula.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#!/usr/bin/env python
"""
Compute what the errant kmer distribution can be expected to look like, using a
recurrence formula (formula below).
That models the process used in recoverY, though it leaves out such
complications as determining which kmers are good/bad, and that some errant
kmers are actually present elsewhere in the real data.
The output is the distribution of the score-- the fraction of reads expected to
be observed to have a particular score.
=== Recurrence ===
given:
k = kmer length
L = read length
p = per-base error rate (errors assumed to be idependent)
computes:
P(S) = probability that a read will have exactly S good kmers
We assume there are no doppleganger kmers (kmers which contain errors but
which exist elsewhere in the sequenced subject).
We define
P(L,S,r) = probability, among all reads of length L, of having score S with
the leftmost error at position r; r starts at 1; r>=k
indicate the leftmost error is far enough that prepending a non-
error will increase S, so we represent any r>=k by r=k.
Conceptually we begin with a read of length 0 with an error immediately
following it. Though no such position exists, putting an error there keeps us
from counting any kmer as good until the length reaches k.
P(0,0,1) = 1
P(0,0,r) = 0 otherwise.
Now extend the length of a general read by 1, on the left.
P(L,S,r) with probabiliy p becomes P(L+1,S,1)
with probabiliy 1-p becomes P(L+1,S,r+1) if (r<k)
or P(L+1,S+1,k) otherwise
P(S) is just the sum of P(L,S,r) over all r, for L = the read length.
"""
from sys import argv,stderr
def main():
assert (len(argv) == 4), "need kmer_size read_length error_rate and nothing else"
kmerSize = int (argv[1])
readLength = int (argv[2])
errorRate = float(argv[3])
assert (0 < kmerSize <= readLength)
assert (0 <= errorRate < 1)
# compute the recurrence
k = kmerSize
p = errorRate
q = 1-p
P = {}
L = 0
P[L] = {}
P[L][(0,1)] = 1.0
for L in xrange(0,readLength):
P[L+1] = {}
for (S,r) in P[L]:
pShort = P[L][(S,r)]
# prepend error
outcome = (S,1)
if (outcome not in P[L+1]): P[L+1][outcome] = 0.0
P[L+1][outcome] += p*pShort
# prepend non-error
if (r < k):
outcome = (S,r+1)
if (outcome not in P[L+1]): P[L+1][outcome] = 0.0
P[L+1][outcome] += q*pShort
else:
outcome = (S+1,k)
if (outcome not in P[L+1]): P[L+1][outcome] = 0.0
P[L+1][outcome] += q*pShort
# compute (by summing) and print the distribution for the final length
# nota bene: we could print the distribution for every length
L = readLength
PSUM = {}
for (S,r) in P[L]:
if (S not in PSUM): PSUM[S] = 0.0
PSUM[S] += P[L][(S,r)]
print "%s\t%s" % ("score_%d_%d" % (kmerSize,readLength),"prob")
for S in xrange(readLength+2-kmerSize):
if (S not in PSUM): print "%d\tNA" % S
else: print "%d\t%.6f" % (S,PSUM[S])
if __name__ == "__main__": main()