-
Notifications
You must be signed in to change notification settings - Fork 0
/
lec22.570.tex
548 lines (428 loc) · 23.1 KB
/
lec22.570.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
\documentclass[10pt]{article}
%%%
%\usepackage{latexsym}
\usepackage{epsfig}
\usepackage{amssymb}
\usepackage{amsmath}
\begin{document}
\textwidth=8in
\textheight=9in
\parindent=10pt
\hoffset=-1in
\voffset=-.5in
\parskip=.065in
\newtheorem{problem}{Problem}
\begin{center}
\begin{tabular}{|lcr|}
\hline
.\hspace{1in}$$&$$\hspace{2in}$$&$$\hspace{1in}. \\
{\large \textsf{Fall 2011}} &
{\large \textsf{\textbf{CSC 570:} Bioinformatics}} &
{\large \textsf{ Alexander Dekhtyar}}\\
.\hspace{1in}$$&$$\hspace{2.5in}$$&$$\hspace{1in}. \\
\hline
\end{tabular}
\end{center}
\begin{center}
\textsf{\large Approximate String Matching}
\end{center}
{\large \textbf{Prepared by:} \textit{Ryan Schmitt, Bob Somers, \& Jim Langlois.}
}
%%
%% BOB'S TEXT SECTION BEGINS HERE
%%
\section*{Approximate String Matching with Dynamic Programming}
\paragraph{}The core idea behind using dynamic programming to solve the
approximate string matching problem is very straightforward and has been
``discovered" by a variety of authors between 1968 and 1992. All of these,
however, focus on using dynamic programming solely to compute the edit distance.
Therefore, the seminal work in this area is usually attributed to Sellers who
expanded the technique from an edit distance computation to an approximate
string search \cite{sellers80}.
The original algorithm is not particularly efficient, but it's
one of the most flexible for adaptation to different distance and cost
functions. The additional work in this area since the original paper has been
focused on small efficiency improvements to the original algorithm.
\subsection*{Computing the Edit Distance}
\paragraph{}In this example, we'll look at computing the edit distance between
``survey" and ``surgery" with dynamic programming. We'll use the
\textit{Levenshtein} or \textit{edit distance} as our distance function which
allows for three possible operations, \textit{insertion}, \textit{deletion},
and \textit{replacement}. We'll also assume that these operations have a
uniform cost of 1. As we'll see later, these are not crippling assumptions,
since the algorithm handles other distance and cost functions very well.
To compute the edit distance $ed(x, y)$ between two strings $x$ and
$y$, we begin by constructing a matrix $C_{0 \ldots |x|, 0 \ldots |y|}$ where
$C_{i, j}$ represents the minimum number of operations needed to match
$x_{1 \ldots i}$ to $y_{1 \ldots j}$.
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c|}
& & s & u & r & g & e & r & y \\ \hline
& 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline
s & 1 & & & & & & & \\ \hline
u & 2 & & & & & & & \\ \hline
r & 3 & & & & & & & \\ \hline
v & 4 & & & & & & & \\ \hline
e & 5 & & & & & & & \\ \hline
y & 6 & & & & & & & \\ \hline
\end{tabular}
\end{center}
Row 0 represents the empty string. Thus, each successive cell in row 0 requires
an additional deletion to match the empty string, so it's easy to see why this
initialization makes sense. The same logic follows for column 0. More formally,
we initialize our matrix as follows:
\begin{align*}
C_{i, 0} = i \\
C_{0, j} = j
\end{align*}
To fill in the rest of the matrix, we simply proceed either row-wise left to
right or column-wise top to bottom. When evaluating a particular cell $C_{i, j}$,
we notice that the edit distances of all shorter strings have been computed
already, so it's simply a matter of inspecting the edit distances from the
preceding strings at $C_{i - 1, j}$, $C_{i - 1, j - 1}$, and $C_{i, j - 1}$ and
incrementing the minimum edit distance if the two characters in question $x_{i}$
and $y_{j}$ don't match.
More formally, the rules for filling in a cell $C_{i, j}$ are as follows:
\begin{align*}
C_{i, j} = \begin{cases}
C_{i - 1, j - 1} & \text{if } x_{i} = y_{j} \\
1 + \text{min}(C_{i - 1, j}, C_{i - 1, j - 1}, C_{i, j - 1}) & \text{otherwise}
\end{cases}
\end{align*}
For this example we'll move row-wise left to right. The first cell that needs
to be filled in is $C_{1, 1}$. We compare $x_{1}$ (``s") with $y_{1}$ (``s") and
find that they match. Thus, we copy the edit distance from $C_{0, 0}$ into this
cell.
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c|}
& & s & u & r & g & e & r & y \\ \hline
& 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline
s & 1 & \textbf{\underline{0}} & & & & & & \\ \hline
u & 2 & & & & & & & \\ \hline
r & 3 & & & & & & & \\ \hline
v & 4 & & & & & & & \\ \hline
e & 5 & & & & & & & \\ \hline
y & 6 & & & & & & & \\ \hline
\end{tabular}
\end{center}
Next we'll consider $C_{1, 2}$. We compare $x_{1}$ (``s") with $y_{2}$ (``u")
find that they do not match. Thus, we examine the edit distances from the three
directions we could have arrived at this cell, take the minimum, and increment
it to account for this replacement. We find that the minimum of its left,
top-left, and top neighbors is 0 in $C_{1, 1}$, so we set this cell's edit
distance to 1.
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c|}
& & s & u & r & g & e & r & y \\ \hline
& 0 & \textbf{1} & \textbf{2} & 3 & 4 & 5 & 6 & 7 \\ \hline
s & 1 & \textbf{0} & \textbf{\underline{1}} & & & & & \\ \hline
u & 2 & & & & & & & \\ \hline
r & 3 & & & & & & & \\ \hline
v & 4 & & & & & & & \\ \hline
e & 5 & & & & & & & \\ \hline
y & 6 & & & & & & & \\ \hline
\end{tabular}
\end{center}
By continuing to traverse the empty cells in this fashion, we can complete the
edit distance matrix as follows. The bold path shows one optimal path for
transforming one string into the other.
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c|}
& & s & u & r & g & e & r & y \\ \hline
& \textbf{\underline{0}} & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline
s & 1 & \textbf{\underline{0}} & 1 & 2 & 3 & 4 & 5 & 6 \\ \hline
u & 2 & 1 & \textbf{\underline{0}} & 1 & 2 & 3 & 4 & 5 \\ \hline
r & 3 & 2 & 1 & \textbf{\underline{0}} & 1 & 2 & 3 & 4 \\ \hline
v & 4 & 3 & 2 & 1 & \textbf{\underline{1}} & 2 & 3 & 4 \\ \hline
e & 5 & 4 & 3 & 2 & 2 & \textbf{\underline{1}} & \textbf{\underline{2}} & 3 \\ \hline
y & 6 & 5 & 4 & 3 & 3 & 2 & 2 & \textbf{\underline{2}} \\ \hline
\end{tabular}
\end{center}
\subsection*{Text Searching}
\paragraph{}Extending this algorithm to text searching is actually extremely
easy. In this case we're interested in finding the locations of a pattern $P$
inside a larger text $T$ with up to $k$ possible errors. Since we're still
considering the simple edit distance, those possible errors consist of
\textit{insertions}, \textit{deletions}, and \textit{replacements}.
The algorithm is nearly identical to computing the edit distance matrix. We
simply assign $x = P$ and $y = T$. We also modify our initialization such that
row 0 is initialized as follows:
\begin{align*}
C_{0, j} = 0 \text{ for all } j \in 0 \ldots n
\end{align*}
Thus, our initial matrix for seaching for the pattern ``survey" inside the text
``surgery" would look like this:
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c|}
& & s & u & r & g & e & r & y \\ \hline
& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \hline
s & 1 & & & & & & & \\ \hline
u & 2 & & & & & & & \\ \hline
r & 3 & & & & & & & \\ \hline
v & 4 & & & & & & & \\ \hline
e & 5 & & & & & & & \\ \hline
y & 6 & & & & & & & \\ \hline
\end{tabular}
\end{center}
The intuition behind this is as follows: Since row 0 represents the empty
string, each cell $C_{0, j}$ represents a location in the text $T_{j}$ where
the empty string matches. Since the length of the matched substring is zero, it
matches everywhere with zero errors.
We fill the rest of the matrix as we did before, computing the edit distance
at each cell. However this time, note that the edit distance represents the
\textbf{cumulative total number of errors} when searching for the pattern $P$
in the text $T$.
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c|}
& & s & u & r & g & e & r & y \\ \hline
& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \hline
s & 1 & 0 & 1 & 1 & 1 & 1 & 1 & 1 \\ \hline
u & 2 & 1 & 0 & 1 & 2 & 2 & 2 & 2 \\ \hline
r & 3 & 2 & 1 & 0 & 1 & 2 & 2 & 3 \\ \hline
v & 4 & 3 & 2 & 1 & 1 & 2 & 3 & 3 \\ \hline
e & 5 & 4 & 3 & 2 & 2 & 1 & 2 & 3 \\ \hline
y & \textbf{6} & \textbf{5} & \textbf{4} & \textbf{3} & \textbf{3} & \textbf{\underline{2}} & \textbf{\underline{2}} & \textbf{\underline{2}} \\ \hline
\end{tabular}
\end{center}
To find approximate matches of the pattern, we simply look at the row which is
at the end of the pattern we're interested in. In this case, we look at row 6,
which is the end of ``survey". Note that this means the method also works for
searching for all prefixes of the pattern, by examining the row corresponding
to the last character in the prefix.
In that row we look for cells with an edit distance $\leq k$, where $k$ is the
maximum number of errors we allow. The location where these matches occur is at
position $j$, where $C_{|P|, j} \leq k$. Note that $j$ marks the \textit{end}
of the match, not the beginning.
In our example above, if we're interested in matches where $k \leq 2$, we find
three matches at positions 5, 6, and 7. These correspond to the following
approximate matches for ``survey" in ``surgery":
\begin{enumerate}
\item survey $\longrightarrow$ surge
\begin{itemize}
\item Replacement of \textbf{v} with \textbf{g}.
\item Deletion of \textbf{y}.
\end{itemize}
\item survey $\longrightarrow$ surger
\begin{itemize}
\item Replacement of \textbf{v} with \textbf{g}.
\item Replacement of \textbf{y} with \textbf{r}.
\end{itemize}
\item survey $\longrightarrow$ surgery
\begin{itemize}
\item Replacement of \textbf{v} with \textbf{g}.
\item Insertion of \textbf{r}.
\end{itemize}
\end{enumerate}
\subsection*{Additional Notes}
\begin{itemize}
\item \textbf{We don't need to store the entire matrix.} We only need to store
the previous row (for row-wise traversal) or the previous column (for column-wise
traversal). This can substantially improve the space requirements of the
algorithm.
\item \textbf{Text searches should use column-wise traversal.} Since generally
$|P| \ll |T|$, it is beneficial to traverse the matrix column-wise so that the
storage requirements are only $O(|P|)$ instead of $O(|T|)$.
\item \textbf{The time complexity isn't that great.} It's $O(|P||T|)$ in both
the worst and average case.
\item \textbf{It works well with other distance and cost functions.} For example,
if your insertions and deletions have a different cost than your replacements,
simply increment the cell by the appropriate cost depending on which operation
you choose.
\end{itemize}
%%
%% BOB'S TEXT SECTION ENDS HERE
%%
%%
%% RYAN'S TEXT SECTION BEGINS HERE
%%
\section*{Approximate String Matching and Filtering}
\paragraph{}Filtering is another method for approximate string matching. Below is a simple overview of how filtering algorithms are designed:
\begin{list}{}{}
\item 1) Filtering Phase
\begin{list}{*}{}
\item input: source text, pattern
\item algorithm quickly scans source for potential pattern matches
\item output: list of potential matches
\end{list}
\item 2) Verification Phase
\begin{list}{*}{}
\item input: list of potential matches
\item algorithm slowly verifies potential matches as exact matches
\item output: list of exact matches
\end{list}
\end{list}
The goal here is to mask out sections of the text that cannot contain matches, and only use the expensive algorithm on sections that could have matches.
For an overview of the twelve filtering algorithms please see \cite{navarro99}.
For specific algorithms see the following list:
\begin{list}{*}{}
\item Faster approximate string matching \cite{baeza99}
\item Sublinear approximate string matching and biological applications \cite{chang94a}
\item Approximate string matching and local similarity \cite{chang94b}
\item A comparison of approximate string matching algorithms \cite{jokinen96}
\item A Guided Tour to Approximate String Matching \cite{navarro99}
\item Improving an algorithm for approximate pattern matching \cite{navarro98}
\item Fast and flexible string matching by combining bit-parallelism and suffix automata \cite{navarro00}
\item Fast approximate string matching with q-blocks sequences \cite{shi96}
\item On using q-gram locations in approximate string matching \cite{sutinen95}
\item Approximate pattern matching with samples \cite{takaoka94}
\item Approximate Boyer-Moore string matching \cite{taruk93}
\item Approximate string matching with q-grams and maximal matches \cite{ukkonen92}
\item Fast text searching allowing errors \cite{wu92}
\end{list}
One of the forms of filtering is based on the Boyer-Moore algorithm that we have
studied previously (specifically the Boyer-Moore-Horspool (BMH) variation, where
the match heuristic is ignored, and only a simplified occurrence heuristic is used
\cite{taruk93}). And the genesis of this type of filtering is a 1993 paper by Tarhio
and Ukkonen, where they extend the BMH algorithm for the k-differences problem \cite{taruk93}.
For a review of the Boyer-Moore algorithm, please see the String Matching lecture notes.
\subsection*{The K-Differences Problem}
\paragraph*{Problem Statement}Given a text, T, and a pattern, P, find all occurrences of P
in T with at most k mismatches \cite{taruk93}. This is related to the Hamming Distance.
\subsubsection*{Example}
\noindent T = ``emample example''
\noindent P = ``example''
The k-differences problem with k=0 has only one solution, where ``example'' matches ``example.''
However, with k=1, there are two occurrences of ``example'' in ``emample example'' with at most 1 mismatch.
\subsection*{Approximate BMH Algorithm}
This extension to the Boyer-Moore-Horspool algorithm solves the k-differences problem, and, while not a
filtering algorithm itself, was the basis for Tarhio and Ukkonen to develop one of the first filtering algorithms
\cite{taruk93}. The crux of this extension is the generalization of both the right-to-left scanning of the pattern
against the text, and the shift computation.
The naming conventions we will use are as follows: for a text, T, of length n, the position index is h, and for a
pattern, P, of length m, the position index is i. We will find all occurrences of P in T with at most k mismatches.
\paragraph{Right-to-Left Scanning}
Instead of scanning until one mismatch is found, we scan until $k+1$ mismatches are found.
For example, if our pattern is ``abbb'' our text is ``abaacbb'' we have this as our first alignment:
\vspace{12pt}
\noindent \verb/a b b b/
\noindent \verb/a b a a c b b/
\vspace{12pt}
If k=1, we know it's not a match after we’ve found two mismatches:
\vspace{12pt}
\noindent \verb/a b B B/
\noindent \verb/a b A A c b b/
\vspace{12pt}
However, if k=2, we have a match, since we only found two mismatches.
\paragraph{Shift Computation}
Instead of only basing the shift on the text character paired with the last pattern character, we base the shift on the
last $k+1$ characters. The maximum shift is $m-k$. The shift is computed as the minimum shift out of those defined by the
$k+1$ text characters that are paired with the $k+1$ last characters of the pattern.
For example, if our pattern is ``abbb'' our text is ``abaacbb'' and $k=1$, we have the following as our first alignment:
\vspace{12pt}
\noindent \verb/a b b b/
\noindent \verb/a b a a c b b/
\vspace{12pt}
Here we have greater than k mismatches (2 mismatches) so we don't have a match. We look at the text characters below the
last $k+1$ characters of the pattern to determine our shift. Since $k = 1$, then $k+1 = 2$, and the text characters below the
last 2 characters of the pattern are `a' and `a':
\vspace{12pt}
\noindent \verb/a b B B/
\noindent \verb/a b A A c b b/
\vspace{12pt}
The `a' located at T$_{4}$ prescribes a shift of 3, in order to pair with the `a' at P$_{1}$, and the `a' located at T$_{3}$ prescribes
a shift of 2 to pair with the `a' at P$_{1}$. Because we minimize over all the shifts ($k+1$ of them), we choose the shift of
2, which leaves us with this:
\vspace{12pt}
\noindent \verb/ a b b b/
\noindent \verb/a b a a c b b/
\vspace{12pt}
Again, we have two mismatches (greater than the allowed 1 mismatch), so we don't have a match. We look at the text
characters below the last $k+1$ pattern characters:
\vspace{12pt}
\noindent \verb/ a b B B/
\noindent \verb/a b a a C B b/
\vspace{12pt}
The `b' located at T$_{6}$ prescribes a shift of 1, in order to pair with the `b' at P$_{3}$, and the `c' located at T$_{5}$ prescribes
a shift of 3, because `c' does not occur in the pattern so we can shift past it. We chose the minimum of the prescribed
shifts, which means we choose the shift of 1.
\vspace{12pt}
\noindent \verb/ a b b b/
\noindent \verb/a b a a c b b/
\vspace{12pt}
Here we only have one mismatch, which is not greater than our k=1. Therefore we have a match!
\vspace{12pt}
The full algorithms are as follows:
\fbox{
\begin{minipage}{7.8cm}
\textsc{Algorithm 2} \textsf{Computation of the shift table. \cite{taruk93}}\\
\textbf{begin}\\
$\mbox{ }\mbox{ }$ \textbf{for} \textsf{a in} $\Sigma$ \textbf{do} \textsf{ready[a] := m + 1;}\\
$\mbox{ }\mbox{ }$ \textbf{for} \textsf{a in} $\Sigma$ \textbf{do}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textbf{for} \textsf{i := m downto m - k} \textbf{do}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{shiftTable[i, a] := m;}\\
$\mbox{ }\mbox{ }$ \textbf{for} \textsf{i := m - 1 downto 1} \textbf{do begin}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textbf{for} \textsf{j := ready[p}$_{\textsf{i}}$\textsf{] - 1 downto max(i, m - k)} \textbf{do}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{shiftTable[j, p}$_{\textsf{i}}$\textsf{] := j - i;}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{ready[p}$_{\textsf{i}}$\textsf{] := max(i, m - k)} \textbf{end}\\
\textbf{end}\\
\end{minipage}
}
\fbox{
\begin{minipage}{10.5cm}
\textsc{Algorithm 2} \textsf{Approximate string matching with k mismatches. \cite{taruk93}}\\
\textbf{begin}\\
$\mbox{ }\mbox{ }$ \textsf{compute shiftTable from P with Algorithm 1;}\\
$\mbox{ }\mbox{ }$ \textsf{j := m;}\\
$\mbox{ }\mbox{ }$ \textbf{while} \textsf{j} $\le$ \textsf{n + k} \textbf{do begin}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{h := j; i := m; neq := 0;}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{d := m - k;}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textbf{while} \textsf{i} $>$ \textsf{0 and neq} $\le$ \textsf{k} \textbf{do begin}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textbf{if} \textsf{i} $\ge$ \textsf{m - k} \textbf{then} \textsf{d := min(d, shiftTable[i, t}$_{\textsf{h}}$\textsf{]);}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textbf{if} \textsf{t}$_{\textsf{h}} \neq$ \textsf{p}$_{\textsf{i}}$ \textbf{then} \textsf{neq := neq + 1;}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{i := i - 1; h := h - 1} \textbf{end}\textsf{;}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textbf{if} \textsf{neq} $\le$ \textsf{k} \textbf{then} \textsf{report match at postition j;}\\
$\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }\mbox{ }$ \textsf{j := j + d} \textbf{end}\textsf{;}\\
\textbf{end}\\
\end{minipage}
}
\vspace{12pt.}
The preprocessing time to create the shift table is $O(m + kc)$, where $c$ is the size of the alphabet. The storage space
required is $O(kc)$. The running time of the approximate BMH algorithm in the worst case is $O(mn)$, however, in
\cite{taruk93} they prove the expected running time is $O(nk (\frac{k}{c} + \frac{1}{m-k})$ if the lengths of all the shifts in the
shift table are mutually independent. \cite{taruk93}
\subsection*{Filtering and Approximate BMH}
We will only mention the basics of how the ideas in this algorithm are applied to filtering, and leave it to the reader
to research it further in \cite{taruk93}.
The authors' algorithm has two phases that they call scanning and checking \cite{taruk93}, which are analogous to the
filtering and verification phase, respectively, mentioned earlier. The verification phase uses the standard dynamic
programming method for checking whether the edit distance between two strings is within some tolerance. It is the
scanning phase we are interested in, which uses the ideas developed in the approximate BMH algorithm to compute the
shift once an alignment of text and pattern is marked as a potential match or not.
The scanning phase performs two operations iteratively: mark and shift. The mark operation marks the current alignment
of text and pattern as promising or not, prompting a check via dynamic programming if it looks like a potential match.
We will not explore the details of how a particular alignment is deemed promising, which can be reviewed in \cite{taruk93}.
The shift operation uses the exact same heuristic as the approximate BMH algorithm. That is, examining the text
characters under the last $k+1$ characters of the pattern.
%%
%% RYAN'S TEXT SECTION ENDS HERE
%%
\subsection*{Bibiliography}
\begin{thebibliography}{999}
%%
%% BOB'S BIBLIOGRAPHY SECTION STARTS HERE
%%
\bibitem{sellers80} Sellers, P., \textit{The theory and computation of evolutionary distances: Pattern recognition}, Journal of Algorithms, 1980.
%%
%% BOB'S BIBLIOGRAPHY SECTION ENDS HERE
%%
%%
%% RYAN'S BIBLIOGRAPHY SECTION STARTS HERE
%%
\bibitem{baeza99} Baeza-Yates, R. and Navarro, G., \textit{Faster approximate string matching}, Algorithmica, 1999.
\bibitem{chang94a} Chang, W. and Lawler, E., \textit{Sublinear approximate string matching and biological applications}, Algorithmica, 1994.
\bibitem{chang94b} Chang, W. and Marr, T., \textit{Approximate string matching and local similarity}, Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, 1994.
\bibitem{jokinen96} Jokinen, P., Tarhio, J., and Ukkonen, E., \textit{A comparison of approximate string matching algorithms}, Software Practice Exper., 1996.
\bibitem{navarro99} Navarro, G., \textit{A Guided Tour to Approximate String Matching}, ACM Computing Surveys, 1999.
\bibitem{navarro98} Navarro, G. and Baeza-Yates, R., \textit{Improving an algorithm for approximate pattern matching}, Algorithmica, 1998.
\bibitem{navarro00} Navarro, G. and Raffinot, M., \textit{Fast and flexible string matching by combining bit-parallelism and suffix automata}, ACM J. Exp. Algor. 5, 2000.
\bibitem{shi96} Shi, F., \textit{Fast approximate string matching with q-blocks sequences}, Proceedings of the 3rd Annual South American Workshop on String Processing, 1996.
\bibitem{sutinen95} Sutinen, E. and Tarhio, J., \textit{On using q-gram locations in approximate string matching}, Proceedings of the 3rd Annual European Symposium on Algorithms, 1995.
\bibitem{takaoka94} Takaoka, T., \textit{Approximate pattern matching with samples}, Proceedings of ISAAC '94, 1994.
\bibitem{taruk93} Tarhio, J. and Ukkonen, E., \textit{Approximate Boyer-Moore string matching}, SIAM J. Comput., 1993.
\bibitem{ukkonen92} Ukkonen, E., \textit{Approximate string matching with q-grams and maximal matches}, Theor. Comput. Sci, 1992.
\bibitem{wu92} Wu, S., and Manber, U., \textit{Fast text searching allowing errors}, Commun. ACM, 1992.
%%
%% RYAN'S BIBLIOGRAPHY SECTION ENDS HERE
%%
\end{thebibliography}
\end{document}