-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathicfp29ex-daniels.tex
1713 lines (1472 loc) · 59.2 KB
/
icfp29ex-daniels.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\newif\ifpagetuning
\pagetuningtrue % trying to get page breaks in the best spots
\newif\ifnotcutting \notcuttingfalse % set if not cutting down to size
\newif\iffinaldraft \finaldrafttrue % true if it's the final version
\newif\iftimestamp\timestamptrue % show MD5 stamp of paper
% \timestampfalse % it's post-submission time
\IfFileExists{timestamp.tex}{}{\timestampfalse}
\newif\ifpdfmadness
\ifx\pdfoutput\undefined
\pdfmadnessfalse
\else
\ifnum\pdfoutput=1
\pdfmadnesstrue
\else
\pdfmadnessfalse
\fi
\fi
\documentclass[nonatbib,blockstyle,times]{sigplanconf}
% at 9pt anything but block style is unreadable
\newcommand\mrfy{MRFy} % damn you Noah
\newcommand\txprobj[3][]{a#1_{{#2}_{j-1}{#3}_j}}
\newcommand\txprobjj[3][]{a#1_{{#2}_{j-1}{#3}_j}}
\newcommand\alignwidth{\ensuremath C} % number of columns in an alignment
\newcommand\pairedwith[1]{{\pi(#1)}}
\newcommand\primcons{\texttt{\small(:)}}
\ifpagetuning
\newcommand\tuningbox{\vbox}
\else
\newcomming\tuningbox{\relax}
\fi
\newcommand\naive{na\"\i ve}
\usepackage{url}
\usepackage{amsmath}
\usepackage{array}
\usepackage{listings}
\usepackage{graphicx}
\lstset{
tabsize = 2,
basicstyle = \ttfamily,
language = Haskell
}
\newcommand\figref[1]{Figure~\ref{#1}}
\newcommand\secref[1]{Section~\ref{sec:#1}}
\newcommand\seclabel[1]{\label{sec:#1}}
\usepackage{verbatim}
\newif\ifverbatimsmall
\verbatimsmallfalse
\makeatletter
\def\verbatim@font{%
\normalfont\ttfamily
\ifverbatimsmall\small\fi
\hyphenchar\font\m@ne
\@noligs}
\makeatother
\newskip\presvtopsep
\newenvironment{smallverbatim}{\par\small\verbatimsmalltrue\verbatim}{\endverbatim}
\newcommand\smallverbatiminput[1]{%
\verbatimsmalltrue
\presvtopsep=\topsep
\topsep=0.78\topsep
\verbatiminput{#1}%
\verbatimsmallfalse
\topsep=\presvtopsep
}
\newcommand\smallfuzzverbatiminput[2]{%
\hfuzz=#1 \smallverbatiminput{#2}\hfuzz=0pt }
\renewcommand\ttdefault{aett}
\long\def\remark#1{%
\ifvmode
\marginpar{\raggedright\hbadness=10000
\parindent=8pt \parskip=2pt
\def\baselinestretch{0.8}\tiny
\itshape\noindent #1\par}%
\else
\unskip\raisebox{-3.5pt}{\rlap{$\scriptstyle\diamond$}}%
\marginpar{\raggedright\hbadness=10000
\parindent=8pt \parskip=2pt
\def\baselinestretch{0.8}\tiny
\itshape\noindent #1\par}%
\fi}
\iffinaldraft
\newcommand\finalremark{\remark}
\else
\newcommand\finalremark[1]{\relax}
\fi
\usepackage[authoryear]{natbib}
\bibpunct();A{},
\let\cite\citep
\let\citeyearnopar=\citeyear
\let\citeyear=\citeyearpar
\IfFileExists{timestamp.tex}{\input{timestamp}}{}
\renewcommand{\floatpagefraction}{0.9} % must be less than \topfraction
\renewcommand{\topfraction}{0.95}
\renewcommand{\textfraction}{0.05}
\begin{document}
%\parskip=0.9\parskip % cheating
%\advance \parskip by 0pt plus 0.1pt
\overfullrule=10pt
\conferenceinfo{ICFP'12,} {September 9--15, 2012, Copenhagen, Denmark.}
\CopyrightYear{2012}
\copyrightdata{978-1-4503-1054-3/12/09}
%\titlebanner{banner above paper title} % These are ignored unless
\preprintfooter{Haskell in Computational Biology} % 'preprint' option specified.
\iftimestamp
\preprintfooter{\mdfivestamp}
\fi
\title{Experience Report: Haskell in Computational Biology}
% \subtitle{Subtitle Text, if any}
\authorinfo{Noah M. Daniels \and Andrew Gallant \and Norman Ramsey}
{Department of Computer Science, Tufts University}
{{\rmfamily\{}ndaniels, agallant, nr{\rmfamily\}}@cs.tufts.edu}
\maketitle
%
% It can't hurt to remind everyone involved exactly what an Experience
% Report is.
%
%
\begin{abstract}
%An Experience Report provides {evidence} that functional
%programming really works, or it describes obstacles that prevent
%functional programming from working.
Haskell gives computational biologists
the flexibility and rapid prototyping of a scripting
language, plus the performance of native code.
In~our experience, higher-order functions, lazy evaluation, and
monads really worked, but
profiling and debugging presented obstacles.
Also, Haskell libraries vary greatly:
memoization combinators and parallel-evaluation
strategies helped us a lot,
but other, nameless libraries mostly got in our way.
Despite the obstacles and the uncertain quality of some libraries,
Haskell's ecosystem
made it easy for us to develop new algorithms in computational
biology.
\end{abstract}
\begingroup
\ifpagetuning
\makeatletter
\@paragraphaboveskip = 0.5\@paragraphaboveskip
\makeatother
\fi
\category{D.1.1}{Applicative (Functional) Programming}{}
\category{J.3}{Biology and genetics}{}
% \category{CR-number}{subcategory}{third-level}
%
% \terms
% term1, term2
%
\keywords
memoization, stochastic search, parallel strategies, QuickCheck,
remote homology detection
{\raggedright\par}
\endgroup
\section{Introduction}
Computational biologists write software that answers questions about
sequences of nucleic acids (genomic data) or sequences of amino
acids (proteomic data).
When performance is paramount,
software is usually written in C~or~C++.
When convenience, readability, and
productivity are more important,
software is usually written in a dynamically typed
or domain-specific language like
Perl, Python, Ruby, SPSS, or~R.
In~this paper, we report on experience using a third kind of language,
Haskell:
\begin{itemize}
\item
We had to reimplement an
algorithm already implemented in~C++,
and the Haskell code is slower.
But the Haskell code was easy to write, % coherent subject!
clearly implements the underlying mathematics
(\secref{viterbi}),
was easy to parallelize,
and performs well enough
(\secref{perf}).
\ifpagetuning
And our new tool solves a problem that could not be solved by
the~C++ tool which preceded~it.
\else
\ifpagetuning
And overall,
\else
Moreover, measured as a whole,
\fi
the software that incorporates our new
code
performs significantly
better than the~C++ software that preceded~it.
\fi
\item
Higher-order functions made it unusually easy to
create and experiment with new stochastic-search algorithms
(\secref{hofs}).
\item
Haskell slowed us down in only one area: understanding and
improving
performance (\secref{awkward-profiling}).
%% \item
%% The only significant impediment that Haskell presented to our
%% research was the difficulty of understanding and improving the
%% performance of Haskell programs (\secref{awkward-profiling}).
\item
Although the first two authors are computational
biologists
with little functional-programming experience,
Haskell made it easy for us to explore new research ideas.
By~contrast,
our group's C++ code has
made it hard to explore
% been a good vehicle for exploring
new ideas
(\secref{comparo}).
\item
The Haskell community offers libraries and tools that
promise powerful abstractions.
Some kept the promise, saved us lots of effort, and were a pleasure
to~use.
Others, not so much.
We couldn't tell in advance which
\ifpagetuning
would be which
\else
experience we were likely to have
\fi
(\secref{penumbra}).
\looseness=-1
\end{itemize}
\section{The biology}
Proteins, by
interacting with one another and with other
molecules, carry out the functions of living cells: metabolism, regulation,
signaling, and so on.
A~protein's function is determined by its structure,
and its structure is determined by the sequence of amino acids that
form the protein.
The amino-acid sequence is ultimately determined by a sequence of
nucleic acids in DNA, which we call a gene.
Given a gene, biologists wish to know the cellular
function of the protein the gene codes for.
One of the best known methods of discovering such function is
to find other proteins of
similar structure, which likely share similar function.
Proteins that share structure and function are expected to be
descended from a common ancestor---in biological terms, \emph{homologous}---and
thus
the problem of identifying proteins similar to a \textit{query sequence} is called
\textit{homology detection}.
Computational biologists detect homologies by building
algorithms which, given a {query sequence}, %% of amino acids,
compare it with known proteins.
When the known proteins have amino-acid sequences that
are not too different from the query sequence, homology can be
detected by
a family of algorithms called
\textit{hidden Markov models}~\cite{Eddy:1998ut}.
But in real biological systems,
proteins with similar structure and function may be formed from significantly
different amino-acid sequences, which are not close in edit distance.
Our~research software, MRFy (pronounced
``Murphy''), can detect homologies
in amino-acid sequences that are only distantly related.
MRFy is available at \url{mrfy.cs.tufts.edu}.
%
%We will explain an
%algorithm for detecting reasonably similar sequences, and then move on to
%explain MRFy, our novel approach for detecting more distantly related sequences
%for proteins that share similar structure and function.
%
%
% I hope the new intro renders these two sentences unnecessary. ---NR
%
%
%%
%% To~enable you to understand our experience with Haskell,
%% we~sketch not only the MRFy algorithm (\secref{mrfy})
%% but also one of the older hidden-Markov algorithms (\secref{viterbi}),
%% which is incorporated into~MRFy.
\section{The software}
Homology-detection software is most often used in one of two ways:
to test a hypothesis about
the function of a single, newly discovered protein, or
to compare every protein in a genome against a library of known protein
structures.
Either way,
the software is \emph{trained}
on a group of proteins that share function and structure.
These proteins are identified by a biologist, who puts
their amino-acid sequences into an
\emph{alignment}.
This alignment relates individual amino acids in a set of homologous proteins.
An~alignment may be represented as a matrix
in which each row corresponds to the amino-acid sequence of a protein,
and each column groups amino acids that play similar roles in
different proteins (Figure~\ref{alignment}).
An alignment may contain \emph{gaps}, which in
\figref{alignment} are shown as dashes.
A~gap in row~2, column~$j$ indicates that as proteins evolved, either
protein~2 lost its amino acid in position~$j$, or
other proteins gained an amino acid in position~$j$.
If~column~$j$ contains few gaps,
it~is considered a \emph{consensus column},
and the few proteins with gaps probably lost amino acids via
\emph{deletions}.
%% \remark{NMD: Missing is the fact that all these models
%% are directionless; they don't care whether something was gained or lost
%% over time. But this doesn't matter.}
If~column~$j$ contains \emph{mostly} gaps,
it~is considered a \emph{non-consensus column},
and the few proteins without gaps probably gained amino acids via
\emph{insertions}.
Once a protein alignment is constructed, it~is used to train a
\emph{hidden Markov model}.
A~hidden Markov model is a probabilistic finite-state machine which
can assign a probability to any query sequence.
A~protein whose query sequence has a higher probability is more likely to %
be homologous to the proteins in the alignment.
We~write a query sequence as $x_1, \ldots, x_{\scriptscriptstyle N}$,
where each $x_i$~is
an amino acid.
The number of amino acids, $N$, can differ from the number of columns
in the alignment,~\alignwidth.
\begin{figure}
\ifpdfmadness
\centerline{\includegraphics[width=6cm]{alignment.pdf}}
\else
\centerline{\includegraphics[width=6cm]{alignment.eps}}
\fi
\caption{A~structural alignment of four proteins $(\alignwidth = 12)$}
\label{alignment}
\end{figure}
A~hidden Markov model carries probabilities on some states and on all
state transitions.
Both the probabilities and the states are determined by the alignment:
\begin{itemize}
\item
For each column~$j$ of the alignment, the hidden Markov model has a
\emph{match state}~$M_j$.
The match state contains a table $e_{M_j}(x)$ which gives the
probability that a homologous protein has amino acid~$x$ in
column~$j$.
\item
For each column~$j$ of the alignment, the hidden Markov model has an
\emph{insertion state}~$I_j$.
The insertion state contains a table $e_{I_j}(x)$ which gives the
probability that a homologous protein has gained amino acid~$x$ by
insertion at column~$j$.
\item
For each column~$j$ of the alignment, the hidden Markov model has a
\emph{deletion state}~$D_j$.
The deletion state determines the probability that a homologous protein
has lost an amino acid by deletion from column~$j$.
\end{itemize}
The probabilities $e_{M_j}(x)$ and $e_{I_j}(x)$ are \emph{emission probabilities}.
A hidden Markov model also has distinguished ``begin'' and ``end''
states.
%%
%% for consistency with the text above 'begin' and 'end' are neither
%% capitalized nor italicized.
%%
In~our representation, each state contains a probability or a table of
probabilities, and it is also labeled with one of these labels:
\smallverbatiminput{statelabel}
We~use the ``Plan7'' hidden Markov model, which forbids direct
transitions between insertion
states and deletion states \cite{Eddy:1998ut}.
\ifpagetuning
``Plan7'' implies that
\else
The model gets its name because
\fi
there are exactly~7 possible transitions
into the states of any column~$j$.
Each transition has its own probability:
\begin{itemize}
\item
A~transition into a match state
is more likely when column~$j$ is a consensus column.
Depending on the predecessor state,
the probability of such a transition is
$\txprobj M M$, $\txprobj I M$, or~$\txprobj D M$.
\item
A~transition into a deletion state
is more likely when column~$j$ is a non-consensus column.
The probability of such a transition is
$\txprobj M D$~or~$\txprobj D D$.
\item
A~transition into an insertion state
is more likely when column~$j$ is a non-consensus column.
The probability of such a transition is
$\txprobjj M I$~or~$\txprobjj I I$.
\end{itemize}
\begin{figure}
\ifpdfmadness
\centerline{\includegraphics[width=8cm]{Plan7.pdf}}
\else
\centerline{\includegraphics[width=8cm]{Plan7.eps}}
\fi
This model
has begin and end states $B$~and~$E$,
as well as four
nodes, each containing
an insertion state~$I$,
a match state~$M$, and a
deletion state~$D$.
\caption{A hidden Markov model $(\alignwidth = 4)$}
\label{plan7} \end{figure}
\begin{figure*}
\newcommand\vsum[2]{#2&{}+{}& #1}
\def\goo{18pt}
\def\gum{14pt}
\[
\def\maxiquad{\hskip 1.2em\relax}
%\mskip -5mu
\begin{array}{@{}l@{}c@{}l}
V_{j}^{M}(i) &{}={}& \log\frac{e_{M_{j}}(x_{i})}{q_{x_{i}}} + \max \left\{
\begin{array}{l@{}c@{}l}
\vsum{V_{j-1}^{M}(i - 1)} {\log a_{M_{j-1}M_{j}}}\\
\vsum{V_{j-1}^{I}(i - 1)} {\log a_{I_{j-1}M_{j}}}\\
\vsum{V_{j-1}^{D}(i - 1)} {\log a_{D_{j-1}M_{j}}}\\
\end{array} \right.\\[\goo]
V_{j}^{I}(i) &=& \log\frac{e_{I_{j}}(x_{i})}{q_{x_{i}}} + \max \left\{
\begin{array}{l@{}c@{}l}
\vsum{V_{j}^{M}(i - 1)} {\log a_{M_{j}I_{j}}}\\
\vsum{V_{j}^{I}(i - 1)} {\log a_{I_{j}I_{j}}}\\
\end{array} \right.\\[\gum]
V_{j}^{D}(i) &=& \max \left\{
\begin{array}{l@{}c@{}l}
\vsum{V_{j-1}^{M}(i)} {\log a_{M_{j-1}D_{j}}}\\
\vsum{V_{j-1}^{D}(i)} {\log a_{D_{j-1}D_{j}}}\\
\end{array} \right.\\
\end{array}
%
\hskip 1.5em\relax % \qquad too big
%
\begin{array}{l@{}c@{}l@{}}
V_{j}^{\prime M}(i) &{}={}& e^{\prime}_{M_{j}}(x_{i}) + \min \left\{
\begin{array}{l@{}c@{}l}
\vsum{V_{j-1}^{\prime M}(i - 1)} {a^{\prime}_{M_{j-1}M_{j}}}\\
\vsum{V_{j-1}^{\prime I}(i - 1)} {a^{\prime}_{I_{j-1}M_{j}}}\\
\vsum{V_{j-1}^{\prime D}(i - 1)} {a^{\prime}_{D_{j-1}M_{j}}}\\
\end{array} \right.\\[\goo]
V_{j}^{\prime I}(i) &=& e^{\prime}_{I_{j}}(x_{i}) + \min \left\{
\begin{array}{l@{}c@{}l}
\vsum{V_{j}^{\prime M}(i - 1)} {a^{\prime}_{M_{j}I_{j}}}\\
\vsum{V_{j}^{\prime I}(i - 1)} {a^{\prime}_{I_{j}I_{j}}}\\
\end{array} \right.\\[\gum]
V_{j}^{\prime D}(i) &=& \min \left\{
\begin{array}{l@{}c@{}l}
\vsum{V_{j-1}^{\prime M}(i)} {a^{\prime}_{M_{j-1}D_{j}}}\\
\vsum{V_{j-1}^{\prime D}(i)} {a^{\prime}_{D_{j-1}D_{j}}}\\
\end{array} \right.\\[\gum]
%
\multicolumn3{l@{}}{%
a'_{s \hat{s}} = - \log a_{s \hat{s}}
\qquad
e^{\prime}_{s}(x) = - \log\frac{e_{s}(x)}{q_{x}}
\qquad
V_j^{\prime M}(i) = - V_j^{M}(i)
}
\\
\end{array}
%\mskip -5mu
\]
\caption{Viterbi's equations, in original and negated forms}
\label{viterbi}
\end{figure*}
\subsection{Computing probabilities using perspicuous Haskell}
\seclabel{viterbi}
Given a hidden Markov model,
an~established software package called HMMER (pronounced ``hammer'')
can compute the probability
that a new protein shares structure
\ifnotcutting and evolutionary history \fi
with the proteins used to train the model.
The computation finds the most likely path through the hidden Markov model.
To~make best use of floating-point arithmetic, the software computes
the \emph{logarithm} of the probability of each path, by summing
the logs of the
probabilities on the states and edges of the path \cite{Viterbi:1967hq}.
The~path that maximizes the
log of the probability is the most likely path.
The computation is specified on the left-hand side of \figref{viterbi}.
A~probability $V_j^M(i)$ represents the probability of the most
likely path of the first~$i$ amino acids in the query sequence,
terminating with placement of amino acid $x_i$ in state~$M_j$.
Probabilities $V_j^I(i)$ and $V_j^D(i)$ are similar.
The
equations are explained very clearly by
\citet[Chapter~5]{Durbin:1998wz}
To~be able to use Haskell, we had to reimplement the standard
algorithm for solving Viterbi's equations.
Haskell made it possible for us to write code that looks like the
math,
which made the code easy to write and gives us confidence that it is
correct.
Our code represents a query sequence as an immutable array of amino
acids.
In~idiomatic Haskell,
we might represent an individual amino acid~$x_i$
using a value of algebraic data type:
\begin{smallverbatim}
data Amino = Ala | Cys | Asp | Glu | ... -- not used
\end{smallverbatim}
But our models use a
legacy file format in which each amino acid is a small integer
used only to index arrays.
%which we use only as an array index.
We therefore chose
\smallverbatiminput{aa}
%% \remark{I'm not sure if this is the whole story, since the reader might simply
%% ask, ``Why not map 0 to Ala, 1 to Cys, etc.?''. The real reason we keep the
%% ``Int'' here I think, is that it provides easy indexing into the match and
%% insertion emission vectors.}
Our legacy file format also \emph{negates}
the log of each probability, making it a positive number.
%%To~make these numbers more pleasant to read, we negate them,
%%which eliminates hordes of minus signs from our file formats.
The negated logarithm of a probability is called a \emph{score}.
\smallverbatiminput{score}
Type \texttt{Score} has a limited \texttt{Num} instance which permits
scores to be added and subtracted but not multiplied.
In a real hidden Markov model, each probability is represented as a score.
Our code implements a transformed version of Viterbi's equations
which operates on scores.
The transformed equations are
shown on the right-hand side of \figref{viterbi}.
They minimize the
score (the negated log probability) for each combination of
column~$j$, amino
acid~$x_i$, and state $M_j$, $I_j$, or~$D_j$.
A~model is represented as a sequence of \emph{nodes};
node~$j$ includes states $M_j$, $I_j$, and~$D_j$,
as well as the probabilities of transitions out of that node.
Each node contains tables of emission scores
$e'_{M_j}$~and~$e'_{I_j}$.
These tables are read by function
\texttt{eScore}, whose specification
is % parallel structure for line breaking below
\mbox{$\mathtt{eScore}\;s\;j\;i = e'_{s_j}(x_i)$}.
%
We place the transition probabilities
into a record in which each field is labeled
\texttt{$s$\_$\mskip 1mu\hat s$},
where $s$~and~$\hat s$ form one of the 7~permissible pairs of state
labels:
\smallverbatiminput{tprob-tprobs}
These scores are read by
function \texttt{aScore},
whose specification is
\mbox{$\mathtt{aScore}\;s\;\hat s\;(j-1) = \txprobj s {\hat s}$}.
%% DROP THIS PARA??
%% Nodes are numbered, and the representation of a node includes
%% the tables
%% of emission probabilities for the match and insertion states,
%% plus
%% the
%% probabilities of transitions
%% into the states of that node.
%% \smallverbatiminput{hmmnode}
Scores can be usefully attached to many types of values,
so we have defined a small abstraction:
\smallfuzzverbatiminput{10.8pt}{vscore}
Think of a value of type \mono{Scored~a} as a container holding
an~``\texttt a'' with a score written on the side.
The \texttt{/+/} function adds to the score without touching the
container.
Function \texttt{fmap} is also defined; it~applies a function to a
container's contents.
%
Finally, we~made \texttt{Scored} an instance of~\texttt{Ord}.
Containers are ordered by score alone, so applying
\texttt{minimum} to a list of scored things chooses the thing with the
smallest (and therefore best) score.
Armed with our models and with the \texttt{Scored} abstraction, we
attacked Viterbi's equations.
The probability in each state is a function of the probabilities
in its predecessor states,
and all probabilities can be computed by a classic dynamic-programming
algorithm.
This algorithm starts at the {begin} state,
computes probabilites in nodes $1$~through~$\alignwidth$ in
succession, and terminates at the {end} state.
One of us implemented this algorithm, storing the probabilities in an array.
The cost was
$O(|N|\times|\alignwidth|)$;
in~MRFy, $\alignwidth$~and~$N$ range from several hundred to a few
thousand.
%% For
%% reasons including floating point underflow, the HMMER software (with which we
%% maintain file format compatibility) stores all probabilities in a trained HMM
%% file as negative natural logs.
%% Thus, the Viterbi recurrence relations are
%% simplified from the form in \ref{viterbi_eqn} to that in \ref{viterbi_log_eqn},
%% and because they are \textit{negative} logs, the problem transforms from
%% maximization to minimization.
\seclabel{cons}
\seclabel{vee-prime}
Another of us was curious to try coding Viterbi's equations
directly as recursive functions.
Like a recursive Fibonacci function, Viterbi's functions,
when implemented \naive ly,
take exponential time.
But like the Fibonacci function, Viterbi's functions can be
\emph{memoized}.
For example, to compute $V_j^{\prime M}(i)$ using the equation at the top
right of \figref{viterbi}, we define
\mbox{\texttt{vee\char`\'} \texttt{Mat} $j$ $i$}.
The equation adds~$e^{\prime}_{M_{j}}(x_{i})$, computed
with \texttt{eScore}, to a minimum of sums.
The~sum of an~$a'_{s\hat s}$~term and a $V^{\prime s}_{j-1}(i-1)$ term
is computed by function \texttt{avSum}, in which
the terms are computed by \texttt{aScore} and \texttt{vee''}, respectively:
\smallfuzzverbatiminput{2.9pt}{vfix}
%%%%%%%%%% horrible line break
% The right-hand side of \verb+vee'+ is wrapped in a call to
What about the call to
\mono{\mbox{fmap (Mat `cons`)}}?
This call performs a computation \emph{not} shown in \figref{viterbi}:
\mrfy\ computes not only the {probability} of the most likely path but
also the path itself.
Function \mono{(Mat `cons`)} adds~$M$ to a path;
we~avoid \mbox{\mono{(Mat :)}} for reasons explained
in \secref{performance} below.
Function~\verb+vee''+ is the {memoized} version of~\verb+vee'+.
Calling~\verb+vee''+ produces the same result as calling~\verb+vee'+,
but faster:
\smallverbatiminput{memo}
Functions \texttt{Memo.memo3} and \texttt{Memo.arrayRange} come from
Luke Palmer's
\path{Data.MemoCombinators} package.
The value
\texttt{numNodes} represents~$\alignwidth$,
and \texttt{seqlen} represents~$N$.
Memoization makes \verb+vee'+ perform as well as our classic
dynamic-programming code.
And~the call to \texttt{Memo.memo3} is the \emph{only} part of the code
devoted to dynamic programming.
By~contrast, standard implementations of Viterbi's algorithm, such as in HMMER,
spend much of their code
managing dynamic-programming tables.
Haskell enabled us write simple, performant code with little effort.
%
Because the memoized version so faithfully resembles the equations in
\figref{viterbi}, we~retired the classic version.
%%
%%
%% In this, we were grateful for the resemblance
%% between the mathematical description of the algorithm and the top-down
%% dynamic-programming approach in Haskell, which resulted in perspicuous code.
%%
%%
\subsection{Exploring new algorithms using higher-order functions}
\seclabel{hofs}
\seclabel{mrfy}
We use Viterbi's algorithm to help detect
homologies in proteins with specific kinds of structure.
When a real protein folds in three dimensions,
amino acids
that are far away in the one-dimensional sequence can be
adjacent in three-dimensional space.
Some groups of such acids are called \emph{beta strands}.
Beta strands
can be hydrogen-bonded to each other,
making them ``stuck together.''
These beta strands help identify groups of homologous
proteins.
MRFy detects homologous proteins that include hydrogen-bonded beta
strands; using prior methods, many instances of this problem are
intractable.
Beta strands require new equations and
richer models of protein structure.
%% The~richer model recognizes that only certain pairs of amino acids are
%% able to bond to each other.
%% The presence of beta strands changes both the natu
%% the probability that a given
%% query sequence fit
%% A~mutation in a paired beta strand may break the pairing, in which
%% case the mutated protein will not fold properly and is unlikely to
%% survive natural selection.
%% But a mutation in a paired beta strand \emph{can} survive if it is
%% accompanied by a \emph{compensatory} mutation in the strand with which it is
%% paired.
When column~$j$ of an alignment is part of a beta strand and is paired
with another column $\pairedwith j$,
the probability of finding amino acid~$x_i$ in column~$j$
\ifpagetuning
depends on the amino acid~$x'$ in column~${\pairedwith i}$.
\else
depends on whatever amino acid~$x'$ is in column~${\pairedwith i}$.
\fi
If~$x'$ is in position~$i'$ in the query sequence, Viterbi's
equations are altered; for example,
$V_{j}^{\prime M}(i)$ depends not only on
$V_{j-1}^{\prime M}(i-1)$ but also on
$V_{\pairedwith j}^{\prime M}(i')$.
%% \remark{I can't get an equation I
%% believe in. ---NR I think it's fine, unless we want to introduce
%% notation to distinguish paired amino acids from paired
%% model nodes --ND}
The distance between $j$~and~$\pairedwith j$ can be as small as a few
columns or as large as a few hundreds of columns.
Because $V_j^{\prime M}(i)$~depends not only on nearby values but also on
$V_{\pairedwith j}^{\prime M}(i')$,
dynamic programming cannot compute the maximum likelihood quickly
\cite{Menke:2010ti,Daniels:2012}.
The new equations are accompanied by a new model.
Within a beta strand, amino acids are not inserted or deleted, so~a
bonded pair of beta strands is modeled by
a pair of sequences of match states.
Between beta strands, the model is structured as before.
The~combined model, an~example of which is shown in \figref{mrf}, is called a
\textit{Markov random field}.
%
% N.B. very tight paragraph...
%
MRFy treats the beta
strands in the model
as ``beads'' which can slide along the query sequence.
A~positioning of the beta strands is called
a \emph{placement}.
A~placement's likelihood is computed
based on frequencies of amino-acid pairs observed in
hydrogen-bonded beta strands~\cite{Cowen:2002p588}.
Given a placement, the maximum likelihood of the rest of
the query sequence, between and around beta strands,
is computed quickly and exactly
using Viterbi's algorithm.
\ifpagetuning
This likelihood is \emph{conditioned} on the placement.
\else
The exact result is a probability that is
\emph{conditioned} on the placement.
\fi
MRFy searches for likely placements stochastically.
\mrfy\ implements
random hill
climbing, simulated
annealing, multistart simulated annealing, and a genetic algorithm.
These algorithms share much code, and \mrfy\ implements them using
higher-order functions,
existentially
quantified types, and lazy evaluation.
%
% placement at top left column is not ideal, but right column results
% in a bad page break elsewhere
%
\begin{figure}
\ifpdfmadness
\centerline{\includegraphics[width=7.8cm]{mrf_interleave_diagram.pdf}}
\else
\centerline{\includegraphics[width=7.8cm]{mrf_interleave_diagram.eps}}
\fi
Each shaded node represents a beta-strand position.
Nodes connected by dashed edges are hydrogen-bonded.
\caption{A Markov random field with two beta-strand pairs}
\label{mrf}
\end{figure}
We~describe \mrfy's search abstractly: \mrfy\ computes a sequence
of \emph{points} in a search space.
The type of point is existentially quantified, but it is typically
a single placement or perhaps a population of placements.
Each point also has a \texttt{Score}; \mrfy\ looks for points with
good scores.
Ideally, MRFy would use the now-classic, lazy, modular
technique advocated by \citet{hughes:why}, in which one function
computes an infinite sequence of points, and another function
uses a finite prefix to decide on an approximation.
But because MRFy's search is stochastic,
making MRFy's search modular is not so easy.
To illustrate the difficulties, we~discuss our simplest search:
random hill climbing.
%The space of beta-strand placements has no notion of slope or
%gradient,
From any given point in the search space, this search moves a random
distance in a random
direction.
If~the move leads to a better point,
we call it \emph{useful};
otherwise it is \emph{useless}.
\smallverbatiminput{utility}
(We~also use \texttt{Useful} and \texttt{Useless} to tag \emph{points}.)
With luck, an~infinite sequence of useful moves converges
at a local optimum.
\mrfy's search path follows only useful moves;
if a move is useless,
\mrfy\ abandons it and moves again (in a new random direction) from the
previous point.
Ideally, \mrfy\ would search by composing a \emph{generator}
that produces an infinite sequence of moves,
a \emph{filter} that selects the useful moves, and
a \emph{test} function that enumerates finitely many useful moves
and returns the final destination.
But a generator may produce an infinite sequence of useless moves.
(For example, if \mrfy\ should stumble upon a global optimum, every move from
that point would be useless.)
Given an infinite sequence of useless inputs, a filter would not
produce any values, and the search would diverge.
We~address this problem by combining ``generate and filter'' into a
single abstraction, which has type \mono{SearchGen pt r}.
Type variable~\texttt{pt} is a point in the search
space, and \texttt{r} is a
random-number generator.
\mono{Rand~r} is a lazy monad of stochastic computations:
\smallverbatiminput{gen.tex}
The monadic computation \texttt{pt0} randomly selects a starting point
for search;
\texttt{nextPt} produces a new point from
an existing point.
Because scoring can be expensive, both \texttt{pt0}
and \texttt{nextPt} use \emph{scored} points, and they can
reuse scores from previous points.
To tell if a point returned by \texttt{nextPt} is useful, we call
the \texttt{utility} function,
which scrutinizes a move represented as follows:
\smallverbatiminput{move}
The decision about utility uses not only a source of randomness but
also the \emph{cumulative cost} of the point, which we define to be
the number of points explored previously.
The~cumulative cost of the current point is also the age of the
search,
and
in~simulated annealing, for example, as the search ages,
the \texttt{utility} function becomes less likely to accept a move
that worsens the score.
\noindent
\vbox{
Using these pieces, function \texttt{everyPt} produces an infinite
sequence containing a mix of useful and useless moves:
\smallfuzzverbatiminput{10.8pt}{everygen}
}
\noindent
Both \texttt{nextPt} and \texttt{utility} are monadic, but
we can still exploit laziness:
from its starting point, \texttt{everyPt} produces an infinite list of
randomly chosen successor points, then calls \texttt{costedUtility} to
tag each one with a cumulative cost and a utility.
We~hope that if you look carefully at how \texttt{successors} is computed,
you will understand why we
separate \texttt{pt0}~from~\texttt{nextPt} instead of using a single
function that produces an infinite list:
We~don't \emph{want} the infinite list that would result from
applying \texttt{nextPt} to many points in succession;
we~want the infinite list that results from applying \texttt{nextPt}
to \texttt{startPt} many times in succession, each time with a
different source of randomness.
Once the successors have been computed and tagged,
\texttt{span} finds the first useful successor.
In~case there \emph{is} no successor, \texttt{everyPt} also returns
all the useless successors.
If~we do find a useful successor, we start searching anew from that
point, with a recursive call to \texttt{everyPt}.
(Because \texttt{everyPt} is monadic, the points accumulated so far are
appended to its result using the \verb+<$>+~operator.)
%
The most informative part of \texttt{everyPt} is last expression of
the \texttt{do} block,