NOTES.BaldiSiegel.txt

//// TODO: Make Unconditional Baldi / Siegel faster by precomputing and saving the score coefficients at each position...
 
In the code, I've been calling this "Baldi / Siegel" to honor Andy Siegel for his inspiration and for our work together on this early; I'm sure it was he who suggested the particular hill-climbing approach.  Really without him, not only would this have not happened, but I would never have gone in this direction and it's very unlikely I'd have become a statistician.  He deserves equal credit for the original method.  I recall him once having said that maybe one day we'll know the gradient analytically, but for now it suffices to sample to determine the gradient.  Later -- much later -- I realized that the Baldi-Chavin paper does derive the exact analytical gradient we were after.  Luckily, what Baldi's method seemed to lack was a determination of the distance to move in the direction of the gradient.  They did provide one but it never gained traction.  (I have never been able to get it to work with the suggested step length, though this could just be an issue of the correct learning rate, but it seems that maybe the learning rate has to vary by context.. it is implemented here as "Baldi", though I'm no longer maintaining it).  I had meanwhile discovered that the Baum-Welch approach worked well as a conditional algorithm, better than as an unconditional one, but I always wondered if the original method would do as well, or even better; I'd already convinced myself that for that original method, the conditional approach would be superior to the unconditional one.  It turns out that it does -- at least it does for amino acids.  For DNA, right now it's not as clear as I'd originally thought it would be.  It seems that with the setting I had been using (which limits the finding-the-peak part to one step) was getting trapped where CBW wasn't -- but in a single data point that I used to test it, I observed that changing that to a high limit (I set it to 1000, which will not be reached before the "close enough" criterion kicks in) makes the method work much better (tied with CBW, in that one instance) -- but alas for the unconditional variant it is disastrously slow.  In initial results it seems that at least at the lowest conservation level (.3, which of course is extremely difficult), there are still cases of CBW getting out of traps that hold CBS aka CQA.

The other major discovery, which is pretty overwhelming honestly, is to start with even emission distributions, rather than uniform ones.  Or (as I do) draw from a dirichlet prior with constant high values for the parameters.  I use "100" for all of them, which still exhibits some real variation (values in the .23-.27 range), but nothing entrapping.  With nearly-even starting values, the parameters are much more consistently trained in all methods and across methods.