-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bleualign scales badly #9
Comments
On a quick reading, I fail to see any fundamental difference with the
regular Wagner–Fischer implementation of Levenshtein edit distance. In
fact, Hirschberg's algorithm seems to be a variation of the
Needleman–Wunsch algorithm, which in turn looks a lot like a clever
trick on top of Levenshtein edit distance to store only two rows or two
columns (whichever is smallest). Time complexities are O(nm) for all
three algorithms (unless one assumes that alignments fall on a diagonal
strip, in which case the algorithm is O(m) or O(n), whichever is smaller):
El 19/8/20 a les 18:43, Jelmer ha escrit:
https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABXSX5OY6ORAXKKPPZ7SQ3SBP6J5ANCNFSM4QDW2VMA>.
--
Mikel L. Forcada http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776
|
Indeed, it doesn't change the time complexity, but it might help with getting the runtime on beefier machines down by using them a bit more optimally. I'm running into this on CSD3, where I can run a couple of bleualign in parallel using GNU parallel, but am still limited by the runtime and memory usage of each individual one. I think it comes down to this: My two main issues with the current algorithm for searching for alignments is that it cannot scale by throwing more hardware at it: the memory issue is exponential, and the runtime issue is that it's single threaded. The memory issue is due to the back pointer array that's used to find the shortest "edit distance", the one filled by process() and read by extract_matches(). As far as I understand Hirschberg, there is no back pointer matrix necessary as it is more of a divide & conquer approach, remembering the shortest path for sections of the search space. The runtime issue I cannot solve because it's in that loop that fills the back pointer matrix, each cell depends on its previously calculated neighbours. Hirschberg is again splitting each task into subtasks that don't depend on each other so that would be much more simple to compute those parallel. |
If you give bleualign large input, it will just run forever. It can hold up the pipeline. Would be great if we either find a way to make it perform better for large document pairs, or have an option for it to bail if it knows it's not going to work out in a reasonable amount of time.
Really this issue is here so I can track/brain dump my findings & attach this trouble.gz file for reference. This is what stalled some bleualign processes for a couple of hours. It contains 174741 vs 175306 lines. Luckily that's not too typical for Paracrawl, but it happens.
Just ran the file on my local machine: it takes 50 minutes, and uses up to 28gb of memory. (And Apple Instruments can't export…, so here's a screenshot)
Most time & memory is spent in search.cpp. I suspect the memory is due to the M*N back_pointer matrix. Most time is spend in boost's find_node, called N*M times in the loop.
The text was updated successfully, but these errors were encountered: