Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bleualign scales badly #9

Open
jelmervdl opened this issue Aug 18, 2020 · 3 comments
Open

Bleualign scales badly #9

jelmervdl opened this issue Aug 18, 2020 · 3 comments

Comments

@jelmervdl
Copy link
Member

jelmervdl commented Aug 18, 2020

If you give bleualign large input, it will just run forever. It can hold up the pipeline. Would be great if we either find a way to make it perform better for large document pairs, or have an option for it to bail if it knows it's not going to work out in a reasonable amount of time.

Really this issue is here so I can track/brain dump my findings & attach this trouble.gz file for reference. This is what stalled some bleualign processes for a couple of hours. It contains 174741 vs 175306 lines. Luckily that's not too typical for Paracrawl, but it happens.

Just ran the file on my local machine: it takes 50 minutes, and uses up to 28gb of memory. (And Apple Instruments can't export…, so here's a screenshot)
Screenshot 2020-08-18 at 18 30 22

Most time & memory is spent in search.cpp. I suspect the memory is due to the M*N back_pointer matrix. Most time is spend in boost's find_node, called N*M times in the loop.

@jelmervdl
Copy link
Member Author

@mlforcada
Copy link

mlforcada commented Aug 24, 2020 via email

@jelmervdl
Copy link
Member Author

Indeed, it doesn't change the time complexity, but it might help with getting the runtime on beefier machines down by using them a bit more optimally. I'm running into this on CSD3, where I can run a couple of bleualign in parallel using GNU parallel, but am still limited by the runtime and memory usage of each individual one.

I think it comes down to this: My two main issues with the current algorithm for searching for alignments is that it cannot scale by throwing more hardware at it: the memory issue is exponential, and the runtime issue is that it's single threaded.

The memory issue is due to the back pointer array that's used to find the shortest "edit distance", the one filled by process() and read by extract_matches(). As far as I understand Hirschberg, there is no back pointer matrix necessary as it is more of a divide & conquer approach, remembering the shortest path for sections of the search space.

The runtime issue I cannot solve because it's in that loop that fills the back pointer matrix, each cell depends on its previously calculated neighbours. Hirschberg is again splitting each task into subtasks that don't depend on each other so that would be much more simple to compute those parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants