Parallelizing ashlar #191

ahnsws · 2023-04-27T20:04:47Z

Hello, I wanted to bring up the performance of ashlar. We use ashlar very heavily in our image pre-processing pipeline, but it's been pretty slow because of its single-threadedness, understandably because of memory concerns. I took a stab at parallelizing ashlar in this gist, without the pyramid step:

https://gist.github.com/ahnsws/b82ed163c773c5d841585e182825f472

I used viztracer to identify slow loops and threw a ThreadPoolExecutor context over them. I did have to place a lock over the reader, as that seemed to give bad results, as expected with multithreading. With the lock, we have exact concordance between images stitched with single-threaded vs parallelized ashlar.

For three rounds of 287 tiles, each with six channels of cycif data, single-threaded ashlar takes around 36 minutes, but with the parallelized version in the gist, it runs in three minutes (using 20 cores on our workstation). I think based on the merging step, the time to run will probably depend on ceil(n_channels / n_cores). I've also used tqdm to show progress bars in verbose mode:

There is a tradeoff here between compute time and memory usage, but because RAM is more plentiful nowadays we can speed things up by quite a bit, at least for our use case.

Thanks,
Seb

ahnsws · 2023-04-28T15:37:15Z

I used dask distributed to profile the program, and based on my machine (20 cores, 128 GB), the amount of memory consumed reaches around 14 GiB, which makes sense since each channel ends up being about 900 MB. I can also try including more channels than the number of available cores.

ahnsws · 2023-04-28T16:27:24Z

With six rounds and 35 total channels, with 287 tiles each, we see the following:

Not sure why the memory usage is logistic-like compared to before, but previously stitched channels seem to be properly gc'ed, and the total memory usage seems to be reasonable for modern computing environments.

josenimo · 2023-06-15T08:57:11Z

Hey @ahnsws,

Would you think that this parallelization would work for HPC runs?
I am a newbie, but trying to improve run times for large WSI images.
Thanks in advance

ahnsws · 2023-06-20T23:50:15Z

Hi @josenimo, it's hard to say without knowing more about your pipeline. Have you tried profiling the steps in your pipeline and identifying bottlenecks? If it is possible to trade off computation time with memory in the context of your clusters, the parallelization should work.

josenimo · 2023-06-21T07:27:52Z

@ahnsws, It is just mcmicro with some little changes, see josenimo/mcmicro .
ASHLAR is a bottleneck, time-wise, I know this from running various datasets.
(1) Illumination takes about 2 hours per cycle, but executor = "sge" in nextflow.config parallelizes this into different nodes wonderfully
(2) ASHLAR takes from 7 to 12 hours, with very little RAM and just one core
(3) Coreograph takes ~2 hours
(4) Segmentation and Quantification takes in total about 25 minutes per core, and because 'executor = "sge"' takes care of parallelizing each core into a different segmentation job.

Theoretically, for this parallelization I would just ask for more cores specifically for the ASHLAR step. I think asking for less time and more cores is a great tradeoff with our current HPC solution.

I will perform some tests later :)

mdposkus · 2024-08-08T19:19:08Z

Hi @ahnsws,

Thank you for contributing this file! ASHLAR is also the longest step of MCMICRO for me, so this implementation would be incredibly helpful. I'm wondering what the workflow is for using the gist file to run ASHLAR. Should it be downloaded separately and called by command prompt? Can it be implemented in MCMICRO?

Thank you.

ciszew · 2024-11-14T19:30:39Z

Hey @ahnsws,

Would you think that this parallelization would work for HPC runs? I am a newbie, but trying to improve run times for large WSI images. Thanks in advance

Hi @josenimo
Im wondering if you were able to implement parallelization on HPC?. Im also looking for ways to speed it up, if you have any advice in this regard would be great to hear it.
Thanks

sophiamaedler · 2024-11-14T21:30:40Z

We recently implemented a parallelized version of Ashlar based on the gist shared here plus some additional changes which multithreads as much of the workflow as possible into an easily accessible Python API.

As of now, the longest step that we have only been able to parallelize to some degree is mosaic assembly. While the assembly of each channel can be put into its own thread, we currently can't multithread the assembly of each channel as this introduces image artefacts in overlapping image regions. Read and write IO can also become an issue for this step.

Our code utilizes out of memory computation to assemble even extremely large whole slide images (I've tried up to 500GB of raw input data which I could stitch into a whole slide image in a couple of hours) so RAM shouldn't be an issue.

Currently, we've only implemented the assembly of individual stitched images using Ashlar not the alignment of multiple image cycles.

And since I mainly work with data acquired on an opera phenix microscope which exports individual tif files all code has been heavily optimized for the FilePatternReader. Parallelization with the Bioformatsreader might need some additional code implementation. We are always happy for contributions though!

Feel free to check out an example notebook here:

https://mannlabs.github.io/scPortrait/pages/tools/stitching/example_stitching_notebook.html#Multi-threaded-Stitching

Or the entire package here:

https://github.com/MannLabs/scPortrait/tree/main

We are happy to get feedback!

ahnsws · 2024-11-15T03:22:36Z

Hi @mdposkus, so sorry I missed your message. Currently the gist as is is not exactly a script; it was meant more as a proof of concept. We do use a modified version of the gist that is more script-like, e.g. has arguments, tries to be general enough, etc., but I am not sure if the way our script works would match your workflow. Actually, if I do have some time I could try and write a script-like version that would match the signature of how ashlar is typically called (and account for changes in ashlar since I wrote it more than a year ago...).

Yes actually I think now that our images keep getting bigger and bigger 😆 (and also more channels) it would be great to more permanently shift to thread-safe out-of-memory methods, as shown by @sophiamaedler (thank you for that notebook! although as a minor point the last cell could be made a bit more concise by using np.all_close or np.array_equal, returning a single bool). That is something we would be very interested in incorporating!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing ashlar #191

Parallelizing ashlar #191

ahnsws commented Apr 27, 2023 •

edited

Loading

ahnsws commented Apr 28, 2023

ahnsws commented Apr 28, 2023

josenimo commented Jun 15, 2023

ahnsws commented Jun 20, 2023

josenimo commented Jun 21, 2023

mdposkus commented Aug 8, 2024

ciszew commented Nov 14, 2024 •

edited

Loading

sophiamaedler commented Nov 14, 2024

ahnsws commented Nov 15, 2024

Parallelizing ashlar #191

Parallelizing ashlar #191

Comments

ahnsws commented Apr 27, 2023 • edited Loading

ahnsws commented Apr 28, 2023

ahnsws commented Apr 28, 2023

josenimo commented Jun 15, 2023

ahnsws commented Jun 20, 2023

josenimo commented Jun 21, 2023

mdposkus commented Aug 8, 2024

ciszew commented Nov 14, 2024 • edited Loading

sophiamaedler commented Nov 14, 2024

ahnsws commented Nov 15, 2024

ahnsws commented Apr 27, 2023 •

edited

Loading

ciszew commented Nov 14, 2024 •

edited

Loading