Using future_map instead of loops in the main scripts to improve performance with multisession #101

lboel · 2023-11-15T12:32:08Z

I was just playing around with some redesign of the current loops in the main scripts into purrr:map and tried out just to use some easy furrr::future_map to make it multisession. With the right number of workers Ive ended up at around the half of the current time with the demo data.
Logging seems to be an issue but at least during raw file read in and clean up (everything with a temp file at the end), multisession should at least improve run speed around 30%-50% and with furrr I had no errors on windows.

https://furrr.futureverse.org/

So the idea would be:

rewrite any loop to purrr:map (either way, this would not harm)
find out the best places to replace map with future_map

This would require to think about some future::plan issues, if we want to nest future_map calls:
https://stackoverflow.com/questions/61506909/nested-furrrfuture-map

So I think in terms of readability we will win with purrr::map instead of loops and any furrr multisession improvement will be easy to implement.

pmayd · 2023-12-12T12:09:20Z

@lboel I think we should recheck this idea and give it some priority because this again will change quite a bit. But I really think we should parallelize again. To help oleg still run the code locally we could simply split the main functions into a code block that executes the pipeline in parallel and one that doesn't based on some parameter set to True or False.
I guess the easiest way to get everything to work is to make a proper package out of a4d that we can use to call functions with a4d:: instead of having to source everything, correct? Lets try this

lboel · 2023-12-12T17:44:16Z

Indead a proper package will help a lot. Once we have a real package we could do easy multi sessioning with ease. Especially because the tracker files are independent, we could easily chunk the task based on tracker files.

I could imagine having a real package could allow us to start the process for chunks of tracker files and then run somekind of merge at the end, without even implement furrr or other parallel code in the package itself but more as a runner script for the package.

pmayd · 2023-12-12T18:20:17Z

I don't really get the last point, how will you process the chunks if not in parallel to speed up the code? so yes, we don't need it IN the package, of course, but like now, this would happen on the outer level in the main pipeline scripts that iterate over the tracker list. An this outer for loop is the most natural point for adding parallelisation I guess. The only problem in the past was that the code in the new threads did not have the packages loaded that were loaded with devtools::load_all() so we had to source everything with source() so the processes in the new spawned threads had the functions. I hope that this is not necessary with a real package because we can simply add all functions with a4d::. The only thing is to load a4d once inside or something like this depending on the package that we use for parallelisation.

So could you please open a new issue with making a4d a proper package and what we need for this?

pmayd · 2023-12-12T18:20:35Z

I guess Konrad already mentioned important points in the Slack discussion

lboel added the enhancement New feature or request label Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using future_map instead of loops in the main scripts to improve performance with multisession #101

Using future_map instead of loops in the main scripts to improve performance with multisession #101

lboel commented Nov 15, 2023 •

edited

Loading

pmayd commented Dec 12, 2023

lboel commented Dec 12, 2023

pmayd commented Dec 12, 2023

pmayd commented Dec 12, 2023

Using future_map instead of loops in the main scripts to improve performance with multisession #101

Using future_map instead of loops in the main scripts to improve performance with multisession #101

Comments

lboel commented Nov 15, 2023 • edited Loading

pmayd commented Dec 12, 2023

lboel commented Dec 12, 2023

pmayd commented Dec 12, 2023

pmayd commented Dec 12, 2023

lboel commented Nov 15, 2023 •

edited

Loading