-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using future_map instead of loops in the main scripts to improve performance with multisession #101
Comments
@lboel I think we should recheck this idea and give it some priority because this again will change quite a bit. But I really think we should parallelize again. To help oleg still run the code locally we could simply split the main functions into a code block that executes the pipeline in parallel and one that doesn't based on some parameter set to True or False. |
Indead a proper package will help a lot. Once we have a real package we could do easy multi sessioning with ease. Especially because the tracker files are independent, we could easily chunk the task based on tracker files. I could imagine having a real package could allow us to start the process for chunks of tracker files and then run somekind of merge at the end, without even implement furrr or other parallel code in the package itself but more as a runner script for the package. |
I don't really get the last point, how will you process the chunks if not in parallel to speed up the code? so yes, we don't need it IN the package, of course, but like now, this would happen on the outer level in the main pipeline scripts that iterate over the tracker list. An this outer for loop is the most natural point for adding parallelisation I guess. The only problem in the past was that the code in the new threads did not have the packages loaded that were loaded with So could you please open a new issue with making a4d a proper package and what we need for this? |
I guess Konrad already mentioned important points in the Slack discussion |
I was just playing around with some redesign of the current loops in the main scripts into purrr:map and tried out just to use some easy furrr::future_map to make it multisession. With the right number of workers Ive ended up at around the half of the current time with the demo data.
Logging seems to be an issue but at least during raw file read in and clean up (everything with a temp file at the end), multisession should at least improve run speed around 30%-50% and with furrr I had no errors on windows.
https://furrr.futureverse.org/
So the idea would be:
This would require to think about some future::plan issues, if we want to nest future_map calls:
https://stackoverflow.com/questions/61506909/nested-furrrfuture-map
So I think in terms of readability we will win with purrr::map instead of loops and any furrr multisession improvement will be easy to implement.
The text was updated successfully, but these errors were encountered: