Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using future_map instead of loops in the main scripts to improve performance with multisession #101

Open
lboel opened this issue Nov 15, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@lboel
Copy link
Collaborator

lboel commented Nov 15, 2023

I was just playing around with some redesign of the current loops in the main scripts into purrr:map and tried out just to use some easy furrr::future_map to make it multisession. With the right number of workers Ive ended up at around the half of the current time with the demo data.
Logging seems to be an issue but at least during raw file read in and clean up (everything with a temp file at the end), multisession should at least improve run speed around 30%-50% and with furrr I had no errors on windows.

https://furrr.futureverse.org/

So the idea would be:

  • rewrite any loop to purrr:map (either way, this would not harm)
  • find out the best places to replace map with future_map

This would require to think about some future::plan issues, if we want to nest future_map calls:
https://stackoverflow.com/questions/61506909/nested-furrrfuture-map

So I think in terms of readability we will win with purrr::map instead of loops and any furrr multisession improvement will be easy to implement.

@lboel lboel added the enhancement New feature or request label Nov 15, 2023
@pmayd
Copy link
Collaborator

pmayd commented Dec 12, 2023

@lboel I think we should recheck this idea and give it some priority because this again will change quite a bit. But I really think we should parallelize again. To help oleg still run the code locally we could simply split the main functions into a code block that executes the pipeline in parallel and one that doesn't based on some parameter set to True or False.
I guess the easiest way to get everything to work is to make a proper package out of a4d that we can use to call functions with a4d:: instead of having to source everything, correct? Lets try this

@lboel
Copy link
Collaborator Author

lboel commented Dec 12, 2023

Indead a proper package will help a lot. Once we have a real package we could do easy multi sessioning with ease. Especially because the tracker files are independent, we could easily chunk the task based on tracker files.

I could imagine having a real package could allow us to start the process for chunks of tracker files and then run somekind of merge at the end, without even implement furrr or other parallel code in the package itself but more as a runner script for the package.

@pmayd
Copy link
Collaborator

pmayd commented Dec 12, 2023

I don't really get the last point, how will you process the chunks if not in parallel to speed up the code? so yes, we don't need it IN the package, of course, but like now, this would happen on the outer level in the main pipeline scripts that iterate over the tracker list. An this outer for loop is the most natural point for adding parallelisation I guess. The only problem in the past was that the code in the new threads did not have the packages loaded that were loaded with devtools::load_all() so we had to source everything with source() so the processes in the new spawned threads had the functions. I hope that this is not necessary with a real package because we can simply add all functions with a4d::. The only thing is to load a4d once inside or something like this depending on the package that we use for parallelisation.

So could you please open a new issue with making a4d a proper package and what we need for this?

@pmayd
Copy link
Collaborator

pmayd commented Dec 12, 2023

I guess Konrad already mentioned important points in the Slack discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants