egs-parallel and run_user_code_batch issues when using massive egsphant geometries #777
Replies: 3 comments 5 replies
-
Thanks @MartinMartinov. I am reluctant to add more hard-coded wait times and loop counts like this as much as possible, because such assumptions invariably cause bottlenecks, race conditions or other issues down the line when we move to faster computers, more cores, etc. I would rather diagnose the root cause for why the simulations take so long to hatch now. If I understand correctly, these same simulations launched as expected with v2020? Can you launch with the PS:
|
Beta Was this translation helpful? Give feedback.
-
Here's some output from my earlier testing:
The killed was me killing the one job that didn't fail, and even though there are only 7 error messages, there were in fact 8 jobs at one point. The worst part is that I only have the egsjob and parallel log files to look at. No egsdat or egslog files or egsrun folder. And though I do agree that adding a hard-coded timer limit is be problematic and not forward-thinking, as it currently stands, the code essentially has 0 seconds as its hard-coded time limit, so I think the 1 minute is an improvement and makes the RCO code more robust. And since it breaks as soon as it's successful, at the most optimal runtime it's only slowed down by constructing those two integers. Possible alternatives to the hard-coding itself could be a global variable in the code (RCO_WAIT_TIME for something that could be used throughout the code), a variable you pass when executing an application (-rcowait 60), or even auto-calculated based on how long it took to hatch the simulation (not sure if these timers are already implemented, but making the wait time equivalent to the hatch time would provide a very reasonable and elegant buffer, I think). |
Beta Was this translation helpful? Give feedback.
-
Yeah, I've been using 2020 for the better part of two years now, and I had no issues with both run_user_code_batch and egs_parallel. The only issue with the delay timer I foresee, is based on my testing earlier, I needed 36 seconds to make sure the control file was generated, then I could launch all other jobs in very quick succession. So using a 36 second delay only because the first job needs it, would make the script take about 5 minutes to submit 8 jobs. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
Since swapping over from the 2020 to the 2021 distribution, I've been having an issue with parallel submission using egsinp files with massive egsphants attached to them. I've tried it with several different egs++ applications, and it seems as if as long as the simulation takes upwards of 20 seconds to 'hatch', then all but my first parallel job tend to fail immediately. No egsrun folder generated, no _w# files, and the processes themselves seem to die immediately after hatching.
My best guess is that the lock file generation takes too long on my system now (did the URCO addition increase the overall control file startup time?) and all the following parallel jobs fail when checking for it. To address it, I originally made some changes to the egs-parallel script to make it wait for the lock file to be generated before launching the following jobs, but after talking to Frederic, it seems like it wasn't a great idea to address this externally. So for a different fix, I went to the run_control_object and added a buffer when reading in the control file for the first time, ie,
and it seems to have resolved my issues. I just wanted to start this discussion because I am pretty unfamiliar with the RCO parts of the code, and I wasn't sure if this could cause some issues down the road. Or maybe there was a better way to address my issue. Any input would be much appreciated.
PS Just looking at the code for parallel submission and I spotted a few small things:
EGSnrc/HEN_HOUSE/cutils/egs_c_utils.c
Line 286 in a6fc389
/bin/sleep $delay
toEGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu
Line 152 in a6fc389
Beta Was this translation helpful? Give feedback.
All reactions