egs-parallel and run_user_code_batch issues when using massive egsphant geometries #777

MartinMartinov · 2021-10-04T16:23:55Z

MartinMartinov
Oct 4, 2021

Hi everyone,

Since swapping over from the 2020 to the 2021 distribution, I've been having an issue with parallel submission using egsinp files with massive egsphants attached to them. I've tried it with several different egs++ applications, and it seems as if as long as the simulation takes upwards of 20 seconds to 'hatch', then all but my first parallel job tend to fail immediately. No egsrun folder generated, no _w# files, and the processes themselves seem to die immediately after hatching.

My best guess is that the lock file generation takes too long on my system now (did the URCO addition increase the overall control file startup time?) and all the following parallel jobs fail when checking for it. To address it, I originally made some changes to the egs-parallel script to make it wait for the lock file to be generated before launching the following jobs, but after talking to Frederic, it seems like it wasn't a great idea to address this externally. So for a different fix, I went to the run_control_object and added a buffer when reading in the control file for the first time, ie,

if (!readControlFile()) {
    return -1;
}
if (first_time) {
    first_time = false;
    njob++;
}

if (first_time) {
    first_time = false;
    njob++;

    // Wait up to 1 minute for file to be generated
    int waitFlag = 0, waitInterval = 5; 
    while (!readControlFile()) {
        sleep(waitInterval*1000);
        waitFlag += waitInterval;
        if (waitFlag > 60)
            return -1;
    }
}
else if (!readControlFile()) {
    return -1;
}

and it seems to have resolved my issues. I just wanted to start this discussion because I am pretty unfamiliar with the RCO parts of the code, and I wasn't sure if this could cause some issues down the road. Or maybe there was a better way to address my issue. Any input would be much appreciated.

PS Just looking at the code for parallel submission and I spotted a few small things:

The cutils wait function for the control object loops over a sleep(1) call 15 times, which I think only makes for a total wait time of 15 milliseconds?

EGSnrc/HEN_HOUSE/cutils/egs_c_utils.c

Line 286 in a6fc389

sleep(1);
The egs-parallel script has a delay flag/option, but egs-parallel-cpu doesn't include a wait time in its submission loop. Might be worth adding /bin/sleep $delay to

EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu

Line 152 in a6fc389

ftessier · 2021-10-04T17:08:20Z

ftessier
Oct 4, 2021
Maintainer

Thanks @MartinMartinov. I am reluctant to add more hard-coded wait times and loop counts like this as much as possible, because such assumptions invariably cause bottlenecks, race conditions or other issues down the line when we move to faster computers, more cores, etc.

I would rather diagnose the root cause for why the simulations take so long to hatch now. If I understand correctly, these same simulations launched as expected with v2020? Can you launch with the --verbose option and attach the output here, to diagnose further?

PS:

Not sure if the sleep(1) is in seconds or milliseconds, will have to check that. The intention is clearly here for it to be 1 second.
Good point, the egs-parallel delay argument should indeed be passed on to the cpu script, for consistency.

0 replies

MartinMartinov · 2021-10-04T17:32:08Z

MartinMartinov
Oct 4, 2021
Author

Here's some output from my earlier testing:

martinov@DESKTOP-A28UJID:~/EGSnrc_CLRP/egs_home/egs_brachy$ egs-parallel -n8 -v -f -c "egs_brachy -i ebGUI_21Oct04_1059 -u"
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.320803000: BEGIN egs-parallel
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.321943800: EGSnrc environment:
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.323236300:     HEN_HOUSE  = /home/martinov/EGSnrc_CLRP/HEN_HOUSE/
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.324285200:     EGS_HOME   = /home/martinov/EGSnrc_CLRP/egs_home/
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.325388300:     EGS_CONFIG = /home/martinov/EGSnrc_CLRP/HEN_HOUSE/specs/linux.conf
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.328267900: parallel options:
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.329395900:     batch      = cpu
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.330441400:     queue      = long
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.331578800:     nthread    = 8
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.332665600:     delay      = 0
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.333684900:     command    = egs_brachy -i ebGUI_21Oct04_1059 -u
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.334659700:     basename   = ebGUI_21Oct04_1059
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.335629500:     first job  = 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.336529000:     options    =
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.340373000: log file: /home/martinov/EGSnrc_CLRP/egs_home/egs_brachy/ebGUI_21Oct04_1059.egsparallel
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.341417700: cd /home/martinov/EGSnrc_CLRP/egs_home/egs_brachy
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.342602100: EXEC egs-parallel-cpu long 8 0 1 ebGUI_21Oct04_1059 'egs_brachy -i ebGUI_21Oct04_1059 -u' '' verbose
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.345872600: BEGIN /home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.353679000: BEGIN host=DESKTOP-A28UJID
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.354755100: job 0001: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 1 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.356386000: job 0001: host=DESKTOP-A28UJID pid=1271
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.358113000: job 0002: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 2 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.359786600: job 0002: host=DESKTOP-A28UJID pid=1284
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.361761700: job 0003: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 3 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.363439800: job 0003: host=DESKTOP-A28UJID pid=1292
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.365690500: job 0004: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 4 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.367470000: job 0004: host=DESKTOP-A28UJID pid=1300
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.369707600: job 0005: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 5 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.372688300: job 0005: host=DESKTOP-A28UJID pid=1308
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.375599400: job 0006: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 6 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.377917900: job 0006: host=DESKTOP-A28UJID pid=1316
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.381092300: job 0007: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 7 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.383373500: job 0007: host=DESKTOP-A28UJID pid=1324
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.385759500: job 0008: RUN egs_brachy -i ebGUI_21Oct04_1059 -u -b -P 8 -j 8 -f 1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:06.389017300: job 0008: host=DESKTOP-A28UJID pid=1332
/home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu: line 172:  1271 Aborted                 $runcommand > /dev/null 2>&1
/home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu: line 172:  1284 Killed                  $runcommand > /dev/null 2>&1
/home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu: line 172:  1292 Aborted                 $runcommand > /dev/null 2>&1
/home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu: line 172:  1300 Aborted                 $runcommand > /dev/null 2>&1
/home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu: line 172:  1316 Aborted                 $runcommand > /dev/null 2>&1
/home/martinov/EGSnrc_CLRP/HEN_HOUSE/scripts/egs-parallel-cpu: line 172:  1332 Aborted                 $runcommand > /dev/null 2>&1
EGSnrc egs-parallel 2021-10-04 (UTC) 15:03:45.025519800: DONE.

The killed was me killing the one job that didn't fail, and even though there are only 7 error messages, there were in fact 8 jobs at one point. The worst part is that I only have the egsjob and parallel log files to look at. No egsdat or egslog files or egsrun folder.

And though I do agree that adding a hard-coded timer limit is be problematic and not forward-thinking, as it currently stands, the code essentially has 0 seconds as its hard-coded time limit, so I think the 1 minute is an improvement and makes the RCO code more robust. And since it breaks as soon as it's successful, at the most optimal runtime it's only slowed down by constructing those two integers.

Possible alternatives to the hard-coding itself could be a global variable in the code (RCO_WAIT_TIME for something that could be used throughout the code), a variable you pass when executing an application (-rcowait 60), or even auto-calculated based on how long it took to hatch the simulation (not sure if these timers are already implemented, but making the wait time equivalent to the hatch time would provide a very reasonable and elegant buffer, I think).

1 reply

ftessier Oct 4, 2021
Maintainer

Hmmm, I will look into this. Perhaps adding the missing sleep $delay in egs-paralllel-cpu (#778) would curb this? Can you confirm that these launched as expected in v2020? That would help me understand whether the issue is rooted in my new egs-parallel scripts, or more generally an issue with the EGSnrc run control code.

MartinMartinov · 2021-10-04T17:47:00Z

MartinMartinov
Oct 4, 2021
Author

Yeah, I've been using 2020 for the better part of two years now, and I had no issues with both run_user_code_batch and egs_parallel.

The only issue with the delay timer I foresee, is based on my testing earlier, I needed 36 seconds to make sure the control file was generated, then I could launch all other jobs in very quick succession. So using a 36 second delay only because the first job needs it, would make the script take about 5 minutes to submit 8 jobs.

4 replies

MartinMartinov Oct 4, 2021
Author

On a side note, the algorithm in the version of egs-parallel-cpu I sent you where the script checks for the lock file creation has been very successful for me in any system that doesn't put your jobs in a queue (like SLURM, which required a different submission approach from me). I think we've been using a version of it at Carleton for 8 years now with no major problems, and I've taken it with me to a few other Compute Canada clusters with no problems.

As for lock file issues, I think the exist check in bash doesn't actually query the file, but the directory for the file listing instead (not 100% on this one). And, by design, the script only queries for the lock file when at most one job is launched, so the file is never hotly contested.

ftessier Oct 4, 2021
Maintainer

All right, good to know. I will start by reviewing what changed since 2020 (well, egs-paralllel-cpu was not in v2020, but maybe you were on develop or checked it out from there). I removed the lock file dependencies in c9ff999, which probably caused the problem. Is this strictly happening with large egsphant files, or can it be seen with "normal" input as well? At any rate, I will try reproducing the issue here.

MartinMartinov Oct 4, 2021
Author

When I tried with input files that only take a few seconds before getting to the simulation proper (egs_chamber example files), I lost at most 1-2 jobs when submitting with egs-parallel. And these I could avoid by using a delay of 5 seconds (once I added just a normal delay to egs-parallel-cpu). When I added a 512x512x30ish egsphant import to it, then it would basically always fail.

I will also say, egs_brachy does a lot of preprocessing when calculating volume corrections and seed discovery which might also exacerbate the problem for CLRP users. I'm not sure where exactly the RCO portion of the code gets invoked in egs_application, let alone egs_brachy.

ftessier Oct 4, 2021
Maintainer

All right, thank you for taking the time to flesh that out a little, very helpful! I feel like the proper approach is indeed to adjust egs_run_control, because it is lock file related. I want the parallel scripts to remain strictly about the "external" process of submitting jobs to the cpu (or queuing system), and then let EGSnrc handle its own internal synchronizing, viz. lock files and all. Otherwise we end up in situations (like with the uniform run control object) where the script logic and the EGSnrc logic interfere with one another. But it is good to know it can also be handled within the script, as a fallback solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

egs-parallel and run_user_code_batch issues when using massive egsphant geometries #777

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

egs-parallel and run_user_code_batch issues when using massive egsphant geometries #777

MartinMartinov Oct 4, 2021

Replies: 3 comments · 5 replies

ftessier Oct 4, 2021 Maintainer

MartinMartinov Oct 4, 2021 Author

ftessier Oct 4, 2021 Maintainer

MartinMartinov Oct 4, 2021 Author

MartinMartinov Oct 4, 2021 Author

ftessier Oct 4, 2021 Maintainer

MartinMartinov Oct 4, 2021 Author

ftessier Oct 4, 2021 Maintainer

MartinMartinov
Oct 4, 2021

Replies: 3 comments 5 replies

ftessier
Oct 4, 2021
Maintainer

MartinMartinov
Oct 4, 2021
Author

ftessier Oct 4, 2021
Maintainer

MartinMartinov
Oct 4, 2021
Author

MartinMartinov Oct 4, 2021
Author

ftessier Oct 4, 2021
Maintainer

MartinMartinov Oct 4, 2021
Author

ftessier Oct 4, 2021
Maintainer