-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executable MPI tasks corrupted profiles #3104
Comments
What was the difference between the two sample executions above? Also, are there lines in the latter profile without final comma, or are those single-comma-lines additional to the correct profile? I am quite surprised by this. The profile lines are written by something like Thanks @AymenFJA ! |
Hey @andre-merzky, to answer your questions:
|
Thanks a lot @AymenFJA , that helps. Can you please attach an |
@andre-merzky Here it is: #!/bin/sh
# ------------------------------------------------------------------------------
export RP_TASK_ID="task.000089"
export RP_TASK_NAME="task.000089"
export RP_PILOT_ID="pilot.0000"
export RP_SESSION_ID="rp.session.udc-aw32-7c0.vaf8uz.019704.0000"
export RP_RESOURCE="uva.rivanna"
export RP_RESOURCE_SANDBOX="/scratch/vaf8uz/radical.pilot.sandbox"
export RP_SESSION_SANDBOX="$RP_RESOURCE_SANDBOX/$RP_SESSION_ID/"
export RP_PILOT_SANDBOX="$RP_SESSION_SANDBOX/pilot.0000/"
export RP_TASK_SANDBOX="$RP_PILOT_SANDBOX/task.000089"
export RP_REGISTRY_ADDRESS="tcp://10.153.50.62:10002"
export RP_CORES_PER_RANK=1
export RP_GPUS_PER_RANK=0
export RP_GTOD="$RP_PILOT_SANDBOX/gtod"
export RP_PROF="$RP_PILOT_SANDBOX/prof"
export RP_PROF_TGT="$RP_PILOT_SANDBOX/task.000089/task.000089.prof"
rp_error() {
echo "$1 failed" 1>&2
exit 1
}
# ------------------------------------------------------------------------------
# rank ID
export RP_RANKS=60
test -z "$SLURM_PROCID" || export RP_RANK=$SLURM_PROCID
test -z "$MPI_RANK" || export RP_RANK=$MPI_RANK
test -z "$PMIX_RANK" || export RP_RANK=$PMIX_RANK
rp_sync_ranks() {
sig=$1
echo $RP_RANK >> $sig.sig
while test $(cat $sig.sig | wc -l) -lt $RP_RANKS; do
sleep 1
done
}
# ------------------------------------------------------------------------------
$RP_PROF exec_start
# ------------------------------------------------------------------------------
# pre-exec commands
$RP_PROF exec_pre
export OMPI_MCA_memory_ptmalloc2_disable=1 || rp_error pre_exec
source /home/vaf8uz/scratch/Cylon/cylon/cy-rp-env/bin/activate || rp_error pre_exec
export LD_LIBRARY_PATH=/home/vaf8uz/scratch/Cylon/cylon/build/arrow/install/lib64:/home/vaf8uz/scratch/Cylon/cylon/build/glog/install/lib64:/home/vaf8uz/scratch/Cylon/cylon/build/lib64:/home/vaf8uz/scratch/Cylon/cylon/build/lib:$LD_LIBRARY_PATH || rp_error pre_exec
# ------------------------------------------------------------------------------
# execute rank
$RP_PROF rank_start
python "cylon_scaling.py" "-n" "100000000" "-i" "4" "-s" "s"
RP_RET=$?
$RP_PROF rank_stop
# ------------------------------------------------------------------------------
# post-exec commands
$RP_PROF exec_post
# ------------------------------------------------------------------------------
$RP_PROF exec_stop
exit $RP_RET
# ------------------------------------------------------------------------------
|
What is the output of The script looks as expected. My assumption would be that the shared FS is not doing atomic writes for multi-node runs. the above command should tell us the file system type so we can have a look at the documentation. |
@andre-merzky this is what I got when running bash-4.4$mount | grep scratch
/dev/sda on /localscratch type ext4 (rw,relatime)
bash-4.4$ |
Woah, we get data reshuffled on an ext4??? What the heck?? Let me read up a bit more, that I did not expect at all... |
The text was updated successfully, but these errors were encountered: