Auto ppn set up might get the wrong value in SLURM adaptor for versions later than 17.11.5 #692

iparask · 2018-11-16T21:45:03Z

On Bridges for example we have SLURM 17.11.7 and it offers at least 2 different core per node counts.

[paraskev@br006 ~]$ scontrol --version
slurm 17.11.7

That is 28 for the RM queue and 32 for the GPU queue when the nodes with the P100 GPUs are used.

Lines 389-391 will select 28 cores per node because that is the output of line 388

[paraskev@br006 ~]$ scontrol show nodes | grep CPUTot| sed -e 's/.*\(CPUTot=[0-9]*\).*/\1/g'| sort | uniq -c | cut -f 2 -d = | xargs echo
28 288 32 352 40 64 80 96

We need to come up with a way to select the correct value. In addition, Bridges does not require the -N flag in the SLURM script, while Stampede2 with SLURM 18.08.3 does.

The text was updated successfully, but these errors were encountered:

iparask · 2018-12-05T21:01:26Z

Hey @andre-merzky , do you remember how we said we are going to tackle this? I remember I would pick it up, but I missed it in my todo list

iparask · 2018-12-05T22:07:01Z

It took me a second, but I remember it!

The solution was to execute scontrol show partitions | grep -E 'PartitionName|TotalCPUs|TotalNodes' instead of the command executed now.

On Bridges this returns:

PartitionName=RM
   State=UP TotalCPUs=20160 TotalNodes=720 SelectTypeParameters=NONE
PartitionName=RM-shared
   State=UP TotalCPUs=1932 TotalNodes=69 SelectTypeParameters=NONE
PartitionName=RM-small
   State=UP TotalCPUs=140 TotalNodes=5 SelectTypeParameters=NONE
PartitionName=GPU
   State=UP TotalCPUs=1344 TotalNodes=44 SelectTypeParameters=NONE
PartitionName=GPU-shared
   State=UP TotalCPUs=700 TotalNodes=23 SelectTypeParameters=NONE
PartitionName=GPU-small
   State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE
PartitionName=GPU-AI
   State=UP TotalCPUs=456 TotalNodes=10 SelectTypeParameters=NONE
PartitionName=LM
   State=UP TotalCPUs=4512 TotalNodes=46 SelectTypeParameters=NONE
PartitionName=XLM
   State=UP TotalCPUs=1280 TotalNodes=4 SelectTypeParameters=NONE
PartitionName=DBMI
   State=UP TotalCPUs=256 TotalNodes=8 SelectTypeParameters=NONE
PartitionName=DBMI-GPU
   State=UP TotalCPUs=64 TotalNodes=2 SelectTypeParameters=NONE

If we take every partition and we calculate the ppn per node, by dividing TotalCPUs with TotalNodes, I get the following dictionary:

{'DBMI': 32,
 'DBMI-GPU': 32,
 'GPU': 31,
 'GPU-AI': 46,
 'GPU-shared': 31,
 'GPU-small': 32,
 'LM': 99,
 'RM': 28,
 'RM-shared': 28,
 'RM-small': 28,
 'XLM': 320}

This is wrong because the GPU nodes have either 28 or 32 cores based on the type of gpu we are selecting. I propose to keep the ppn checking as it was and strict the version checking for Stampede2 and what it was already. Is that okay with you?

iparask assigned iparask and andre-merzky Nov 16, 2018

iparask added comp:slurm layer:saga priority:high labels Nov 16, 2018

iparask mentioned this issue Dec 14, 2018

Adding Stampede specific version check only. #698

Merged

mturilli added priority:medium and removed priority:high labels Mar 26, 2019

andre-merzky removed their assignment Apr 3, 2019

andre-merzky added this to the cfg milestone Feb 3, 2020

andre-merzky unassigned iparask Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto ppn set up might get the wrong value in SLURM adaptor for versions later than 17.11.5 #692

Auto ppn set up might get the wrong value in SLURM adaptor for versions later than 17.11.5 #692

iparask commented Nov 16, 2018

iparask commented Dec 5, 2018

iparask commented Dec 5, 2018

Auto ppn set up might get the wrong value in SLURM adaptor for versions later than 17.11.5 #692

Auto ppn set up might get the wrong value in SLURM adaptor for versions later than 17.11.5 #692

Comments

iparask commented Nov 16, 2018

iparask commented Dec 5, 2018

iparask commented Dec 5, 2018