Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto ppn set up might get the wrong value in SLURM adaptor for versions later than 17.11.5 #692

Open
iparask opened this issue Nov 16, 2018 · 2 comments

Comments

@iparask
Copy link
Contributor

iparask commented Nov 16, 2018

On Bridges for example we have SLURM 17.11.7 and it offers at least 2 different core per node counts.

[paraskev@br006 ~]$ scontrol --version
slurm 17.11.7

That is 28 for the RM queue and 32 for the GPU queue when the nodes with the P100 GPUs are used.

Lines 389-391 will select 28 cores per node because that is the output of line 388

[paraskev@br006 ~]$ scontrol show nodes | grep CPUTot| sed -e 's/.*\(CPUTot=[0-9]*\).*/\1/g'| sort | uniq -c | cut -f 2 -d = | xargs echo
28 288 32 352 40 64 80 96

We need to come up with a way to select the correct value. In addition, Bridges does not require the -N flag in the SLURM script, while Stampede2 with SLURM 18.08.3 does.

@iparask
Copy link
Contributor Author

iparask commented Dec 5, 2018

Hey @andre-merzky , do you remember how we said we are going to tackle this? I remember I would pick it up, but I missed it in my todo list

@iparask
Copy link
Contributor Author

iparask commented Dec 5, 2018

It took me a second, but I remember it!

The solution was to execute scontrol show partitions | grep -E 'PartitionName|TotalCPUs|TotalNodes' instead of the command executed now.

On Bridges this returns:

PartitionName=RM
   State=UP TotalCPUs=20160 TotalNodes=720 SelectTypeParameters=NONE
PartitionName=RM-shared
   State=UP TotalCPUs=1932 TotalNodes=69 SelectTypeParameters=NONE
PartitionName=RM-small
   State=UP TotalCPUs=140 TotalNodes=5 SelectTypeParameters=NONE
PartitionName=GPU
   State=UP TotalCPUs=1344 TotalNodes=44 SelectTypeParameters=NONE
PartitionName=GPU-shared
   State=UP TotalCPUs=700 TotalNodes=23 SelectTypeParameters=NONE
PartitionName=GPU-small
   State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE
PartitionName=GPU-AI
   State=UP TotalCPUs=456 TotalNodes=10 SelectTypeParameters=NONE
PartitionName=LM
   State=UP TotalCPUs=4512 TotalNodes=46 SelectTypeParameters=NONE
PartitionName=XLM
   State=UP TotalCPUs=1280 TotalNodes=4 SelectTypeParameters=NONE
PartitionName=DBMI
   State=UP TotalCPUs=256 TotalNodes=8 SelectTypeParameters=NONE
PartitionName=DBMI-GPU
   State=UP TotalCPUs=64 TotalNodes=2 SelectTypeParameters=NONE

If we take every partition and we calculate the ppn per node, by dividing TotalCPUs with TotalNodes, I get the following dictionary:

{'DBMI': 32,
 'DBMI-GPU': 32,
 'GPU': 31,
 'GPU-AI': 46,
 'GPU-shared': 31,
 'GPU-small': 32,
 'LM': 99,
 'RM': 28,
 'RM-shared': 28,
 'RM-small': 28,
 'XLM': 320}

This is wrong because the GPU nodes have either 28 or 32 cores based on the type of gpu we are selecting. I propose to keep the ppn checking as it was and strict the version checking for Stampede2 and what it was already. Is that okay with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants