You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Creating an issue more for discussion than anything else, since I'd like to get some input from you @andre-merzky ~
Currently the Cobalt adaptor has these three job description parameters to control the number of nodes, cpus, mpi ranks.
# Example, 16-ranks-per-node, total of 8192 ranks, equate 8192 cpus (1 per rank)
jd.processes_per_host = 16
jd.number_of_processes = 8192
jd.total_cpu_count = 8192 # 16 cores per node == 512 nodes
Here, we want to run a task in 512 nodes at 16-ranks per node and 1 rank per core. Cobalt (specifically Mira) has a static cores-per-node count of 16.
The original idea was to use the jd.total_cpu_count parameter since you can have multiple ranks per core (up to 4 ranks per core which mean up to 64 ranks per node), so it would make node calculations easy. But after some thinking I believe we may be able to just rely on jd.processes_per_host and jd.number_of_processes since the number of nodes can be extracted from these parameters (with some assumptions~).
Here is what I mean:
Example 1: Single rank per node
Ranks-per-node: 1
Total Number of Ranks: 512
Here, since we have only one rank per node and a total of 512 ranks, we will use 512 nodes and since the number of cores is static in Mira (16 cores-per-node as per documentation) we can assume we need a total of 8192 cores.
We get that: cores = 16 * (jd.number_of_processes / jd.processes_per_host)
Example 2: Single rank per core
Ranks-per-node: 16
Total Number of Ranks: 8192
Similar as above with the slight difference that now all each rank will use up a core (instead of a rank using all cores in a node), but the calculations use the same formula.
cores = 16 * ( 8192 / 16 ) = 8192
Example 3: Multiple ranks per core
Ranks-per-node: 64
Total Number of Ranks: 32,768
This example is also similar, but now we have 4 ranks per core (this is the upper limit in Mira). The calculations use the same formula.
cores = 16 * ( 32,768 / 64 ) = 8192
Example 4: Multiple ranks per core, node is not fully utilized
Ranks-per-node: 64
Total Number of Ranks: 32,705
This example is different, now the total number of ranks do not use all available cores in a node, which actually breaks our previous formula.
cores = 16 * (32,705 / 64) = 511 # should be 512 !!!
This is were documentation helps again, in Mira you cannot request partial nodes so in cases where part of the node is needed, we can use the following generalized formula.
Example 5: for completeness' sake. Multiple ranks per node, node is not fully utilized
Ranks-per-node: 2
Total Number of Ranks: 1,023 cores = 16 * ( ceiling[1,023 / 2] ) = 512 # Correct!
Ranks-per-node: 4
Total Number of Ranks: 2,046 cores = 16 * ( ceiling[2,046 / 4] ) = 512 # Correct!
Ranks-per-node: 8
Total Number of Ranks: 4,091 cores = 16 * ( ceiling[4,091 / 8] ) = 512 # Correct!
Ranks-per-node: 16
Total Number of Ranks: 8,177 cores = 16 * ( ceiling[8,177 / 16] ) = 512 # Correct!
Ranks-per-node: 32
Total Number of Ranks: 16,353 cores = 16 * ( ceiling[16,353 / 32] ) = 512 # Correct!
In short
We can drop the parameter jd.total_cpu_count and calculate it with the other two parameters since the following are reliable assumptions for Mira Blue Gene/Q:
Cores per node: 16
Max ranks per node: 64
Acceptable values for ranks per node: 1, 2, 4, 8, 16, 32, 64
Nodes cannot be partially scheduled
We can use the following general formula to calculate the number of nodes: cores = 16 * ( ceiling[ total-number-of-ranks / ranks-per-node ] )
Finally, we can also provide defaults for the other two parameters (as Mira does).
Creating an issue more for discussion than anything else, since I'd like to get some input from you @andre-merzky ~
Currently the Cobalt adaptor has these three job description parameters to control the number of nodes, cpus, mpi ranks.
Here, we want to run a task in 512 nodes at 16-ranks per node and 1 rank per core. Cobalt (specifically Mira) has a static cores-per-node count of
16
.The original idea was to use the
jd.total_cpu_count
parameter since you can have multiple ranks per core (up to 4 ranks per core which mean up to 64 ranks per node), so it would make node calculations easy. But after some thinking I believe we may be able to just rely onjd.processes_per_host
andjd.number_of_processes
since the number of nodes can be extracted from these parameters (with some assumptions~).Here is what I mean:
Example 1: Single rank per node
Here, since we have only one rank per node and a total of 512 ranks, we will use 512 nodes and since the number of cores is static in Mira (
16 cores-per-node
as per documentation) we can assume we need a total of 8192 cores.We get that:
cores = 16 * (jd.number_of_processes / jd.processes_per_host)
Example 2: Single rank per core
Similar as above with the slight difference that now all each rank will use up a core (instead of a rank using all cores in a node), but the calculations use the same formula.
cores = 16 * ( 8192 / 16 ) = 8192
Example 3: Multiple ranks per core
This example is also similar, but now we have 4 ranks per core (this is the upper limit in Mira). The calculations use the same formula.
cores = 16 * ( 32,768 / 64 ) = 8192
Example 4: Multiple ranks per core, node is not fully utilized
This example is different, now the total number of ranks do not use all available cores in a node, which actually breaks our previous formula.
cores = 16 * (32,705 / 64) = 511 # should be 512 !!!
This is were documentation helps again, in Mira you cannot request partial nodes so in cases where part of the node is needed, we can use the following generalized formula.
cores = 16 * ( ceiling[32,705 / 64] ) = 512 # Correct!
or specifically for Python 2.7~
Example 5: for completeness' sake. Multiple ranks per node, node is not fully utilized
Ranks-per-node: 2
Total Number of Ranks: 1,023
cores = 16 * ( ceiling[1,023 / 2] ) = 512 # Correct!
Ranks-per-node: 4
Total Number of Ranks: 2,046
cores = 16 * ( ceiling[2,046 / 4] ) = 512 # Correct!
Ranks-per-node: 8
Total Number of Ranks: 4,091
cores = 16 * ( ceiling[4,091 / 8] ) = 512 # Correct!
Ranks-per-node: 16
Total Number of Ranks: 8,177
cores = 16 * ( ceiling[8,177 / 16] ) = 512 # Correct!
Ranks-per-node: 32
Total Number of Ranks: 16,353
cores = 16 * ( ceiling[16,353 / 32] ) = 512 # Correct!
In short
We can drop the parameter
jd.total_cpu_count
and calculate it with the other two parameters since the following are reliable assumptions for Mira Blue Gene/Q:16
64
1, 2, 4, 8, 16, 32, 64
We can use the following general formula to calculate the number of nodes:
cores = 16 * ( ceiling[ total-number-of-ranks / ranks-per-node ] )
Finally, we can also provide defaults for the other two parameters (as Mira does).
Source: Mira Documentation
The text was updated successfully, but these errors were encountered: