Trials and tribulations prior to the start of training.
For the trials and tribulation during the training see: chronicles.
200B
torch.optim.Adam:
16 nodes:
- 1st node: 61GB
- all nodes: 47GB
- performance: XXX
apex.optimizers.FusedAdam
16 nodes:
- 1st node: 51GB
- all nodes: 44GB
- performance: XXX
Here are some existing models around the same size with NLAYERS / NHIDDEN and their ratio:
origin | size | layers | hidden | ratio |
---|---|---|---|---|
bs | 104B | 64 | 11600 | 180 |
meg-lm | 145B | 80 | 12288 | 154 |
openai | 175B | 96 | 12288 | 128 |
meg-lm | 310B | 96 | 16384 | 170 |
msft | 530B | 105 | 20480 | 195 |
Possible ideas:
- 205B: 112 / 12288 (ratio: 109) narrow
- 206B: 96 / 13312 (ratio: 139) closer to typical 150-200 ratio
Formula to get model size, used 150k dict roughly - need to update:
NHIDDEN=12288; NLAYERS=112; SEQ_LEN=2048; VOCAB_SIZE=150257; python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Looking at the current 104B topology to try to estimate the 200B model, though many things are different.
NLAYERS=64
NHIDDEN=11600
NHEADS=80
SEQ_LEN=2048
VOCAB_SIZE=50257
32 GB gpus.
TP=4, PP=32
breakdown:
104B:
- embedding size:
v*h
:50257*11600
= 582_981_200 / 4 (TP=4) => 145_745_300 params per gpu for embedding - one layer size:
12*h**2 + 13*h
: 1_614_870_800 / 4 (TP=4) => 403_717_700 params per gpu per layer
64 layers over PP=32 => 2 layers per gpu
Total params per gpu:
- gpu w/ emb:
2*403_717_700 + 145_745_300
= 953_180_700 params * 18 bytes = 17_157_252_600 bytes (17GB) - gpu w/o emb:
2*403_717_700
= 807_435_400 params * 18 bytes = 14_533_837_200 (15GB)
plus activations memory
Checking the actual GPU allocations (nvidia-smi) - also need to take into account the cuda kernels (1271MiB)
- 22GB (w/ embed) (4GB activations memory)
- 18GB (w/o embed) (2GB activations memory)
384 A100s 80GB / 8 gpus per node
We can plan to use 384 gpus out of 416 as 4 nodes of 8 gpus need to remain reserved for when some nodes happen to be down.
Initially we will have only 144 gpus and then around mid-Feb we should have the rest.
So a possible config is
-
a single replica needs to fit 96 gpus and then we can do DP=4 to a full 384 gpus
-
extrapolating from the current 104B setup we can have: TP=4/PP=24 @ 80GB + 150K vocab size (which is different from the 50k vocab in 104B - 3x bigger embed matrix plus bigger hidden size.
-
most likely the embedding layer now will need to be partitioned together with the transformer blocks to do a good balancing of resources. e.g. in the current 1.3B ml setup, the 1st and last gpus use all of DRAM, but the rest of gpus use only 1/2 DRAM - and TLOPs are ~21 which is very underutilized.
206B:
NLAYERS=96
NHIDDEN=13312
NHEADS=XXX
SEQ_LEN=2048
VOCAB_SIZE=150_000
Overall we know that DP is the fastest, then PP, then TP - but for PP to be efficient we need a big bs.
The following math is trying various topologies to fit into 80GB gpus
- TP=4, PP=24
- embedding size:
v*h: 150257*13312
=2_000_221_184 / 4
(TP=4) => 500_055_296 params per gpu for embedding - one layer size:
12*h**2 + 13*h
:2_126_685_184 / 4
(TP=4) => 531_671_296 params per gpu per layer
In other words 2B params per layer w/o TP, or 38GB (2.12*18
) per layer.
So here we definitely need to balance embedding layer with transformer layers as they are of the same size, so overall 2+layers blocks to balance - and the constraint won't be Layers % PP = 0 but Layers+2 % PP = 0
So probably should do 94 layers?
94+2 layers over PP=24 => 4 layers per gpu
Total params per gpu (considering emb layer on par with transformers block):
4*531_671_296
=2_126_685_184 params * 18
= 38_280_333_312 bytes plus activations memory
40GB A100 takes 1573MiB for cuda kernels (probably about the same for 80GB? may be a bit larger)
python -c "import torch; import time; torch.ones(1).cuda(); time.sleep(30)"
+ check nvidia-smi
output.
- TP=1, PP=96
~2B params per layer w/o TP, or 38GB (2.12*18
) per layer.
but DS breaks if there isn't at least one transformer block per gpu :( otherwise could do a very efficient:
1 | 2 | 3 ... | 95 | 96
emb | transf | transf ....| transf | emb
So in this scenario no TP is needed, which should make the assembly much faster. But will require DS fixing their side. or perhaps we could somehow hack on a dummy layer which will be like transformers? e.g. if it's the first or last layer it'd be an identity forward.
Also the pipeline will be super long here, which to make efficient will require a huge global batch size.
- with TP=2, PP=48
1_063_342_592 params per layer, 19_140_166_656 bytes (19GB) per layer
perhaps could squeeze 3 layers per gpu - but of course each gpu will be less efficient since it will have to do 3 pipe stages.
- Other considerations
Of course, we could make the model wider and shallower so for example with TP=1 perhaps we could fit a bit more width and use less layers. e.g. 530B model was NLAYERS=105, NHIDDEN=20480 - so it's much wider.
After discussing the above plans with the NVIDIA and DeepSpeed experts it appears that:
-
on A100 and especially with much larger models TP>1 is much more beneficial and typically NVIDIA almost always uses TP=gpus_per_node for large models.
-
A very deep PP (96) would be very difficult to keep efficient unless the batch size per replica is huge.
-
Too many layers isn't great either:
Jared Casper writes:
Regarding hidden size vs transformer layer (width vs depth), some feedback I got is that there isn't really a magic formula/process. We increase depth with the width but not as drastically as a typical vision model scaling. So you shouldn't go too crazy with depth. The width is somewhat constrained by sizes good for the GPU, so it seems a strategy is to push out the width but keep it nice numbers, then fill out with depth. You'll notice even at 530B params we only went to 105 layers.
Let's first analyse a few existing models and see how they fit 80GB A100 8-gpu nodes.
- 145B meg-lm
NHIDDEN=12288; NLAYERS=80; SEQ_LEN=2048; VOCAB_SIZE=50257; python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 146B, ratio=153
NHIDDEN=12288; VOCAB_SIZE=50257; TP=8; python -c "h=$NHIDDEN; v=$VOCAB_SIZE; tp=$TP; emb=v*h/10**6; blk=(12*h**2+13*h)/10**6; print(f'emb size: {emb:.2f}M/{emb*18:.2f}GB, per gpu {emb/tp:.2f}M/{emb*18/tp:.2f}GB'); print(f'blk size: {blk:.2f}M/{blk*18:.2f}GB, per gpu {blk/tp:.2f}M/{blk*18/tp:.2f}GB')"
emb size: 617.56M/11116.04GB, per gpu 77.19M/1389.51GB
blk size: 1812.10M/32617.78GB, per gpu 226.51M/4077.22GB
MP=64: TP=8, PP=8: one replica 64 gpus
so 80/8=10 PP stages per gpu: 10*4
=40GB of weights/optim states/grads per gpu
- 310B meg-lm
NHIDDEN=16384; NLAYERS=96; SEQ_LEN=2048; VOCAB_SIZE=50257; python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 310B, ratio=170
MP=128: TP=8, PP=16: one replica 128 gpus
NHIDDEN=16384; VOCAB_SIZE=50257; TP=8; python -c "h=$NHIDDEN; v=$VOCAB_SIZE; tp=$TP; emb=v*h/10**6; blk=(12*h**2+13*h)/10**6; print(f'emb size: {emb:.2f}M/{emb*18:.2f}GB, per gpu {emb/tp:.2f}M/{emb*18/tp:.2f}GB'); print(f'blk size: {blk:.2f}M/{blk*18:.2f}GB, per gpu {blk/tp:.2f}M/{blk*18/tp:.2f}GB')"
emb size: 823.41M/14821.39GB, per gpu 102.93M/1852.67GB
blk size: 3221.44M/57985.89GB, per gpu 402.68M/7248.24GB
so 96/16=6
PP stages per gpu: 6*7.3
~44GB of weights/optim states/grads per gpu
- 530B msft
NHIDDEN=20480; NLAYERS=105; SEQ_LEN=2048; VOCAB_SIZE=50257; python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 310B, ratio=170
MP=280: TP=8, PP=35: one replica 280 gpus
(actually don't know the vocab size here, but it doesn't matter much)
NHIDDEN=20480; VOCAB_SIZE=50257; TP=8; python -c "h=$NHIDDEN; v=$VOCAB_SIZE; tp=$TP; emb=v*h/10**6; blk=(12*h**2+13*h)/10**6; print(f'emb size: {emb:.2f}M/{emb*18:.2f}GB, per gpu {emb/tp:.2f}M/{emb*18/tp:.2f}GB'); print(f'blk size: {blk:.2f}M/{blk*18:.2f}GB, per gpu {blk/tp:.2f}M/{blk*18/tp:.2f}GB')"
emb size: 1029.26M/18526.74GB, per gpu 128.66M/2315.84GB
blk size: 5033.43M/90601.76GB, per gpu 629.18M/11325.22GB
so 105/35=3 PP stages per gpu: 6*7.3
= ~33.9GB of weights/optim states/grads per gpu
To summarize we can see the setup is so that about half the gpu is loaded with weights / optim states / grad *18
)
So first let's try to come up with wider and shallower model to fit 200B, or wide if shallow doesn't work out too well topology/efficiency-wise
199B: 80 x 14336 (layers x hidden)
NHIDDEN=14336; NLAYERS=80; SEQ_LEN=2048; VOCAB_SIZE=150257; python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 199B, ratio=179
which gives us:
NHIDDEN=14336; VOCAB_SIZE=150257; TP=8; python -c "h=$NHIDDEN; v=$VOCAB_SIZE; tp=$TP; emb=v*h/10**6; blk=(12*h**2+13*h)/10**6; print(f'emb size: {emb:.2f}M/{emb*18:.2f}GB, per gpu {emb/tp:.2f}M/{emb*18/tp:.2f}GB'); print(f'blk size: {blk:.2f}M/{blk*18:.2f}GB, per gpu {blk/tp:.2f}M/{blk*18/tp:.2f}GB')"
emb size: 2154.08M/38773.52GB, per gpu 269.26M/4846.69GB
blk size: 2466.44M/44395.87GB, per gpu 308.30M/5549.48GB
TP=8, PP=10 - 80 gpus for one replica, can fit DP=4 (320/384)
so with PP=10, we get 80/10 = 8 stages per gpu = 44GB for normal layer gpus and 50GB for the 1st/last gpus due to 5G embedding, the remaining 28GB for activations (2GB is cuda kernels) - could be enough, but not sure.
If we are tight, consider giving the embedding its own layer so the total layers will be NLAYERS+2. In which case we need to change NLAYERS to be -2 than the wanted number to be able to spread out the layers evenly across gpus.
Also consider that the more tightly we pack each gpu the more PP stages it'll have - the slower it'll run.
And less GPUs means less processing power - so overall it's likely to be slower.
206B: 96 x 13312 (layers x hidden)
NHIDDEN=13312; NLAYERS=96; SEQ_LEN=2048; VOCAB_SIZE=150257; python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 206B, ratio=138
NHIDDEN=13312; VOCAB_SIZE=150257; TP=8; python -c "h=$NHIDDEN; v=$VOCAB_SIZE; tp=$TP; emb=v*h/10**6; blk=(12*h**2+13*h)/10**6; print(f'emb size: {emb:.2f}M/{emb*18:.2f}GB, per gpu {emb/tp:.2f}M/{emb*18/tp:.2f}GB'); print(f'blk size: {blk:.2f}M/{blk*18:.2f}GB, per gpu {blk/tp:.2f}M/{blk*18/tp:.2f}GB')"
emb size: 2000.22M/36003.98GB, per gpu 250.03M/4500.50GB
blk size: 2126.69M/38280.33GB, per gpu 265.84M/4785.04GB
TP=8, PP=12 => 96 gpus for one replica, can fit DP=4 (384/384)
96/12 = 8 stages per gpu = ~40GB per gpu, same number of PP stages per gpu and more spare memory
This might be a better fit memory-wise if the one above is too close to being full, especially on gpu 0 and -1.
It also uses the full 384 gpu allocation in a snag way.
So A100 spec is 312 TFLOPS for BF16, so probably the best would be 50% of that so 150 TFLOPs (which we probably won't reach, but let's see), so yes I agree 150 is a bit too optimistic, but let's use it as the best case scenario.
Also we still don't know how many gpus we will end up using, but let's say we use them all - 350. Once we decide on the topology we will be able to replace 350 with the actual number of gpus we plan to use.
$ python -c 'print(f"{8*300*200_000_000/(350*150)/(3600*24):0.2f}", "days")'
105.82 days
so 3.5 months in the best case scenario. But more likely 150-200 days since it'll be less of everything plus potential issues. We will know more once we get access to 1 replica as then we should get a much better TFLOPs estimation, which will then be less for DP>1.
And this estimate is w/o encountering any problems, which is unlikely, so add more overhead for rollbacks and restarts.
Additionally this number is too optimistic since we won't have the full number of GPUs till about some time in end of February.
See Estimate total training time for details of the math.
XXX: actually are we training for 300B or 400B tokens because of Multi-Lingual? in which case it'll be 1/3 longer!
We currently have about 3M gpu hours left in our allocation.
Let's see how many total gpus hours the good estimation is:
python -c 'print(f"{8*300*200_000_000/150/3600:0.2f}", "compute hours")'
888888.89 compute hours
So if it takes 2x longer than the best case scenario, then we say need about 2M hours, so we are fine there.
Important nuance:
We will have an exclusive access only till May, and in May we will have to share with others.
So at the moment we will have only about 3 months of having access to all gpus.
To measure best TFLOPs possible use a single, so that it uses all the intra-node connections (NVLink) and doesn't touch the network:
- 1 node, 1 replica
20B model: TP=8, PP=1, NLAYERS=8, NHIDDEN=14400, NHEADS=32, SEQ_LEN=2048, VOCAB_LENGTH=250k, GBS=2048
iteration 2/ 95367 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 769.99 | learning rate: 3.787E-06 | global batch size: 2048 | lm loss: 6.384045E+01 | loss scale: 4096.0 | grad norm: 15906.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.660 | TFLOPs: 108.47 |
- 10 nodes, 1 replica
200B model: TP=8, PP=10, NLAYERS=80, NHIDDEN=14400, NHEADS=96, SEQ_LEN=2048, VOCAB_LENGTH=250k, GBS=2048
iteration 2/ 95367 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 844.81 | learning rate: 3.787E-06 | global batch size: 2048 | lm loss: 6.373861E+01 | loss scale: 4096.0 | grad norm: 34132.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.424 | TFLOPs: 98.87 |
- 20 nodes, 2 replicas
iteration 2/ 95367 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 430.21 | learning rate: 3.787E-06 | global batch size: 2048 | lm loss: 6.373876E+01 | loss scale: 4096.0 | grad norm: 34027.311 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 4.761 | TFLOPs: 97.07 |
It was puzzling why much less memory was used for identical set up with DP=2 over DP=1 - but it's because of ZeRO-1 that saves a lot of memory across all GPUs!
GPUs | Size | DP | TP | PP | MBS | Mem | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|
8 | 20B | 1 | 8 | 1 | 1 | 67GB | 108.47 | 02-17 |
80 | 200B | 1 | 8 | 10 | 1 | 73GB | 98.87 | 02-17 |
160 | 200B | 2 | 8 | 10 | 1 | 51GB | 97.07 | 02-17 |
*Mem = max memory used by the first (last) nodes with the word embedding matrix - max is 77GB
- 1 node, 1 replica
20B model: TP=8, PP=1, NLAYERS=8, NHIDDEN=14400, NHEADS=32, SEQ_LEN=2048, VOCAB_LENGTH=250k, GBS=2048
iteration 2/ 95367 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 777.09 | learning rate: 3.787E-06 | global batch size: 2048 | lm loss: 6.381926E+01 | loss scale: 1.0 | grad norm: 2.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.635 | TFLOPs: 107.48 |
- 10 nodes, 1 replica
200B model: TP=8, PP=10, NLAYERS=80, NHIDDEN=14400, NHEADS=96, SEQ_LEN=2048, VOCAB_LENGTH=250k, GBS=2048
iteration 2/ 95367 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 853.81 | learning rate: 3.787E-06 | global batch size: 2048 | lm loss: 6.369443E+01 | loss scale: 1.0 | grad norm: 4.461 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.399 | TFLOPs: 97.82 |
- 20 nodes, 2 replicas
iteration 2/ 95367 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 434.14 | learning rate: 3.787E-06 | global batch size: 2048 | lm loss: 6.369444E+01 | loss scale: 1.0 | grad norm: 6.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 4.717 | TFLOPs: 96.19 |
GPUs | Size | DP | TP | PP | MBS | Mem | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|
8 | 20B | 1 | 8 | 1 | 1 | 68GB | 107.48 | 02-17 |
80 | 200B | 1 | 8 | 10 | 1 | 75GB | 97.82 | 02-17 |
160 | 200B | 2 | 8 | 10 | 1 | 53GB | 96.19 | 02-17 |
*Mem = max memory used by the first (last) nodes with the word embedding matrix - max is 77GB
So we can load more stages as we get higher DP as ZeRO spreads out over more gpus - smaller shards.
This overcomes the hanging which in general should lead to a slower throughput since all CUDA operations become synchronous and would block until they are done.
export CUDA_LAUNCH_BLOCKING=1
200B, measuring 2nd iter:
GPUs | async | GBS | TFLOPs | Notes |
---|---|---|---|---|
80 | no | 512 | 91.04 | |
80 | yes | 512 | 97.20 | |
160 | no | 512 | 84.59 | |
160 | yes | 512 | 84.44 | |
160 | no | 2048 | 90.29 | |
160 | yes | 2048 | 90.25 | may hang |
320 | no | 2048 | 87.78 | |
320 | yes | 2048 | xxxx | always hangs |
async/yes == CUDA_LAUNCH_BLOCKING=0
Interesting. Sometimes CUDA_LAUNCH_BLOCKING=1
impacts the speed, at other times it doesn't. Perhaps with larger set ups it's barely impacting since there is a lot more comms than the small setup.
Benchmarking the fastest 3D topology. Constraint: can use at most 48 nodes of 8 gpu a100 80gb nodes.
Note that we want not the highest TFLOPs but the highest speed per iteration, since one can get high TFLOPs on less GPUs and overall slower speed, since we only care about how fast we can finish the training.
Also note that the model size isn't always the same as the number of layers had to be tweaked to fit PP and NHIDDEN was fixed - so speed/tflops can't be exactly compared - but can be brought back to the same size by tweaking NHIDDEN. also since for efficiency of finishing this process I take the snapshot of a single iteration (always 2nd) the data isn't exact and can fluctuate a bit. But the point of this exercise is to get a feel of which topology is superior.
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 200B | 12 | 8 | 4 | 1 | 2040 | 47GB | 189.06 | 91.67 | 02-20 |
45 | 200B | 9 | 8 | 5 | 1 | 2043 | 44GB | 208.40 | 88.84 | 02-20 |
48 | 194B | 8 | 8 | 6 | 1 | 2048 | 39GB | 183.64 | 92.38 | 02-20 |
42 | 191B | 6 | 8 | 7 | 1 | 2046 | 39GB | 202.99 | 94.20 | 02-20 |
48 | 200B | 6 | 8 | 8 | 1 | 2046 | 36GB | 185.75 | 93.59 | 02-20 |
45 | 205B | 5 | 8 | 9 | 1 | 2045 | 37GB | 199.14 | 94.23 | 02-20 |
40 | 200B | 4 | 8 | 10 | 1 | 2048 | 35GB | 221.21 | 94.39 | 02-20 |
44 | 195B | 4 | 8 | 11 | 1 | 2048 | 32GB | 197.15 | 92.67 | 02-20 |
48 | 183B | 4 | 8 | 12 | 1 | 2048 | 30GB | 172.40 | 90.84 | 02-20 |
- Sec/it throughput at iteration 2
As you can see the 80GB is totally unnecessary for MBS=1 as we are bound by compute of each gpu and we barely use half the gpu memory and trying to pack more on each gpu slows the ensemble down. This is of course thanks to ZeRO which shards all fp32 optim+grad+params over all gpus - so the more gpus you use the less memory is needed to accomodate the same model size, regardless of DP/TP/PP topology. (with MBS=1 that is so that the activations don't take too much memory)
This table doesn't take into account batch size rampup which needs to be divisible by DP as it progressed from 32, 64, ... so really we have an additional constraint of DP % 4 = 0
and GBS % 32 = 0
.
which means from the above list only a few configs are suitable, and these are:
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 194B | 8 | 8 | 6 | 1 | 2048 | 39GB | 183.64 | 92.38 | 02-20 |
40 | 200B | 4 | 8 | 10 | 1 | 2048 | 35GB | 221.21 | 94.39 | 02-20 |
44 | 195B | 4 | 8 | 11 | 1 | 2048 | 32GB | 197.15 | 92.67 | 02-20 |
48 | 183B | 4 | 8 | 12 | 1 | 2048 | 30GB | 172.40 | 90.84 | 02-20 |
Increasing MBS will speed up things a bit and we have a ton of spare memory to accommodate a larger MBS, but have to ensure we get the batch size ramp up sorted out. So if the rampup steps are in increments of 32 with DP=4 highest MBS is 8. and log2(MBS) % 2 = 0
.
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 194B | 8 | 8 | 6 | 1 | 2048 | 39GB | 183.64 | 92.38 | 02-20 |
48 | 194B | 8 | 8 | 6 | 2 | 2048 | 45GB | 172.36 | 98.43 | 02-20 |
48 | 194B | 8 | 8 | 6 | 4 | 2048 | 56GB | 173.92 | 97.55 | 02-20 |
48 | 194B | 8 | 8 | 6 | 8 | 2048 | 75GB | 192.42 | 88.17 | 02-20 |
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
40 | 200B | 4 | 8 | 10 | 1 | 2048 | 35GB | 221.21 | 94.39 | 02-20 |
40 | 200B | 4 | 8 | 10 | 2 | 2048 | 43GB | 207.92 | 100.43 | 02-20 |
40 | 200B | 4 | 8 | 10 | 4 | 2048 | 55GB | 208.18 | 100.30 | 02-20 |
40 | 200B | 4 | 8 | 10 | 8 | 2048 | 76GB | 229.69 | 90.91 | 02-20 too close to OOM |
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
44 | 195B | 4 | 8 | 11 | 1 | 2048 | 32GB | 197.15 | 92.67 | 02-20 |
44 | 195B | 4 | 8 | 11 | 2 | 2048 | 41GB | 186.65 | 97.89 | 02-20 |
44 | 195B | 4 | 8 | 11 | 4 | 2048 | 53GB | 185.79 | 98.34 | 02-20 |
44 | 195B | 4 | 8 | 11 | 8 | 2048 | 75GB | 206.62 | 88.42 | 02-20 |
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 183B | 4 | 8 | 12 | 1 | 2048 | 30GB | 172.40 | 90.84 | 02-20 |
48 | 183B | 4 | 8 | 12 | 2 | 2048 | 39GB | 161.96 | 96.69 | 02-20 |
48 | 183B | 4 | 8 | 12 | 4 | 2048 | 50GB | 163.32 | 95.89 | 02-20 |
The models are slightly different in size so can't compare absolute numbers.
But clearly MBS=2 is about the best, MBS=4 is close by.
If we utilize all 48 nodes then we have PP6 and PP12 as contenders.
A100 80GB has 108 SMs
nhidden % 128 = 0
nhidden % 108 = 0
TP=8:
nhidden % 8 = 0
Combining all 3:
nhidden = 108*8*c = 864*c
which gives 864*16 = 13824 (187B) => so let's try to compare with 14400 (200B)
XXX: This is a total guestimate - need proper math
Nodes | Size | NHIDDEN | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
40 | 200B | 14400 | 4 | 8 | 10 | 1 | 2048 | 35GB | 221.21 | 94.39 | 02-20 |
40 | 187B | 13824 | 4 | 8 | 10 | 1 | 2048 | 33GB | 160.29 | 120.05 | 02-20 |
40 | 187B | 13824 | 4 | 8 | 10 | 2 | 2048 | 39GB | 151.07 | 127.38 | 02-20 |
40 | 187B | 13824 | 4 | 8 | 10 | 4 | 2048 | 53GB | 147.43 | 130.53 | 02-20 |
40 | 187B | 13824 | 4 | 8 | 10 | 8 | 2048 | 73GB | 152.51 | 126.18 | 02-20 |
Until now we used an estimated TFLOPs calculator which was under-reporting the real TFLOPs. And we couldn't compare those to the TFLOPs reported by Megatron-LM.
Deepak Narayanan fixed this here: bigscience-workshop/Megatron-DeepSpeed#251
So from here on all the TLOPs reports will be about 3% higher - so can't exactly compare to the earlier numbers in this document.
So we have 2 set ups that fit well into 48 nodes - and that's PP=6/DP=8 or PP=12/DP=4
NHIDDEN=14336 / NLAYERS=72
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 4 | 8 | 12 | 1 | 2048 | 29GB | 143.31 | 112.49 | 02-21 |
48 | 181B | 4 | 8 | 12 | 2 | 2048 | 37GB | 134.02 | 120.29 | 02-21 |
48 | 181B | 4 | 8 | 12 | 4 | 2048 | 49GB | 123.69 | 130.34 | 02-21 |
48 | 181B | 4 | 8 | 12 | 8 | 2048 | 69GB | 129.26 | 124.72 | 02-21 |
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 8 | 8 | 6 | 1 | 2048 | 38GB | 139.82 | 115.31 | 02-21 |
48 | 181B | 8 | 8 | 6 | 2 | 2048 | 44GB | 131.02 | 123.05 | 02-21 |
48 | 181B | 8 | 8 | 6 | 4 | 2048 | 56GB | 121.48 | 132.71 | 02-21 |
So it's either:
- DP=4, PP=12, MBS=4: 123 secs/it | 130 TFLOPS
- DP=8, PP=06, MBS=4: 121 secs/it | 133 TFLOPS
Let's compare again with another setup:
NHIDDEN=13824 / NLAYERS=84
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 196B | 4 | 8 | 12 | 2 | 2048 | 39GB | 143.89 | 121.45 | 02-21 |
48 | 196B | 4 | 8 | 12 | 4 | 2048 | 52GB | 133.12 | 131.27 | 02-21 |
48 | 196B | 8 | 8 | 6 | 2 | 2048 | 65GB | 141.41 | 123.58 | 02-21 |
48 | 196B | 8 | 8 | 6 | 4 | 2048 | 56GB | 130.31 | 134.11 | 02-21 |
This one has 15% more layers than the previous tables, so here the less-PP-stages setup wins, that is:
- DP=8, PP=06, MBS=4: 130.31 secs/it | 134.11 TFLOPS
The following has so far given the highest TFLOPs, as we are packing more into less GPUs so 64 gpus are left out, and of course the total speed for iteration is much slower. So the key is the iteration speed and not TFLOPs.
NHIDDEN=13824 / NLAYERS=80
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
40 | 187B | 8 | 8 | 10 | 4 | 2048 | GB | 147.04 | 135.92 | 02-21 |
Max possible TFLOPs check for NHIDDEN=14336
:
NHIDDEN=14336 / NLAYERS=6 / GBS=512
Nodes | Size | Layers | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
1 | 18B | 6 | 8 | 1 | 2 | 2048 | 54GB | 130.43 | 143.48 | 02-21 |
1 | 18B | 6 | 8 | 1 | 2 | 2048 | 54GB | 119.19 | 157.02 | 02-21 |
1 | 18B | 10 | 8 | 1 | 1 | 2048 | 80GB | 205.52 | 142.59 | 02-21 |
Trying with ZeRO_STAGE=0/1
NHIDDEN=14336 / NLAYERS=72
Nodes | Size | ZS | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 1 | 4 | 8 | 12 | 2 | 2048 | 37GB | 120.29 | 134.02 | 02-21 |
48 | 181B | 0 | 4 | 8 | 12 | 2 | 2048 | 72GB | 137.34 | 113.02 | 02-21 |
- ZS = ZERO_STAGE
XXX: currently can't test ZeRO_STAGE=0
on master, or ZeRO_STAGE=1
on the special branch - so need to retest the above on the same branch.
all NHEADS=64 (above too)
NHIDDEN=12288 / NLAYERS=96
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 177B | 8 | 8 | 6 | 2 | 2048 | GB | 136.73 | 115.73 | 02-23 |
48 | 177B | 8 | 8 | 6 | 4 | 2048 | GB | 122.96 | 128.69 | 02-23 |
NHIDDEN=13312 / NLAYERS=84
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 4 | 8 | 12 | 4 | 2048 | GB | 125.52 | 129.29 | 02-23 |
48 | 182B | 8 | 8 | 6 | 2 | 2048 | GB | 135.55 | 119.72 | 02-23 |
48 | 182B | 8 | 8 | 6 | 4 | 2048 | GB | 122.93 | 132.00 | 02-23 |
NHIDDEN=13824 / NLAYERS=78
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 8 | 6 | 4 | 2048 | GB | 121.28 | 133.93 | 02-23 |
NHIDDEN=14336 / NLAYERS=72
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 4 | 8 | 12 | 4 | 2048 | GB | 123.79 | 130.24 | 02-23 |
48 | 181B | 8 | 8 | 6 | 4 | 2048 | GB | 120.85 | 133.40 | 02-23 |
NHIDDEN=14336 / NLAYERS=72
not many variations around 100 as 14336 = 2**11*7
and the constraint is (HEADS/TP)*MBS % 4 = 0
or for MBS=4, TP=8
HEADS % 16 = 0
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 8 | 8 | 6 | 4 | 16 | 2048 | 54GB | 121.03 | 133.20 | 02-24 |
48 | 181B | 8 | 8 | 6 | 4 | 32 | 2048 | 55GB | 124.01 | 130.00 | 02-23 |
48 | 181B | 8 | 8 | 6 | 4 | 64 | 2048 | 55GB | 120.18 | 134.15 | 02-23 |
48 | 181B | 8 | 8 | 6 | 4 | 112 | 2048 | 53GB | 138.72 | 116.21 | 02-23 |
48 | 181B | 8 | 8 | 6 | 4 | 128 | 2048 | 55GB | 124.89 | 129.08 | 02-23 |
48 | 181B | 8 | 8 | 6 | 4 | 256 | 2048 | 54GB | 132.85 | 121.35 | 02-24 |
NHIDDEN=13824 / NLAYERS=78
here 13824 = 2**9*3**3
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 8 | 6 | 4 | 64 | 2048 | GB | 121.28 | 133.93 | 02-23 |
48 | 182B | 8 | 8 | 6 | 4 | 96 | 2048 | 59GB | 124.75 | 130.21 | 02-23 |
48 | 182B | 8 | 8 | 6 | 4 | 128 | 2048 | 54GB | 162.72 | 99.82 | 02-23 |
NHEADS=108 breaks constraints for invoking optimized fused softmax kernel
NHIDDEN=13312 / NLAYERS=84
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 8 | 6 | 4 | 64 | 2048 | GB | 122.93 | 132.00 | 02-23 |
48 | 182B | 8 | 8 | 6 | 4 | 128 | 2048 | GB | 129.17 | 125.63 | 02-23 |
NHIDDEN=12288 / NLAYERS=96
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 177B | 8 | 8 | 6 | 4 | 64 | 2048 | GB | 122.96 | 128.69 | 02-24 |
48 | 177B | 8 | 8 | 6 | 4 | 96 | 2048 | GB | 145.40 | 108.83 | 02-24 |
48 | 177B | 8 | 8 | 6 | 4 | 128 | 2048 | GB | 129.42 | 122.27 | 02-24 |
Note: A100s PCI-Express/NUMA was improved today so all TFLOPs have changed for the better (1-5%) - thus do not compare today's numbers to yesterday's.
NLAYERS=72 NHIDDEN=14336 NHEADS=64
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 8 | 8 | 6 | 4 | 1568 | 56GB | 113.01 | 109.22 | 02-25 |
48 | 181B | 8 | 8 | 6 | 4 | 2048 | 55GB | 114.11 | 141.27 | 02-25 |
48 | 181B | 8 | 8 | 6 | 6 | 2016 | 66GB | 123.57 | 128.43 | 02-25 |
48 | 181B | 4 | 8 | 12 | 4 | 1568 | GB | 92.75 | 133.08 | 02-25 |
48 | 181B | 4 | 8 | 12 | 4 | 2048 | 49GB | 117.07 | 137.70 | 02-25 |
48 | 181B | 4 | 8 | 12 | 2 | 1568 | GB | 99.93 | 123.51 | 02-25 |
48 | 181B | 4 | 8 | 12 | 2 | 2048 | GB | 128.82 | 125.15 | 02-25 |
some more configs with lower PP:
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 6 | 8 | 8 | 4 | 2016 | 52GB | 113.16 | 140.24 | 02-25 |
48 | 181B | 12 | 8 | 4 | 2 | 2016 | 53GB | 125.52 | 126.43 | 02-25 |
48 | 181B | 12 | 8 | 4 | 4 | 2016 | 59GB | 114.81 | 138.22 | 02-25 |
48 | 181B | 24 | 8 | 2 | 1 | 2016 | 65GB | 145.45 | 109.11 | 02-25 |
48 | 181B | 24 | 8 | 2 | 2 | 2016 | 76GB | 136.13 | 116.58 | 02-25 |
48 | 181B | 48 | 8 | 1 | 1 | 2016 | OOM | 02-25 | ||
Tweaking TP for the first time from the TP=8 is best assumption. But if the model fits into smaller TP it should be faster!
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 8 | 4 | 12 | 4 | 2048 | 60GB | 111.89 | 144.08 | 02-25 |
48 | 181B | 8 | 4 | 12 | 2 | 2048 | 44GB | 110.48 | 145.92 | 02-25 |
48 | 181B | 8 | 4 | 12 | 2 | 2048 | 38GB | 113.54 | 141.99 | 02-25 |
48 | 181B | 16 | 4 | 6 | 4 | 2048 | 75GB | 117.11 | 137.66 | 02-25 |
48 | 181B | 16 | 4 | 6 | 2 | 2048 | 57GB | 111.71 | 144.31 | 02-25 |
48 | 181B | 16 | 2 | 12 | 2 | 2048 | 63GB | 112.50 | 143.30 | 02-25 |
48 | 181B | 32 | 2 | 6 | 2 | 2048 | OOM | 02-25 | ||
48 | 181B | 32 | 2 | 6 | 1 | 2048 | OOM | 02-25 | ||
48 | 181B | 8 | 2 | 24 | 1 | 2048 | 44GB | 119.53 | 134.88 | 02-25 |
48 | 181B | 8 | 2 | 24 | 2 | 2048 | 53GB | 122.75 | 131.33 | 02-25 |
48 | 181B | 4 | 4 | 24 | 1 | 2048 | GB | 130.60 | 123.44 | 02-25 |
NHIDDEN=12288 / NLAYERS=96
Nodes | Size | DP | TP | PP | MBS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|
48 | 177B | 8 | 1 | 48 | 1 | 2048 | 58GB | 142.17 | 111.30 | 02-25 |
to retest with TP<8 variations
NHIDDEN=13824 / NLAYERS=78
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 4 | 12 | 1 | 64 | 2048 | 148.24 | 109.57 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 2 | 64 | 2048 | 48GB | 103.51 | 156.92 | 02-26 |
48 | 182B | 8 | 4 | 12 | 2 | 96 | 2048 | 48GB | 107.12 | 151.64 | 02-26 |
48 | 182B | 8 | 4 | 12 | 2 | 128 | 2048 | 147.41 | 110.19 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 4 | 64 | 2048 | 106.72 | 152.21 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 4 | 96 | 2048 | 110.31 | 147.25 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 4 | 128 | 2048 | 153.90 | 105.54 | 02-26 | |
48 | 182B | 8 | 8 | 6 | 4 | 96 | 2048 | 118.12 | 137.51 | 02-26 | |
48 | 182B | 8 | 8 | 6 | 4 | 128 | 2048 | 156.84 | 103.56 | 02-26 | |
NHIDDEN=14336 / NLAYERS=72
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 181B | 8 | 4 | 12 | 2 | 64 | 2048 | 110.42 | 146.00 | 02-26 | |
48 | 181B | 8 | 4 | 12 | 2 | 128 | 2048 | 114.02 | 141.39 | 02-26 | |
48 | 181B | 8 | 4 | 12 | 4 | 128 | 2048 | 137.53 | 117.23 | 02-26 | |
48 | 181B | 8 | 8 | 6 | 4 | 64 | 2048 | 113.95 | 141.47 | 02-26 | |
48 | 181B | 8 | 8 | 6 | 4 | 128 | 2048 | 116.06 | 138.90 | 02-26 | |
NHIDDEN=13312 / NLAYERS=84
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 4 | 12 | 2 | 64 | 2048 | 103.82 | 156.46 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 4 | 64 | 2048 | 113.21 | 143.34 | 02-26 | |
48 | 182B | 8 | 8 | 6 | 2 | 64 | 2048 | 129.61 | 125.21 | 02-26 | |
NHIDDEN=13824 / NLAYERS=78
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 4 | 12 | 2 | 96 | 512 | 35.77 | 113.52 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 2 | 96 | 1024 | 59.65 | 136.15 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 2 | 96 | 1536 | 83.11 | 146.59 | 02-26 | |
48 | 182B | 8 | 4 | 12 | 2 | 96 | 2048 | 107.12 | 151.64 | 02-26 | |
78/12=6.5 - so the last stage has 1 block, while the rest have 7 - which is uneven. So that config is not optimal as it wastes gpus.
NHIDDEN=13824 / NLAYERS=78
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 182B | 8 | 8 | 6 | 2 | 96 | 2048 | GB | 133.57 | 121.61 | 02-27 |
48 | 182B | 8 | 8 | 6 | 4 | 96 | 2048 | 59GB | 118.24 | 137.38 | 02-27 |
48 | 182B | 16 | 4 | 6 | 2 | 96 | 2048 | GB | 02-27 | ||
48 | 182B | 16 | 4 | 6 | 4 | 96 | 2048 | 75GB | 115.55 | 140.57 | 02-27 |
HIDDEN=12288; NLAYERS=106; regex partition_method='type:transformer|embed')
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 195B | 8 | 4 | 12 | 2 | 96 | 2048 | 44GB | 112.69 | 154.86 | 02-27 |
48 | 195B | 8 | 4 | 12 | 2 | 64 | 2048 | GB | 110.96 | 157.27 | 02-27 |
Do not compare these numbers to the previous ones. For 2 reasons:
- First, from now on the testing is happening with BF16 optimizer that was just written to accumulate gradients in fp32, so it is more memory heavy and is a bit slower - this is compared to fp16 which grad accumulates in fp16. The additional memory usage is 4bytes x params and it's not sharded across gpus.
- I implemented and enabled
--pp-partition-method 'type:transformer|embedding'
so we use 2 layers less, to match2+nlayers*PP
math to get a perfect balance giving each embedding layer its own slot on par with transformer layers. This is because 250k embedding matrix takes as much space as a single transformer layer.
HIDDEN=12288; NLAYERS=106; Model size: 195B, ratio=115
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 195B | 8 | 4 | 12 | 2 | 64 | 2048 | 67GB | 116.54 | 149.75 | 02-28 |
48 | 195B | 8 | 4 | 12 | 2 | 96 | 2048 | 65GB | 118.79 | 146.90 | 02-28 |
48 | 195B | 8 | 4 | 12 | 2 | 128 | 2048 | 67GB | 121.42 | 143.73 | 02-28 |
48 | 195B | 8 | 4 | 12 | 4 | 96 | 2048 | 79GB | 120.34 | 145.01 | 02-28 |
HIDDEN=12288; NLAYERS=100; Model size: 184B, ratio=122
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 184B | 16 | 4 | 6 | 2 | 64 | 2048 | OOM | x | x | 02-28 |
48 | 184B | 16 | 4 | 6 | 1 | 64 | 2048 | OOM | x | x | 02-28 |
48 | 184B | 8 | 8 | 6 | 2 | 64 | 2048 | 61GB | 139.72 | 117.91 | 02-28 |
48 | 184B | 8 | 8 | 6 | 4 | 64 | 2048 | 72GB | 120.96 | 136.20 | 02-28 |
NHIDDEN=13312; NLAYERS=82; Model size: 178B, ratio=162
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 178B | 4 | 8 | 12 | 4 | 64 | 2048 | 52GB | 111.79 | 141.76 | 02-28 |
48 | 178B | 8 | 4 | 12 | 2 | 64 | 2048 | 63GB | 104.45 | 151.71 | 02-28 |
48 | 178B | 8 | 4 | 12 | 2 | 104 | 2048 | 62GB | 123.71 | 128.10 | 02-28 |
48 | 178B | 8 | 4 | 12 | 2 | 128 | 2048 | 60GB | 108.78 | 145.68 | 02-28 |
48 | 178B | 8 | 4 | 12 | 4 | 64 | 2048 | 74GB | 104.82 | 151.18 | 02-28 |
NHIDDEN=13312; NLAYERS=94 Model size: 203B, ratio=141
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 203B | 8 | 4 | 12 | 2 | 128 | 2048 | 67GB | 124.10 | 146.12 | 02-28 |
NHIDDEN=14336; NLAYERS=70; Model size: 176B, ratio=204
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 176B | 4 | 8 | 12 | 2 | 64 | 2048 | 40GB | 121.63 | 128.92 | 02-28 |
48 | 176B | 8 | 4 | 12 | 2 | 64 | 2048 | 59GB | 102.03 | 153.68 | 02-28 |
48 | 176B | 8 | 4 | 12 | 2 | 112 | 2048 | 59GB | 104.50 | 150.05 | 02-28 |
48 | 176B | 8 | 4 | 12 | 2 | 128 | 2048 | 60GB | 105.89 | 148.08 | 02-28 |
48 | 176B | 8 | 4 | 12 | 4 | 64 | 2048 | 73GB | 102.27 | 153.33 | 02-28 |
NHIDDEN=14336; NLAYERS=82; Model size: 206B, ratio=174
Nodes | Size | DP | TP | PP | MBS | NHEADS | GBS | Mem | Sec/it | TFLOPs | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
48 | 206B | 8 | 4 | 12 | 2 | 128 | 2048 | OOM | 02-28 | ||
(was quickly getting the memory snapshot with: pdsh -w jean-zay-iam01 "source ~/.pdshrc; nvidia-smi"
)
Here we are dealing with 320-384 A100 GPUs working in ensemble.
It appears that the system can't handle heavy NCCL traffic or something of sorts. It can handle less than 100B model over 40nodes (TP=8/PP=10/DP=4). It can handle 200B over 10 nodes. At 100B over 20-40 nodes random GPUs start not to respond and the whole system hangs until it times out. I was able to test with the same NHIDDEN and growing the model on the layer dimension:
- 10 layers - 25B works
- 20 layers - 50B works
- 40 layers - 100B hangs after succeeding iteration 1
I was just starting to diagnose on the hidden dimension and now 13/52 nodes are down and so I can't continue with this line of work, since 40 nodes gave me a reliable failure and 20 nodes is intermittent failure, so it's not good for diagnosing.
This is for a single replica of 10 nodes with 200B model + 250k vocab.
I think the failed nodes that crashed and didn't recover are high suspects for having internal problems. Even though when I tested in groups of 10 nodes everything was dandy - note - the same 200B model. One more data point - Deepspeed ZeRO shards data over all gpus - so the more GPUs are involved the more communication happens. This is totally orthogonal to DP.
The next day:
Most of the nodes have come back this morning so continuing the dimensional growing experiments.
To remind, growing on the layer dimension and keeping hidden at 1024*14
worked until 40 layers were reached where it was hanging. So it couldn't handle 100B model in this dimension.
Now I'm keeping the layers dimension frozen to 80 and growing the nhidden dimension, starting from 1024*4
- proving that it works and then incrementing the size until it hangs:
1024*10
works (100B model)1024*12
hangs (145B model)
So these 2 experiments both show that when the inter-node traffic exceeds certain level - the system is fails.
So it's not the size of each all_reduce
/broadcast
packet since at full NHIDDEN but only 1/4 of layers everything is just fine.
And BTW to get a quick success/failure indication I'm working with GLOBAL_BATCH_SIZE=64
so PP is very inefficient, but it doesn't matter for the purpose of this experiment.
Using py-spy
on the processes to dump python call stacks I have derived the same story on each node:
On each node with TP=8 - i.e. each node is only TP - the same situation: (checked nodes 0 and 1 only)
6 processes are in:
Thread 835990 (active): "MainThread"
train (megatron/training.py:915)
pretrain (megatron/training.py:187)
<module> (pretrain_gpt.py:239)
2 processes are in:
Thread 835995 (active): "MainThread"
broadcast (torch/distributed/distributed_c10d.py:1191)
_aggregate_total_loss (deepspeed/runtime/pipe/engine.py:540)
train_batch (deepspeed/runtime/pipe/engine.py:330)
train_step (megatron/training.py:436)
train (megatron/training.py:851)
pretrain (megatron/training.py:187)
<module> (pretrain_gpt.py:239)
so 6 processes finished train_step
and now are trying to:
torch.distributed.all_reduce(
done_cuda, op=torch.distributed.ReduceOp.MAX)
but for some reason 2 processes never finished the train_step
and are stuck broadcasting I presume to the other 6 processes, which have long gone.
So this hanging happens partially in Deepspeed and partially in Megatron-LM, somehow processes get out of sync even though everything works just fine on a smaller scale. But the issue could be brought on by apex's FusedAdam
as we have dealt with a serious issue in it as well a week earlier, but it could also be pytorch, NCCL or some internal system issue. It's very hard to find the cause.
As I shared earlier the problem doesn't exist or goes away if either of 2 things happens:
- the model is under 100B (short stack of layer or narrow hidden) and 20 or more nodes are used in a single job
CUDA_LAUNCH_BLOCKING=1
Topology is TP=8, PP=10, DP=4
It has been very difficult to work on diagnosing this issue since every time I run the hanging setup I would lose a few nodes and since I'm 10h behind JeanZay, nobody is around there to reboot the nodes.
So first of all it appears that CUDA_LAUNCH_BLOCKING=1
removes the hanging issue and I did several performance checks and it surprisingly has no impact on this framework at this scale. Normally, it should make things much slower as it makes CUDA ops synchronous.
After discussing this issue with Samyam I first run py-spy
on all processes, but alas several processes weren't responding, so we had no idea how to tell where they were hanging.
For posterity here is the process:
In one console, first allocate the gpus:
salloc --partition=gpu_p5 --constraint=a100 --reservation=hug --nodes=2 --ntasks-per-node=1 --cpus-per-task=64 --hint=nomultithread --gres=gpu:8 --time 20:00:00 --account=six@a100
We are doing that so that if SLURM kills the processes we could still access those.
Now run the training job, which calls the main srun
with all the gpus:
bash 200B-n40-bf16-mono.slurm
Wait till the program hangs.
Now in another console get the SLURM_JOBID
(or get it from salloc
log):
squeue -u `whoami` -o "%.16i %.9P %.26j %.8T %.10M %.8l %.6D %.20S %R"
Adjust jobid with SLURM_JOBID
from above:
srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed"
Must use --gres=gpu:0
for the monitor srun
or otherwise it will block until the first srun
exits
I also attempted using pdsh
via ds_ssh
, but somehow I wasn't able to run py-spy
remotely - the main issue was that remote ssh
command wasn't giving the same env as when I was logged in interactively via ssh
. But if you have sudo
access on the compute nodes than you could do:
First prepare hostfile
:
function makehostfile() {
perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"};
$slots=8 if $slots==0; # workaround 8 gpu machines
@nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}];
print map { "$b$_ slots=$slots\n" } @nodes'
}
makehostfile > hostfile
Now run the py-spy
extraction command over all nodes:
ds_ssh -f hostfile "source ~/.pdshrc; ps aux | grep python | grep -v grep | grep `whoami` | awk '{print \$2}' | xargs -I {} sudo py-spy dump --pid {} "
So next came the idea of tracing all calls like one does with strace(1)
, I researched python calls tracing facilities and have discovered that python has a trace
sub-system.
This code will trace all python calls and log them to the console and into a dedicated per process log file, via a custom Tee
module I added.
This then can help to understand where some processes stopped responding, since we will have the log of the last call before it went unresponsive.
$ cat pretrain_gpt.py
[...]
def main():
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
import re
class Tee:
"""
A helper class to tee print's output into a file.
Usage:
sys.stdout = Tee(filename)
"""
def __init__(self, filename):
self.stdout = sys.stdout
self.file = open(filename, "a")
def __getattr__(self, attr):
return getattr(self.stdout, attr)
def write(self, msg):
self.stdout.write(msg)
self.file.write(msg)
self.file.flush()
def flush(self):
self.stdout.flush()
self.file.flush()
if __name__ == "__main__":
import sys
import trace
import socket
import os
# enable to trace
if 0:
cwd = os.path.realpath('.')
pid = os.getpid()
hostname = socket.gethostname()
local_rank = int(os.environ["LOCAL_RANK"])
trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt"
# create a Trace object, telling it what to ignore, and whether to
# do tracing or line-counting or both.
tracer = trace.Trace(
ignoredirs=[sys.prefix, sys.exec_prefix],
trace=1,
count=1,
)
# outfile=trace_output_file)
# run the new command using the given tracer
sys.stdout = Tee(trace_output_file)
tracer.run('main()')
else:
main()
This code doesn't require any special handing other than enabling the trace by changing if 0
to if 1
.
Of course, this will now dump all python calls. I was worried that the slowdown will mask the issue causing the hanging, but surprisingly it didn't.
I got 14GB (!) of data logged of just python calls from 320 processes.
In retrospect I probably should have started the tracing at a later place, probably just before train_step
- otherwise we have gotten a lot of useless traces of the dataloader and other preliminary code.
I wish I could tell trace
which packages to follow, but alas it only supports dirs to ignore, which is much more difficult to set, and thus you end up with a lot more data than one needs. But still this is a super useful tool for debugging hanging processes.
We needed to do some more tweaks to get to the root of it.
Unfortunately I had to pause here, since I had to switch to testing the final version of the code and I couldn't risk losing nodes.
With having CUDA_LAUNCH_BLOCKING=1
workaround providing a robust solution we will use that for a time being.
While the final data is being cleaned up we are doing a few preliminary runs with data that still has some issues.
GBS ramp up of --rampup-batch-size 16 16 9_765_625
- the first few stages starting with GBS=16 are really slow (8 TFLOPs). The pipeline doesn't have enough data to even fill all the stages once, so it's super inefficient and it'll take days until we start hitting 100 TFLOPs.
But there were no spikes during this brief experiment.
Trying --rampup-batch-size 384 16 9_765_625
since 384 is the first GBS where the pipe is filled up fully for the first time. 12*2*4=384
(PP*MBS*DP
). The throughput start at 100 TFLOPs right away (and it should be 150 TFLOPS once we reach GBS=2048).
Found a bug: tied weights weren't getting reduced - was getting a spike on restart, fixed at https://github.com/microsoft/DeepSpeed/pull/1801/commits/37011a92bad42b07c2cb742751873ef7073d84b8
So only the front embed matrix grad updates were making, the end one were ignored.
Will do a totally new run to compare that it's similar or better.
Trying the rebased to master version 61d51fd62141ddb51b629b785af256fac407e048 and it has serious issues - the learning is much much slower
So rolling back olruwase/bf16-updates
branch to the fix:
37011a92bad42b07c2cb742751873ef7073d84b8 Reduce tied weight gradients
This time the learning is just a tad slower than main-2, so either deepspeed@master introduced some regression or the merge didn't go well.
additionally going to try the latest checkpoint from main-3
as it's more progressive to check main-4
for spikes.
same spike.
After analysis the module weights, it's clear we have 2 distinct issues:
module.tied_modules.embed.word_embeddings.norm
is not taken care of at all in pp rank -1 and they get reset to defaults on loadmodule.tied_modules.embed.word_embeddings.weight
is mismatching on pp rank -1 between after iteration of last step before save and before iteration of first step after load
This was derived with this debug instrumentation:
diff --git a/megatron/training.py b/megatron/training.py
index fd65ae9..fd76d28 100644
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -58,6 +58,23 @@ from megatron.data.dataset_utils import analyze_data_prefix
import deepspeed
+def dump_weights(preamble, iteration, model):
+
+ import os
+ import socket
+ hostname = socket.gethostname()
+ pid = os.getpid()
+
+ tp_rank = mpu.get_tensor_model_parallel_rank()
+ pp_rank = mpu.get_pipeline_model_parallel_rank()
+ dp_rank = mpu.get_data_parallel_rank()
+ global_rank = torch.distributed.get_rank()
+
+ fn = f"debug-{iteration}-pp{pp_rank}-tp{tp_rank}-dp{dp_rank}-global{global_rank}-{preamble}-{pid}.txt"
+ #print(fn)
+ with open(fn, "w") as fh:
+ for n, p in model[0].named_parameters():
+ fh.write(f"{n}={p}\n")
def print_datetime(string):
"""Note that this call will sync across all ranks."""
@@ -426,6 +443,8 @@ def setup_model_and_optimizer(model_provider_func):
if args.fp16:
optimizer.reload_model_params()
+ #optimizer.update_lp_params()
+
return model, optimizer, lr_scheduler
@@ -848,12 +867,18 @@ def train(forward_step_func, model, optimizer, lr_scheduler,
args.pipeline_model_parallel_size >= 1:
args.curriculum_seqlen = args.curriculum_scheduler.update_difficulty( \
args.iteration + 1)
+
+ dump_weights("before-iteration", iteration+1, model)
+
loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
train_step(forward_step_func,
train_data_iterator,
model,
optimizer,
lr_scheduler)
+
+ dump_weights("after-iteration", iteration+1, model)
+
iteration += 1
args.iteration = iteration
new_samples = mpu.get_data_parallel_world_size() * \
and then
- run 5 iterations and saved checkpoint, then run:
mkdir a; mv debug-* a
- restarted and run a few iterations, then run:
mkdir b; mv debug-* b
I basically dumped weights for all ranks before and after train_step
Now let's compared them all. Comparing:
- the after iteration of the last step before save (iteration 805 in this example)
- the before iteration step after the load (on restart) (iteration 806 in this example)
with the help of:
perl -le 'print qx[diff -u a/debug-805-*global$_-after-iteration-*.txt b/debug-806-*-global$_-before-iteration-*.txt] for 0..383'
Result: all a/debug-805-pp11-*-after-iteration-*.txt
and corresponding b/debug-806-pp11-*-before-iteration-*.txt
mismatch.
so here is a sample diff:
--- a/debug-805-pp11-tp1-dp4-global369-after-iteration-377074.txt 2022-03-06 05:44:06.074835000 +0100
+++ b/debug-806-pp11-tp1-dp4-global369-before-iteration-378990.txt 2022-03-06 05:48:24.842635000 +0100
@@ -1,21 +1,15 @@
module.tied_modules.embed.word_embeddings.weight=Parameter containing:
-tensor([[-3.1090e-04, 4.6082e-03, -2.3499e-03, ..., -1.1292e-02,
- 2.1667e-03, -2.7313e-03],
- [-1.1353e-02, 9.9487e-03, -1.9684e-03, ..., -5.4550e-04,
- -2.3460e-04, 4.2114e-03],
- [ 3.2806e-03, -3.4332e-04, -5.5847e-03, ..., 7.6294e-03,
- 1.7853e-03, 2.5868e-05],
+tensor([[-0.0006, 0.0046, -0.0024, ..., -0.0114, 0.0014, -0.0030],
+ [-0.0109, 0.0096, -0.0020, ..., -0.0005, -0.0001, 0.0041],
+ [ 0.0027, -0.0004, -0.0056, ..., 0.0070, 0.0017, 0.0003],
...,
- [ 1.6098e-03, 4.1809e-03, -2.4567e-03, ..., -4.6692e-03,
- -4.5776e-03, 1.7090e-03],
- [ 5.7373e-03, 3.5858e-03, -1.7471e-03, ..., 2.3041e-03,
- -6.4392e-03, 1.0223e-03],
- [-1.6937e-03, -1.4038e-02, 2.1057e-03, ..., -3.6011e-03,
- 1.3275e-03, -5.8594e-03]], device='cuda:1', dtype=torch.bfloat16,
- requires_grad=True)
+ [ 0.0018, 0.0039, -0.0026, ..., -0.0051, -0.0043, 0.0016],
+ [ 0.0051, 0.0039, -0.0015, ..., 0.0027, -0.0063, 0.0008],
+ [-0.0018, -0.0142, 0.0021, ..., -0.0035, 0.0015, -0.0060]],
+ device='cuda:1', dtype=torch.bfloat16, requires_grad=True)
module.tied_modules.embed.word_embeddings.norm.weight=Parameter containing:
-tensor([0.9961, 0.9961, 0.9961, ..., 0.9961, 0.9961, 0.9961], device='cuda:1',
- dtype=torch.bfloat16, requires_grad=True)
+tensor([1., 1., 1., ..., 1., 1., 1.], device='cuda:1', dtype=torch.bfloat16,
+ requires_grad=True)
module.tied_modules.embed.word_embeddings.norm.bias=Parameter containing:
tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:1', dtype=torch.bfloat16,
requires_grad=True)
trying a new baseline with rampup starting from 192
trying bigscience-workshop/Megatron-DeepSpeed#260 - comparing with main-5
tracks exactly main-5 - merged.
Running with bigscience-workshop/Megatron-DeepSpeed#261
Don't allocate embed LN on pp rank -1, - different checkpoint
still spikes on restart
disable --embed-layernorm
completely, check if spikes on restart
no spikes on restart
-
At 1438 switched to deepspeed@ab61edb02a137d91b61bd416b4e8d3eb287b0eba of olruwase/bf16-updates - let's see if it tracks still the previous runs - yes it does.
So the restart spike's cause was this: the framework was putting LayerNorm
that I added for the embedding layr into the wrong param group here.
it should have been in no_weight_decay_params
but ended up in weight_decay_params
because in this module LayerNorm
is an alias for MixedFusedLayerNorm
, so if isinstance(module_, LayerNorm)
was False
.
So if we want to use torch.nn.LayerNorm
we have to change the code above to additionally check for or isinstance(module_, torch.nn.LayerNorm).
re-running with deepspeed@77b649d160c1cd86f33415e2a7deab50c45fba16 of olruwase/bf16-updates which fixed the tied-embedding desynchronization bug due to clip grads not running on the last pp rank for tied embeddings.