Below shows the Table 2
in the paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.
llm-analysis are run with the setups described in the paper, and the outputs match the Training time for 300B tokens (days)
reported for different schemes.
- For
PTD Parallelism
scheme, run_train.sh is used and the output summaries are under outputs_train. - For
ZeRO-3 without Model Parallelism
scheme, run_train_zero.sh is used and the output summaries are under outputs_train_zero