[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959

umiswing · 2025-01-26T02:24:30Z

I hack NeMo as this PR umiswing#1 to make it run on single node, and run Mixtral-8x7B with nemo llm pretrain --factory mixtral_8x7b, however, I found no no performance gain to train Mixtral-8x7B with FP8 on H800, with NGC from nvcr.io/nvidia/nemo:24.12. Is there something wrong in my experiment？

The text was updated successfully, but these errors were encountered:

akoumpa · 2025-01-30T01:53:41Z

@malay-nagda

umiswing · 2025-02-06T13:19:50Z

@akoumpa @malay-nagda hello, any update for this issue?

malay-nagda · 2025-02-06T13:34:21Z

Hi @umiswing!
I think nvcr.io/nvidia/nemo:24.12 container is missing this fix for MoE fp8- 9a2f0bd.

This fix is likely to be present in a more recent NeMo container.

umiswing · 2025-02-06T14:46:26Z

@malay-nagda Thanks! I checkout to 9a2f0bd in the nvcr.io/nvidia/nemo:24.12 container and Mixtral-8x7B is much fater with FP8 now.

However, I am concerned about the fairness of the experiment. I hack some code in the recipes and collections to run Mixtral-8x7B on a single node. Could you take a look at my modification?

I start the experiment with nemo llm pretrain --factory mixtral_8x7b, and calculate the average train_step_timing from global_step: 5 to global_step: 14

malay-nagda · 2025-02-06T15:09:47Z

@akoumpa and @gdengk are MoE experts here...maybe they can comment on your modifications

akoumpa · 2025-02-06T18:04:09Z

Hi @umiswing , IIRC you need 4 nodes with 8x80G (or similar) to run full pretrain of Mixtral 8x7B.

as long as you apply the code modifications to both containers then any comparison between these is valid, however, extrapolating from these numbers to the full model may include some degree of error.

gdengk · 2025-02-06T20:49:41Z

Agreed with @akoumpa .
Running the same config on both container would be a fair comparison.
The full mixtral-7B model could not fit on single node but I saw you reduce the num_layers to 8, which should be able to fit.
However the MFU from 8-layer mixtral-7B could not be directly translated to full model since pipeline bubble is not accounted in this small test. However you should still expect some gain with optimization in small model.

umiswing · 2025-02-10T10:02:23Z

Thanks for your help!

umiswing closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959

[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959

umiswing commented Jan 26, 2025 •

edited

Loading

akoumpa commented Jan 30, 2025

umiswing commented Feb 6, 2025

malay-nagda commented Feb 6, 2025

umiswing commented Feb 6, 2025

malay-nagda commented Feb 6, 2025

akoumpa commented Feb 6, 2025

gdengk commented Feb 6, 2025

umiswing commented Feb 10, 2025

[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959

[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959

Comments

umiswing commented Jan 26, 2025 • edited Loading

akoumpa commented Jan 30, 2025

umiswing commented Feb 6, 2025

malay-nagda commented Feb 6, 2025

umiswing commented Feb 6, 2025

malay-nagda commented Feb 6, 2025

akoumpa commented Feb 6, 2025

gdengk commented Feb 6, 2025

umiswing commented Feb 10, 2025

umiswing commented Jan 26, 2025 •

edited

Loading