Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959

Closed
umiswing opened this issue Jan 26, 2025 · 8 comments
Closed

Comments

@umiswing
Copy link

umiswing commented Jan 26, 2025

I hack NeMo as this PR umiswing#1 to make it run on single node, and run Mixtral-8x7B with nemo llm pretrain --factory mixtral_8x7b, however, I found no no performance gain to train Mixtral-8x7B with FP8 on H800, with NGC from nvcr.io/nvidia/nemo:24.12. Is there something wrong in my experiment?

@akoumpa
Copy link
Member

akoumpa commented Jan 30, 2025

@malay-nagda

@umiswing
Copy link
Author

umiswing commented Feb 6, 2025

@akoumpa @malay-nagda hello, any update for this issue?

@malay-nagda
Copy link
Collaborator

Hi @umiswing!
I think nvcr.io/nvidia/nemo:24.12 container is missing this fix for MoE fp8- 9a2f0bd.

This fix is likely to be present in a more recent NeMo container.

@umiswing
Copy link
Author

umiswing commented Feb 6, 2025

@malay-nagda Thanks! I checkout to 9a2f0bd in the nvcr.io/nvidia/nemo:24.12 container and Mixtral-8x7B is much fater with FP8 now.

However, I am concerned about the fairness of the experiment. I hack some code in the recipes and collections to run Mixtral-8x7B on a single node. Could you take a look at my modification?

I start the experiment with nemo llm pretrain --factory mixtral_8x7b, and calculate the average train_step_timing from global_step: 5 to global_step: 14

@malay-nagda
Copy link
Collaborator

@akoumpa and @gdengk are MoE experts here...maybe they can comment on your modifications

@akoumpa
Copy link
Member

akoumpa commented Feb 6, 2025

Hi @umiswing , IIRC you need 4 nodes with 8x80G (or similar) to run full pretrain of Mixtral 8x7B.

as long as you apply the code modifications to both containers then any comparison between these is valid, however, extrapolating from these numbers to the full model may include some degree of error.

@gdengk
Copy link
Contributor

gdengk commented Feb 6, 2025

Agreed with @akoumpa .
Running the same config on both container would be a fair comparison.
The full mixtral-7B model could not fit on single node but I saw you reduce the num_layers to 8, which should be able to fit.
However the MFU from 8-layer mixtral-7B could not be directly translated to full model since pipeline bubble is not accounted in this small test. However you should still expect some gain with optimization in small model.

@umiswing
Copy link
Author

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants