-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Found no performance gain training Mixtral-8x7B with FP8 on H800 #11959
Comments
@akoumpa @malay-nagda hello, any update for this issue? |
@malay-nagda Thanks! I checkout to 9a2f0bd in the However, I am concerned about the fairness of the experiment. I hack some code in the recipes and collections to run Mixtral-8x7B on a single node. Could you take a look at my modification? I start the experiment with |
Hi @umiswing , IIRC you need 4 nodes with 8x80G (or similar) to run full pretrain of Mixtral 8x7B. as long as you apply the code modifications to both containers then any comparison between these is valid, however, extrapolating from these numbers to the full model may include some degree of error. |
Agreed with @akoumpa . |
Thanks for your help! |
I hack NeMo as this PR umiswing#1 to make it run on single node, and run Mixtral-8x7B with
nemo llm pretrain --factory mixtral_8x7b
, however, I found no no performance gain to train Mixtral-8x7B with FP8 on H800, with NGC from nvcr.io/nvidia/nemo:24.12. Is there something wrong in my experiment?The text was updated successfully, but these errors were encountered: