-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running into ValueError when running moe/dmoe scripts #134
Comments
Is this during eval? Can you provide a minimum repro? |
I dont have a repo yet. I am in the process of porting this to work on AMD/ROCm and I run into this issue. I only run into this error when running dmoe or moe scripts. The Megablocks exp/gpt scripts all work fine. Here is the log from running exp/gpt2_46m_8gpu.sh: validation loss at iteration 1000 | lm loss value: 5.058095E+00 | lm loss PPL: 1.572906E+02 |iteration 1100/ 100000 | consumed samples: 563200 | elapsed time per iteration (ms): 365.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 5.032094E+00 | loss scale: 1.0 | grad norm: 1.338 | number of skipped iterations: 0 | number of nan iterations: 0 | Which makes sense because the error can be traced back to moe.py |
@mvpatel2000 to answer your question, yes this looks like it happens during eval. I set the eval-interval to 500 and ran into this after 500 iters. |
Ah, this is because you dont store LBL during eval. You should set model to eval mode. We should give a friendlier error... CC: @eitanturok |
Ohh Ok. What changes do I make to the training script? Right now I just run exp/moe/moe_46m_8gpu.sh with the default args and some changes for dataset and stuff. |
before eval, you'll need to call |
@mvpatel2000 can you point out to where this change has to be made. I tried playing around with the scripts but could not figure out. Thanks! |
@eitanturok maybe? |
This might have to happen in |
iteration 1000/ 20000 | consumed samples: 512000 | elapsed time per iteration (ms): 336.3 | learning rate: 1.495E-04 | global batch size: 512 | load balancing loss: 9.743530E-02 | lm loss: 5.181638E+00 | loss scale: 32768.0 | grad norm: 1.003 | number of skipped iterations: 0 | number of nan iterations: 0 |
The training begins and after running for 1000 iterations, I get:
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/megablocks-0.5.1-py3.9-linux-x86_64.egg/megablocks/layers/moe.py", line 37, in batched_load_balancing_loss
ValueError: not enough values to unpack (expected 2, got 0)
tokens_per_expert, expert_scores = zip(*get_load_balancing_loss())
ValueError: not enough values to unpack (expected 2, got 0)
The text was updated successfully, but these errors were encountered: