Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running into ValueError when running moe/dmoe scripts #134

Open
rtmadduri opened this issue Aug 5, 2024 · 9 comments
Open

Running into ValueError when running moe/dmoe scripts #134

rtmadduri opened this issue Aug 5, 2024 · 9 comments

Comments

@rtmadduri
Copy link

rtmadduri commented Aug 5, 2024

iteration 1000/ 20000 | consumed samples: 512000 | elapsed time per iteration (ms): 336.3 | learning rate: 1.495E-04 | global batch size: 512 | load balancing loss: 9.743530E-02 | lm loss: 5.181638E+00 | loss scale: 32768.0 | grad norm: 1.003 | number of skipped iterations: 0 | number of nan iterations: 0 |

The training begins and after running for 1000 iterations, I get:

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/megablocks-0.5.1-py3.9-linux-x86_64.egg/megablocks/layers/moe.py", line 37, in batched_load_balancing_loss
ValueError: not enough values to unpack (expected 2, got 0)
tokens_per_expert, expert_scores = zip(*get_load_balancing_loss())
ValueError: not enough values to unpack (expected 2, got 0)

@mvpatel2000
Copy link
Contributor

Is this during eval? Can you provide a minimum repro?

@rtmadduri
Copy link
Author

rtmadduri commented Aug 5, 2024

I dont have a repo yet. I am in the process of porting this to work on AMD/ROCm and I run into this issue.

I only run into this error when running dmoe or moe scripts. The Megablocks exp/gpt scripts all work fine. Here is the log from running exp/gpt2_46m_8gpu.sh:


validation loss at iteration 1000 | lm loss value: 5.058095E+00 | lm loss PPL: 1.572906E+02 |

iteration 1100/ 100000 | consumed samples: 563200 | elapsed time per iteration (ms): 365.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 5.032094E+00 | loss scale: 1.0 | grad norm: 1.338 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 1200/ 100000 | consumed samples: 614400 | elapsed time per iteration (ms): 276.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 4.857907E+00 | loss scale: 1.0 | grad norm: 0.746 | number of skipped iterations: 0 | number of nan iterations: 0 |

Which makes sense because the error can be traced back to moe.py

@rtmadduri
Copy link
Author

@mvpatel2000 to answer your question, yes this looks like it happens during eval. I set the eval-interval to 500 and ran into this after 500 iters.

@mvpatel2000
Copy link
Contributor

Ah, this is because you dont store LBL during eval. You should set model to eval mode. We should give a friendlier error... CC: @eitanturok

@rtmadduri
Copy link
Author

Ohh Ok. What changes do I make to the training script? Right now I just run exp/moe/moe_46m_8gpu.sh with the default args and some changes for dataset and stuff.

@mvpatel2000
Copy link
Contributor

before eval, you'll need to call model.eval(). @eitanturok can you look at tweaking scripts?

@rtmadduri
Copy link
Author

@mvpatel2000 can you point out to where this change has to be made. I tried playing around with the scripts but could not figure out. Thanks!

@rtmadduri
Copy link
Author

@eitanturok maybe?

@mvpatel2000
Copy link
Contributor

This might have to happen in third_party/Megatron-LM/pretrain_gpt.py which is the script being called...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants