Running into ValueError when running moe/dmoe scripts #134

rtmadduri · 2024-08-05T17:20:36Z

The training begins and after running for 1000 iterations, I get:

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/megablocks-0.5.1-py3.9-linux-x86_64.egg/megablocks/layers/moe.py", line 37, in batched_load_balancing_loss
ValueError: not enough values to unpack (expected 2, got 0)
tokens_per_expert, expert_scores = zip(*get_load_balancing_loss())
ValueError: not enough values to unpack (expected 2, got 0)

mvpatel2000 · 2024-08-05T17:46:36Z

Is this during eval? Can you provide a minimum repro?

rtmadduri · 2024-08-05T18:09:39Z

I dont have a repo yet. I am in the process of porting this to work on AMD/ROCm and I run into this issue.

I only run into this error when running dmoe or moe scripts. The Megablocks exp/gpt scripts all work fine. Here is the log from running exp/gpt2_46m_8gpu.sh:

validation loss at iteration 1000 | lm loss value: 5.058095E+00 | lm loss PPL: 1.572906E+02 |

Which makes sense because the error can be traced back to moe.py

rtmadduri · 2024-08-05T18:27:52Z

@mvpatel2000 to answer your question, yes this looks like it happens during eval. I set the eval-interval to 500 and ran into this after 500 iters.

mvpatel2000 · 2024-08-05T18:45:43Z

Ah, this is because you dont store LBL during eval. You should set model to eval mode. We should give a friendlier error... CC: @eitanturok

rtmadduri · 2024-08-05T18:48:39Z

Ohh Ok. What changes do I make to the training script? Right now I just run exp/moe/moe_46m_8gpu.sh with the default args and some changes for dataset and stuff.

mvpatel2000 · 2024-08-05T18:53:42Z

before eval, you'll need to call model.eval(). @eitanturok can you look at tweaking scripts?

rtmadduri · 2024-08-28T16:58:58Z

@mvpatel2000 can you point out to where this change has to be made. I tried playing around with the scripts but could not figure out. Thanks!

rtmadduri · 2024-08-29T21:46:12Z

@eitanturok maybe?

mvpatel2000 · 2024-08-30T13:29:39Z

This might have to happen in third_party/Megatron-LM/pretrain_gpt.py which is the script being called...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running into ValueError when running moe/dmoe scripts #134

Running into ValueError when running moe/dmoe scripts #134

rtmadduri commented Aug 5, 2024 •

edited

Loading

mvpatel2000 commented Aug 5, 2024

rtmadduri commented Aug 5, 2024 •

edited

Loading

rtmadduri commented Aug 5, 2024

mvpatel2000 commented Aug 5, 2024

rtmadduri commented Aug 5, 2024

mvpatel2000 commented Aug 5, 2024

rtmadduri commented Aug 28, 2024

rtmadduri commented Aug 29, 2024

mvpatel2000 commented Aug 30, 2024

Running into ValueError when running moe/dmoe scripts #134

Running into ValueError when running moe/dmoe scripts #134

Comments

rtmadduri commented Aug 5, 2024 • edited Loading

mvpatel2000 commented Aug 5, 2024

rtmadduri commented Aug 5, 2024 • edited Loading

validation loss at iteration 1000 | lm loss value: 5.058095E+00 | lm loss PPL: 1.572906E+02 |

rtmadduri commented Aug 5, 2024

mvpatel2000 commented Aug 5, 2024

rtmadduri commented Aug 5, 2024

mvpatel2000 commented Aug 5, 2024

rtmadduri commented Aug 28, 2024

rtmadduri commented Aug 29, 2024

mvpatel2000 commented Aug 30, 2024

rtmadduri commented Aug 5, 2024 •

edited

Loading

rtmadduri commented Aug 5, 2024 •

edited

Loading