-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
amp_C undefined symbol after installing Megablocks #157
Comments
I've seen this a few times if you build for the wrong version of PyTorch and it installs funny. I would print the whole install logs and see if there's any reinstalling going on |
I am using the nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container which already has the installation. Do you suggest installing a specific alternate version? |
We use and recommend images: https://github.com/mosaicml/composer/tree/main/docker |
I was able to fix this. So this happens because when you install megablocks it installs torch since it is listed in the setup.py file here. When you do pip install megablocks, it automatically tries to install dependencies. The way to fix this is: After that, install both stanford-stk and megablocks from source using python setup.py install. This will prevent Megablocks from reinstalling torch |
I am trying to setup and use megablocks to train MoE models, but I see the following error:
I am working on NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container.
When I try running gpt2 training (using
exp/gpt2/gpt2_gpt2_46m_1gpu.sh
) before doing apip install megablocks
, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh
) gives the errorMegablocks not available
.However, after I do a
pip install megablocks
orpip install .
in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.The text was updated successfully, but these errors were encountered: