Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amp_C undefined symbol after installing Megablocks #157

Open
RachitBansal opened this issue Oct 11, 2024 · 4 comments
Open

amp_C undefined symbol after installing Megablocks #157

RachitBansal opened this issue Oct 11, 2024 · 4 comments

Comments

@RachitBansal
Copy link

RachitBansal commented Oct 11, 2024

I am trying to setup and use megablocks to train MoE models, but I see the following error:

Traceback (most recent call last):
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/pretrain_gpt.py", line 8, in <module>
    from megatron import get_args
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/__init__.py", line 13, in <module>
    from .initialize  import initialize_megatron
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/initialize.py", line 19, in <module>
    from megatron.checkpointing import load_args_from_checkpoint
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/checkpointing.py", line 15, in <module>
    from .utils import (unwrap_model,
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/utils.py", line 11, in <module>
    import amp_C
ImportError: /usr/local/lib/python3.10/dist-packages/amp_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I am working on NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container.

When I try running gpt2 training (using exp/gpt2/gpt2_gpt2_46m_1gpu.sh) before doing a pip install megablocks, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh) gives the error Megablocks not available.

However, after I do a pip install megablocks or pip install . in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.

@mvpatel2000
Copy link
Contributor

I've seen this a few times if you build for the wrong version of PyTorch and it installs funny. I would print the whole install logs and see if there's any reinstalling going on

@RachitBansal
Copy link
Author

I am using the nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container which already has the installation. Do you suggest installing a specific alternate version?

@mvpatel2000
Copy link
Contributor

We use and recommend images: https://github.com/mosaicml/composer/tree/main/docker

@rtmadduri
Copy link

I was able to fix this. So this happens because when you install megablocks it installs torch since it is listed in the setup.py file here.

When you do pip install megablocks, it automatically tries to install dependencies. The way to fix this is:
to comment out the torch requirement in setup.py and then do the same in the stanford-stk repo here.

After that, install both stanford-stk and megablocks from source using python setup.py install.

This will prevent Megablocks from reinstalling torch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants