Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using the dev container #16

Open
jeffchy opened this issue Aug 28, 2024 · 8 comments
Open

Segmentation fault when using the dev container #16

jeffchy opened this issue Aug 28, 2024 · 8 comments

Comments

@jeffchy
Copy link

jeffchy commented Aug 28, 2024

Segmentation fault when using the dev container to train the llm finetune recipe:

nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 strategies:244] Fixing mis-match between ddp-config & mcore-optimizer config
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:314] Rank 0 has data parallel group : [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:325] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:328] Ranks 0 has data parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:336] Rank 0 has context parallel group: [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:340] Ranks 0 has context parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:347] Rank 0 has model parallel group: [0, 1, 2, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:348] All model parallel group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:357] Rank 0 has tensor model parallel group: [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:361] All tensor model parallel group ranks: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:362] Rank 0 has tensor model parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:382] Rank 0 has pipeline model parallel group: [0, 1, 2, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:394] Rank 0 has embedding group: [0, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:400] All pipeline model parallel group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:401] Rank 0 has pipeline model parallel rank 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:402] All embedding group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:403] Rank 0 has embedding rank: 0
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.bf16  False -> True
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.pipeline_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.autocast_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.bf16  False -> True
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.pipeline_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.autocast_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote OptimizerConfig.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote DistributedDataParallelConfig.grad_reduce_in_fp32  False -> True
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:35] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:36] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:36] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
nemo.collections.llm.api.finetune/0 ----------------------------------------------------------------------------------------------------
nemo.collections.llm.api.finetune/0 distributed_backend=nccl
nemo.collections.llm.api.finetune/0 All distributed processes registered. Starting with 4 processes
nemo.collections.llm.api.finetune/0 ----------------------------------------------------------------------------------------------------
nemo.collections.llm.api.finetune/0
nemo.collections.llm.api.finetune/0 [10-7-133-247:16170:0:17316] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:16172:0:17317] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:16171:0:17318] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:15836:0:17315] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17316) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
nemo.collections.llm.api.finetune/0 =================================
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17317) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
nemo.collections.llm.api.finetune/0 =================================
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17318) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
@jeffchy
Copy link
Author

jeffchy commented Aug 28, 2024

solve by using 24.07 image and install Nemo-Run + upgrade Nemo (build from source) manually

@hemildesai
Copy link
Collaborator

Thanks @jeffchy for creating the issue. Glad to know you were able to fix it. Please let us know if you run into this issue again, and if it's ok to close the issue for now since you were able to solve it.

@jeffchy
Copy link
Author

jeffchy commented Aug 29, 2024

I'm able pass the phase I mentioned above, but it then raise CheckPointError

@ericharper
Copy link
Collaborator

@jeffchy is that the same error as above or a new one? Could you share it if it's new?

@jeffchy
Copy link
Author

jeffchy commented Aug 31, 2024

it's a new one, I'll try to reproduce the error.

@jeffchy
Copy link
Author

jeffchy commented Sep 2, 2024

Update: I can successfully run the newest pretrain recipe https://github.com/NVIDIA/NeMo/blob/main/examples/llm/run/llama3_pretraining.py

but failed when I want to use fientune_recipe and own model.
I replace the hf_resume() with:

def hf_resume() -> Config[nl.AutoResume]:
    return Config(nl.AutoResume, import_path="hf://{my local model path}")

And I got

llama3-8b/0 [default3]:[rank3]:     self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
llama3-8b/0 [default3]:[rank3]:   File "/workspace/NeMo/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 636, in load_optimizer_state_dict
llama3-8b/0 [default3]:[rank3]:     optimizer_states = checkpoint["optimizer"]
llama3-8b/0 [default3]:[rank3]: KeyError: 'optimizer'

I'm not familiar with nemo, maybe I got something wrong?

@marcromeyn
Copy link
Collaborator

import_path is a special argument that's intended for only HF -> NeMo model converts. If your model is already trained using NeMo, you don't need that. In that can you can use: path as opposed to import_path.

@jeffchy
Copy link
Author

jeffchy commented Sep 9, 2024

Thanks for your reply, but if I have a custom fine-tuned HF model (on local device), how to start from it? Do I need to convert it in advance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants