Segmentation fault when using the dev container #16

jeffchy · 2024-08-28T07:04:15Z

Segmentation fault when using the dev container to train the llm finetune recipe:

nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 strategies:244] Fixing mis-match between ddp-config & mcore-optimizer config
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:314] Rank 0 has data parallel group : [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:325] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:328] Ranks 0 has data parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:336] Rank 0 has context parallel group: [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:340] Ranks 0 has context parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:347] Rank 0 has model parallel group: [0, 1, 2, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:348] All model parallel group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:357] Rank 0 has tensor model parallel group: [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:361] All tensor model parallel group ranks: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:362] Rank 0 has tensor model parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:382] Rank 0 has pipeline model parallel group: [0, 1, 2, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:394] Rank 0 has embedding group: [0, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:400] All pipeline model parallel group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:401] Rank 0 has pipeline model parallel rank 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:402] All embedding group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:403] Rank 0 has embedding rank: 0
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.bf16  False -> True
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.pipeline_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.autocast_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.bf16  False -> True
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.pipeline_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.autocast_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote OptimizerConfig.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote DistributedDataParallelConfig.grad_reduce_in_fp32  False -> True
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:35] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:36] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:36] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
nemo.collections.llm.api.finetune/0 ----------------------------------------------------------------------------------------------------
nemo.collections.llm.api.finetune/0 distributed_backend=nccl
nemo.collections.llm.api.finetune/0 All distributed processes registered. Starting with 4 processes
nemo.collections.llm.api.finetune/0 ----------------------------------------------------------------------------------------------------
nemo.collections.llm.api.finetune/0
nemo.collections.llm.api.finetune/0 [10-7-133-247:16170:0:17316] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:16172:0:17317] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:16171:0:17318] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:15836:0:17315] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17316) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
nemo.collections.llm.api.finetune/0 =================================
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17317) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
nemo.collections.llm.api.finetune/0 =================================
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17318) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0

The text was updated successfully, but these errors were encountered:

jeffchy · 2024-08-28T09:32:17Z

solve by using 24.07 image and install Nemo-Run + upgrade Nemo (build from source) manually

hemildesai · 2024-08-28T21:17:49Z

Thanks @jeffchy for creating the issue. Glad to know you were able to fix it. Please let us know if you run into this issue again, and if it's ok to close the issue for now since you were able to solve it.

jeffchy · 2024-08-29T03:30:52Z

I'm able pass the phase I mentioned above, but it then raise CheckPointError

ericharper · 2024-08-30T21:20:28Z

@jeffchy is that the same error as above or a new one? Could you share it if it's new?

jeffchy · 2024-08-31T16:42:58Z

it's a new one, I'll try to reproduce the error.

jeffchy · 2024-09-02T05:40:37Z

Update: I can successfully run the newest pretrain recipe https://github.com/NVIDIA/NeMo/blob/main/examples/llm/run/llama3_pretraining.py

but failed when I want to use fientune_recipe and own model.
I replace the hf_resume() with:

def hf_resume() -> Config[nl.AutoResume]:
    return Config(nl.AutoResume, import_path="hf://{my local model path}")

And I got

llama3-8b/0 [default3]:[rank3]:     self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
llama3-8b/0 [default3]:[rank3]:   File "/workspace/NeMo/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 636, in load_optimizer_state_dict
llama3-8b/0 [default3]:[rank3]:     optimizer_states = checkpoint["optimizer"]
llama3-8b/0 [default3]:[rank3]: KeyError: 'optimizer'

I'm not familiar with nemo, maybe I got something wrong?

marcromeyn · 2024-09-03T08:39:14Z

import_path is a special argument that's intended for only HF -> NeMo model converts. If your model is already trained using NeMo, you don't need that. In that can you can use: path as opposed to import_path.

jeffchy · 2024-09-09T16:38:41Z

Thanks for your reply, but if I have a custom fine-tuned HF model (on local device), how to start from it? Do I need to convert it in advance？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when using the dev container #16

Segmentation fault when using the dev container #16

jeffchy commented Aug 28, 2024

jeffchy commented Aug 28, 2024

hemildesai commented Aug 28, 2024

jeffchy commented Aug 29, 2024 •

edited

Loading

ericharper commented Aug 30, 2024

jeffchy commented Aug 31, 2024

jeffchy commented Sep 2, 2024

marcromeyn commented Sep 3, 2024

jeffchy commented Sep 9, 2024 •

edited

Loading

Segmentation fault when using the dev container #16

Segmentation fault when using the dev container #16

Comments

jeffchy commented Aug 28, 2024

jeffchy commented Aug 28, 2024

hemildesai commented Aug 28, 2024

jeffchy commented Aug 29, 2024 • edited Loading

ericharper commented Aug 30, 2024

jeffchy commented Aug 31, 2024

jeffchy commented Sep 2, 2024

marcromeyn commented Sep 3, 2024

jeffchy commented Sep 9, 2024 • edited Loading

jeffchy commented Aug 29, 2024 •

edited

Loading

jeffchy commented Sep 9, 2024 •

edited

Loading