You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environment:
Image: nvcr.io/ea-bignlp/nemofw-training:23.04.1-py3
Inside container, /workspace/Megatron-LM was updated to the latest version 992da75a1fd90989eb1a97be8d9ff3eca993aa83
When I use fp8 to do training met the following problem.
Error Log:
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/language_model.py", line 701, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1431, in forward
fp8_group = parallel_state.get_amax_reduction_group()
File "/workspace/Megatron-LM/megatron/core/parallel_state.py", line 333, in get_amax_reduction_group
assert _AMAX_REDUCTION_GROUP is not None, \
AssertionError: FP8 amax reduction group is not initialized
For _AMAX_REDUCTION_GROUP,
megatron/core/parallel_state.py:24:_AMAX_REDUCTION_GROUP = None
megatron/core/parallel_state.py:245: global _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:246: assert _AMAX_REDUCTION_GROUP is None, \
megatron/core/parallel_state.py:259: _AMAX_REDUCTION_GROUP = group
megatron/core/parallel_state.py:333: assert _AMAX_REDUCTION_GROUP is not None, \
megatron/core/parallel_state.py:335: return _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:603: global _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:604: _AMAX_REDUCTION_GROUP = None
So, it should be updated by this line as only this line the _AMAX_REDUCTION_GROUP is at the left side.
megatron/core/parallel_state.py:259: _AMAX_REDUCTION_GROUP = group
Then, add some logs around this part but no printing occurs.
Function initialize_model_parallel
print(f"========================= use_fp8 {use_fp8}")
if use_fp8:
amax_group_size: int = tensor_model_parallel_size * data_parallel_size
num_amax_groups: int = world_size // amax_group_size
for i in range(num_amax_groups):
start_rank = i * amax_group_size
end_rank = (i + 1) * amax_group_size
ranks = range(start_rank, end_rank)
group = torch.distributed.new_group(ranks)
print(f"========================= rank {rank} ranks {ranks}")
if rank in ranks:
_AMAX_REDUCTION_GROUP = group
with rng_context:
# fp8_autocast will not do anything if TE or FP8 isn't used
fp8_group = None
if self.fp8 and parallel_state.model_parallel_is_initialized():
fp8_group = parallel_state.get_amax_reduction_group()
So initialize_model_parallel() may be called before get_amax_reduction_group().
Could you give me some tips about how to fix this error?
Thanks
Aaron
The text was updated successfully, but these errors were encountered:
Hi NVIDIA,
Environment:
Image: nvcr.io/ea-bignlp/nemofw-training:23.04.1-py3
Inside container, /workspace/Megatron-LM was updated to the latest version 992da75a1fd90989eb1a97be8d9ff3eca993aa83
When I use fp8 to do training met the following problem.
Error Log:
For _AMAX_REDUCTION_GROUP,
So, it should be updated by this line as only this line the _AMAX_REDUCTION_GROUP is at the left side.
Then, add some logs around this part but no printing occurs.
Function initialize_model_parallel
Lines arount the error lines are:
/opt/NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py +1431
So initialize_model_parallel() may be called before get_amax_reduction_group().
Could you give me some tips about how to fix this error?
Thanks
Aaron
The text was updated successfully, but these errors were encountered: