AssertionError: FP8 amax reduction group is not initialized #81

starlitsky2010 · 2023-06-11T14:44:34Z

Hi NVIDIA,

Environment:
Image: nvcr.io/ea-bignlp/nemofw-training:23.04.1-py3
Inside container, /workspace/Megatron-LM was updated to the latest version 992da75a1fd90989eb1a97be8d9ff3eca993aa83
When I use fp8 to do training met the following problem.

Error Log:

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/language_model.py", line 701, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1431, in forward
    fp8_group = parallel_state.get_amax_reduction_group()
  File "/workspace/Megatron-LM/megatron/core/parallel_state.py", line 333, in get_amax_reduction_group
    assert _AMAX_REDUCTION_GROUP is not None, \
AssertionError: FP8 amax reduction group is not initialized

For _AMAX_REDUCTION_GROUP,

megatron/core/parallel_state.py:24:_AMAX_REDUCTION_GROUP = None
megatron/core/parallel_state.py:245:    global _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:246:    assert _AMAX_REDUCTION_GROUP is None, \
megatron/core/parallel_state.py:259:                _AMAX_REDUCTION_GROUP = group
megatron/core/parallel_state.py:333:    assert _AMAX_REDUCTION_GROUP is not None, \
megatron/core/parallel_state.py:335:    return _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:603:    global _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:604:    _AMAX_REDUCTION_GROUP = None

So, it should be updated by this line as only this line the _AMAX_REDUCTION_GROUP is at the left side.

megatron/core/parallel_state.py:259:                _AMAX_REDUCTION_GROUP = group

Then, add some logs around this part but no printing occurs.
Function initialize_model_parallel

    print(f"========================= use_fp8 {use_fp8}")
    if use_fp8:
        amax_group_size: int = tensor_model_parallel_size * data_parallel_size
        num_amax_groups: int = world_size // amax_group_size
        for i in range(num_amax_groups):
            start_rank = i * amax_group_size
            end_rank = (i + 1) * amax_group_size
            ranks = range(start_rank, end_rank)
            group = torch.distributed.new_group(ranks)
            print(f"========================= rank {rank} ranks {ranks}")
            if rank in ranks:
                _AMAX_REDUCTION_GROUP = group

Lines arount the error lines are:

/opt/NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py +1431

        with rng_context:
            # fp8_autocast will not do anything if TE or FP8 isn't used
            fp8_group = None
            if self.fp8 and parallel_state.model_parallel_is_initialized():
                fp8_group = parallel_state.get_amax_reduction_group()

So initialize_model_parallel() may be called before get_amax_reduction_group().

Could you give me some tips about how to fix this error?

Thanks
Aaron

The text was updated successfully, but these errors were encountered:

ericharper · 2023-06-11T22:05:34Z

Apologies. This is a known issue and is fixed in our 23.05 container which is going to be released very soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: FP8 amax reduction group is not initialized #81

AssertionError: FP8 amax reduction group is not initialized #81

starlitsky2010 commented Jun 11, 2023

ericharper commented Jun 11, 2023

AssertionError: FP8 amax reduction group is not initialized #81

AssertionError: FP8 amax reduction group is not initialized #81

Comments

starlitsky2010 commented Jun 11, 2023

ericharper commented Jun 11, 2023