Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: FP8 amax reduction group is not initialized #81

Open
starlitsky2010 opened this issue Jun 11, 2023 · 1 comment
Open

Comments

@starlitsky2010
Copy link

Hi NVIDIA,

Environment:
Image: nvcr.io/ea-bignlp/nemofw-training:23.04.1-py3
Inside container, /workspace/Megatron-LM was updated to the latest version 992da75a1fd90989eb1a97be8d9ff3eca993aa83
When I use fp8 to do training met the following problem.

Error Log:

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/language_model.py", line 701, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1431, in forward
    fp8_group = parallel_state.get_amax_reduction_group()
  File "/workspace/Megatron-LM/megatron/core/parallel_state.py", line 333, in get_amax_reduction_group
    assert _AMAX_REDUCTION_GROUP is not None, \
AssertionError: FP8 amax reduction group is not initialized

For _AMAX_REDUCTION_GROUP,

megatron/core/parallel_state.py:24:_AMAX_REDUCTION_GROUP = None
megatron/core/parallel_state.py:245:    global _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:246:    assert _AMAX_REDUCTION_GROUP is None, \
megatron/core/parallel_state.py:259:                _AMAX_REDUCTION_GROUP = group
megatron/core/parallel_state.py:333:    assert _AMAX_REDUCTION_GROUP is not None, \
megatron/core/parallel_state.py:335:    return _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:603:    global _AMAX_REDUCTION_GROUP
megatron/core/parallel_state.py:604:    _AMAX_REDUCTION_GROUP = None

So, it should be updated by this line as only this line the _AMAX_REDUCTION_GROUP is at the left side.

megatron/core/parallel_state.py:259:                _AMAX_REDUCTION_GROUP = group

Then, add some logs around this part but no printing occurs.
Function initialize_model_parallel

    print(f"========================= use_fp8 {use_fp8}")
    if use_fp8:
        amax_group_size: int = tensor_model_parallel_size * data_parallel_size
        num_amax_groups: int = world_size // amax_group_size
        for i in range(num_amax_groups):
            start_rank = i * amax_group_size
            end_rank = (i + 1) * amax_group_size
            ranks = range(start_rank, end_rank)
            group = torch.distributed.new_group(ranks)
            print(f"========================= rank {rank} ranks {ranks}")
            if rank in ranks:
                _AMAX_REDUCTION_GROUP = group

Lines arount the error lines are:

/opt/NeMo/nemo/collections/nlp/modules/common/megatron/transformer.py +1431

        with rng_context:
            # fp8_autocast will not do anything if TE or FP8 isn't used
            fp8_group = None
            if self.fp8 and parallel_state.model_parallel_is_initialized():
                fp8_group = parallel_state.get_amax_reduction_group()

So initialize_model_parallel() may be called before get_amax_reduction_group().

Could you give me some tips about how to fix this error?

Thanks
Aaron

@ericharper
Copy link
Collaborator

Apologies. This is a known issue and is fixed in our 23.05 container which is going to be released very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants