Why the `magatron_v4.patch` is needed? #14

hxdtest · 2024-11-18T07:11:03Z

https://github.com/volcengine/verl/blob/main/patches/megatron_v4.patch

For example:

case 1

-    tensor_shape = [seq_length, micro_batch_size, config.hidden_size]
+    tensor_shape = [seq_length, micro_batch_size, hidden_size]

what is the difference between hidden_size and config.hidden_size？

case 2

     # Run 1F1B in steady state.
     for i in range(num_microbatches_remaining):
         last_iteration = i == (num_microbatches_remaining - 1)
+        next_forward_k = num_warmup_microbatches + i + 1
+        backward_k = i

Why do you need next_forward_k and backward_k ?

case 3

-        return FusedLayerNormAffineFunction.apply(input, weight, self.bias, self.normalized_shape, self.eps)
+        return FusedLayerNormAffineFunction.apply(input, weight, self.bias, self.normalized_shape, self.eps, False)

Why False is needed?
And for current apex, it seems that memory_efficient is set as False by default. fused_layer_norm.py

case 4

+        self.overlap_param_gather = overlap_param_gather
         if self.overlap_param_gather:
             self.remove_pre_hook_handle = torch.nn.modules.module.register_module_forward_pre_hook(
                 self._make_forward_pre_hook())

Why do you need overlap_param_gather? Does it have side-effects on training?

Many thanks !

The text was updated successfully, but these errors were encountered:

PeterSH6 · 2024-11-18T13:22:57Z

Hi @hxdtest , the megatron_v4.patch is necessary for veRL for two main reasons:

In veRL, we didn't initialize Megatron-LM with initialize_megatron, which initializes the global args. We only build the necessary process group by using mpu.initialize_model_parallel. Therefore, we have to delete the usage of get_args(). Case 4 is where we delete the get_args() and overlap_param_gather is set to False by default.
We fix the vpp hanging problem when applying remove padding techniques in model training. Case 2 is used for fixing this

For case 1, config.hidden_size should be equal to hidden_size.
False in case 3 could be removed as the default value is False and there seems to be no way to change its value in v0.4

hxdtest · 2024-11-20T11:48:48Z

Many thanks for your reply.

hxdtest · 2024-11-20T11:58:47Z

@PeterSH6
Have you tested verl with model size that's larger than 300B ? For example, have you tested llama3 405B ppo training on verl ?

PeterSH6 · 2024-11-24T12:18:39Z

@hxdtest , we haven't tested verl on the 405B model.

I think we can try it by using a larger TP size in rollout or implementing pipeline parallelism in vLLM rollout. This is one of our plans.

hxdtest changed the title ~~Why the magatron_v4.patch is needed~~ Why the magatron_v4.patch is needed? Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the `magatron_v4.patch` is needed? #14

Why the `magatron_v4.patch` is needed? #14

hxdtest commented Nov 18, 2024 •

edited

Loading

PeterSH6 commented Nov 18, 2024

hxdtest commented Nov 20, 2024

hxdtest commented Nov 20, 2024

PeterSH6 commented Nov 24, 2024

Why the magatron_v4.patch is needed? #14

Why the magatron_v4.patch is needed? #14

Comments

hxdtest commented Nov 18, 2024 • edited Loading

PeterSH6 commented Nov 18, 2024

hxdtest commented Nov 20, 2024

hxdtest commented Nov 20, 2024

PeterSH6 commented Nov 24, 2024

Why the `magatron_v4.patch` is needed? #14

Why the `magatron_v4.patch` is needed? #14

hxdtest commented Nov 18, 2024 •

edited

Loading