[REQUEST] Why is the column linear layer with all-gather not implemented in DeepSpeed Inference? #7037

zhangvia · 2025-02-14T07:09:38Z

Is your feature request related to a problem? Please describe.
if there is no column linear layer with all-gather, we can't deal with single linear layer

i can see the rowlinear with allreduce aka LinearAllreduce. but there is no any implementations about column linear layer with all gather.

how could i set the linear type when running dit models:

(transformer_blocks): ModuleList(
      (0-18): 19 x FluxTransformerBlock(
        (norm1): AdaLayerNormZero(
          (silu): SiLU()
          (linear): LinearLayer(in_features=3072, out_features=9216, bias=True, dtype=torch.bfloat16)
          (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
        )
        (norm1_context): AdaLayerNormZero(
          (silu): SiLU()
          (linear): LinearLayer(in_features=3072, out_features=9216, bias=True, dtype=torch.bfloat16)
          (norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
        )
        (attn): Attention(
          (norm_q): RMSNorm()
          (norm_k): RMSNorm()
          (to_q): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (to_k): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (to_v): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (add_k_proj): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (add_v_proj): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (add_q_proj): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (to_out): ModuleList(
            (0): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
            (1): Dropout(p=0.0, inplace=False)
          )
          (to_add_out): LinearLayer(in_features=3072, out_features=1536, bias=True, dtype=torch.bfloat16)
          (norm_added_q): RMSNorm()
          (norm_added_k): RMSNorm()
        )
        (norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
        (ff): FeedForward(
          (net): ModuleList(
            (0): GELU(
              (proj): LinearLayer(in_features=3072, out_features=6144, bias=True, dtype=torch.bfloat16)
            )
            (1): Dropout(p=0.0, inplace=False)
            (2): LinearLayer(in_features=12288, out_features=1536, bias=True, dtype=torch.bfloat16)
          )
        )
        (norm2_context): LayerNorm((3072,), eps=1e-06, elementwise_affine=False)
        (ff_context): FeedForward(
          (net): ModuleList(
            (0): GELU(
              (proj): LinearLayer(in_features=3072, out_features=6144, bias=True, dtype=torch.bfloat16)
            )
            (1): Dropout(p=0.0, inplace=False)
            (2): LinearLayer(in_features=12288, out_features=1536, bias=True, dtype=torch.bfloat16)
          )
        )
      )
    )

i can set the attn.to_out.0 attn.to_add_out ff.net.2 ff_context.net.2 to LinearAllreduce, but how to deal with norm1.linear and norm1_context.linear. i need all gather the results of a single linear layer or it will cause error because the inputs of both norm1 and norm1_context are a whole hidden_states

The text was updated successfully, but these errors were encountered:

zhangvia added the enhancement New feature or request label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Why is the column linear layer with all-gather not implemented in DeepSpeed Inference? #7037

[REQUEST] Why is the column linear layer with all-gather not implemented in DeepSpeed Inference? #7037

zhangvia commented Feb 14, 2025

[REQUEST] Why is the column linear layer with all-gather not implemented in DeepSpeed Inference? #7037

[REQUEST] Why is the column linear layer with all-gather not implemented in DeepSpeed Inference? #7037

Comments

zhangvia commented Feb 14, 2025