Fix backward_dense_test #3702

avbokovoy · 2025-02-18T10:03:34Z

Attempt to fix the following five existing issues in the dense unit test:

OOM (only on A100, hard to catch due to randomness of UTs)
Memory access error (both MI300X and A100, hard to catch)
Assertions failure in PoolingMode.MEAN test (both ROCm and Nvidia)
Assertions failure in PoolingMode.SUM test (both ROCm and Nvidia)
Wrong indexing in vbe test

Issues 1, 2, 5 are also observed in pytorch/pytorch#141904

The initial intention of aligned_grad_output_tensor_for_cuda_backwards() function is unclear to me, so this fix particular might be "sub-optimal". Thus asking for some reviews

netlify · 2025-02-18T10:04:10Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`a83abef`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67b6f37ec70a520008ab3c7b
😎 Deploy Preview	https://deploy-preview-3702--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-02-18T17:56:10Z

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

sryap · 2025-02-19T18:25:34Z

fbgemm_gpu/include/fbgemm_gpu/utils/tensor_utils.h

@@ -296,17 +296,14 @@ std::string tensor_on_same_gpu_if_not_optional_check(

 inline at::Tensor aligned_grad_output_tensor_for_cuda_backwards(
    const at::Tensor& grad_output) {
-  auto aligned_grad_output = grad_output;
+  auto aligned_grad_output = at::empty_like(grad_output).copy_(grad_output);


We should not do this every time. It will be costly. Is there a reason why you would like to do this?

We should not do this every time. It will be costly. Is there a reason why you would like to do this?

@sryap Agree on that we shouldn't do it every time. However, with code bisecting, this allows to pass the unit test. I need some help from your side:

What is the intention of aligned_grad_output_tensor_for_cuda_backwards() function? I can assume that we get the grad_output without copy if the data is aligned to 16B, otherwise get the "aligned" tensor from input with potential memory copy. Is the tensor constructed with .contiguous() or empty_like is guaranteed to be aligned?

Could you please clarify what is tested here?:
https://github.com/pytorch/FBGEMM/pull/3702/files#diff-dc94c00639d812c6bddd3a893aa08255d1ca5819cc8c3cfa524706d5a21a65baR331-R340
We want to make sure that sequential call of bwd will produce the same gradient w.r.t. feature_requires_grad?

Is there any possible sync issues that might occur in this test scenario?

The parameter set to test the failure is:

( T=1, D=2, B=2, log_E=1, L=1, weights_precision=SparseType.FP16, weighted=False, mixed=False, mixed_B=True, long_segments=False, pooling_mode=PoolingMode.SUM, use_cpu=False, output_dtype=SparseType.FP32, )

Also the random seed needs to be fixed at the start of test_backward_dense:

np.random.seed(2007) torch.manual_seed(2007)

Are those parameters valid?

avbokovoy added 3 commits February 18, 2025 09:51

Fix dense_test gradient indexing

3056099

Workaround aligned grad function

922acd4

Enable grad_mean for dense and vbe modes

e4a5d28

facebook-github-bot added the cla signed label Feb 18, 2025

Fix F401, E203 linter issues

1c6d065

sryap reviewed Feb 19, 2025

View reviewed changes

Format with ufmt

a83abef

avbokovoy requested a review from sryap February 20, 2025 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backward_dense_test #3702

Fix backward_dense_test #3702

avbokovoy commented Feb 18, 2025 •

edited

Loading

netlify bot commented Feb 18, 2025 •

edited

Loading

facebook-github-bot commented Feb 18, 2025

sryap Feb 19, 2025

avbokovoy Feb 20, 2025

Fix backward_dense_test #3702

Are you sure you want to change the base?

Fix backward_dense_test #3702

Conversation

avbokovoy commented Feb 18, 2025 • edited Loading

netlify bot commented Feb 18, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Feb 18, 2025

sryap Feb 19, 2025

Choose a reason for hiding this comment

avbokovoy Feb 20, 2025

Choose a reason for hiding this comment

avbokovoy commented Feb 18, 2025 •

edited

Loading

netlify bot commented Feb 18, 2025 •

edited

Loading