Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backward_dense_test #3702

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

avbokovoy
Copy link
Contributor

@avbokovoy avbokovoy commented Feb 18, 2025

Attempt to fix the following five existing issues in the dense unit test:

  1. OOM (only on A100, hard to catch due to randomness of UTs)
  2. Memory access error (both MI300X and A100, hard to catch)
  3. Assertions failure in PoolingMode.MEAN test (both ROCm and Nvidia)
  4. Assertions failure in PoolingMode.SUM test (both ROCm and Nvidia)
  5. Wrong indexing in vbe test

Issues 1, 2, 5 are also observed in pytorch/pytorch#141904

The initial intention of aligned_grad_output_tensor_for_cuda_backwards() function is unclear to me, so this fix particular might be "sub-optimal". Thus asking for some reviews

Copy link

netlify bot commented Feb 18, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit a83abef
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67b6f37ec70a520008ab3c7b
😎 Deploy Preview https://deploy-preview-3702--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -296,17 +296,14 @@ std::string tensor_on_same_gpu_if_not_optional_check(

inline at::Tensor aligned_grad_output_tensor_for_cuda_backwards(
const at::Tensor& grad_output) {
auto aligned_grad_output = grad_output;
auto aligned_grad_output = at::empty_like(grad_output).copy_(grad_output);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do this every time. It will be costly. Is there a reason why you would like to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do this every time. It will be costly. Is there a reason why you would like to do this?

@sryap Agree on that we shouldn't do it every time. However, with code bisecting, this allows to pass the unit test. I need some help from your side:

  1. What is the intention of aligned_grad_output_tensor_for_cuda_backwards() function? I can assume that we get the grad_output without copy if the data is aligned to 16B, otherwise get the "aligned" tensor from input with potential memory copy. Is the tensor constructed with .contiguous() or empty_like is guaranteed to be aligned?
  2. Could you please clarify what is tested here?:
    https://github.com/pytorch/FBGEMM/pull/3702/files#diff-dc94c00639d812c6bddd3a893aa08255d1ca5819cc8c3cfa524706d5a21a65baR331-R340
    We want to make sure that sequential call of bwd will produce the same gradient w.r.t. feature_requires_grad?
  3. Is there any possible sync issues that might occur in this test scenario?
  4. The parameter set to test the failure is:
(
    T=1,
    D=2,
    B=2,
    log_E=1,
    L=1,
    weights_precision=SparseType.FP16,
    weighted=False,
    mixed=False,
    mixed_B=True,
    long_segments=False,
    pooling_mode=PoolingMode.SUM,                                                                                                                        
    use_cpu=False,
    output_dtype=SparseType.FP32,
)

Also the random seed needs to be fixed at the start of test_backward_dense:

np.random.seed(2007)
torch.manual_seed(2007)

Are those parameters valid?

@avbokovoy avbokovoy requested a review from sryap February 20, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants