Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip the failing unit tests from the FusedRMSNorm PR #85

Merged
merged 2 commits into from
Aug 8, 2022

Conversation

hubertlu-tw
Copy link

FusedRMSNorm PR: #78
The failed unit tests which were skipped in this PR: #78 (comment)

@hubertlu-tw
Copy link
Author

The two failing tests in the CI checks are both flaky tests.

rocm-pytorch-master

FAIL: test_loss_scale_decrease (test_checkpointing.TestCheckpointing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/apex/tests/L0/run_amp/test_checkpointing.py", line 212, in test_loss_scale_decrease
    self.assertEqual(update_ls, init_ls / 2**factor)
AssertionError: 32768.0 != 16384.0

rocm-pytorch-release

FAIL: test_layer_norm (test_fused_layer_norm.TestFusedLayerNormElemWiseBFloat16) (contiguous=True)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/apex/tests/L0/run_fused_layer_norm/test_fused_layer_norm.py", line 69, in _test_same_output
    self._check_same_output(batch_size, contiguous)
  File "/apex/tests/L0/run_fused_layer_norm/test_fused_layer_norm.py", line 62, in _check_same_output
    out_cpu_.to(device="cuda", dtype=self.dtype), out_cuda_, **self.fwd_thresholds)
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_deprecated.py", line 32, in inner_wrapper
    return_value = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_deprecated.py", line 89, in assert_allclose
    msg=msg or None,
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1321, in assert_close
    msg=msg,
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1074, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 8192 (0.0%)
Greatest absolute difference: 0.0006103515625 at index (8, 12, 10) (up to 0.0003 allowed)
Greatest relative difference: 0.03546099290780142 at index (8, 12, 10) (up to 0.016 allowed)

@jithunnair-amd
Copy link
Collaborator

Just to elaborate, based on offline discussion, the above two tests are flaky even on upstream Apex.

@jithunnair-amd jithunnair-amd merged commit 87fc412 into master Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants