Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows][ARC][UT] test_rms_norm.py::TestNNMethod::test_rms_norm_bw failed with error AssertionError: Tensor-likes are not close! #1400

Open
huaiyuzh opened this issue Feb 24, 2025 · 0 comments
Assignees

Comments

@huaiyuzh
Copy link
Contributor

[Windows][ARC][UT] tests/gpu/examples/test_rms_norm.py::TestNNMethod::test_rms_norm_bw failed with error AssertionError: Tensor-likes are not close!

[Error Details]

================= Session test_rms_norm.py =================
test_rms_norm.py::TestNNMethod::test_rms_norm_bw, FAILED, E AssertionError: Tensor-likes are not close!, 1.960

self = <test_rms_norm.TestNNMethod testMethod=test_rms_norm_bw>

def test_rms_norm_bw(self):
    def test_rms_norm_fwd_bwd(dtype):
        print("test_rms_norm_fw_bw"  dtype)
        torch.manual_seed(13)
        modelb = RMSNormRef(64)
        model0 = RMSNormRef(768)
        model1 = RMSNormRef(2048)
        model2 = RMSNormRef(4096)
        model3 = RMSNormRef(16384)
        model4 = RMSNormRef(16384 * 4 + 123)
        hszs = [64  768  2048  4096  16384  16384 * 4 + 123]
        ls = [modelb  model0  model1  model2  model3  model4]
        for i  model in enumerate(ls):
            model = model.to(dtype)
            hsz = hszs[i]
            input_case = torch.rand(4  1024  hsz).to(dtype)
            input_case.requires_grad_(True)
            grad = torch.rand(4  1024  hsz).to(dtype)
            output_ref = model(input_case)
            output_ref.backward(grad)
            grad_wei = model.weight.grad.clone()
            input_grad_cpu = input_case.grad.clone()
            w = model.weight.clone()

            input_case_xpu = input_case.clone().xpu()
            input_case_xpu.retain_grad()
            input_case_xpu.requires_grad_(True)
            grad_xpu = grad.xpu()
            w = w.xpu()
            w.retain_grad()
            w.requires_grad_(True)
            output1 = torch.xpu.IpexRmsNorm(input_case_xpu  [hsz]  w  1e-5)
            output1.backward(grad_xpu)
            grad_wei_xpu = w.grad

            self.assertEqual(grad_wei_xpu.cpu()  grad_wei  atol=10e-2  rtol=10e-2)
            self.assertEqual(
                input_case_xpu.grad.cpu()  input_grad_cpu  atol=10e-2  rtol=10e-2
            )
  test_rms_norm_fwd_bwd(torch.bfloat16)

test_rms_norm.py:94:


dtype = torch.bfloat16

def test_rms_norm_fwd_bwd(dtype):
    print("test_rms_norm_fw_bw"  dtype)
    torch.manual_seed(13)
    modelb = RMSNormRef(64)
    model0 = RMSNormRef(768)
    model1 = RMSNormRef(2048)
    model2 = RMSNormRef(4096)
    model3 = RMSNormRef(16384)
    model4 = RMSNormRef(16384 * 4 + 123)
    hszs = [64  768  2048  4096  16384  16384 * 4 + 123]
    ls = [modelb  model0  model1  model2  model3  model4]
    for i  model in enumerate(ls):
        model = model.to(dtype)
        hsz = hszs[i]
        input_case = torch.rand(4  1024  hsz).to(dtype)
        input_case.requires_grad_(True)
        grad = torch.rand(4  1024  hsz).to(dtype)
        output_ref = model(input_case)
        output_ref.backward(grad)
        grad_wei = model.weight.grad.clone()
        input_grad_cpu = input_case.grad.clone()
        w = model.weight.clone()

        input_case_xpu = input_case.clone().xpu()
        input_case_xpu.retain_grad()
        input_case_xpu.requires_grad_(True)
        grad_xpu = grad.xpu()
        w = w.xpu()
        w.retain_grad()
        w.requires_grad_(True)
        output1 = torch.xpu.IpexRmsNorm(input_case_xpu  [hsz]  w  1e-5)
        output1.backward(grad_xpu)
        grad_wei_xpu = w.grad
      self.assertEqual(grad_wei_xpu.cpu()  grad_wei  atol=10e-2  rtol=10e-2)

E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 128 / 16384 (0.8%)
E Greatest absolute difference: 488.0 at index (9973 ) (up to 0.1 allowed)
E Greatest relative difference: 0.265625 at index (9860 ) (up to 0.1 allowed)
E
E To execute this test run the following from the base repo dir:
E python test_rms_norm.py TestNNMethod.test_rms_norm_bw
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

test_rms_norm.py:89: AssertionError
---------------------------- Captured stdout call -----------------------------
test_rms_norm_fw_bw torch.bfloat16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants