[Win] UR Error when OOM and break the tensor context #1324

Stonepia · 2025-01-24T02:54:35Z

🐛 Describe the bug

We found that when running models and the model is OOM, we get the UR Error, and this UR Error will break tensor context.

# On a 16.5 GB host memory of LNL

# 1. First fill all the GPU memory
(8GB)>>> x1 = torch.ones(1024*1024*1024, dtype=torch.float64, device='xpu')
(8GB)>>> x2 = torch.ones(1024*1024*1024, dtype=torch.float64, device='xpu')
(0.5GB)>>> x3 = torch.ones(128*1024*1024, dtype=torch.float32, device='xpu')

# 1.1 You can see this access is ok
>>> x1[100]
tensor(1., device='xpu:0', dtype=torch.float64)

# 1.2 Fill another tensor, this is OOM, expected. This x4 should not be created.
>>> x4 = torch.ones(128*1024*1024, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 16.50 GiB. Of the allocated memory 16.50 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.
 
# 2. Re-access the tensor, it gets UR Error with OUT_OF_RESOURCES. This is unexpected, the tensor context of x1 should be normal.
 
>>> x1[100]
File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
RuntimeError: UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
 
# 3. Re-access, it gets UR Error without any useful information
>>> x1[100]
File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 704, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 621, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 353, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
RuntimeError: UR error

As a comparison, CUDA could handle all of the above. It has the following logic:

If the dedicated memory is not enough, it will try allocate from the host memory. This behavior could be disabled by https://forums.developer.nvidia.com/t/cuda-unified-memory-oversubscription-in-windows-systems/58391
If disabled CUDA's unified memory oversubscription, it will throw OOM, and the re-access x1 in the example could also work, its tensor context won't be affected.

Versions

The text was updated successfully, but these errors were encountered:

guangyey · 2025-02-12T03:26:26Z

It seems like a driver bug. @riverliuintel, please help by asking the driver team to cherry-pick the hotfix for the release branch used by PT2.6.

riverliuintel · 2025-02-12T04:05:39Z

Could you please submit a JIRA ticket for driver team? I will push this issue fix for driver windows.

guangyey · 2025-02-13T07:33:38Z

Target to agama 25.11

daisyden · 2025-02-20T03:16:55Z

@Stonepia please try new driver in 2.7.

Stonepia assigned guangyey Jan 24, 2025

Stonepia added the os: Windows Windows Platform label Jan 24, 2025

daisyden added this to the PT2.7 milestone Feb 20, 2025

Stonepia mentioned this issue Feb 24, 2025

[LNL Windows][Test by CD Nightly Wheels] BlenderbotSmallForCausalLM failed with UR Error #1158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Win] UR Error when OOM and break the tensor context #1324

[Win] UR Error when OOM and break the tensor context #1324

Stonepia commented Jan 24, 2025 •

edited

Loading

guangyey commented Feb 12, 2025

riverliuintel commented Feb 12, 2025

guangyey commented Feb 13, 2025

daisyden commented Feb 20, 2025

[Win] UR Error when OOM and break the tensor context #1324

[Win] UR Error when OOM and break the tensor context #1324

Comments

Stonepia commented Jan 24, 2025 • edited Loading

🐛 Describe the bug

Versions

guangyey commented Feb 12, 2025

riverliuintel commented Feb 12, 2025

guangyey commented Feb 13, 2025

daisyden commented Feb 20, 2025

Stonepia commented Jan 24, 2025 •

edited

Loading