Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Win] UR Error when OOM and break the tensor context #1324

Open
Stonepia opened this issue Jan 24, 2025 · 4 comments
Open

[Win] UR Error when OOM and break the tensor context #1324

Stonepia opened this issue Jan 24, 2025 · 4 comments
Assignees
Labels
os: Windows Windows Platform
Milestone

Comments

@Stonepia
Copy link
Contributor

Stonepia commented Jan 24, 2025

🐛 Describe the bug

We found that when running models and the model is OOM, we get the UR Error, and this UR Error will break tensor context.

# On a 16.5 GB host memory of LNL

# 1. First fill all the GPU memory
(8GB)>>> x1 = torch.ones(1024*1024*1024, dtype=torch.float64, device='xpu')
(8GB)>>> x2 = torch.ones(1024*1024*1024, dtype=torch.float64, device='xpu')
(0.5GB)>>> x3 = torch.ones(128*1024*1024, dtype=torch.float32, device='xpu')

# 1.1 You can see this access is ok
>>> x1[100]
tensor(1., device='xpu:0', dtype=torch.float64)

# 1.2 Fill another tensor, this is OOM, expected. This x4 should not be created.
>>> x4 = torch.ones(128*1024*1024, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 16.50 GiB. Of the allocated memory 16.50 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.
 
# 2. Re-access the tensor, it gets UR Error with OUT_OF_RESOURCES. This is unexpected, the tensor context of x1 should be normal.
 
>>> x1[100]
File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
RuntimeError: UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
 
# 3. Re-access, it gets UR Error without any useful information
>>> x1[100]
File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 704, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 621, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 353, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
RuntimeError: UR error

As a comparison, CUDA could handle all of the above. It has the following logic:

  1. If the dedicated memory is not enough, it will try allocate from the host memory. This behavior could be disabled by https://forums.developer.nvidia.com/t/cuda-unified-memory-oversubscription-in-windows-systems/58391
  2. If disabled CUDA's unified memory oversubscription, it will throw OOM, and the re-access x1 in the example could also work, its tensor context won't be affected.

Versions

@Stonepia Stonepia added the os: Windows Windows Platform label Jan 24, 2025
@guangyey
Copy link
Contributor

It seems like a driver bug. @riverliuintel, please help by asking the driver team to cherry-pick the hotfix for the release branch used by PT2.6.

@riverliuintel
Copy link
Contributor

Could you please submit a JIRA ticket for driver team? I will push this issue fix for driver windows.

@guangyey
Copy link
Contributor

Target to agama 25.11

@daisyden
Copy link
Contributor

@Stonepia please try new driver in 2.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
os: Windows Windows Platform
Projects
None yet
Development

No branches or pull requests

4 participants