You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We found that when running models and the model is OOM, we get the UR Error, and this UR Error will break tensor context.
# On a 16.5 GB host memory of LNL# 1. First fill all the GPU memory
(8GB)>>>x1=torch.ones(1024*1024*1024, dtype=torch.float64, device='xpu')
(8GB)>>>x2=torch.ones(1024*1024*1024, dtype=torch.float64, device='xpu')
(0.5GB)>>>x3=torch.ones(128*1024*1024, dtype=torch.float32, device='xpu')
# 1.1 You can see this access is ok>>>x1[100]
tensor(1., device='xpu:0', dtype=torch.float64)
# 1.2 Fill another tensor, this is OOM, expected. This x4 should not be created.>>>x4=torch.ones(128*1024*1024, dtype=torch.float32, device='xpu')
Traceback (mostrecentcalllast):
File"<stdin>", line1, in<module>torch.OutOfMemoryError: XPUoutofmemory. Triedtoallocate512.00MiB. GPU0hasatotalcapacityof16.50GiB. Oftheallocatedmemory16.50GiBisallocatedbyPyTorch, and0bytesisreservedbyPyTorchbutunallocated. Pleaseuse`empty_cache`toreleaseallunoccupiedcachedmemory.
# 2. Re-access the tensor, it gets UR Error with OUT_OF_RESOURCES. This is unexpected, the tensor context of x1 should be normal.>>>x1[100]
File"C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line146, in__init__tensor_view, torch.isfinite(tensor_view) &tensor_view.ne(0)
RuntimeError: URbackendfailed. URbackendreturns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
# 3. Re-access, it gets UR Error without any useful information>>>x1[100]
File"C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor.py", line568, in__repr__returntorch._tensor_str._str(self, tensor_contents=tensor_contents)
File"C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line704, in_strreturn_str_intern(self, tensor_contents=tensor_contents)
File"C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line621, in_str_interntensor_str=_tensor_str(self, indent)
File"C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line353, in_tensor_strformatter=_Formatter(get_summarized_data(self) ifsummarizeelseself)
File"C:\Users\sdp\miniforge3\envs\tongsu_onednn_37_no_fix\lib\site-packages\torch\_tensor_str.py", line146, in__init__tensor_view, torch.isfinite(tensor_view) &tensor_view.ne(0)
RuntimeError: URerror
As a comparison, CUDA could handle all of the above. It has the following logic:
If disabled CUDA's unified memory oversubscription, it will throw OOM, and the re-access x1 in the example could also work, its tensor context won't be affected.
Versions
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
We found that when running models and the model is OOM, we get the UR Error, and this UR Error will break tensor context.
As a comparison, CUDA could handle all of the above. It has the following logic:
x1
in the example could also work, its tensor context won't be affected.Versions
The text was updated successfully, but these errors were encountered: