rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

echosalik · 2021-12-31T04:37:00Z

After installing the rocm 4.5.0 I followed for method and added rocm-dkms and rocm-libs and installing the rocmblas downloaded from here for rocm4.5.0.

When I run rocm-smi I get this:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr   SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    45.0c  14.127W  760Mhz  1750Mhz  0%   auto  48.0W    11%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

But when I run rocminfo I get this:

ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

I am already part of render and video groups

salik@salik-pc:~$ groups
salik adm cdrom sudo dip video plugdev kvm render lpadmin lxd sambashare libvirt docker

Any help would be appreciated.

Kernel: Linux 5.11.0-43-generic #47
OS: 20.04.2-Ubuntu
ROCM: 4.5

The text was updated successfully, but these errors were encountered:

xuhuisheng · 2021-12-31T05:52:40Z

I suggest testing whether cpu and motherboard supports atomicOps, just run dmesg|grep kfd and check if there is any messages like:

kfd: skipped device 1002:7300, PCI rejects atomics

echosalik · 2021-12-31T06:10:35Z

Hi
It returns nothing, empty response.

xuhuisheng · 2021-12-31T07:20:05Z

There should be a /dev/kfd initialized while system startup.
Actually /dev/kfd is the virtual device mapping for AMD gpu.
You can run ll /dev/kfd to check if it exists.

echosalik · 2021-12-31T12:42:57Z

My bad (typed kdf instead of kfd), dmesg does return kfd
dmsg:

[    1.397124] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    1.397246] kfd kfd: amdgpu: added device 1002:67ef

ls for kfd

salik@salik-pc:~$ ls -la /dev/kfd 
crw-rw---- 1 root render 238, 0 دسمبر  31 09:22 /dev/kfd

rocminfo (with and without sudo):

ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

echosalik · 2021-12-31T13:17:43Z

I ran this code, just in case ROCM was being weird.

import tensorflow as tf
print(tf.test.gpu_device_name())

On running the above code I get this:

2021-12-31 18:14:15.175273: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-31 18:14:15.176885: E tensorflow/stream_executor/rocm/rocm_driver.cc:983] could not retrieve ROCM device count: HIP_ERROR_NoDevice

Lunatik00 · 2022-01-03T02:54:16Z

try using it inside a docker:
once docker is installed run
docker pull rocm/dev-ubuntu-20.04:4.5-complete to download the rocm image,
then
docker run -it -v $HOME/dockerx:dockerx --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:4.5-complete bash
to enter in the docker with the gpu enabled, you can create the folder dockerx in your home, that one will be shared between the host and the container

The next part is inside the container
if you didn't downloaded the files, you need to install wget to download the files.

wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm452/rocblas_2.41.0-337552f0.dirty_amd64.deb
dpkg -i rocblas_2.41.0-337552f0.dirty_amd64.deb

with that you have rocm and the only thing you need is the proprietary drivers for amd.
now, this container has python 3.8, the person that made this fixes compiled tensorflow and pytorch for python 3.8, if you need other you must compile them yourself and hopefully share it after.
so, for tensorflow:

wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/tensorflow-2.6.0-cp38-cp38-linux_x86_64.whl
pip3 install tensorflow-2.6.0-cp38-cp38-linux_x86_64.whl

and for pytorch:

wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/torch-1.9.0a0+gitd69c22d-cp38-cp38-linux_x86_64.whl
wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/torchvision-0.10.0a0+300a8a4-cp38-cp38-linux_x86_64.whl
pip3 install torch-1.9.0a0+gitd69c22d-cp38-cp38-linux_x86_64.whl
pip3 install torchvision-0.10.0a0+300a8a4-cp38-cp38-linux_x86_64.whl

i have not used pytorch so i don't know if it works, in this manner you will troubleshoot if your problem is from your installation of rocm, since it comes preinstaled with all you need, at least for tensorflow 2.6

echosalik · 2022-01-12T15:42:20Z

@Lunatik00 Hey, I tried it out. same error:

root@0f607de254bf:/# apt show rocblas
Package: rocblas
Version: 2.41.0-337552f0~dirty
Status: install ok installed
Priority: optional
Section: devel
Maintainer: rocBLAS Maintainer <[email protected]>
Installed-Size: 260 MB
Depends: hip-rocclr (>= 4.0.0), rocm-core
Recommends: rocblas-dev (>= 2.41.0)
Download-Size: unknown
APT-Manual-Installed: no
APT-Sources: /var/lib/dpkg/status
Description: rocBLAS is AMD's library for BLAS on ROCm. It is implemented in HIP and optimized for AMD GPUs

root@0f607de254bf:/# rocminfo 
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

Lunatik00 · 2022-01-12T20:12:06Z

then it is likely something with your driver on the host machine, you have mesa and amdgpu packages?
I have just those 2 installed and the docker works well for me, i have not installed in the host the rocm packages, so maybe you need to remove them from the host. At least you now know that it is not your rocm install, it is an issue with the drivers, do you have an intel apu?

tejasraman · 2022-03-19T15:47:40Z

I have the same issue with ROCm 5.2 #8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

echosalik commented Dec 31, 2021 •

edited

Loading

xuhuisheng commented Dec 31, 2021

echosalik commented Dec 31, 2021

xuhuisheng commented Dec 31, 2021

echosalik commented Dec 31, 2021 •

edited

Loading

echosalik commented Dec 31, 2021

Lunatik00 commented Jan 3, 2022

echosalik commented Jan 12, 2022

Lunatik00 commented Jan 12, 2022

tejasraman commented Mar 19, 2022

rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

Comments

echosalik commented Dec 31, 2021 • edited Loading

xuhuisheng commented Dec 31, 2021

echosalik commented Dec 31, 2021

xuhuisheng commented Dec 31, 2021

echosalik commented Dec 31, 2021 • edited Loading

echosalik commented Dec 31, 2021

Lunatik00 commented Jan 3, 2022

echosalik commented Jan 12, 2022

Lunatik00 commented Jan 12, 2022

tejasraman commented Mar 19, 2022

echosalik commented Dec 31, 2021 •

edited

Loading

echosalik commented Dec 31, 2021 •

edited

Loading