Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

Open
echosalik opened this issue Dec 31, 2021 · 9 comments
Open

rocminfo generates error HSA_STATUS_ERROR_OUT_OF_RESOURCES #5

echosalik opened this issue Dec 31, 2021 · 9 comments

Comments

@echosalik
Copy link

echosalik commented Dec 31, 2021

After installing the rocm 4.5.0 I followed for method and added rocm-dkms and rocm-libs and installing the rocmblas downloaded from here for rocm4.5.0.

When I run rocm-smi I get this:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr   SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    45.0c  14.127W  760Mhz  1750Mhz  0%   auto  48.0W    11%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

But when I run rocminfo I get this:

ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

I am already part of render and video groups

salik@salik-pc:~$ groups
salik adm cdrom sudo dip video plugdev kvm render lpadmin lxd sambashare libvirt docker

Any help would be appreciated.

Kernel: Linux 5.11.0-43-generic #47
OS: 20.04.2-Ubuntu
ROCM: 4.5

@xuhuisheng
Copy link
Owner

I suggest testing whether cpu and motherboard supports atomicOps, just run dmesg|grep kfd and check if there is any messages like:

kfd: skipped device 1002:7300, PCI rejects atomics

@echosalik
Copy link
Author

Hi
It returns nothing, empty response.

@xuhuisheng
Copy link
Owner

There should be a /dev/kfd initialized while system startup.
Actually /dev/kfd is the virtual device mapping for AMD gpu.
You can run ll /dev/kfd to check if it exists.

@echosalik
Copy link
Author

echosalik commented Dec 31, 2021

My bad (typed kdf instead of kfd), dmesg does return kfd
dmsg:

[    1.397124] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    1.397246] kfd kfd: amdgpu: added device 1002:67ef

ls for kfd

salik@salik-pc:~$ ls -la /dev/kfd 
crw-rw---- 1 root render 238, 0 دسمبر  31 09:22 /dev/kfd

rocminfo (with and without sudo):

ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

@echosalik
Copy link
Author

I ran this code, just in case ROCM was being weird.

import tensorflow as tf
print(tf.test.gpu_device_name())

On running the above code I get this:

2021-12-31 18:14:15.175273: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-31 18:14:15.176885: E tensorflow/stream_executor/rocm/rocm_driver.cc:983] could not retrieve ROCM device count: HIP_ERROR_NoDevice

@Lunatik00
Copy link

try using it inside a docker:
once docker is installed run
docker pull rocm/dev-ubuntu-20.04:4.5-complete to download the rocm image,
then
docker run -it -v $HOME/dockerx:dockerx --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:4.5-complete bash
to enter in the docker with the gpu enabled, you can create the folder dockerx in your home, that one will be shared between the host and the container

The next part is inside the container
if you didn't downloaded the files, you need to install wget to download the files.

wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm452/rocblas_2.41.0-337552f0.dirty_amd64.deb
dpkg -i rocblas_2.41.0-337552f0.dirty_amd64.deb

with that you have rocm and the only thing you need is the proprietary drivers for amd.
now, this container has python 3.8, the person that made this fixes compiled tensorflow and pytorch for python 3.8, if you need other you must compile them yourself and hopefully share it after.
so, for tensorflow:

wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/tensorflow-2.6.0-cp38-cp38-linux_x86_64.whl
pip3 install tensorflow-2.6.0-cp38-cp38-linux_x86_64.whl

and for pytorch:

wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/torch-1.9.0a0+gitd69c22d-cp38-cp38-linux_x86_64.whl
wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm43/torchvision-0.10.0a0+300a8a4-cp38-cp38-linux_x86_64.whl
pip3 install torch-1.9.0a0+gitd69c22d-cp38-cp38-linux_x86_64.whl
pip3 install torchvision-0.10.0a0+300a8a4-cp38-cp38-linux_x86_64.whl

i have not used pytorch so i don't know if it works, in this manner you will troubleshoot if your problem is from your installation of rocm, since it comes preinstaled with all you need, at least for tensorflow 2.6

@echosalik
Copy link
Author

@Lunatik00 Hey, I tried it out. same error:

root@0f607de254bf:/# apt show rocblas
Package: rocblas
Version: 2.41.0-337552f0~dirty
Status: install ok installed
Priority: optional
Section: devel
Maintainer: rocBLAS Maintainer <[email protected]>
Installed-Size: 260 MB
Depends: hip-rocclr (>= 4.0.0), rocm-core
Recommends: rocblas-dev (>= 2.41.0)
Download-Size: unknown
APT-Manual-Installed: no
APT-Sources: /var/lib/dpkg/status
Description: rocBLAS is AMD's library for BLAS on ROCm. It is implemented in HIP and optimized for AMD GPUs
root@0f607de254bf:/# rocminfo 
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

@Lunatik00
Copy link

then it is likely something with your driver on the host machine, you have mesa and amdgpu packages?
I have just those 2 installed and the docker works well for me, i have not installed in the host the rocm packages, so maybe you need to remove them from the host. At least you now know that it is not your rocm install, it is an issue with the drivers, do you have an intel apu?

@tejasraman
Copy link

I have the same issue with ROCm 5.2 #8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants