AMD GPU on Linux, some feedback, some questions #70

aa956 · 2023-11-25T15:05:36Z

aa956
Nov 25, 2023

Tried to install the OneTrainer on Debian 12 desktop with AMD RX 6700 XT GPU.

Got it working and now as the first SD1.5 LoRA training is running I've a few questions:

Do we need the guide for installation (as install scripts were not used)? Or is it better to make a separate issue with full step-by-step process description and decide later if this will be good to make a pull request changing install scripts? As I've used system python 3.11 instead of conda and changed requirements.txt replacing cuda specific parts with rocm alternatives.
Training does not utilize the GPU at 100%. Both GPU load and power consumption are floating around 25% with occasional spikes to 75%. E.g. kohya-ss' sd-scripts train_network.py is utilizing GPU at 100% load at full power cap (187 W). Same with the inference tools, e.g. automatic1111 webUI and ComfyUI, they run at 100% GPU utilization if the step count is high enough.

aa956 · 2023-11-25T15:52:54Z

aa956
Nov 25, 2023
Author

Manual installation guide.

OneTrainer installation on AMD GPU on Debian 12

Prerequsites

`amdgpu` driver

Check as follows:

$ lshw -c video|grep driver
       configuration: depth=32 driver=amdgpu latency=0 resolution=2560,1440
$

If there is a driver=amdgpu text then driver is ok, moving to the next point.

If the correct driver is not installed, see here for Debian:
https://wiki.debian.org/AtiHowTo

Basically, enable non-free-firmware repository and

# apt update
# apt install firmware-amd-graphics libgl1-mesa-dri libglx-mesa0 mesa-vulkan-drivers xserver-xorg-video-all

Group membership

Your user shall be a member of groups video and render.

If your username is e.g. jdoe then you shall see this:

$ getent group video
video:x:44:jdoe
$ getent group render
render:x:105:jdoe
$

If you are not, add your user to these groups:

# usermod -a -G render jdoe
# usermod -a -G video jdoe

and logout / login.

Python, pip, venv

# apt update
# apt install python3-venv python3-pip

Installation

$ git clone https://github.com/Nerogar/OneTrainer

Virtual environment

Create virtual environment for the software OneTrainer/venv

$ cd OneTrainer
$ python3 -m venv venv

Manual changes

Change the requirements.txt as follows:

 # pytorch
---extra-index-url https://download.pytorch.org/whl/cu118
-torch==2.1.0+cu118
-torchvision==0.16.0+cu118
-accelerate==0.21.0
+--extra-index-url https://download.pytorch.org/whl/rocm5.6
+torch==2.1.1+rocm5.6
+torchvision==0.16.1
+accelerate==0.23.0
 safetensors==0.3.1
 tensorboard==2.14.1
 pytorch-lightning==2.0.3
@@ -31,14 +31,14 @@ open-clip-torch==2.22.0
 git+https://github.com/Nerogar/mgds.git@d8d6018#egg=mgds
 
 # xformers
-xformers==0.0.22.post4+cu118
+# xformers==0.0.22.post4+cu118
 
 # optimizers
---extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui --prefer-binary
-bitsandbytes==0.41.1 # bitsandbytes for 8-bit optimizers
+# --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui --prefer-binary
+# bitsandbytes==0.41.1 # bitsandbytes for 8-bit optimizers

Activate the environment

$ source ./venv/bin/activate
(venv) $

Install pip packages

(venv) $ pip install -r requirements.txt

Non-Pro video cards environment variables

Environment variable is required to override GPU to compatible Pro model:

$ export HSA_OVERRIDE_GFX_VERSION=10.3.0

HSA_OVERRIDE_GFX_VERSION=10.3.0 for RX 6700 XT.

For other cards there may be other values necessary, requires some searching.

E.g. RX 5700 XT (Navi 10 - RDNA 1.0) may require one more variable according to comments here:
https://old.reddit.com/r/StableDiffusion/comments/ww436j/howto_stable_diffusion_on_an_amd_gpu/

export AMDGPU_TARGETS="gfx1010"
export HSA_OVERRIDE_GFX_VERSION=10.3.0

For newer RDNA3 cards (7600 and maybe others) shall be something like:

export HSA_OVERRIDE_GFX_VERSION=11.0.0

Changes to the start-ui.sh

Change the start-ui.sh as follows:

--- a/start-ui.sh
+++ b/start-ui.sh
@@ -7,6 +7,7 @@ python_venv=venv
 #additional arguments helpful if running into CUDA OOM error; default: false
 use_alloc_args=false
 
+export HSA_OVERRIDE_GFX_VERSION=10.3.0
 
 if "$use_alloc_args"; then
        export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
@@ -23,7 +24,7 @@ elif [ -x "$(command -v python)" ]; then
        if [[ "$major" -eq "3" ]];
                then
                        #check minor version of python
-                       if [[ "$minor" -le "10" ]];
+                       if [[ "$minor" -le "11" ]];
                                then
                                        if ! [ -x "$(command -v conda)" ]; then
                                                echo 'conda not found; python version correct; use native python'

Usage

Things that are not working on AMD GPU:

xformers
AdamW8bit optimizer
bfloat16 format

So the first thing to do after selecting some preset is:

Attention: replace xformers with sdpa
Optimizer: Replace AdamW8bit with AdamW
Data format: replace bfloat16 with float16

1 reply

Heasterian Nov 25, 2023

Pretty neat one. We could try to add some simple check with/for rocminfo to check if rocm is installed and supported by GPU and than install separate requirements_amd.txt.
Also, fyi, https://github.com/Lzy17/bitsandbytes-rocm this fork of bitsandbytes should work.

aa956 · 2023-11-26T15:37:22Z

aa956
Nov 26, 2023
Author

Looks like I've either run into the problem with GPU hardware (overheating? power?) or running into this/similar error on each use of the compute (inference or training): https://gitlab.freedesktop.org/mesa/mesa/-/issues/7504

GPU was stable for over a month (inference mostly, just getting into training) so suspect more of the hardware problem.
Just to check will try to get the OneTrainer running on fresh fedora 39 install (as it has newer gpu drivers/kernel/mesa and has rocm packaged).

0 replies

aa956 · 2023-11-28T23:15:05Z

aa956
Nov 28, 2023
Author

Ok, got it working again, this time on fedora 39.

Procedure was basically the same but instead of venv had to use conda because system python is python v 3.12.

Run into one issue with the Tkinter GUI -> it was basically unusable because no system fonts were available due to conda packaging:
ContinuumIO/anaconda-issues#6833

Fortunately this solution worked:

ContinuumIO/anaconda-issues#6833 (comment)

Otherwise looks like everything works.

Low GPU utilization is related to the GUI too, when exported script is run it utilizes GPU at close to 100%.

0 replies

aa956 · 2023-11-28T23:22:44Z

aa956
Nov 28, 2023
Author

One more note, had to do this to get training running.

Not sure if it's related to Linux, rocm, specific dataset or training options.

Got errors that ScaleImage is not defined, replaced it with Upscale:

--- a/modules/dataLoader/StableDiffusionBaseDataLoader.py
+++ b/modules/dataLoader/StableDiffusionBaseDataLoader.py
@@ -215 +215 @@ class StablDiffusionBaseDataLoader(BaseDataLoader):
-        downscale_mask = ScaleImage(in_name='mask', out_name='latent_mask', factor=0.125)
+        downscale_mask = Upscale(in_name='mask', out_name='latent_mask', factor=0.125)
@@ -217 +217 @@ class StablDiffusionBaseDataLoader(BaseDataLoader):
-        downscale_depth = ScaleImage(in_name='depth', out_name='latent_depth', factor=0.125)
+        downscale_depth = Upscale(in_name='depth', out_name='latent_depth', factor=0.125)
@@ -321 +321 @@ class StablDiffusionBaseDataLoader(BaseDataLoader):
-        upscale_mask = ScaleImage(in_name='latent_mask', out_name='decoded_mask', factor=8)
+        upscale_mask = Upscale(in_name='latent_mask', out_name='decoded_mask', factor=8)

Looks like the reason is here:
https://github.com/Nerogar/mgds/compare/wuerstchen#diff-d30660d680d7ce781945013eda5f8d1ccdbdca8d71319cec4cf7ab0c18e78a80L1014

So needed to upgrade pip packages.

0 replies

feffy380 · 2023-12-09T12:58:28Z

feffy380
Dec 9, 2023

@aa956 How does VRAM usage compare to kohya's sd-scripts? My experience with sd-scripts was SDP attention would OOM while the plain pytorch implementation of flash attention (--mem_eff_attn) worked fine

0 replies

aa956 · 2023-12-09T18:05:14Z

aa956
Dec 9, 2023
Author

If I recall correctly there were no significant differences in VRAM usage between kohya's scripts and OneTrainer.

I've jumped the ship (replaced RX 6700 XT 12Gb with RTX 4060 Ti 16 Gb) so can't repeat unfortunately but there were no OOM-s with SD1.5 Lora training using 512px resolution batch sizes 1,2 and 4 if I recall correctly.

But do sd-scripts support SDP at all?

I see only --xformers or --mem_eff_attn in this doc:
https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_README-en.md#training-settings

So I've used --mem_eff_attn with kohya's scripts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD GPU on Linux, some feedback, some questions #70

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

`amdgpu` driver

Group membership

Python, pip, venv

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

AMD GPU on Linux, some feedback, some questions #70

aa956 Nov 25, 2023

Replies: 6 comments · 1 reply

aa956 Nov 25, 2023 Author

OneTrainer installation on AMD GPU on Debian 12

amdgpu driver

Group membership

Python, pip, venv

Installation

Virtual environment

Manual changes

Activate the environment

Install pip packages

Non-Pro video cards environment variables

Changes to the start-ui.sh

Usage

Heasterian Nov 25, 2023

aa956 Nov 26, 2023 Author

aa956 Nov 28, 2023 Author

aa956 Nov 28, 2023 Author

feffy380 Dec 9, 2023

aa956 Dec 9, 2023 Author

aa956
Nov 25, 2023

Replies: 6 comments 1 reply

aa956
Nov 25, 2023
Author

`amdgpu` driver

aa956
Nov 26, 2023
Author

aa956
Nov 28, 2023
Author

aa956
Nov 28, 2023
Author

feffy380
Dec 9, 2023

aa956
Dec 9, 2023
Author