Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Dockerfile for AMD ROCm #3750

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

dark-penguin
Copy link

Description

Provide a Dockerfile for AMD ROCm. Finding a good image is not trivial because unlike the Torch image for CUDA, the image for ROCm is 71 GB for whatever reason.

Additionally, having a Dockerfile that "works" is a great reference for when you are trying to install something on bare metal.

Notes

Build with: docker build -t sdnext -f Dockerfile.rocm .
Run with (example): docker run -it --rm --device /dev/dri --group-add video -v /sdnext:/mnt -p 7860:7860 sdnext

  • --device /dev/dri - that's the way to "mount" the graphics card devices into the container (instead of NVidia Toolkit)
  • --group-add video - the user inside the container needs access to that device
  • -v /sdnext:/mnt - mount a volume or a directory to keep persistent data
  • -p 7860:7860 - publish the port

The Dockerfile is made with minimal changes from the "official" NVidia Dockerfile to minimize the difference.

The Torch image for ROCm is 71 GB for some reason, so one difference I had to make is use a smaller image with only the essentials of ROCm installed (3 GB). Torch will be installed at buildtime (~2 GB download size). Total size of the built image is 23 GB (apparently Torch is packed really well).

Environment and Testing

Tested on Debian 12 Bookworm (I had to remove the --skip-all option from the CMD while testing since it's currently broken in master).

@dark-penguin
Copy link
Author

Oops, I guess I should have opened the PR against the dev branch...

@vladmandic vladmandic changed the base branch from master to dev February 8, 2025 21:19
Dockerfile.rocm Outdated
LABEL org.opencontainers.image.licenses="AGPL-3.0"
LABEL org.opencontainers.image.title="SD.Next"
LABEL org.opencontainers.image.description="SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models"
LABEL org.opencontainers.image.base.name="https://hub.docker.com/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct here, does it? (given that this uses ROCm)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

@lbeltrame
Copy link
Contributor

You may want to add a comment at the top of the Dockerfile mentioning the *GFX_OVERRIDE (forgot the complete name) env variable, because it needs to be set in case people don't run an officially supported card.

@dark-penguin
Copy link
Author

Good point, but that's up to @vladmandic I guess. A comment in the Dockerfile or a note in the Wiki?

@vladmandic
Copy link
Owner

a) yes, rocm overrides should be exposed as its quite a common thing.
b) if we're to include docker for anything except cuda, wiki page needs rewrite as well. adding dockerfile without that is pointless. https://github.com/vladmandic/sdnext/wiki/Docker

@vladmandic
Copy link
Owner

ok, i've pretty much rewritten https://github.com/vladmandic/sdnext/wiki/Docker so its not cuda specific
this pr should target this file, not create new one in root: https://github.com/vladmandic/sdnext/blob/dev/configs/Dockerfile.rocm

@Disty0
Copy link
Collaborator

Disty0 commented Feb 12, 2025

Added Dockerfile.rocm: https://github.com/vladmandic/sdnext/blob/dev/configs/Dockerfile.rocm

Went with different approach than Cuda because of flash atten.

We can save 30 gb of disk space after installing flash atten in rocm-complete image and share the venv with the smaller rocm runtime image.
venv can also be shared between different instances if you have multiple gpus.

Also using ubuntu 24 with python3.12 because onnxruntime-rocm needs python3.12.

If you want to make changes, please target the new file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants