amd build improvements #156

dtrifiro · 2024-09-11T16:27:53Z

enable ccache for AMD builds ([CI/Build] enable ccache/scccache for HIP builds vllm-project/vllm#8327)

Dockerfile.rocm.ubi:

get rid of non-essential dependencies
reduce image size (currently down to ~32GB from ~50GB)
bump flash-attention to 2.6.2
use flash-attention with triton backend by default:
- clone main_perf branch (waiting on [AMD] Triton Backend for ROCm Dao-AILab/flash-attention#1203 to switch to upstream)
- build rocm target
- set up triton rocm env var
use built triton wheel from pytorch
configure numba, outlines and triton cache directory
add vllm-tgis-adapter layer

https://issues.redhat.com/browse/RHOAIENG-12611

openshift-ci · 2024-09-11T16:27:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

dtrifiro · 2024-09-11T17:22:52Z

/test all

NickLucche · 2024-09-12T12:33:57Z

Dockerfile.rocm.ubi

+RUN rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
+    rpm -ql epel-release && \


do we need epel and -ql listing?

We need epel for ccache, the -ql listing is not required

NickLucche · 2024-09-12T12:37:46Z

Dockerfile.rocm.ubi

 ENV CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/libtorch/include:/libtorch/include/torch/csrc/api/include:/opt/rocm/include
-ENV PYTORCH_ROCM_ARCH="gfx908;gfx90a;gfx942;gfx1100"
-ENV CCACHE_DIR=/root/.cache/ccache
+ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}


isn't this var used by vllm only later on..?

Yes, we can remove this.

NickLucche · 2024-09-12T12:44:36Z

Dockerfile.rocm.ubi

+        torch==2.5.0.dev20240726+rocm6.1 \
+        torchvision==0.20.0.dev20240726+rocm6.1 && \


we already installed torch at line 77, do we just copy files from mounted cache?

Moved the torch install to the rocm_base layer instead

Dockerfile.rocm.ubi

openshift-ci · 2024-09-13T14:33:36Z

@dtrifiro: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/smoke-test	`8f1fcff`	link	true	`/test smoke-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dtrifiro · 2024-09-13T15:26:30Z

Smoke test failure is unrelated

- get rid of non-essential dependencies - consolidate package installs - do not copy wheels in final stage - fix ccache usage - use flashattention with triton backend by default: - clone main_perf branch - build rocm target - set up triton rocm env var - configure numba, outlines and triton cache directory

NickLucche · 2024-09-16T14:40:41Z

/lgtm

openshift-ci · 2024-09-16T14:42:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro, NickLucche

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [NickLucche,dtrifiro]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Sep 11, 2024

openshift-ci bot added the approved label Sep 11, 2024

dtrifiro force-pushed the amd-build-improvements branch from cf62620 to 64a8df2 Compare September 12, 2024 11:15

dtrifiro requested a review from NickLucche September 12, 2024 11:25

dtrifiro marked this pull request as ready for review September 12, 2024 11:25

openshift-ci bot removed the do-not-merge/work-in-progress label Sep 12, 2024

openshift-ci bot requested review from fialhocoelho and Xaenalt September 12, 2024 11:25

dtrifiro force-pushed the amd-build-improvements branch from 64a8df2 to d7a74df Compare September 12, 2024 11:25

NickLucche approved these changes Sep 12, 2024

View reviewed changes

fialhocoelho removed their request for review September 12, 2024 12:59

dtrifiro force-pushed the amd-build-improvements branch 3 times, most recently from e5f6c41 to 0f2d1ef Compare September 12, 2024 16:08

NickLucche requested changes Sep 13, 2024

View reviewed changes

Dockerfile.rocm.ubi Outdated Show resolved Hide resolved

dtrifiro force-pushed the amd-build-improvements branch from 0f2d1ef to a25f69f Compare September 13, 2024 08:57

dtrifiro force-pushed the main branch from 3bd5180 to d02a789 Compare September 13, 2024 09:27

openshift-merge-robot added the needs-rebase label Sep 13, 2024

dtrifiro force-pushed the amd-build-improvements branch from a25f69f to 8f1fcff Compare September 13, 2024 10:32

openshift-merge-robot removed the needs-rebase label Sep 13, 2024

dtrifiro force-pushed the amd-build-improvements branch from 8f1fcff to 1e8d6df Compare September 16, 2024 13:35

dtrifiro added 2 commits September 16, 2024 15:37

add vllm-tgis-adapter layer

1e8d6df

openshift-ci bot assigned NickLucche Sep 16, 2024

openshift-ci bot added the lgtm label Sep 16, 2024

NickLucche approved these changes Sep 16, 2024

View reviewed changes

dtrifiro merged commit 66984d4 into opendatahub-io:main Sep 16, 2024
19 of 21 checks passed

dtrifiro deleted the amd-build-improvements branch September 16, 2024 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amd build improvements #156

amd build improvements #156

dtrifiro commented Sep 11, 2024 •

edited

Loading

openshift-ci bot commented Sep 11, 2024

dtrifiro commented Sep 11, 2024

NickLucche Sep 12, 2024

dtrifiro Sep 12, 2024

NickLucche Sep 12, 2024

dtrifiro Sep 12, 2024

NickLucche Sep 12, 2024

dtrifiro Sep 12, 2024

openshift-ci bot commented Sep 13, 2024

dtrifiro commented Sep 13, 2024

NickLucche commented Sep 16, 2024

openshift-ci bot commented Sep 16, 2024

		RUN rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
		rpm -ql epel-release && \

		torch==2.5.0.dev20240726+rocm6.1 \
		torchvision==0.20.0.dev20240726+rocm6.1 && \

amd build improvements #156

amd build improvements #156

Conversation

dtrifiro commented Sep 11, 2024 • edited Loading

openshift-ci bot commented Sep 11, 2024

dtrifiro commented Sep 11, 2024

NickLucche Sep 12, 2024

Choose a reason for hiding this comment

dtrifiro Sep 12, 2024

Choose a reason for hiding this comment

NickLucche Sep 12, 2024

Choose a reason for hiding this comment

dtrifiro Sep 12, 2024

Choose a reason for hiding this comment

NickLucche Sep 12, 2024

Choose a reason for hiding this comment

dtrifiro Sep 12, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Sep 13, 2024

dtrifiro commented Sep 13, 2024

NickLucche commented Sep 16, 2024

openshift-ci bot commented Sep 16, 2024

dtrifiro commented Sep 11, 2024 •

edited

Loading