Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amd build improvements #156

Merged
merged 2 commits into from
Sep 16, 2024
Merged

Conversation

dtrifiro
Copy link

@dtrifiro dtrifiro commented Sep 11, 2024

Dockerfile.rocm.ubi:

  • get rid of non-essential dependencies
  • reduce image size (currently down to ~32GB from ~50GB)
  • bump flash-attention to 2.6.2
  • use flash-attention with triton backend by default:
  • use built triton wheel from pytorch
  • configure numba, outlines and triton cache directory
  • add vllm-tgis-adapter layer

https://issues.redhat.com/browse/RHOAIENG-12611

Copy link

openshift-ci bot commented Sep 11, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@dtrifiro
Copy link
Author

/test all

@dtrifiro dtrifiro force-pushed the amd-build-improvements branch from cf62620 to 64a8df2 Compare September 12, 2024 11:15
@dtrifiro dtrifiro requested a review from NickLucche September 12, 2024 11:25
@dtrifiro dtrifiro marked this pull request as ready for review September 12, 2024 11:25
@dtrifiro dtrifiro force-pushed the amd-build-improvements branch from 64a8df2 to d7a74df Compare September 12, 2024 11:25
Comment on lines +55 to +61
RUN rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
rpm -ql epel-release && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need epel and -ql listing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need epel for ccache, the -ql listing is not required

ENV CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/libtorch/include:/libtorch/include/torch/csrc/api/include:/opt/rocm/include
ENV PYTORCH_ROCM_ARCH="gfx908;gfx90a;gfx942;gfx1100"
ENV CCACHE_DIR=/root/.cache/ccache
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this var used by vllm only later on..?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove this.

Comment on lines 178 to 179
torch==2.5.0.dev20240726+rocm6.1 \
torchvision==0.20.0.dev20240726+rocm6.1 && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already installed torch at line 77, do we just copy files from mounted cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the torch install to the rocm_base layer instead

Dockerfile.rocm.ubi Show resolved Hide resolved
@fialhocoelho fialhocoelho removed their request for review September 12, 2024 12:59
@dtrifiro dtrifiro force-pushed the amd-build-improvements branch 3 times, most recently from e5f6c41 to 0f2d1ef Compare September 12, 2024 16:08
Dockerfile.rocm.ubi Outdated Show resolved Hide resolved
Copy link

openshift-ci bot commented Sep 13, 2024

@dtrifiro: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/smoke-test 8f1fcff link true /test smoke-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dtrifiro
Copy link
Author

Smoke test failure is unrelated

@dtrifiro dtrifiro force-pushed the amd-build-improvements branch from 8f1fcff to 1e8d6df Compare September 16, 2024 13:35
- get rid of non-essential dependencies
- consolidate package installs
- do not copy wheels in final stage
- fix ccache usage
- use flashattention with triton backend by default:
    - clone main_perf branch
    - build rocm target
    - set up triton rocm env var
- configure numba, outlines and triton cache directory
@NickLucche
Copy link

/lgtm

Copy link

openshift-ci bot commented Sep 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro, NickLucche

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [NickLucche,dtrifiro]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dtrifiro dtrifiro merged commit 66984d4 into opendatahub-io:main Sep 16, 2024
19 of 21 checks passed
@dtrifiro dtrifiro deleted the amd-build-improvements branch September 16, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants