Skip to content

Commit

Permalink
Apply temp. patch to Triton code to resolve conflicting cache dirs in…
Browse files Browse the repository at this point in the history
… TP case (#34)

We are seeing Mixtral pods with TP>1 failing with errors like:
```
FileNotFoundError: [Errno 2] No such file or directory: '/home/vllm/.triton/cache/c926ad2ef143810ed738a313c473c7b2/fused_moe_kernel.cubin.tmp.pid_72_945989'
```
It seems like there is some conflict in the Triton cache directories
when using multi-processing. This has actually been
[fixed](triton-lang/triton#3544) upstream in
Triton, but the fix hasn't made it into Triton v2.3.0 which is what vLLM
is currently using.

This change essentially applies same fix that has made it into Triton
main branch inside our container.

---------

Signed-off-by: Thomas Parnell <[email protected]>
  • Loading branch information
tdoublep authored May 28, 2024
1 parent 066041a commit 4af59d3
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 0 deletions.
8 changes: 8 additions & 0 deletions Dockerfile.ubi
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,14 @@ RUN --mount=type=cache,target=/root/.cache/pip \
RUN microdnf install -y gcc \
&& microdnf clean all

# patch triton (fix for #720)
COPY triton_patch/cache_fix.patch .
RUN microdnf install -y patch \
&& patch /opt/vllm/lib/python3.11/site-packages/triton/runtime/cache.py cache_fix.patch \
&& microdnf remove -y patch \
&& microdnf clean all \
&& rm cache_fix.patch

ENV HF_HUB_OFFLINE=1 \
PORT=8000 \
GRPC_PORT=8033 \
Expand Down
8 changes: 8 additions & 0 deletions triton_patch/cache_fix.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
4c4
< import random
---
> import uuid
117c117
< rnd_id = random.randint(0, 1000000)
---
> rnd_id = str(uuid.uuid4())

0 comments on commit 4af59d3

Please sign in to comment.