[FRONTEND] let CacheManager write to temp dir instead of temp file #4295

yundai424 · 2024-07-09T21:27:17Z

Summary

there've been multiple issues discussing around the FileNotFoundError on compilation when CompiledKernel is trying to read from the listed ASM files. #2688 #4002 vllm-project/vllm#6103 etc. and there have been some attempts to address it such as #3544 . This PR attempts to explain the root cause and suggest a fix.

Why

When a kernel is being compiled, triton first writes IRs to triton cache dir (ref). Inside of the write operation, the process first writes it to a temp file unique to the current process (plus a uuid to distinguish between multiple processes with same PID on different hosts sharing the same underlying FS) (ref) and then atomically os.replace it to the final file name. Afterwards the CompiledKernel lists all the IRs and reads them (ref).

On multiprocess set up this may however result in a race condition. Let's focus on a case where there's one host with 2 processes on it.

At the time when pid 1 lists ASMs, the dir may contain temp files generated from another process pid 2. However at the time when pid 1 proceeds to read bytes from the listed files, pid2 may have already os.replaceed its temp files, so pid 1 will encounter FileNotFoundError when trying to read the temp file generated by pid 2. IBM/vllm#35 (comment) also believes this is the root cause.

How

There're multiple potential solutions towards this, as mentioned in IBM/vllm#35 (comment) as well:

let each process write to a private temp dir instead so glob won't bother taking the temp stuff into consideration
or, exclude tmp.pid_* from glob

This PR tries to go with the 1st approach to avoid adding an assumption on the tmp file pattern (which is only used in runtime/cache.py) in compiler/compiler.py but is open to any suggestion. Thanks!

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because not applicable.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

ThomasRaoux

Looks reasonable to me

tdoublep · 2024-07-10T11:54:19Z

awesome thanks @yundai424 - this looks like it should fix the issue we have in vLLM.

is there a rough ETA for when this would be in a Triton release? trying to plan whether we need to merge a work-around in the meantime.

…riton-lang#4295) # Summary there've been multiple issues discussing around the `FileNotFoundError` on compilation when `CompiledKernel` is trying to read from the listed ASM files. triton-lang#2688 triton-lang#4002 vllm-project/vllm#6103 etc. and there have been some attempts to address it such as triton-lang#3544 . This PR attempts to explain the root cause and suggest a fix. # Why When a kernel is being compiled, triton first writes IRs to triton cache dir ([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L289)). Inside of the write operation, the process first writes it to a temp file unique to the current process (plus a uuid to distinguish between multiple processes with same PID on different hosts sharing the same underlying FS) ([ref](https://github.com/triton-lang/triton/blob/c14b033cd979d5c39e5fdb3847c022fa5d71a0c1/python/triton/runtime/cache.py#L124-L130)) and then atomically `os.replace` it to the final file name. Afterwards the `CompiledKernel` lists all the IRs and reads them ([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L362-L367)). On multiprocess set up this may however result in a race condition. Let's focus on a case where there's one host with 2 processes on it. ![Triton RC (1)](https://github.com/triton-lang/triton/assets/43726198/ffc20e0c-0404-4e7a-bd6c-022e710e97b9) At the time when `pid 1` lists ASMs, the dir may contain temp files generated from another process `pid 2`. However at the time when `pid 1` proceeds to read bytes from the listed files, `pid2` may have already `os.replace`ed its temp files, so `pid 1` will encounter `FileNotFoundError` when trying to read the temp file generated by `pid 2`. IBM/vllm#35 (comment) also believes this is the root cause. # How There're multiple potential solutions towards this, as mentioned in IBM/vllm#35 (comment) as well: - let each process write to a private temp dir instead so `glob` won't bother taking the temp stuff into consideration - or, exclude `tmp.pid_*` from `glob` This PR tries to go with the 1st approach to avoid adding an assumption on the tmp file pattern (which is only used in `runtime/cache.py`) in `compiler/compiler.py` but is open to any suggestion. Thanks! Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `not applicable`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

let cachemgr write to different temp dir instead

accfdf7

yundai424 requested a review from ptillet as a code owner July 9, 2024 21:27

ThomasRaoux approved these changes Jul 10, 2024

View reviewed changes

ThomasRaoux merged commit b674269 into triton-lang:main Jul 10, 2024
7 checks passed

tdoublep mentioned this pull request Jul 10, 2024

[Bug]: fused_moe_kernel compile bug vllm-project/vllm#6103

Closed

tdoublep mentioned this pull request Jul 10, 2024

[Bugfix] Require triton >= 3.0.0 to resolve issue with MoE and TP>1 vllm-project/vllm#6304

Closed

hnyls2002 mentioned this pull request Jul 25, 2024

Trouble Shooting sgl-project/sglang#548

Closed

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FRONTEND] let CacheManager write to temp dir instead of temp file #4295

[FRONTEND] let CacheManager write to temp dir instead of temp file #4295

yundai424 commented Jul 9, 2024 •

edited

Loading

ThomasRaoux left a comment

tdoublep commented Jul 10, 2024 •

edited

Loading

[FRONTEND] let CacheManager write to temp dir instead of temp file #4295

[FRONTEND] let CacheManager write to temp dir instead of temp file #4295

Conversation

yundai424 commented Jul 9, 2024 • edited Loading

Summary

Why

How

ThomasRaoux left a comment

Choose a reason for hiding this comment

tdoublep commented Jul 10, 2024 • edited Loading

yundai424 commented Jul 9, 2024 •

edited

Loading

tdoublep commented Jul 10, 2024 •

edited

Loading