Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FRONTEND] let CacheManager write to temp dir instead of temp file #4295

Merged
merged 1 commit into from
Jul 10, 2024

Conversation

yundai424
Copy link
Contributor

@yundai424 yundai424 commented Jul 9, 2024

Summary

there've been multiple issues discussing around the FileNotFoundError on compilation when CompiledKernel is trying to read from the listed ASM files. #2688 #4002 vllm-project/vllm#6103 etc. and there have been some attempts to address it such as #3544 . This PR attempts to explain the root cause and suggest a fix.

Why

When a kernel is being compiled, triton first writes IRs to triton cache dir (ref). Inside of the write operation, the process first writes it to a temp file unique to the current process (plus a uuid to distinguish between multiple processes with same PID on different hosts sharing the same underlying FS) (ref) and then atomically os.replace it to the final file name. Afterwards the CompiledKernel lists all the IRs and reads them (ref).

On multiprocess set up this may however result in a race condition. Let's focus on a case where there's one host with 2 processes on it.
Triton RC (1)

At the time when pid 1 lists ASMs, the dir may contain temp files generated from another process pid 2. However at the time when pid 1 proceeds to read bytes from the listed files, pid2 may have already os.replaceed its temp files, so pid 1 will encounter FileNotFoundError when trying to read the temp file generated by pid 2. IBM/vllm#35 (comment) also believes this is the root cause.

How

There're multiple potential solutions towards this, as mentioned in IBM/vllm#35 (comment) as well:

  • let each process write to a private temp dir instead so glob won't bother taking the temp stuff into consideration
  • or, exclude tmp.pid_* from glob

This PR tries to go with the 1st approach to avoid adding an assumption on the tmp file pattern (which is only used in runtime/cache.py) in compiler/compiler.py but is open to any suggestion. Thanks!

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because not applicable.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@yundai424 yundai424 requested a review from ptillet as a code owner July 9, 2024 21:27
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me

@ThomasRaoux ThomasRaoux merged commit b674269 into triton-lang:main Jul 10, 2024
7 checks passed
@tdoublep
Copy link

tdoublep commented Jul 10, 2024

awesome thanks @yundai424 - this looks like it should fix the issue we have in vLLM.

is there a rough ETA for when this would be in a Triton release? trying to plan whether we need to merge a work-around in the meantime.

bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…riton-lang#4295)

# Summary
there've been multiple issues discussing around the `FileNotFoundError`
on compilation when `CompiledKernel` is trying to read from the listed
ASM files. triton-lang#2688 triton-lang#4002 vllm-project/vllm#6103
etc. and there have been some attempts to address it such as triton-lang#3544 .
This PR attempts to explain the root cause and suggest a fix.

# Why
When a kernel is being compiled, triton first writes IRs to triton cache
dir
([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L289)).
Inside of the write operation, the process first writes it to a temp
file unique to the current process (plus a uuid to distinguish between
multiple processes with same PID on different hosts sharing the same
underlying FS)
([ref](https://github.com/triton-lang/triton/blob/c14b033cd979d5c39e5fdb3847c022fa5d71a0c1/python/triton/runtime/cache.py#L124-L130))
and then atomically `os.replace` it to the final file name. Afterwards
the `CompiledKernel` lists all the IRs and reads them
([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L362-L367)).

On multiprocess set up this may however result in a race condition.
Let's focus on a case where there's one host with 2 processes on it.
![Triton RC
(1)](https://github.com/triton-lang/triton/assets/43726198/ffc20e0c-0404-4e7a-bd6c-022e710e97b9)

At the time when `pid 1` lists ASMs, the dir may contain temp files
generated from another process `pid 2`. However at the time when `pid 1`
proceeds to read bytes from the listed files, `pid2` may have already
`os.replace`ed its temp files, so `pid 1` will encounter
`FileNotFoundError` when trying to read the temp file generated by `pid
2`. IBM/vllm#35 (comment) also
believes this is the root cause.

# How
There're multiple potential solutions towards this, as mentioned in
IBM/vllm#35 (comment) as well:
- let each process write to a private temp dir instead so `glob` won't
bother taking the temp stuff into consideration
- or, exclude `tmp.pid_*` from `glob`

This PR tries to go with the 1st approach to avoid adding an assumption
on the tmp file pattern (which is only used in `runtime/cache.py`) in
`compiler/compiler.py` but is open to any suggestion. Thanks!

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `not applicable`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants