Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc : enable async operations #7915

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

rgerganov
Copy link
Collaborator

Start a dedicated backend thread in the rpc-server and use message passing interface for submitting work to it. This will enable backend async operations and cross-server communication.

  • Self Reported Review Complexity:
    • Review Complexity : Low
    • Review Complexity : Medium
    • Review Complexity : High
  • I have read the contributing guidelines

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 13, 2024
Start a dedicated backend thread in the rpc-server and use message
passing interface for submitting work to it. This will enable backend
async operations and cross-server communication.
@slaren
Copy link
Collaborator

slaren commented Jun 16, 2024

I may be wrong, but I suspect that the async queue will need to be implemented in the client side instead.

@rgerganov
Copy link
Collaborator Author

If we want to copy tensors across RPC servers then we need to handle at least two connections on the server side -- one from the scheduler and one from another RPC server. I considered the following options for implementing this:

  1. Using a single thread and async IO. I think this would be hard to implement in a cross-platform way without using 3rd party libraries.
  2. Using multiple threads and blocking IO. My assumption is that backend implementations are not guaranteed to be thread-safe, so we need to add synchronization when access the backend from multiple threads.
  3. Using a single thread for all backend ops and submitting work to it via thread-safe message queue. No synchronization needed as backend is confined to a single thread.

I think option 3 is bringing less complexity compared to option 2 so I opted for it but I am open to discussions.

I may be wrong, but I suspect that the async queue will need to be implemented in the client side instead.

Could you please elaborate?

@slaren
Copy link
Collaborator

slaren commented Jun 17, 2024

I wouldn't say that the message queue doesn't require synchronization, it is still locking a mutex for every message. Whether that's more efficient than the other methods, I don't know, but it is probably not going to be the bottleneck regardless. Another option could be using select/poll, which is still a single thread with blocking I/O.

To implement the async interface of ggml-backend, my intuition is that it would be simpler to implement the queue on the client side, but I am not completely sure of that. I think it should be possible to create a generic adapter that sits on top of another backend and implements the asynchronous operations by running an asynchronous queue in a different thread. For APIs that support multi-device synchronization natively such as CUDA, it is still going to be more efficient to use the native implementation, but for other backends it should be possible to provide a generic implementation.

@rgerganov
Copy link
Collaborator Author

PR #8032 is based on this work, trying to make copying tensors across servers more efficient. However, I am observing performance degradation with TinyLlama and 2 CUDA servers running on localhost.

@slaren may be we should close this PR and continue the discussion on PR #8032?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants