[CudaIpc 2/3]: Ipc handle exchange #3910

samnordmann · 2025-02-17T11:38:17Z

On top of

[CudaIpc 1/3]P2PCommunication: add backend type #3909

prerequesite to:

[CudaIpc 3/3]: p2p get-Zcopy #3911

What

Set up the infrastructure needed for ipc handle exchange and caching
Add an Expr node hir::ShareMemHandles to represent this op. We cannot embed the op in the Send/Recv semantics because we need to group the handle exchange between matching sends and recv to avoid deadlocks

How

Most of the implementation is in multidevice/ipc_handle.cpp

Define the class IpcHandle representing the ipc handle that is exchanged. This class is supplemented with a semaphore, which is a local cuda buffer allocated on the exporter's device.
Define IpcHandleCache which handles exchanging and caching the ipc handles. Caching is made on with respect to a combination of runtime and symbolic ingredients: (runtime peer, at::Tensor, Expr*). This caching allows to have arbitrary number of p2p comms between pairs of ranks.

github-actions · 2025-02-17T11:39:17Z

Review updated until commit c047576

Description

Added ShareMemHandles class for handling shared memory IPC.
Implemented IpcHandle and IpcHandleCache for CUDA IPC memory management.
Updated HostIrEvaluator to handle ShareMemHandles.
Included IpcHandle in CMakeLists.txt for compilation.

Changes walkthrough 📝

Relevant files

Enhancement

9 files

executor.cpp `Added handler for ShareMemHandles`	+5/-0
host_ir.cpp `Introduced ShareMemHandles class`	+28/-0
ipc_handle.cpp `Implemented IPC handle management`	+150/-0
dispatch.h `Added ShareMemHandles to dispatch macros`	+2/-1
executor.h `Added ShareMemHandles handler declaration`	+3/-0
host_ir.h `Added ShareMemHandles class declaration`	+25/-0
communicator.h `Added TCP store access method`	+4/-0
ipc_handle.h `Defined IPC handle classes and cache`	+163/-0
CMakeLists.txt `Added ipc_handle.cpp to build`	+1/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests

⚡ Recommended focus areas for review

Error Handling

The code does not handle potential errors from CUDA API calls, such as cudaIpcGetMemHandle and cudaMalloc. It would be beneficial to add error handling to ensure that the program can gracefully handle failures.

NVFUSER_CUDA_RT_SAFE_CALL(
    cudaIpcGetMemHandle(&ipc_handle_, tensor.data_ptr()));
NVFUSER_CUDA_RT_SAFE_CALL(
    cudaMalloc((void**)&semaphore_, sizeof(IpcSemaphore)));
static_assert(
    sizeof(IpcSemaphore) == sizeof(int),
    "IpcSemaphore must be same size as int");
NVFUSER_CUDA_RT_SAFE_CALL(cudaMemset(
    (void*)semaphore_, (int)IpcSemaphore::kReady, sizeof(IpcSemaphore)));
NVFUSER_CUDA_RT_SAFE_CALL(
    cudaIpcGetMemHandle(&semaphore_ipc_handle_, semaphore_));

Memory Management

The code uses cudaMalloc and cudaFree for semaphore memory allocation and deallocation. It is crucial to ensure that all allocated memory is properly freed to avoid memory leaks.

      cudaMalloc((void**)&semaphore_, sizeof(IpcSemaphore)));
  static_assert(
      sizeof(IpcSemaphore) == sizeof(int),
      "IpcSemaphore must be same size as int");
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemset(
      (void*)semaphore_, (int)IpcSemaphore::kReady, sizeof(IpcSemaphore)));
  NVFUSER_CUDA_RT_SAFE_CALL(
      cudaIpcGetMemHandle(&semaphore_ipc_handle_, semaphore_));
}

IpcHandle::IpcHandle(std::vector<uint8_t> data) {
  const IpcHandle& imported_buffer = fromBytes<IpcHandle>(data);

  storage_offset_ = imported_buffer.storage_offset_;
  element_size_ = imported_buffer.element_size_;
  ipc_handle_ = imported_buffer.ipc_handle_;
  semaphore_ipc_handle_ = imported_buffer.semaphore_ipc_handle_;
  rank_ = imported_buffer.rank_;

  NVFUSER_CUDA_RT_SAFE_CALL(
      cudaIpcOpenMemHandle(&ptr_, ipc_handle_, cudaIpcMemLazyEnablePeerAccess));
  ptr_ = (void*)((uint8_t*)ptr_ + storage_offset_ * element_size_);

  NVFUSER_CUDA_RT_SAFE_CALL(cudaIpcOpenMemHandle(
      (void**)&semaphore_,
      semaphore_ipc_handle_,
      cudaIpcMemLazyEnablePeerAccess));
}

IpcHandle::~IpcHandle() {
  if (rank_ == Communicator::getInstance().deviceId()) {
    NVFUSER_CUDA_RT_SAFE_CALL(cudaFree((void*)semaphore_));
  } else {

Performance Considerations

The code uses a barrier to synchronize all ranks after pushing their memory handles to the store. This can be a performance bottleneck. It would be beneficial to investigate more efficient synchronization mechanisms or selectively synchronize only the necessary ranks.

// barrier to ensure all ranks have pushed their memhandles to the store
// TODO: precisely select what ranks need to wait on that barrier.
communicator->barrier();

samnordmann · 2025-02-17T14:51:30Z

!test

samnordmann · 2025-02-18T15:02:04Z

csrc/multidevice/ipc_handle.cpp

+      storage_offset_(tensor.storage_offset()),
+      element_size_(tensor.element_size()),
+      rank_(Communicator::getInstance().deviceId()) {
+  NVFUSER_CUDA_RT_SAFE_CALL(


assert that the tensor is not strided

samnordmann · 2025-02-18T15:08:09Z

csrc/multidevice/ipc_handle.h

+
+  std::unordered_map<KeyType, std::unique_ptr<P2pIpcHandle>, KeyHash, KeyEqual>
+      handles_;
+  std::unordered_set<std::string> keys_;


remove (unnecessary)

samnordmann · 2025-02-18T15:15:59Z

csrc/multidevice/ipc_handle.h

+  }
+
+ private:
+  using KeyType = std::tuple<int64_t, at::Tensor, P2PCommunication*>;


maybe we don't need P2PCommunication* here

We actually need it in the following case:
rank 0 sends a buffer to rank 1's buffer1
and concurrently ,
rank 0 sends the same buffer to rank 1's buffer2

samnordmann added 2 commits February 14, 2025 05:04

infra for ipc handles

1ef87b7

fix lint

207bb7e

samnordmann mentioned this pull request Feb 17, 2025

[CudaIpc 2/3]: Ipc handle exchange #3893

Closed

samnordmann mentioned this pull request Feb 17, 2025

[CudaIpc 3/3]: p2p get-Zcopy #3911

Open

samnordmann changed the title ~~Ipc handle infra~~ [CudaIpc 2/3]: Ipc handle exchange Feb 17, 2025

samnordmann added 2 commits February 17, 2025 03:55

fix linter

3068905

Merge branch 'add_backend_type_to_p2p_comm' into ipc_handle_infra

c047576

samnordmann requested review from wujingyue and nsarka February 17, 2025 14:51

samnordmann commented Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CudaIpc 2/3]: Ipc handle exchange #3910

[CudaIpc 2/3]: Ipc handle exchange #3910

samnordmann commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

samnordmann commented Feb 17, 2025

samnordmann Feb 18, 2025

samnordmann Feb 18, 2025

samnordmann Feb 18, 2025

samnordmann Feb 18, 2025

[CudaIpc 2/3]: Ipc handle exchange #3910

Are you sure you want to change the base?

[CudaIpc 2/3]: Ipc handle exchange #3910

Conversation

samnordmann commented Feb 17, 2025 • edited Loading

What

How

github-actions bot commented Feb 17, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

samnordmann commented Feb 17, 2025

samnordmann Feb 18, 2025

Choose a reason for hiding this comment

samnordmann Feb 18, 2025

Choose a reason for hiding this comment

samnordmann Feb 18, 2025

Choose a reason for hiding this comment

samnordmann Feb 18, 2025

Choose a reason for hiding this comment

samnordmann commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading