[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860

kevin85421 · 2024-11-22T02:04:37Z

Why are these changes needed?

exec_ray_dag_gpu_nccl_static_shape_direct_return had 3079 executions/sec before this PR, and it increased to 5737 executions/sec after this PR.

This PR is trying to reduce the shared memory overhead in the NCCL benchmark.

The reasons are:

Reason 1: exec_ray_dag_gpu_nccl_static_shape_direct_return includes the following data transfers: (1) driver to sender, (2) sender to receiver, and (3) receiver to driver.
- (1) and (3) are via shared memory, and (2) is via NCCL.
- However, (1) and (3) account for a larger proportion than we expected because the data size for (2) is not big enough. See [core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319 (comment) for more details.
Reason 2: We also found that the shared memory overhead in a DAG with NCCL is higher than that in a DAG without NCCL. To elaborate, [core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319 (comment) uses a very small tensor with shape = (1, 1) to minimize the NCCL data transfer. As a result, the measured time should be close to the shared memory overhead, which is 0.14 ms. However, compiled single-actor DAG calls only takes 0.05 ms per execution.
- Reason 2-1: The former case has 1 more actor.
  - former (0.14 ms): driver -> sender -> receiver -> driver
  - later (compiled single-actor DAG calls): driver -> a.echo -> driver
- Reason 2-2: The former sends an integer from the driver to the sender and from the receiver to the driver, while the latter sends a byte string from the driver to the actor and from the actor to the driver. the byte string data transfer is RayCG is much faster than transferring an integer.
  - @ruisearch42 mentioned that there are some optimizations for byte string
    
    ray/python/ray/_private/serialization.py
    
    Line 550 in 335bd66
    
    if isinstance(value, bytes):
Experiments for "Reason 2-1" and "Reason 2-2"
- experiment-1:
  - driver -> actor -> driver
  - input: integer
  - avg execution time: 0.090 ms
- experiment-2:
  - driver -> actor 1 -> actor 2 -> driver
  - input: integer
  - avg execution time: 0.130 ms
- experiment-3
  - driver -> actor -> driver
  - input: b"x"
  - avg execution time: 0.051 ms
- experiment-4
  - driver -> actor 1 -> actor 2 -> driver
  - input: b"x"
  - avg execution time: 0.072 ms

part of #45319

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 · 2024-11-22T02:41:25Z

release/microbenchmark/experimental/accelerated_dag_gpu_microbenchmark.py

-    def send(self, shape, dtype, value: int):
-        t = torch.ones(shape, dtype=dtype, device=self.device) * value
+    def send(self, shape, dtype, _):
+        t = torch.ones(shape, dtype=dtype, device=self.device) * 1


In pure NCCl benchmark exec_nccl_gpu, the function do_send_recv also multiplies a tensor.

update

91c11d5

Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 commented Nov 22, 2024

View reviewed changes

kevin85421 marked this pull request as ready for review November 22, 2024 02:41

kevin85421 assigned stephanie-wang, rkooo567 and ruisearch42 Nov 22, 2024

kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860

[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860

kevin85421 commented Nov 22, 2024 •

edited

Loading

kevin85421 Nov 22, 2024

[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860

Are you sure you want to change the base?

[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860

Conversation

kevin85421 commented Nov 22, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 Nov 22, 2024

Choose a reason for hiding this comment

kevin85421 commented Nov 22, 2024 •

edited

Loading