Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 22, 2024

Why are these changes needed?

exec_ray_dag_gpu_nccl_static_shape_direct_return had 3079 executions/sec before this PR, and it increased to 5737 executions/sec after this PR.

This PR is trying to reduce the shared memory overhead in the NCCL benchmark.

The reasons are:

  • Reason 1: exec_ray_dag_gpu_nccl_static_shape_direct_return includes the following data transfers: (1) driver to sender, (2) sender to receiver, and (3) receiver to driver.

  • Reason 2: We also found that the shared memory overhead in a DAG with NCCL is higher than that in a DAG without NCCL. To elaborate, [core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319 (comment) uses a very small tensor with shape = (1, 1) to minimize the NCCL data transfer. As a result, the measured time should be close to the shared memory overhead, which is 0.14 ms. However, compiled single-actor DAG calls only takes 0.05 ms per execution.

    • Reason 2-1: The former case has 1 more actor.
      • former (0.14 ms): driver -> sender -> receiver -> driver
      • later (compiled single-actor DAG calls): driver -> a.echo -> driver
    • Reason 2-2: The former sends an integer from the driver to the sender and from the receiver to the driver, while the latter sends a byte string from the driver to the actor and from the actor to the driver. the byte string data transfer is RayCG is much faster than transferring an integer.
  • Experiments for "Reason 2-1" and "Reason 2-2"

    • experiment-1:
      • driver -> actor -> driver
      • input: integer
      • avg execution time: 0.090 ms
    • experiment-2:
      • driver -> actor 1 -> actor 2 -> driver
      • input: integer
      • avg execution time: 0.130 ms
    • experiment-3
      • driver -> actor -> driver
      • input: b"x"
      • avg execution time: 0.051 ms
    • experiment-4
      • driver -> actor 1 -> actor 2 -> driver
      • input: b"x"
      • avg execution time: 0.072 ms

part of #45319

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <[email protected]>
def send(self, shape, dtype, value: int):
t = torch.ones(shape, dtype=dtype, device=self.device) * value
def send(self, shape, dtype, _):
t = torch.ones(shape, dtype=dtype, device=self.device) * 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pure NCCl benchmark exec_nccl_gpu, the function do_send_recv also multiplies a tensor.

@kevin85421 kevin85421 marked this pull request as ready for review November 22, 2024 02:41
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants