[core][compiled-graphs] Minimize the overhead of shared memory in NCCL benchmark #48860
+10
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
exec_ray_dag_gpu_nccl_static_shape_direct_return
had 3079 executions/sec before this PR, and it increased to 5737 executions/sec after this PR.This PR is trying to reduce the shared memory overhead in the NCCL benchmark.
The reasons are:
Reason 1:
exec_ray_dag_gpu_nccl_static_shape_direct_return
includes the following data transfers: (1) driver to sender, (2) sender to receiver, and (3) receiver to driver.Reason 2: We also found that the shared memory overhead in a DAG with NCCL is higher than that in a DAG without NCCL. To elaborate, [core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319 (comment) uses a very small tensor with
shape = (1, 1)
to minimize the NCCL data transfer. As a result, the measured time should be close to the shared memory overhead, which is 0.14 ms. However,compiled single-actor DAG calls
only takes 0.05 ms per execution.compiled single-actor DAG calls
): driver ->a.echo
-> driverray/python/ray/_private/serialization.py
Line 550 in 335bd66
Experiments for "Reason 2-1" and "Reason 2-2"
part of #45319
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.