Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disabling automatic sync on task switch #2662

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vchuravy
Copy link
Member

@maleadt is that what you had in mind for #2617

One of the tricky things is if we should flip the stream, or not.
But we are about to set dirty so I think we must, but that of course means it is possible to "miss" logical sync events within a task.

@spawn begin
   # operation A
   # task switch -- synchronize on a different task
   # operation B
end

Copy link
Contributor

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/memory.jl b/src/memory.jl
index fe7808c1f..0d015aa88 100644
--- a/src/memory.jl
+++ b/src/memory.jl
@@ -509,13 +509,13 @@ mutable struct Managed{M}
   # whether the memory has been captured in a way that would make the dirty bit unreliable
   captured::Bool
 
-  # whether memory accessed from another task causes implicit syncrhonization
-  task_sync::Bool
+    # whether memory accessed from another task causes implicit syncrhonization
+    task_sync::Bool
 
-  function Managed(mem::AbstractMemory; stream = CUDA.stream(), dirty = true, captured = false, task_sync = true)
+    function Managed(mem::AbstractMemory; stream = CUDA.stream(), dirty = true, captured = false, task_sync = true)
     # NOTE: memory starts as dirty, because stream-ordered allocations are only
     #       guaranteed to be physically allocated at a synchronization event.
-    return new{typeof(mem)}(mem, stream, dirty, captured, task_sync)
+        return new{typeof(mem)}(mem, stream, dirty, captured, task_sync)
   end
 end
 
@@ -569,13 +569,13 @@ function Base.convert(::Type{CuPtr{T}}, managed::Managed{M}) where {T,M}
 
   # accessing memory on another stream: ensure the data is ready and take ownership
   if managed.stream != state.stream
-    # Synchronize the array if task_sync is enabled
-    managed.task_sync && maybe_synchronize(managed)
-    # We must still switch the stream, since we are about to set ".dirty=true"
-    # and otherwise subsequent operations will synchronize against the wrong stream.
-    # XXX: What to do when an array is used on multiple tasks concurrently?
-    #      We could have the situation that we will synchronize against the "wrong" stream
-    #      But that would mean that we need a mapping from task to stream per Managed object...
+        # Synchronize the array if task_sync is enabled
+        managed.task_sync && maybe_synchronize(managed)
+        # We must still switch the stream, since we are about to set ".dirty=true"
+        # and otherwise subsequent operations will synchronize against the wrong stream.
+        # XXX: What to do when an array is used on multiple tasks concurrently?
+        #      We could have the situation that we will synchronize against the "wrong" stream
+        #      But that would mean that we need a mapping from task to stream per Managed object...
     managed.stream = state.stream
   end
 

@maleadt
Copy link
Member

maleadt commented Feb 17, 2025

I would widen the scope, what if you have a unified array you're using on multiple devices? But basically that's what I was thinking of, yes.

Copy link

codecov bot commented Feb 17, 2025

Codecov Report

Attention: Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.01%. Comparing base (3d42ca2) to head (c1e04f2).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/array.jl 0.00% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2662   +/-   ##
=======================================
  Coverage   74.01%   74.01%           
=======================================
  Files         158      158           
  Lines       15257    15261    +4     
=======================================
+ Hits        11293    11296    +3     
- Misses       3964     3965    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: c1e04f2 Previous: 3d42ca2 Ratio
latency/precompile 46474076514.5 ns 46520878096 ns 1.00
latency/ttfp 6956156543 ns 7008237160 ns 0.99
latency/import 3631584313 ns 3631961055 ns 1.00
integration/volumerhs 9624598.5 ns 9623329 ns 1.00
integration/byval/slices=1 147057 ns 146619 ns 1.00
integration/byval/slices=3 425280 ns 424765 ns 1.00
integration/byval/reference 144868 ns 144768 ns 1.00
integration/byval/slices=2 286168 ns 285820 ns 1.00
integration/cudadevrt 103450 ns 103203 ns 1.00
kernel/indexing 14050.5 ns 13905 ns 1.01
kernel/indexing_checked 14697 ns 14547 ns 1.01
kernel/occupancy 631.9235294117647 ns 666.4058823529411 ns 0.95
kernel/launch 1997.8 ns 2007.2 ns 1.00
kernel/rand 18179 ns 16740 ns 1.09
array/reverse/1d 19600 ns 19411 ns 1.01
array/reverse/2d 23498.5 ns 23115.5 ns 1.02
array/reverse/1d_inplace 10204 ns 9745.333333333334 ns 1.05
array/reverse/2d_inplace 11681 ns 11354 ns 1.03
array/copy 21035.5 ns 20871 ns 1.01
array/iteration/findall/int 157996 ns 157360 ns 1.00
array/iteration/findall/bool 138733 ns 137944 ns 1.01
array/iteration/findfirst/int 153655 ns 152667 ns 1.01
array/iteration/findfirst/bool 154472 ns 154321 ns 1.00
array/iteration/scalar 72458.5 ns 73050 ns 0.99
array/iteration/logical 212821 ns 211424.5 ns 1.01
array/iteration/findmin/1d 40754 ns 40711.5 ns 1.00
array/iteration/findmin/2d 93781 ns 93411 ns 1.00
array/reductions/reduce/1d 34849 ns 43368 ns 0.80
array/reductions/reduce/2d 40485 ns 49343.5 ns 0.82
array/reductions/mapreduce/1d 32887 ns 35682 ns 0.92
array/reductions/mapreduce/2d 40794 ns 43698 ns 0.93
array/broadcast 20879 ns 20818 ns 1.00
array/copyto!/gpu_to_gpu 13707 ns 13676 ns 1.00
array/copyto!/cpu_to_gpu 208093 ns 207803 ns 1.00
array/copyto!/gpu_to_cpu 244906 ns 243284 ns 1.01
array/accumulate/1d 108414.5 ns 108135 ns 1.00
array/accumulate/2d 79828 ns 79400 ns 1.01
array/construct 1278.1 ns 1294.3000000000002 ns 0.99
array/random/randn/Float32 43467 ns 43629.5 ns 1.00
array/random/randn!/Float32 26256 ns 26224 ns 1.00
array/random/rand!/Int64 26918 ns 27041 ns 1.00
array/random/rand!/Float32 8736 ns 8637.333333333334 ns 1.01
array/random/rand/Int64 29891 ns 37854 ns 0.79
array/random/rand/Float32 13125 ns 12706 ns 1.03
array/permutedims/4d 60821.5 ns 60584 ns 1.00
array/permutedims/2d 54853 ns 54607 ns 1.00
array/permutedims/3d 55989 ns 55510 ns 1.01
array/sorting/1d 2776394.5 ns 2777205.5 ns 1.00
array/sorting/by 3367046 ns 3369336 ns 1.00
array/sorting/2d 1084585 ns 1083969 ns 1.00
cuda/synchronization/stream/auto 1051.2 ns 996.3846153846154 ns 1.06
cuda/synchronization/stream/nonblocking 6429.4 ns 6421.1 ns 1.00
cuda/synchronization/stream/blocking 822.3030303030303 ns 821.9183673469388 ns 1.00
cuda/synchronization/context/auto 1153 ns 1154.2 ns 1.00
cuda/synchronization/context/nonblocking 6583.8 ns 6536 ns 1.01
cuda/synchronization/context/blocking 897.1590909090909 ns 899.2982456140351 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@luraess
Copy link
Contributor

luraess commented Feb 17, 2025

The current proposition seems to fix the overlap (async execution) in Chmy (see PTsolvers/Chmy.jl#65). Unsure tho if it's the best way to handle this as besides having to call unsafe_disable_task_sync! for each array creation, one also has to call it after each resize! or unsafe_wrap, and potentially others which makes the usage rather fragile and potentially error-prone.

@maleadt
Copy link
Member

maleadt commented Feb 18, 2025

one also has to call it after each resize!

resize! should propagate the flag

or unsafe_wrap

That's not relevant here, as it's basically a constructor creating a new array. The idea is that this property is set on an array object, which is the only safe scope to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants