Support disabling automatic sync on task switch #2662

vchuravy · 2025-02-17T15:07:53Z

@maleadt is that what you had in mind for #2617

One of the tricky things is if we should flip the stream, or not.
But we are about to set dirty so I think we must, but that of course means it is possible to "miss" logical sync events within a task.

@spawn begin
   # operation A
   # task switch -- synchronize on a different task
   # operation B
end

github-actions · 2025-02-17T15:08:35Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/memory.jl b/src/memory.jl
index fe7808c1f..0d015aa88 100644
--- a/src/memory.jl
+++ b/src/memory.jl
@@ -509,13 +509,13 @@ mutable struct Managed{M}
   # whether the memory has been captured in a way that would make the dirty bit unreliable
   captured::Bool
 
-  # whether memory accessed from another task causes implicit syncrhonization
-  task_sync::Bool
+    # whether memory accessed from another task causes implicit syncrhonization
+    task_sync::Bool
 
-  function Managed(mem::AbstractMemory; stream = CUDA.stream(), dirty = true, captured = false, task_sync = true)
+    function Managed(mem::AbstractMemory; stream = CUDA.stream(), dirty = true, captured = false, task_sync = true)
     # NOTE: memory starts as dirty, because stream-ordered allocations are only
     #       guaranteed to be physically allocated at a synchronization event.
-    return new{typeof(mem)}(mem, stream, dirty, captured, task_sync)
+        return new{typeof(mem)}(mem, stream, dirty, captured, task_sync)
   end
 end
 
@@ -569,13 +569,13 @@ function Base.convert(::Type{CuPtr{T}}, managed::Managed{M}) where {T,M}
 
   # accessing memory on another stream: ensure the data is ready and take ownership
   if managed.stream != state.stream
-    # Synchronize the array if task_sync is enabled
-    managed.task_sync && maybe_synchronize(managed)
-    # We must still switch the stream, since we are about to set ".dirty=true"
-    # and otherwise subsequent operations will synchronize against the wrong stream.
-    # XXX: What to do when an array is used on multiple tasks concurrently?
-    #      We could have the situation that we will synchronize against the "wrong" stream
-    #      But that would mean that we need a mapping from task to stream per Managed object...
+        # Synchronize the array if task_sync is enabled
+        managed.task_sync && maybe_synchronize(managed)
+        # We must still switch the stream, since we are about to set ".dirty=true"
+        # and otherwise subsequent operations will synchronize against the wrong stream.
+        # XXX: What to do when an array is used on multiple tasks concurrently?
+        #      We could have the situation that we will synchronize against the "wrong" stream
+        #      But that would mean that we need a mapping from task to stream per Managed object...
     managed.stream = state.stream
   end

maleadt · 2025-02-17T15:43:01Z

I would widen the scope, what if you have a unified array you're using on multiple devices? But basically that's what I was thinking of, yes.

codecov · 2025-02-17T17:48:38Z

Codecov Report

Attention: Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.01%. Comparing base (3d42ca2) to head (c1e04f2).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
src/array.jl	0.00%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2662   +/-   ##
=======================================
  Coverage   74.01%   74.01%           
=======================================
  Files         158      158           
  Lines       15257    15261    +4     
=======================================
+ Hits        11293    11296    +3     
- Misses       3964     3965    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `c1e04f2`	Previous: `3d42ca2`	Ratio
`latency/precompile`	`46474076514.5` ns	`46520878096` ns	`1.00`
`latency/ttfp`	`6956156543` ns	`7008237160` ns	`0.99`
`latency/import`	`3631584313` ns	`3631961055` ns	`1.00`
`integration/volumerhs`	`9624598.5` ns	`9623329` ns	`1.00`
`integration/byval/slices=1`	`147057` ns	`146619` ns	`1.00`
`integration/byval/slices=3`	`425280` ns	`424765` ns	`1.00`
`integration/byval/reference`	`144868` ns	`144768` ns	`1.00`
`integration/byval/slices=2`	`286168` ns	`285820` ns	`1.00`
`integration/cudadevrt`	`103450` ns	`103203` ns	`1.00`
`kernel/indexing`	`14050.5` ns	`13905` ns	`1.01`
`kernel/indexing_checked`	`14697` ns	`14547` ns	`1.01`
`kernel/occupancy`	`631.9235294117647` ns	`666.4058823529411` ns	`0.95`
`kernel/launch`	`1997.8` ns	`2007.2` ns	`1.00`
`kernel/rand`	`18179` ns	`16740` ns	`1.09`
`array/reverse/1d`	`19600` ns	`19411` ns	`1.01`
`array/reverse/2d`	`23498.5` ns	`23115.5` ns	`1.02`
`array/reverse/1d_inplace`	`10204` ns	`9745.333333333334` ns	`1.05`
`array/reverse/2d_inplace`	`11681` ns	`11354` ns	`1.03`
`array/copy`	`21035.5` ns	`20871` ns	`1.01`
`array/iteration/findall/int`	`157996` ns	`157360` ns	`1.00`
`array/iteration/findall/bool`	`138733` ns	`137944` ns	`1.01`
`array/iteration/findfirst/int`	`153655` ns	`152667` ns	`1.01`
`array/iteration/findfirst/bool`	`154472` ns	`154321` ns	`1.00`
`array/iteration/scalar`	`72458.5` ns	`73050` ns	`0.99`
`array/iteration/logical`	`212821` ns	`211424.5` ns	`1.01`
`array/iteration/findmin/1d`	`40754` ns	`40711.5` ns	`1.00`
`array/iteration/findmin/2d`	`93781` ns	`93411` ns	`1.00`
`array/reductions/reduce/1d`	`34849` ns	`43368` ns	`0.80`
`array/reductions/reduce/2d`	`40485` ns	`49343.5` ns	`0.82`
`array/reductions/mapreduce/1d`	`32887` ns	`35682` ns	`0.92`
`array/reductions/mapreduce/2d`	`40794` ns	`43698` ns	`0.93`
`array/broadcast`	`20879` ns	`20818` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`13707` ns	`13676` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`208093` ns	`207803` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`244906` ns	`243284` ns	`1.01`
`array/accumulate/1d`	`108414.5` ns	`108135` ns	`1.00`
`array/accumulate/2d`	`79828` ns	`79400` ns	`1.01`
`array/construct`	`1278.1` ns	`1294.3000000000002` ns	`0.99`
`array/random/randn/Float32`	`43467` ns	`43629.5` ns	`1.00`
`array/random/randn!/Float32`	`26256` ns	`26224` ns	`1.00`
`array/random/rand!/Int64`	`26918` ns	`27041` ns	`1.00`
`array/random/rand!/Float32`	`8736` ns	`8637.333333333334` ns	`1.01`
`array/random/rand/Int64`	`29891` ns	`37854` ns	`0.79`
`array/random/rand/Float32`	`13125` ns	`12706` ns	`1.03`
`array/permutedims/4d`	`60821.5` ns	`60584` ns	`1.00`
`array/permutedims/2d`	`54853` ns	`54607` ns	`1.00`
`array/permutedims/3d`	`55989` ns	`55510` ns	`1.01`
`array/sorting/1d`	`2776394.5` ns	`2777205.5` ns	`1.00`
`array/sorting/by`	`3367046` ns	`3369336` ns	`1.00`
`array/sorting/2d`	`1084585` ns	`1083969` ns	`1.00`
`cuda/synchronization/stream/auto`	`1051.2` ns	`996.3846153846154` ns	`1.06`
`cuda/synchronization/stream/nonblocking`	`6429.4` ns	`6421.1` ns	`1.00`
`cuda/synchronization/stream/blocking`	`822.3030303030303` ns	`821.9183673469388` ns	`1.00`
`cuda/synchronization/context/auto`	`1153` ns	`1154.2` ns	`1.00`
`cuda/synchronization/context/nonblocking`	`6583.8` ns	`6536` ns	`1.01`
`cuda/synchronization/context/blocking`	`897.1590909090909` ns	`899.2982456140351` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

luraess · 2025-02-17T23:07:26Z

The current proposition seems to fix the overlap (async execution) in Chmy (see PTsolvers/Chmy.jl#65). Unsure tho if it's the best way to handle this as besides having to call unsafe_disable_task_sync! for each array creation, one also has to call it after each resize! or unsafe_wrap, and potentially others which makes the usage rather fragile and potentially error-prone.

maleadt · 2025-02-18T11:27:28Z

one also has to call it after each resize!

resize! should propagate the flag

or unsafe_wrap

That's not relevant here, as it's basically a constructor creating a new array. The idea is that this property is set on an array object, which is the only safe scope to do so.

Support disabling automatic sync on task switch

c1e04f2

github-actions bot reviewed Feb 17, 2025

View reviewed changes

luraess mentioned this pull request Feb 17, 2025

Disable task sync in CUDA PTsolvers/Chmy.jl#65

Draft

utkinis mentioned this pull request Feb 18, 2025

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Open

kshyatt added the performance How fast can we go? label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disabling automatic sync on task switch #2662

Support disabling automatic sync on task switch #2662

vchuravy commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

maleadt commented Feb 17, 2025

codecov bot commented Feb 17, 2025 •

edited

Loading

github-actions bot left a comment

luraess commented Feb 17, 2025

maleadt commented Feb 18, 2025

Support disabling automatic sync on task switch #2662

Are you sure you want to change the base?

Support disabling automatic sync on task switch #2662

Conversation

vchuravy commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

maleadt commented Feb 17, 2025

codecov bot commented Feb 17, 2025 • edited Loading

Codecov Report

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

luraess commented Feb 17, 2025

maleadt commented Feb 18, 2025

codecov bot commented Feb 17, 2025 •

edited

Loading