-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support disabling automatic sync on task switch #2662
base: master
Are you sure you want to change the base?
Conversation
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/memory.jl b/src/memory.jl
index fe7808c1f..0d015aa88 100644
--- a/src/memory.jl
+++ b/src/memory.jl
@@ -509,13 +509,13 @@ mutable struct Managed{M}
# whether the memory has been captured in a way that would make the dirty bit unreliable
captured::Bool
- # whether memory accessed from another task causes implicit syncrhonization
- task_sync::Bool
+ # whether memory accessed from another task causes implicit syncrhonization
+ task_sync::Bool
- function Managed(mem::AbstractMemory; stream = CUDA.stream(), dirty = true, captured = false, task_sync = true)
+ function Managed(mem::AbstractMemory; stream = CUDA.stream(), dirty = true, captured = false, task_sync = true)
# NOTE: memory starts as dirty, because stream-ordered allocations are only
# guaranteed to be physically allocated at a synchronization event.
- return new{typeof(mem)}(mem, stream, dirty, captured, task_sync)
+ return new{typeof(mem)}(mem, stream, dirty, captured, task_sync)
end
end
@@ -569,13 +569,13 @@ function Base.convert(::Type{CuPtr{T}}, managed::Managed{M}) where {T,M}
# accessing memory on another stream: ensure the data is ready and take ownership
if managed.stream != state.stream
- # Synchronize the array if task_sync is enabled
- managed.task_sync && maybe_synchronize(managed)
- # We must still switch the stream, since we are about to set ".dirty=true"
- # and otherwise subsequent operations will synchronize against the wrong stream.
- # XXX: What to do when an array is used on multiple tasks concurrently?
- # We could have the situation that we will synchronize against the "wrong" stream
- # But that would mean that we need a mapping from task to stream per Managed object...
+ # Synchronize the array if task_sync is enabled
+ managed.task_sync && maybe_synchronize(managed)
+ # We must still switch the stream, since we are about to set ".dirty=true"
+ # and otherwise subsequent operations will synchronize against the wrong stream.
+ # XXX: What to do when an array is used on multiple tasks concurrently?
+ # We could have the situation that we will synchronize against the "wrong" stream
+ # But that would mean that we need a mapping from task to stream per Managed object...
managed.stream = state.stream
end
|
I would widen the scope, what if you have a unified array you're using on multiple devices? But basically that's what I was thinking of, yes. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2662 +/- ##
=======================================
Coverage 74.01% 74.01%
=======================================
Files 158 158
Lines 15257 15261 +4
=======================================
+ Hits 11293 11296 +3
- Misses 3964 3965 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: c1e04f2 | Previous: 3d42ca2 | Ratio |
---|---|---|---|
latency/precompile |
46474076514.5 ns |
46520878096 ns |
1.00 |
latency/ttfp |
6956156543 ns |
7008237160 ns |
0.99 |
latency/import |
3631584313 ns |
3631961055 ns |
1.00 |
integration/volumerhs |
9624598.5 ns |
9623329 ns |
1.00 |
integration/byval/slices=1 |
147057 ns |
146619 ns |
1.00 |
integration/byval/slices=3 |
425280 ns |
424765 ns |
1.00 |
integration/byval/reference |
144868 ns |
144768 ns |
1.00 |
integration/byval/slices=2 |
286168 ns |
285820 ns |
1.00 |
integration/cudadevrt |
103450 ns |
103203 ns |
1.00 |
kernel/indexing |
14050.5 ns |
13905 ns |
1.01 |
kernel/indexing_checked |
14697 ns |
14547 ns |
1.01 |
kernel/occupancy |
631.9235294117647 ns |
666.4058823529411 ns |
0.95 |
kernel/launch |
1997.8 ns |
2007.2 ns |
1.00 |
kernel/rand |
18179 ns |
16740 ns |
1.09 |
array/reverse/1d |
19600 ns |
19411 ns |
1.01 |
array/reverse/2d |
23498.5 ns |
23115.5 ns |
1.02 |
array/reverse/1d_inplace |
10204 ns |
9745.333333333334 ns |
1.05 |
array/reverse/2d_inplace |
11681 ns |
11354 ns |
1.03 |
array/copy |
21035.5 ns |
20871 ns |
1.01 |
array/iteration/findall/int |
157996 ns |
157360 ns |
1.00 |
array/iteration/findall/bool |
138733 ns |
137944 ns |
1.01 |
array/iteration/findfirst/int |
153655 ns |
152667 ns |
1.01 |
array/iteration/findfirst/bool |
154472 ns |
154321 ns |
1.00 |
array/iteration/scalar |
72458.5 ns |
73050 ns |
0.99 |
array/iteration/logical |
212821 ns |
211424.5 ns |
1.01 |
array/iteration/findmin/1d |
40754 ns |
40711.5 ns |
1.00 |
array/iteration/findmin/2d |
93781 ns |
93411 ns |
1.00 |
array/reductions/reduce/1d |
34849 ns |
43368 ns |
0.80 |
array/reductions/reduce/2d |
40485 ns |
49343.5 ns |
0.82 |
array/reductions/mapreduce/1d |
32887 ns |
35682 ns |
0.92 |
array/reductions/mapreduce/2d |
40794 ns |
43698 ns |
0.93 |
array/broadcast |
20879 ns |
20818 ns |
1.00 |
array/copyto!/gpu_to_gpu |
13707 ns |
13676 ns |
1.00 |
array/copyto!/cpu_to_gpu |
208093 ns |
207803 ns |
1.00 |
array/copyto!/gpu_to_cpu |
244906 ns |
243284 ns |
1.01 |
array/accumulate/1d |
108414.5 ns |
108135 ns |
1.00 |
array/accumulate/2d |
79828 ns |
79400 ns |
1.01 |
array/construct |
1278.1 ns |
1294.3000000000002 ns |
0.99 |
array/random/randn/Float32 |
43467 ns |
43629.5 ns |
1.00 |
array/random/randn!/Float32 |
26256 ns |
26224 ns |
1.00 |
array/random/rand!/Int64 |
26918 ns |
27041 ns |
1.00 |
array/random/rand!/Float32 |
8736 ns |
8637.333333333334 ns |
1.01 |
array/random/rand/Int64 |
29891 ns |
37854 ns |
0.79 |
array/random/rand/Float32 |
13125 ns |
12706 ns |
1.03 |
array/permutedims/4d |
60821.5 ns |
60584 ns |
1.00 |
array/permutedims/2d |
54853 ns |
54607 ns |
1.00 |
array/permutedims/3d |
55989 ns |
55510 ns |
1.01 |
array/sorting/1d |
2776394.5 ns |
2777205.5 ns |
1.00 |
array/sorting/by |
3367046 ns |
3369336 ns |
1.00 |
array/sorting/2d |
1084585 ns |
1083969 ns |
1.00 |
cuda/synchronization/stream/auto |
1051.2 ns |
996.3846153846154 ns |
1.06 |
cuda/synchronization/stream/nonblocking |
6429.4 ns |
6421.1 ns |
1.00 |
cuda/synchronization/stream/blocking |
822.3030303030303 ns |
821.9183673469388 ns |
1.00 |
cuda/synchronization/context/auto |
1153 ns |
1154.2 ns |
1.00 |
cuda/synchronization/context/nonblocking |
6583.8 ns |
6536 ns |
1.01 |
cuda/synchronization/context/blocking |
897.1590909090909 ns |
899.2982456140351 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
The current proposition seems to fix the overlap (async execution) in Chmy (see PTsolvers/Chmy.jl#65). Unsure tho if it's the best way to handle this as besides having to call |
That's not relevant here, as it's basically a constructor creating a new array. The idea is that this property is set on an array object, which is the only safe scope to do so. |
@maleadt is that what you had in mind for #2617
One of the tricky things is if we should flip the stream, or not.
But we are about to set dirty so I think we must, but that of course means it is possible to "miss" logical sync events within a task.