-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexecdisk: create disk-backed operators lazily in diskSpiller #137373
Conversation
02ad865
to
5b552cc
Compare
5b552cc
to
b0b7b0f
Compare
Here is the change on
|
Hm, we might need mutex protection now for the monitor and closer registries. |
ef08126
to
023bd35
Compare
Added concurrency-safety to both registries, should be RFAL. |
023bd35
to
316be7c
Compare
316be7c
to
bcdf264
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 14 files at r1, 30 of 30 files at r6, 16 of 16 files at r7, 8 of 8 files at r8, 1 of 1 files at r9, 20 of 21 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)
pkg/sql/colexec/colexecargs/monitor_registry.go
line 33 at r11 (raw file):
// Note that Close and Reset are not protected. func (r *MonitorRegistry) SetMutex(mu *syncutil.Mutex) { if r.optionalMu == nil {
I have the same concern here—if tow operators spill to disk at the same time, this would be a race condition, wouldn't it?
pkg/sql/colexec/colexecargs/closer_registry.go
line 27 at r11 (raw file):
// Note that Close and Reset are not protected. func (r *CloserRegistry) SetMutex(mu *syncutil.Mutex) { if r.optionalMu == nil {
Couldn't we have a race if two goroutines simultaneously try to set this mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)
pkg/sql/colexec/colexecargs/closer_registry.go
line 27 at r11 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Couldn't we have a race if two goroutines simultaneously try to set this mutex?
SetMutex
is called during the operator planning which is done in a single goroutine - this is done in makeConstructorConcurrencySafe
which is called whenever we create a disk-spilling operator. Only diskBackedConstructor
function can be called during the execution, and at that point the mutex will have been set, protecting the registries.
I tried to deduplicate the code a bit with makeConstructorConcurrencySafe
but perhaps it makes it harder to reason about - should we just explicitly call SetMutex
right before calling disk spiller constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)
pkg/sql/colexec/colexecargs/closer_registry.go
line 27 at r11 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
SetMutex
is called during the operator planning which is done in a single goroutine - this is done inmakeConstructorConcurrencySafe
which is called whenever we create a disk-spilling operator. OnlydiskBackedConstructor
function can be called during the execution, and at that point the mutex will have been set, protecting the registries.I tried to deduplicate the code a bit with
makeConstructorConcurrencySafe
but perhaps it makes it harder to reason about - should we just explicitly callSetMutex
right before calling disk spiller constructors?
You mean before calling the functions likeNewOneInputDiskSpiller
? I don't have a very strong opinion here, so up to you. Not sure the makeConstructorConcurrencySafe
adds much value, but either way seems potentially easy to misuse.
Do you think there are any valuable assertions we could add, e.g., is there a way that AddCloser
could assert that the mutex has been set when it should be?
pkg/sql/colexec/colexecargs/op_creation.go
line 109 at r11 (raw file):
func (r *NewColOperatorArgs) MakeConcurrencySafeForDiskBackedOp() { r.MonitorRegistry.SetMutex(r.RegistriesMu) r.CloserRegistry.SetMutex(r.RegistriesMu)
Do these need to be the same mutex? Would using a mutex embedded in the registries with a boolean indicating whether it should be used or not be simpler?
58e6c94
to
1b615b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)
pkg/sql/colexec/colexecargs/closer_registry.go
line 27 at r11 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
You mean before calling the functions like
NewOneInputDiskSpiller
? I don't have a very strong opinion here, so up to you. Not sure themakeConstructorConcurrencySafe
adds much value, but either way seems potentially easy to misuse.Do you think there are any valuable assertions we could add, e.g., is there a way that
AddCloser
could assert that the mutex has been set when it should be?
I decided to remove this optionality and just make the registries concurrency-safe unconditionally. Mutex operations should take on the order of tens nanoseconds, and during the operator planning we only have a single goroutine, so there shouldn't be any contention on the hot path. In other words, let's make it safe first, and if we see the mutex overhead show up in profiles, then we'll optimize it.
pkg/sql/colexec/colexecargs/op_creation.go
line 109 at r11 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Do these need to be the same mutex? Would using a mutex embedded in the registries with a boolean indicating whether it should be used or not be simpler?
The mutexes can be different. Yeah, perhaps a boolean to indicate whether the concurrency protection is enabled would be cleaner. I put this thought aside for now (see previous comment).
3887d35
to
b132f0e
Compare
Whenever we have the monitor registry in scope, we can just leave this field unset, and an account will be created lazily via the monitor registry. This allows for some code cleanup but also fixes an issue in `verifyColOperator` where a memory account would be closed before those owned by the monitor registry (which would be problematic in the following commit where closers are closed right before the monitor registry cleanup, but happened to be after this single acc closure). Release note: None
This commit introduces `colexecargs.CloserRegistry` which is now responsible for tracking all `colexecop.Closer`s we create during the vectorized planning. Previously, we would accumulate them in `NewColOperatorResult.ToClose` slice to be appended to the `closers` slice in the vectorized flow. However, the following commit will make creation of some closers lazy, meaning that the captured `result.ToClose` slice might have incomplete information, and this commit goes around that issue. This commit should be a no-op change as of right now. Release note: None
The change in the previous commit made sure that we now close all closers in the tests (if the closer registry is specified), so we can clean up some logic around that in tests. The main idea is rather than rely on the RunTests harness to try to close the necessary operators, each test will now be responsible for doing so if it might create a closer. Release note: None
Previously, the hash group joiner inherited the reset behavior from `TwoInputInitHelper` that it embeds, which could leave the hash joiner and the hash aggregator encapsulated in the hash group joiner in the non-reset state. As a result, if the external hash group joiner spilled to disk, we could get incorrect results. This is now fixed. There is no explicit test added for this because the change in the following commit (where `ChildCount` on the disk spiller now returns one less) will exercise this code path (via "reusing after reset" part of the RunTests harness). This is also an experimental feature, so there is no release note. Release note: None
This commit makes the creation of the disk-backed operator lazy, whenever the diskSpiller spills to disk for the first time throughout its lifetime. This allows us avoid allocations in the common case when we don't spill to disk. This required adjustment for some tests that verify that the expected closers are accumulated. In some cases we know exactly how many closers will be added if the operator tree is forced to spill to disk (external sort, external distinct, disk-backed hash group join) whereas in others (external hash join and external hash aggregator) this number is not easy to calculate. In order to have some coverage for the latter case this commit introduces some sanity checks to ensure that at least in some test cases the number of added closers seems reasonable. Note that this change is not without its risks. In particular, the disk-backed operator constructor must execute correctly when delayed. The function captures references to miscellaneous state, which - if mutated - can lead to problems. One such problem was the capture of `result.ColumnTypes` when creating the external distinct, and this commit applies a fix of having a separate reference to the input schema. All constructor functions have been audited to ensure that no arguments being captured would be modified later. The constructors also capture some other functions (most commonly around the monitor registry, the closer registry, and the constructor for disk-backed sort), but those should be safe. An additional complication is that the delayed constructor functions can now run concurrently (if we have concurrency within the flow, most likely due to the plan being distributed). To account for that the monitor and the closer registries have been adjusted to be concurrency-safe. Note that we could have made the mutex protection optional (i.e. "enable" mutex protection only when we create the disk spiller), but for safety we decided to add unconditional protection. If mutex operations show up in the profiles we'll adjust this. Also, note that the disk-backed operator chain will now be excluded from EXPLAIN (VEC, VERBOSE). This seems ok. Release note: None
b132f0e
to
dbc9d84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 32 files at r13, 6 of 16 files at r14, 5 of 8 files at r15, 1 of 1 files at r16, 20 of 20 files at r17, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)
pkg/sql/colexec/colexecargs/closer_registry.go
line 27 at r11 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I decided to remove this optionality and just make the registries concurrency-safe unconditionally. Mutex operations should take on the order of tens nanoseconds, and during the operator planning we only have a single goroutine, so there shouldn't be any contention on the hot path. In other words, let's make it safe first, and if we see the mutex overhead show up in profiles, then we'll optimize it.
SGTM 👍
pkg/sql/colexec/colexecargs/op_creation.go
line 109 at r11 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
The mutexes can be different. Yeah, perhaps a boolean to indicate whether the concurrency protection is enabled would be cleaner. I put this thought aside for now (see previous comment).
👍
@yuzefovich Did you run any benchmarks with this PR? I can run some quick sysbench microbenchmarks if not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I included some microbenchmarks in this comment. I also just reran the benchmarks with the latest version, the results show an improvement https://gist.github.com/yuzefovich/c5b074f1804ddd2fc92470eeaddfd622.
TFTR!
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)
This PR refactors the disk spilling machinery in the vectorized engine to create the disk-backed operator lazily, when the spilling to disk occurs, rather than eagerly as it is done right now. The main motivation is that in many cases we won't need the disk-backed operator at all, so this reduced the allocations on the hot path. Some care had to be taken to ensure that the construction - which is a planning-time operation - works when performed during the execution time. See each commit for details.
Fixes: #137323.
Release note: None