Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexecdisk: create disk-backed operators lazily in diskSpiller #137373

Merged
merged 5 commits into from
Jan 8, 2025

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Dec 13, 2024

This PR refactors the disk spilling machinery in the vectorized engine to create the disk-backed operator lazily, when the spilling to disk occurs, rather than eagerly as it is done right now. The main motivation is that in many cases we won't need the disk-backed operator at all, so this reduced the allocations on the hot path. Some care had to be taken to ensure that the construction - which is a planning-time operation - works when performed during the execution time. See each commit for details.

Fixes: #137323.
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the lazy-disk-spilling branch 7 times, most recently from 02ad865 to 5b552cc Compare December 14, 2024 00:13
@yuzefovich yuzefovich changed the title WIP on lazy disk-backed operator creation colexecdisk: create disk-backed operators lazily in diskSpiller Dec 14, 2024
@yuzefovich
Copy link
Member Author

Here is the change on BenchmarkExternalDistinct. As expected, when we don't spill we see a noticeable reduction in allocations:

ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=1024/ordCols=0/type=bytes-24          150 ± 0%        66 ± 0%  -56.00%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=1024/ordCols=0/type=bytes-24          150 ± 0%        66 ± 0%  -56.00%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=1024/ordCols=0/type=int-24            154 ± 0%        70 ± 0%  -54.55%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=1024/ordCols=0/type=int-24            168 ± 0%        84 ± 0%  -50.00%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=65536/ordCols=0/type=bytes-24         657 ± 0%       573 ± 0%  -12.79%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=65536/ordCols=0/type=int-24           677 ± 0%       593 ± 0%  -12.41%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=65536/ordCols=0/type=bytes-24         681 ± 0%       597 ± 0%  -12.33%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=65536/ordCols=0/type=int-24           835 ± 0%       751 ± 0%  -10.06%  (p=0.000 n=10+10)

@yuzefovich yuzefovich requested review from DrewKimball and a team December 14, 2024 00:22
@yuzefovich yuzefovich marked this pull request as ready for review December 14, 2024 00:22
@yuzefovich
Copy link
Member Author

Hm, we might need mutex protection now for the monitor and closer registries.

@yuzefovich yuzefovich force-pushed the lazy-disk-spilling branch 2 times, most recently from ef08126 to 023bd35 Compare December 19, 2024 22:09
@yuzefovich
Copy link
Member Author

Added concurrency-safety to both registries, should be RFAL.

@cockroachdb cockroachdb deleted a comment from blathers-crl bot Dec 19, 2024
Copy link

blathers-crl bot commented Dec 19, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@mgartner mgartner added the o-perf-efficiency Related to performance efficiency label Dec 20, 2024
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 14 files at r1, 30 of 30 files at r6, 16 of 16 files at r7, 8 of 8 files at r8, 1 of 1 files at r9, 20 of 21 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)


pkg/sql/colexec/colexecargs/monitor_registry.go line 33 at r11 (raw file):

// Note that Close and Reset are not protected.
func (r *MonitorRegistry) SetMutex(mu *syncutil.Mutex) {
	if r.optionalMu == nil {

I have the same concern here—if tow operators spill to disk at the same time, this would be a race condition, wouldn't it?


pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

// Note that Close and Reset are not protected.
func (r *CloserRegistry) SetMutex(mu *syncutil.Mutex) {
	if r.optionalMu == nil {

Couldn't we have a race if two goroutines simultaneously try to set this mutex?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)


pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Couldn't we have a race if two goroutines simultaneously try to set this mutex?

SetMutex is called during the operator planning which is done in a single goroutine - this is done in makeConstructorConcurrencySafe which is called whenever we create a disk-spilling operator. Only diskBackedConstructor function can be called during the execution, and at that point the mutex will have been set, protecting the registries.

I tried to deduplicate the code a bit with makeConstructorConcurrencySafe but perhaps it makes it harder to reason about - should we just explicitly call SetMutex right before calling disk spiller constructors?

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)


pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

SetMutex is called during the operator planning which is done in a single goroutine - this is done in makeConstructorConcurrencySafe which is called whenever we create a disk-spilling operator. Only diskBackedConstructor function can be called during the execution, and at that point the mutex will have been set, protecting the registries.

I tried to deduplicate the code a bit with makeConstructorConcurrencySafe but perhaps it makes it harder to reason about - should we just explicitly call SetMutex right before calling disk spiller constructors?

You mean before calling the functions likeNewOneInputDiskSpiller? I don't have a very strong opinion here, so up to you. Not sure the makeConstructorConcurrencySafe adds much value, but either way seems potentially easy to misuse.

Do you think there are any valuable assertions we could add, e.g., is there a way that AddCloser could assert that the mutex has been set when it should be?


pkg/sql/colexec/colexecargs/op_creation.go line 109 at r11 (raw file):

func (r *NewColOperatorArgs) MakeConcurrencySafeForDiskBackedOp() {
	r.MonitorRegistry.SetMutex(r.RegistriesMu)
	r.CloserRegistry.SetMutex(r.RegistriesMu)

Do these need to be the same mutex? Would using a mutex embedded in the registries with a boolean indicating whether it should be used or not be simpler?

@yuzefovich yuzefovich force-pushed the lazy-disk-spilling branch 2 times, most recently from 58e6c94 to 1b615b6 Compare December 23, 2024 20:59
@yuzefovich yuzefovich requested a review from mgartner December 23, 2024 21:00
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)


pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

You mean before calling the functions likeNewOneInputDiskSpiller? I don't have a very strong opinion here, so up to you. Not sure the makeConstructorConcurrencySafe adds much value, but either way seems potentially easy to misuse.

Do you think there are any valuable assertions we could add, e.g., is there a way that AddCloser could assert that the mutex has been set when it should be?

I decided to remove this optionality and just make the registries concurrency-safe unconditionally. Mutex operations should take on the order of tens nanoseconds, and during the operator planning we only have a single goroutine, so there shouldn't be any contention on the hot path. In other words, let's make it safe first, and if we see the mutex overhead show up in profiles, then we'll optimize it.


pkg/sql/colexec/colexecargs/op_creation.go line 109 at r11 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Do these need to be the same mutex? Would using a mutex embedded in the registries with a boolean indicating whether it should be used or not be simpler?

The mutexes can be different. Yeah, perhaps a boolean to indicate whether the concurrency protection is enabled would be cleaner. I put this thought aside for now (see previous comment).

@yuzefovich yuzefovich force-pushed the lazy-disk-spilling branch 2 times, most recently from 3887d35 to b132f0e Compare December 23, 2024 22:33
Whenever we have the monitor registry in scope, we can just leave this
field unset, and an account will be created lazily via the monitor
registry. This allows for some code cleanup but also fixes an issue in
`verifyColOperator` where a memory account would be closed before those
owned by the monitor registry (which would be problematic in the
following commit where closers are closed right before the monitor
registry cleanup, but happened to be after this single acc closure).

Release note: None
This commit introduces `colexecargs.CloserRegistry` which is now
responsible for tracking all `colexecop.Closer`s we create during the
vectorized planning. Previously, we would accumulate them in
`NewColOperatorResult.ToClose` slice to be appended to the `closers`
slice in the vectorized flow. However, the following commit will make
creation of some closers lazy, meaning that the captured
`result.ToClose` slice might have incomplete information, and this
commit goes around that issue.

This commit should be a no-op change as of right now.

Release note: None
The change in the previous commit made sure that we now close all
closers in the tests (if the closer registry is specified), so we
can clean up some logic around that in tests. The main idea is rather
than rely on the RunTests harness to try to close the necessary
operators, each test will now be responsible for doing so if it might
create a closer.

Release note: None
Previously, the hash group joiner inherited the reset behavior from
`TwoInputInitHelper` that it embeds, which could leave the hash joiner
and the hash aggregator encapsulated in the hash group joiner in the
non-reset state. As a result, if the external hash group joiner spilled
to disk, we could get incorrect results. This is now fixed. There is no
explicit test added for this because the change in the following commit
(where `ChildCount` on the disk spiller now returns one less) will
exercise this code path (via "reusing after reset" part of the RunTests
harness). This is also an experimental feature, so there is no release
note.

Release note: None
This commit makes the creation of the disk-backed operator lazy,
whenever the diskSpiller spills to disk for the first time throughout
its lifetime. This allows us avoid allocations in the common case when
we don't spill to disk.

This required adjustment for some tests that verify that the expected
closers are accumulated. In some cases we know exactly how many
closers will be added if the operator tree is forced to spill to disk
(external sort, external distinct, disk-backed hash group join) whereas
in others (external hash join and external hash aggregator) this number
is not easy to calculate. In order to have some coverage for the latter
case this commit introduces some sanity checks to ensure that at least
in some test cases the number of added closers seems reasonable.

Note that this change is not without its risks. In particular, the
disk-backed operator constructor must execute correctly when delayed.
The function captures references to miscellaneous state, which - if
mutated - can lead to problems. One such problem was the capture of
`result.ColumnTypes` when creating the external distinct, and this
commit applies a fix of having a separate reference to the input schema.
All constructor functions have been audited to ensure that no arguments
being captured would be modified later. The constructors also capture
some other functions (most commonly around the monitor registry, the
closer registry, and the constructor for disk-backed sort), but those
should be safe.

An additional complication is that the delayed constructor functions can
now run concurrently (if we have concurrency within the flow, most
likely due to the plan being distributed). To account for that the
monitor and the closer registries have been adjusted to be
concurrency-safe. Note that we could have made the mutex protection
optional (i.e. "enable" mutex protection only when we create the disk
spiller), but for safety we decided to add unconditional protection. If
mutex operations show up in the profiles we'll adjust this.

Also, note that the disk-backed operator chain will now be excluded
from EXPLAIN (VEC, VERBOSE). This seems ok.

Release note: None
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: Thanks for doing this!

Reviewed 9 of 32 files at r13, 6 of 16 files at r14, 5 of 8 files at r15, 1 of 1 files at r16, 20 of 20 files at r17, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)


pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I decided to remove this optionality and just make the registries concurrency-safe unconditionally. Mutex operations should take on the order of tens nanoseconds, and during the operator planning we only have a single goroutine, so there shouldn't be any contention on the hot path. In other words, let's make it safe first, and if we see the mutex overhead show up in profiles, then we'll optimize it.

SGTM 👍


pkg/sql/colexec/colexecargs/op_creation.go line 109 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

The mutexes can be different. Yeah, perhaps a boolean to indicate whether the concurrency protection is enabled would be cleaner. I put this thought aside for now (see previous comment).

👍

@mgartner
Copy link
Collaborator

mgartner commented Jan 8, 2025

@yuzefovich Did you run any benchmarks with this PR? I can run some quick sysbench microbenchmarks if not.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I included some microbenchmarks in this comment. I also just reran the benchmarks with the latest version, the results show an improvement https://gist.github.com/yuzefovich/c5b074f1804ddd2fc92470eeaddfd622.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)

@craig craig bot merged commit c3eb6e5 into cockroachdb:master Jan 8, 2025
22 checks passed
@yuzefovich yuzefovich deleted the lazy-disk-spilling branch January 8, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
o-perf-efficiency Related to performance efficiency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: disk spiller responsible for 2% allocations in sysbench
3 participants