colexecdisk: create disk-backed operators lazily in diskSpiller #137373

yuzefovich · 2024-12-13T01:31:48Z

This PR refactors the disk spilling machinery in the vectorized engine to create the disk-backed operator lazily, when the spilling to disk occurs, rather than eagerly as it is done right now. The main motivation is that in many cases we won't need the disk-backed operator at all, so this reduced the allocations on the hot path. Some care had to be taken to ensure that the construction - which is a planning-time operation - works when performed during the execution time. See each commit for details.

Fixes: #137323.
Release note: None

cockroach-teamcity · 2024-12-13T01:31:59Z

This change is

yuzefovich · 2024-12-14T00:22:17Z

Here is the change on BenchmarkExternalDistinct. As expected, when we don't spill we see a noticeable reduction in allocations:

ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=1024/ordCols=0/type=bytes-24          150 ± 0%        66 ± 0%  -56.00%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=1024/ordCols=0/type=bytes-24          150 ± 0%        66 ± 0%  -56.00%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=1024/ordCols=0/type=int-24            154 ± 0%        70 ± 0%  -54.55%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=1024/ordCols=0/type=int-24            168 ± 0%        84 ± 0%  -50.00%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=65536/ordCols=0/type=bytes-24         657 ± 0%       573 ± 0%  -12.79%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.001/rows=65536/ordCols=0/type=int-24           677 ± 0%       593 ± 0%  -12.41%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=65536/ordCols=0/type=bytes-24         681 ± 0%       597 ± 0%  -12.33%  (p=0.000 n=10+10)
ExternalDistinct/spilled=false/ordering=false/shuffled/newTupleProbability=0.100/rows=65536/ordCols=0/type=int-24           835 ± 0%       751 ± 0%  -10.06%  (p=0.000 n=10+10)

yuzefovich · 2024-12-14T02:51:02Z

Hm, we might need mutex protection now for the monitor and closer registries.

yuzefovich · 2024-12-19T22:19:11Z

Added concurrency-safety to both registries, should be RFAL.

blathers-crl · 2024-12-19T22:33:32Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

mgartner

Reviewed 14 of 14 files at r1, 30 of 30 files at r6, 16 of 16 files at r7, 8 of 8 files at r8, 1 of 1 files at r9, 20 of 21 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)

pkg/sql/colexec/colexecargs/monitor_registry.go line 33 at r11 (raw file):

// Note that Close and Reset are not protected.
func (r *MonitorRegistry) SetMutex(mu *syncutil.Mutex) {
	if r.optionalMu == nil {

I have the same concern here—if tow operators spill to disk at the same time, this would be a race condition, wouldn't it?

pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

// Note that Close and Reset are not protected.
func (r *CloserRegistry) SetMutex(mu *syncutil.Mutex) {
	if r.optionalMu == nil {

Couldn't we have a race if two goroutines simultaneously try to set this mutex?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)

pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Couldn't we have a race if two goroutines simultaneously try to set this mutex?

SetMutex is called during the operator planning which is done in a single goroutine - this is done in makeConstructorConcurrencySafe which is called whenever we create a disk-spilling operator. Only diskBackedConstructor function can be called during the execution, and at that point the mutex will have been set, protecting the registries.

I tried to deduplicate the code a bit with makeConstructorConcurrencySafe but perhaps it makes it harder to reason about - should we just explicitly call SetMutex right before calling disk spiller constructors?

mgartner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)

pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

SetMutex is called during the operator planning which is done in a single goroutine - this is done in makeConstructorConcurrencySafe which is called whenever we create a disk-spilling operator. Only diskBackedConstructor function can be called during the execution, and at that point the mutex will have been set, protecting the registries.

I tried to deduplicate the code a bit with makeConstructorConcurrencySafe but perhaps it makes it harder to reason about - should we just explicitly call SetMutex right before calling disk spiller constructors?

You mean before calling the functions likeNewOneInputDiskSpiller? I don't have a very strong opinion here, so up to you. Not sure the makeConstructorConcurrencySafe adds much value, but either way seems potentially easy to misuse.

Do you think there are any valuable assertions we could add, e.g., is there a way that AddCloser could assert that the mutex has been set when it should be?

pkg/sql/colexec/colexecargs/op_creation.go line 109 at r11 (raw file):

func (r *NewColOperatorArgs) MakeConcurrencySafeForDiskBackedOp() {
	r.MonitorRegistry.SetMutex(r.RegistriesMu)
	r.CloserRegistry.SetMutex(r.RegistriesMu)

Do these need to be the same mutex? Would using a mutex embedded in the registries with a boolean indicating whether it should be used or not be simpler?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)

pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

You mean before calling the functions likeNewOneInputDiskSpiller? I don't have a very strong opinion here, so up to you. Not sure the makeConstructorConcurrencySafe adds much value, but either way seems potentially easy to misuse.

Do you think there are any valuable assertions we could add, e.g., is there a way that AddCloser could assert that the mutex has been set when it should be?

I decided to remove this optionality and just make the registries concurrency-safe unconditionally. Mutex operations should take on the order of tens nanoseconds, and during the operator planning we only have a single goroutine, so there shouldn't be any contention on the hot path. In other words, let's make it safe first, and if we see the mutex overhead show up in profiles, then we'll optimize it.

pkg/sql/colexec/colexecargs/op_creation.go line 109 at r11 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Do these need to be the same mutex? Would using a mutex embedded in the registries with a boolean indicating whether it should be used or not be simpler?

The mutexes can be different. Yeah, perhaps a boolean to indicate whether the concurrency protection is enabled would be cleaner. I put this thought aside for now (see previous comment).

Whenever we have the monitor registry in scope, we can just leave this field unset, and an account will be created lazily via the monitor registry. This allows for some code cleanup but also fixes an issue in `verifyColOperator` where a memory account would be closed before those owned by the monitor registry (which would be problematic in the following commit where closers are closed right before the monitor registry cleanup, but happened to be after this single acc closure). Release note: None

This commit introduces `colexecargs.CloserRegistry` which is now responsible for tracking all `colexecop.Closer`s we create during the vectorized planning. Previously, we would accumulate them in `NewColOperatorResult.ToClose` slice to be appended to the `closers` slice in the vectorized flow. However, the following commit will make creation of some closers lazy, meaning that the captured `result.ToClose` slice might have incomplete information, and this commit goes around that issue. This commit should be a no-op change as of right now. Release note: None

The change in the previous commit made sure that we now close all closers in the tests (if the closer registry is specified), so we can clean up some logic around that in tests. The main idea is rather than rely on the RunTests harness to try to close the necessary operators, each test will now be responsible for doing so if it might create a closer. Release note: None

Previously, the hash group joiner inherited the reset behavior from `TwoInputInitHelper` that it embeds, which could leave the hash joiner and the hash aggregator encapsulated in the hash group joiner in the non-reset state. As a result, if the external hash group joiner spilled to disk, we could get incorrect results. This is now fixed. There is no explicit test added for this because the change in the following commit (where `ChildCount` on the disk spiller now returns one less) will exercise this code path (via "reusing after reset" part of the RunTests harness). This is also an experimental feature, so there is no release note. Release note: None

This commit makes the creation of the disk-backed operator lazy, whenever the diskSpiller spills to disk for the first time throughout its lifetime. This allows us avoid allocations in the common case when we don't spill to disk. This required adjustment for some tests that verify that the expected closers are accumulated. In some cases we know exactly how many closers will be added if the operator tree is forced to spill to disk (external sort, external distinct, disk-backed hash group join) whereas in others (external hash join and external hash aggregator) this number is not easy to calculate. In order to have some coverage for the latter case this commit introduces some sanity checks to ensure that at least in some test cases the number of added closers seems reasonable. Note that this change is not without its risks. In particular, the disk-backed operator constructor must execute correctly when delayed. The function captures references to miscellaneous state, which - if mutated - can lead to problems. One such problem was the capture of `result.ColumnTypes` when creating the external distinct, and this commit applies a fix of having a separate reference to the input schema. All constructor functions have been audited to ensure that no arguments being captured would be modified later. The constructors also capture some other functions (most commonly around the monitor registry, the closer registry, and the constructor for disk-backed sort), but those should be safe. An additional complication is that the delayed constructor functions can now run concurrently (if we have concurrency within the flow, most likely due to the plan being distributed). To account for that the monitor and the closer registries have been adjusted to be concurrency-safe. Note that we could have made the mutex protection optional (i.e. "enable" mutex protection only when we create the disk spiller), but for safety we decided to add unconditional protection. If mutex operations show up in the profiles we'll adjust this. Also, note that the disk-backed operator chain will now be excluded from EXPLAIN (VEC, VERBOSE). This seems ok. Release note: None

mgartner

Thanks for doing this!

Reviewed 9 of 32 files at r13, 6 of 16 files at r14, 5 of 8 files at r15, 1 of 1 files at r16, 20 of 20 files at r17, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @yuzefovich)

pkg/sql/colexec/colexecargs/closer_registry.go line 27 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I decided to remove this optionality and just make the registries concurrency-safe unconditionally. Mutex operations should take on the order of tens nanoseconds, and during the operator planning we only have a single goroutine, so there shouldn't be any contention on the hot path. In other words, let's make it safe first, and if we see the mutex overhead show up in profiles, then we'll optimize it.

SGTM 👍

pkg/sql/colexec/colexecargs/op_creation.go line 109 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

The mutexes can be different. Yeah, perhaps a boolean to indicate whether the concurrency protection is enabled would be cleaner. I put this thought aside for now (see previous comment).

👍

mgartner · 2025-01-08T18:22:18Z

@yuzefovich Did you run any benchmarks with this PR? I can run some quick sysbench microbenchmarks if not.

yuzefovich

Yeah, I included some microbenchmarks in this comment. I also just reran the benchmarks with the latest version, the results show an improvement https://gist.github.com/yuzefovich/c5b074f1804ddd2fc92470eeaddfd622.

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)

craig · 2025-01-08T19:45:33Z

Build succeeded:

yuzefovich force-pushed the lazy-disk-spilling branch 7 times, most recently from 02ad865 to 5b552cc Compare December 14, 2024 00:13

yuzefovich changed the title ~~WIP on lazy disk-backed operator creation~~ colexecdisk: create disk-backed operators lazily in diskSpiller Dec 14, 2024

yuzefovich force-pushed the lazy-disk-spilling branch from 5b552cc to b0b7b0f Compare December 14, 2024 00:19

yuzefovich requested review from DrewKimball and a team December 14, 2024 00:22

yuzefovich marked this pull request as ready for review December 14, 2024 00:22

yuzefovich force-pushed the lazy-disk-spilling branch 2 times, most recently from ef08126 to 023bd35 Compare December 19, 2024 22:09

yuzefovich force-pushed the lazy-disk-spilling branch from 023bd35 to 316be7c Compare December 19, 2024 22:32

cockroachdb deleted a comment from blathers-crl bot Dec 19, 2024

yuzefovich force-pushed the lazy-disk-spilling branch from 316be7c to bcdf264 Compare December 19, 2024 22:33

mgartner added the o-perf-efficiency Related to performance efficiency label Dec 20, 2024

mgartner reviewed Dec 20, 2024

View reviewed changes

yuzefovich commented Dec 21, 2024

View reviewed changes

mgartner reviewed Dec 23, 2024

View reviewed changes

yuzefovich force-pushed the lazy-disk-spilling branch 2 times, most recently from 58e6c94 to 1b615b6 Compare December 23, 2024 20:59

yuzefovich requested a review from mgartner December 23, 2024 21:00

yuzefovich commented Dec 23, 2024

View reviewed changes

yuzefovich force-pushed the lazy-disk-spilling branch 2 times, most recently from 3887d35 to b132f0e Compare December 23, 2024 22:33

yuzefovich added 5 commits December 26, 2024 17:25

yuzefovich force-pushed the lazy-disk-spilling branch from b132f0e to dbc9d84 Compare December 27, 2024 01:27

mgartner approved these changes Jan 8, 2025

View reviewed changes

yuzefovich commented Jan 8, 2025

View reviewed changes

craig bot merged commit c3eb6e5 into cockroachdb:master Jan 8, 2025
22 checks passed

yuzefovich deleted the lazy-disk-spilling branch January 8, 2025 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexecdisk: create disk-backed operators lazily in diskSpiller #137373

colexecdisk: create disk-backed operators lazily in diskSpiller #137373

yuzefovich commented Dec 13, 2024 •

edited

Loading

cockroach-teamcity commented Dec 13, 2024

yuzefovich commented Dec 14, 2024

yuzefovich commented Dec 14, 2024

yuzefovich commented Dec 19, 2024

blathers-crl bot commented Dec 19, 2024

mgartner left a comment

yuzefovich left a comment

mgartner left a comment

yuzefovich left a comment

mgartner left a comment

mgartner commented Jan 8, 2025

yuzefovich left a comment

craig bot commented Jan 8, 2025

colexecdisk: create disk-backed operators lazily in diskSpiller #137373

colexecdisk: create disk-backed operators lazily in diskSpiller #137373

Conversation

yuzefovich commented Dec 13, 2024 • edited Loading

cockroach-teamcity commented Dec 13, 2024

yuzefovich commented Dec 14, 2024

yuzefovich commented Dec 14, 2024

yuzefovich commented Dec 19, 2024

blathers-crl bot commented Dec 19, 2024

mgartner left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

mgartner commented Jan 8, 2025

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Jan 8, 2025

yuzefovich commented Dec 13, 2024 •

edited

Loading