db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085

sumeerbhola · 2021-03-05T23:15:10Z

This is motivated by the CockroachDB issues discussed
in #1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs #1070

cockroach-teamcity · 2021-03-05T23:15:16Z

This change is

sumeerbhola · 2021-03-05T23:17:07Z

Ignore the first commit which is from #1079
It is used only to pick up a tiny benchmark change.

petermattis

I'll have to take a closer look at the testing tomorrow, but this seems relatively small and safe. I like it.

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)

merging_iter.go, line 637 at r2 (raw file):

					return nil, nil
				}
			}

This is a lot smaller than I expected.

petermattis

Do we have sufficient test coverage already that a mistake in this optimization would cause a failure? I suspect yes, but just want to verify.

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)

merging_iter.go, line 637 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is a lot smaller than I expected.

As an alternative, could you have performed this optimization down in sstable.singleLevelIterator? In particular, a call to singleLevelIterator.SeekPrefixGE could save away the prefix as mergingIter does, and return false from either SeekPrefixGE or a subsequent Next when the key no longer has the desired prefix. I haven't thought through whether it is more desirable to do this in mergingIter vs singleLevelIterator.

This is motivated by the CockroachDB issues discussed in cockroachdb#1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs cockroachdb#1070

sumeerbhola

TFTR!

Do we have sufficient test coverage already that a mistake in this optimization would cause a failure? I suspect yes, but just want to verify.

I've added a test case to merging_iter_test.go. And metamorphic testing will exercise this.

Also, I've removed the amortization logic since it was a premature optimization, and have added a comment there mentioning that we could do so, including the compares we are already incurring (which is why I removed the logic).

Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @petermattis)

merging_iter.go, line 637 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

As an alternative, could you have performed this optimization down in sstable.singleLevelIterator? In particular, a call to singleLevelIterator.SeekPrefixGE could save away the prefix as mergingIter does, and return false from either SeekPrefixGE or a subsequent Next when the key no longer has the desired prefix. I haven't thought through whether it is more desirable to do this in mergingIter vs singleLevelIterator.

This is definitely debatable.

Theoretically, one could construct a large memtable with a similar pattern of sets which are range deleted and incur the cost of iterating over 64MB of data for each SeekPrefixGE, though it is unlikely to occur in practice.

In general, I've been making iterator optimizations at the highest layer in the iterator stack as possible, since it reduces the places to touch.

The next optimizations ended up at the lowest layer because those iterators know whether they are close enough for it to be worthwhile (e.g. same ssblock), and know about bloom filter matching -- this is true for both the SeekPrefixGE, and SeekGE/LT with bounds optimizations (the former passes trySeekUsingNext up from the Iterator as a hint only, and the optimization is done at the lowest layers. I should have done the same for the latter by adding a parameter to SetBounds instead of making each InternalIterator do bounds comparisons).
The noop optimization for SeekGE,LT was feasible to do in Iterator.
For this change mergingIter was the highest level where it was possible.

jbowens

!

I'm really glad we didn't need to do anything more invasive than this.

Reviewed 1 of 5 files at r2, 4 of 14 files at r3.
Reviewable status: 2 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @petermattis)

petermattis

Reviewable status: 2 of 6 files reviewed, all discussions resolved (waiting on @itsbilal)

merging_iter.go, line 637 at r2 (raw file):

Previously, sumeerbhola wrote…

This is definitely debatable.

Theoretically, one could construct a large memtable with a similar pattern of sets which are range deleted and incur the cost of iterating over 64MB of data for each SeekPrefixGE, though it is unlikely to occur in practice.

In general, I've been making iterator optimizations at the highest layer in the iterator stack as possible, since it reduces the places to touch.

The next optimizations ended up at the lowest layer because those iterators know whether they are close enough for it to be worthwhile (e.g. same ssblock), and know about bloom filter matching -- this is true for both the SeekPrefixGE, and SeekGE/LT with bounds optimizations (the former passes trySeekUsingNext up from the Iterator as a hint only, and the optimization is done at the lowest layers. I should have done the same for the latter by adding a parameter to SetBounds instead of making each InternalIterator do bounds comparisons).

The noop optimization for SeekGE,LT was feasible to do in Iterator.

For this change mergingIter was the highest level where it was possible.

Ok, the approach here SGTM. Thanks for laying out your thinking.

sumeerbhola

TFTRs!

Reviewable status: 2 of 6 files reviewed, all discussions resolved (waiting on @itsbilal)

sumeerbhola requested review from itsbilal, jbowens and petermattis March 5, 2021 23:15

petermattis reviewed Mar 6, 2021

View reviewed changes

petermattis approved these changes Mar 6, 2021

View reviewed changes

sumeerbhola force-pushed the seekprefix_stop branch from 5b352c2 to 4c0a523 Compare March 8, 2021 14:39

sumeerbhola force-pushed the seekprefix_stop branch from 4c0a523 to 3045305 Compare March 8, 2021 15:58

sumeerbhola commented Mar 8, 2021

View reviewed changes

jbowens approved these changes Mar 8, 2021

View reviewed changes

petermattis approved these changes Mar 8, 2021

View reviewed changes

sumeerbhola commented Mar 8, 2021

View reviewed changes

sumeerbhola merged commit 00685e9 into cockroachdb:master Mar 8, 2021

jbowens mentioned this pull request Mar 26, 2021

crl-release-20.2: db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1100

Merged

lunevalex mentioned this pull request Mar 30, 2021

kv/sql: GC after TRUNCATE temporarily craters write throughput cockroachdb/cockroach#62700

Closed

sumeerbhola deleted the seekprefix_stop branch October 26, 2021 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085

db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085

sumeerbhola commented Mar 5, 2021

cockroach-teamcity commented Mar 5, 2021

sumeerbhola commented Mar 5, 2021

petermattis left a comment

petermattis left a comment

sumeerbhola left a comment

jbowens left a comment

petermattis left a comment

sumeerbhola left a comment

db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085

db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085

Conversation

sumeerbhola commented Mar 5, 2021

cockroach-teamcity commented Mar 5, 2021

sumeerbhola commented Mar 5, 2021

petermattis left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment