Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: relative positioning through broad range tombstones is slow #1070

Closed
jbowens opened this issue Feb 25, 2021 · 20 comments
Closed

perf: relative positioning through broad range tombstones is slow #1070

jbowens opened this issue Feb 25, 2021 · 20 comments

Comments

@jbowens
Copy link
Collaborator

jbowens commented Feb 25, 2021

In cockroachlabs/support#846, it looks like high CPU usage was caused by a SeekGE for a key just before a range tombstone's start key. If a SeekGE key is just less than a tombstone's start key, and there are no keys between the seek key the tombstone start key, SeekGE must Next through the entire range tombstone.

We have a couple longer term ideas that would have helped here:

  • perf: split L6 data into separate obsolete, live files #847 perf: split L6 data into separate obsolete, live files - Segmenting obsolete data into separate files would have allowed the SeekGE at a recent sequence number to ignore all the obsolete keys covered by the range tombstone.
  • NewSnapshot key range(s) - I couldn't find where this idea was documented, but it's the idea of specifying key ranges to NewSnapshot so that a snapshot only applies to keys within the range. This would've allowed Pebble to drop the data covered by the range tombstone, assuming the snapshot did not actually care about the data beneath the tombstone. I think this idea is tricky mostly in the interface exposed to CockroachDB, because CRDB should only be able to read snapshotted data when reading from a snapshot with a range.

In the immediate term, it seems like it should be possible to add some sort noninvasive optimization:

  • When we Next into the range tombstone, we're nexting outside the replica's keyspace. I'm not sure if iterator used by mvccPutUsingIter currently has an upper bound set on its iterator, but if not, we could set one. Then whenever we encounter a new tombstone during findNextEntry, we could compare it to the iterator upper bound and return early.
  • When we load a sstable that is wholly covered by a range tombstone (either within that file or above), can we somehow detect that and skip it? @sumeerbhola pointed out that the range tombstone would be a part of the file's metadata key range, which makes it hard to tell if the range tombstone is more recent than every other key contained within the file. This also doesn't help with files that contain live and obsolete data.
  • If we detect that we're nexting many times through a range tombstone, can we seek the iterators lower than the tombstone in the LSM to the tombstone's end key?
@petermattis
Copy link
Collaborator

In cockroachlabs/support#846, it looks like high CPU usage was caused by a SeekGE for a key just before a range tombstone's start key. If a SeekGE key is just less than a tombstone's start key, and there are no keys between the seek key the tombstone start key, SeekGE must Next through the entire range tombstone.

I think this nexting only occurs at the same level the range tombstone is located at. Lower levels will skip right over the range tombstone data.

When we load a sstable that is wholly covered by a range tombstone (either within that file or above), can we somehow detect that and skip it?

I think this is already done for sstables that are covered by range tombstones at a higher level by how mergingIter will seek lower iterators when a range tombstone covers them at a higher level.

If we detect that we're nexting many times through a range tombstone, can we seek the iterators lower than the tombstone in the LSM to the tombstone's end key?

See above. Isn't this already done by mergingIter?

@jbowens
Copy link
Collaborator Author

jbowens commented Feb 25, 2021

I think this is already done for sstables that are covered by range tombstones at a higher level by how mergingIter will seek lower iterators when a range tombstone covers them at a higher level.

Ah, maybe this is specifically an issue for sstables that contain a range tombstone covering their own contents. I think this might be the common case, because we prioritize compaction of these range tombstones which tends to move the tombstone into the same level as the data itself.

In cockroachlabs/support#846, the elision-only compaction right as the CPU became unpegged suggests that at least one tombstone was within the same file as the keys it covered.

I'm going to try to produce a test case that exhibits this behavior and see what I find.

@petermattis
Copy link
Collaborator

Ah, maybe this is specifically an issue for sstables that contain a range tombstone covering their own contents. I think this might be the common case, because we prioritize compaction of these range tombstones which tends to move the tombstone into the same level as the data itself.

Yeah, this sounds like it could be a problem. Not sure how to address it as point operations at the same seqnum as a range tombstone are specified to take precedence (this is needed for ingestion). So we can't easily tell that a range tombstone in an sstable covers all of the point operations in an sstable given the info we currently have. If we kept track of the largest point operation seqnum in FileMetadata we could.

jbowens added a commit to jbowens/pebble that referenced this issue Feb 25, 2021
Amend BenchmarkRangeDelIterate to include benchmarks that iterate
through keys that are deleted by a range tombstone within the same
SSTable.

See cockroachdb#1070.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
BenchmarkRangeDelIterate
BenchmarkRangeDelIterate/entries=10
BenchmarkRangeDelIterate/entries=10/deleted=10
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false-24         	 1868349	       636 ns/op
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true-24          	  403632	      2945 ns/op
BenchmarkRangeDelIterate/entries=10/deleted=9
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false-24          	  647324	      1855 ns/op
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true-24           	  425908	      2793 ns/op
BenchmarkRangeDelIterate/entries=1000
BenchmarkRangeDelIterate/entries=1000/deleted=1000
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false-24     	 1874809	       642 ns/op
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true-24      	   10000	    106763 ns/op
BenchmarkRangeDelIterate/entries=1000/deleted=999
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false-24      	  613291	      1895 ns/op
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true-24       	   10000	    106169 ns/op
BenchmarkRangeDelIterate/entries=100000
BenchmarkRangeDelIterate/entries=100000/deleted=100000
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false-24 	 1865816	       640 ns/op
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true-24  	     106	  10928184 ns/op
BenchmarkRangeDelIterate/entries=100000/deleted=99999
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false-24  	  614889	      1937 ns/op
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true-24   	     105	  10958571 ns/op
```
jbowens added a commit that referenced this issue Feb 25, 2021
Amend BenchmarkRangeDelIterate to include benchmarks that iterate
through keys that are deleted by a range tombstone within the same
SSTable.

See #1070.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
BenchmarkRangeDelIterate
BenchmarkRangeDelIterate/entries=10
BenchmarkRangeDelIterate/entries=10/deleted=10
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false-24         	 1868349	       636 ns/op
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true-24          	  403632	      2945 ns/op
BenchmarkRangeDelIterate/entries=10/deleted=9
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false-24          	  647324	      1855 ns/op
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true-24           	  425908	      2793 ns/op
BenchmarkRangeDelIterate/entries=1000
BenchmarkRangeDelIterate/entries=1000/deleted=1000
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false-24     	 1874809	       642 ns/op
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true-24      	   10000	    106763 ns/op
BenchmarkRangeDelIterate/entries=1000/deleted=999
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false-24      	  613291	      1895 ns/op
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true-24       	   10000	    106169 ns/op
BenchmarkRangeDelIterate/entries=100000
BenchmarkRangeDelIterate/entries=100000/deleted=100000
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false-24 	 1865816	       640 ns/op
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true-24  	     106	  10928184 ns/op
BenchmarkRangeDelIterate/entries=100000/deleted=99999
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false-24  	  614889	      1937 ns/op
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true
BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true-24   	     105	  10958571 ns/op
```
@jbowens
Copy link
Collaborator Author

jbowens commented Feb 26, 2021

Not sure how to address it as point operations at the same seqnum as a range tombstone are specified to take precedence (this is needed for ingestion). So we can't easily tell that a range tombstone in an sstable covers all of the point operations in an sstable given the info we currently have. If we kept track of the largest point operation seqnum in FileMetadata we could.

Yeah, tracking the largest point seqnum, storing it in sstable properties and pulling it into FileMetadata.TableStats seems relatively straightforward, and I think it's noninvasive enough that we could backport it to 20.2.

But I also worry about files that contain both live and obsolete data. Most of a file might be covered by its tombstone, but there could be a few point operations more recent than the tombstone. I think in CockroachDB this would happen if an sstable straddles a range boundary (and one of the ranges was deleted) or a table boundary (and one of the tables was dropped). In both of these scenarios, it's unlikely that the more recent point operations overlap the tombstone's key range.

One idea is to construct an O(# range tombstones) bitmap that indicates whether each tombstone's keyspace covers exclusively keys at lower sequence numbers. If a read's sequence number is >= the tombstone's and the corresponding bit is set, the sstable iterator can skip over the tombstone's key space. We could stick this bitmap in a table property for backwards compatibility. I haven't thought through how exactly this would shake out in code.

Something like #847 seems like a better long-term solution, but the problem is big enough that I think it warrants a backport fix.

@petermattis
Copy link
Collaborator

One idea is to construct an O(# range tombstones) bitmap that indicates whether each tombstone's keyspace covers exclusively keys at lower sequence numbers. If a read's sequence number is >= the tombstone's and the corresponding bit is set, the sstable iterator can skip over the tombstone's key space. We could stick this bitmap in a table property for backwards compatibility. I haven't thought through how exactly this would shake out in code.

How would you index into this bitmap? Range tombstones don't have an index. Would you just use the implicit order of how they occur in the range-del block?

How would you build up this bitmap? I think something would have to be added to rangedel.Fragmenter, but this feels like it would get complex. Perhaps you had another idea in mind that would be simple and bulletproof.

@sumeerbhola
Copy link
Collaborator

I'm not sure if iterator used by mvccPutUsingIter currently has an upper bound set on its iterator, but if not, we could set one. Then whenever we encounter a new tombstone during findNextEntry, we could compare it to the iterator upper bound and return early.

I believe we are not setting bounds in this mvccPut path since it is using SeekPrefixGE. I think the prefix value is currently only used in two places: in Iterator for strictly checking that only keys consistent with that prefix are returned, and down in the sstable iterators for bloom filter checking. We could start using it in mergingIter too, but will need to be careful wrt what state we leave it in when returning nil -- I suspect it will be straightforward when restricted to only optimizing SeekPrefixGE since Next and Prev will not subsequently be called (former because the mergingIter is already done, and latter since it is not permitted with prefix iteration). Adding an explicit upper bound could be an option, but the bound would need to be changed before successive SeekPrefixGE, which will disable the current optimization to use Next for such successive SeekPrefixGE call.

@sumeerbhola
Copy link
Collaborator

btw, do we know how long the longest lived snapshots are in CockroachDB? I am wondering whether in the CockroachDB context we can sidestep this issue by using an Iterator instead of a snapshot, and not worry about the extra space accumulation due to not being able to delete sstables.
We can now clone iterators to ensure that they see identical state, in case we need multiple iterators over the "snapshot".

@petermattis
Copy link
Collaborator

btw, do we know how long the longest lived snapshots are in CockroachDB? I am wondering whether in the CockroachDB context we can sidestep this issue by using an Iterator instead of a snapshot, and not worry about the extra space accumulation due to not being able to delete sstables.

However long it takes to send a replica snapshot across the network. I'd estimate on the order of a minute or so.

We can now clone iterators to ensure that they see identical state, in case we need multiple iterators over the "snapshot".

This is an interesting idea worth exploring. I looked in the past at how hard it would be to get rid of snapshots and the need for multiple concurrent iterators was the stumbling block. The downside to this approach is that it could cause excessive memory usage because the iterator will maintain a reference to the active memtables.

@sumeerbhola
Copy link
Collaborator

The downside to this approach is that it could cause excessive memory usage because the iterator will maintain a reference to the active memtables.

I overlooked the memtables. I think that kills that idea. It is bad feedback loop for slow snapshot sending to cause OOMs, further slowing down the cluster.

@petermattis
Copy link
Collaborator

I overlooked the memtables. I think that kills that idea. It is bad feedback loop for slow snapshot sending to cause OOMs, further slowing down the cluster.

The memtable problem may not be a killer. The iterators returned by Snapshot.NewIter also pin whatever memtables are currently in existence at the time of the call. The thing to figure out here is whether we hold the snapshot open for significantly longer than the individual iterators created on the snapshot.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 1, 2021

I think something would have to be added to rangedel.Fragmenter, but this feels like it would get complex. Perhaps you had another idea in mind that would be simple and bulletproof.

Oof, yeah, I overlooked the fragmenter. The only other idea that I have is collapsing that bitmap into a single flag that when set indicates all of the sstable's range tombstones are more recent than any point key contained within their key spans.

I think this still requires coordination from the range deletion fragmenter, but not as much. Whenever a key is added to an output sstable and the range deletion fragmenter is non-empty, we would need to check if the key is more recent and within the key range of any of the pending range deletions.

@sumeerbhola
Copy link
Collaborator

collapsing that bitmap into a single flag that when set indicates all of the sstable's range tombstones are more recent than any point key contained within their key spans.

I'm wary of this given that other nearby (CRDB) ranges will have writes continuing to happen after the deleted range has its range tombstone written to the memtable. It seems likely that some of those will find their way into the same sstable. However given our sparing use of range tombstones in CRDB it may be ok to claim for an sstable that: all point keys in the sstable that are within any range tombstone in that sstable are less recent than that tombstone. This would allow us to encode a single bit for the sstable.
If we wanted to do the bitmap with one bit per fragmented tombstone, we could encode it in the range tombstone itself. There are a large number of unused InternalKeyKinds and we could make one for InternalKeyKindRangeDeleteCoversAllInSStable. The read path would not distinguish between both kinds except when mergingIterLevel.rangeDelIter is non-nil, and would make the latter kind appear as a normal InternalKeyKindRangeDelete.

@petermattis
Copy link
Collaborator

If we wanted to do the bitmap with one bit per fragmented tombstone, we could encode it in the range tombstone itself. There are a large number of unused InternalKeyKinds and we could make one for InternalKeyKindRangeDeleteCoversAllInSStable. The read path would not distinguish between both kinds except when mergingIterLevel.rangeDelIter is non-nil, and would make the latter kind appear as a normal InternalKeyKindRangeDelete.

I had a similar thought to this. One challenge here is doing something that is backwards compatible. Keeping the bitmap separate avoids the backwards compatibility issue and we likely have few enough range tombstones in an sstable to make storing the bitmap in a property feasible.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 1, 2021

I'm wary of this given that other nearby (CRDB) ranges will have writes continuing to happen after the deleted range has its range tombstone written to the memtable. It seems likely that some of those will find their way into the same sstable. However given our sparing use of range tombstones in CRDB it may be ok to claim for an sstable that: all point keys in the sstable that are within any range tombstone in that sstable are less recent than that tombstone. This would allow us to encode a single bit for the sstable.

Sorry, I think it was unclear from my comment, but I think we're saying the same thing. The flag would still be set if there are more recent point tombstones that do not overlap with any range tombstone's span. Whenever we're adding a point to the sstable.Writer, we could check with the fragmenter to see if the key is within and as recent as any of the pending tombstones. If it is, the flag must be cleared.

@petermattis
Copy link
Collaborator

Something to note here is that the problem seen in cockroachlabs/support#846 was exacerbated by unusually high rebalance activity caused by a bug in CRDB. In normal operation, range tombstones tend to be compacted out of existence quickly and not kept alive by snapshots. It would be nice to solve this perf problem, but we need to balance the improvement against the difficulty of doing so (moderate to high) and the potential for some short term instability (anything that touches range tombstones is by definition risky).

@sumeerbhola
Copy link
Collaborator

The memtable problem may not be a killer. The iterators returned by Snapshot.NewIter also pin whatever memtables are currently in existence at the time of the call. The thing to figure out here is whether we hold the snapshot open for significantly longer than the individual iterators created on the snapshot.

I am wary of going down this path since it is possible we have left iterators open for too long that we can improve later, but eliminating the use of a snapshot will prevent that fix. An example is the iterator in OutgoingSnapshot which is kept open for sending all the replicated data for a range, which will grow as we increase the range size, but the sending is also throttled by a rate limiter, so could be slow (and affected by network issues).

It would be nice to solve this perf problem, but we need to balance the improvement against the difficulty of doing so (moderate to high) and the potential for some short term instability (anything that touches range tombstones is by definition risky).

Given that this has recurred, I am wondering whether we could do a simpler workaround, that only involves how we prioritize compactions. When calculating RangeDeletionsBytesEstimate we could also make a note of the highest sequence number across these range tombstones in the file and store it in TableStats. We would not adjust the compensated size for the file by this estimate until after the minimum snapshot sequence number became higher than the file's highest-range-del-seq-num. This would require an extra data-structure for tracking files whose compensated size needs to be adjusted when snapshots are closed, perhaps in versionSet (so we can keep one copy of this data-structure and remove files that are compacted away). The interaction with compensatedSizeAnnotator may need some care -- I guess we could introduce a new bit in addition to TableStats.Valid and disallow caching if either was false.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 5, 2021

Given that this has recurred, I am wondering whether we could do a simpler workaround, that only involves how we prioritize compactions. When calculating RangeDeletionsBytesEstimate we could also make a note of the highest sequence number across these range tombstones in the file and store it in TableStats. We would not adjust the compensated size for the file by this estimate until after the minimum snapshot sequence number became higher than the file's highest-range-del-seq-num. This would require an extra data-structure for tracking files whose compensated size needs to be adjusted when snapshots are closed, perhaps in versionSet (so we can keep one copy of this data-structure and remove files that are compacted away). The interaction with compensatedSizeAnnotator may need some care -- I guess we could introduce a new bit in addition to TableStats.Valid and disallow caching if either was false.

I think I tried something similar to this in #872 before the B-Tree and the compensatedSizeAnnotator made this more difficult. At the time, I observed that the compaction of these 'pinned' range tombstones is still important for reducing IMPORT write amplification, because it clears broad ranges from higher levels of the LSM. That ensures that ingestions of future overlapping sstables can be ingested into lower levels rather than L0.

@sumeerbhola
Copy link
Collaborator

I think the prefix value is currently only used in two places: in Iterator for strictly checking that only keys consistent with that prefix are returned, and down in the sstable iterators for bloom filter checking. We could start using it in mergingIter too, but will need to be careful wrt what state we leave it in when returning nil -- I suspect it will be straightforward when restricted to only optimizing SeekPrefixGE since Next and Prev will not subsequently be called (former because the mergingIter is already done, and latter since it is not permitted with prefix iteration).

I'm going to attempt this -- it isn't a completely general solution, but should help with the CockroachDB scenario.

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Mar 5, 2021
This is motivated by the CockroachDB issues discussed
in cockroachdb#1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs cockroachdb#1070
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Mar 8, 2021
This is motivated by the CockroachDB issues discussed
in cockroachdb#1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs cockroachdb#1070
sumeerbhola added a commit that referenced this issue Mar 8, 2021
This is motivated by the CockroachDB issues discussed
in #1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs #1070
jbowens pushed a commit to jbowens/pebble that referenced this issue Mar 26, 2021
This is motivated by the CockroachDB issues discussed
in cockroachdb#1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs cockroachdb#1070
jbowens pushed a commit to jbowens/pebble that referenced this issue Mar 31, 2021
This is motivated by the CockroachDB issues discussed
in cockroachdb#1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs cockroachdb#1070
jbowens pushed a commit that referenced this issue Mar 31, 2021
This is motivated by the CockroachDB issues discussed
in #1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs #1070
@nicktrav
Copy link
Contributor

The snapshot predicate feature (#1810) would likely help with this.

@jbowens
Copy link
Collaborator Author

jbowens commented Jun 5, 2023

I'm going to close this out in favor of #2424.

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants