-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: relative positioning through broad range tombstones is slow #1070
Comments
I think this nexting only occurs at the same level the range tombstone is located at. Lower levels will skip right over the range tombstone data.
I think this is already done for sstables that are covered by range tombstones at a higher level by how
See above. Isn't this already done by |
Ah, maybe this is specifically an issue for sstables that contain a range tombstone covering their own contents. I think this might be the common case, because we prioritize compaction of these range tombstones which tends to move the tombstone into the same level as the data itself. In cockroachlabs/support#846, the elision-only compaction right as the CPU became unpegged suggests that at least one tombstone was within the same file as the keys it covered. I'm going to try to produce a test case that exhibits this behavior and see what I find. |
Yeah, this sounds like it could be a problem. Not sure how to address it as point operations at the same seqnum as a range tombstone are specified to take precedence (this is needed for ingestion). So we can't easily tell that a range tombstone in an sstable covers all of the point operations in an sstable given the info we currently have. If we kept track of the largest point operation seqnum in |
Amend BenchmarkRangeDelIterate to include benchmarks that iterate through keys that are deleted by a range tombstone within the same SSTable. See cockroachdb#1070. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble BenchmarkRangeDelIterate BenchmarkRangeDelIterate/entries=10 BenchmarkRangeDelIterate/entries=10/deleted=10 BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false-24 1868349 636 ns/op BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true-24 403632 2945 ns/op BenchmarkRangeDelIterate/entries=10/deleted=9 BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false-24 647324 1855 ns/op BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true-24 425908 2793 ns/op BenchmarkRangeDelIterate/entries=1000 BenchmarkRangeDelIterate/entries=1000/deleted=1000 BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false-24 1874809 642 ns/op BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true-24 10000 106763 ns/op BenchmarkRangeDelIterate/entries=1000/deleted=999 BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false-24 613291 1895 ns/op BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true-24 10000 106169 ns/op BenchmarkRangeDelIterate/entries=100000 BenchmarkRangeDelIterate/entries=100000/deleted=100000 BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false-24 1865816 640 ns/op BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true-24 106 10928184 ns/op BenchmarkRangeDelIterate/entries=100000/deleted=99999 BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false-24 614889 1937 ns/op BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true-24 105 10958571 ns/op ```
Amend BenchmarkRangeDelIterate to include benchmarks that iterate through keys that are deleted by a range tombstone within the same SSTable. See #1070. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble BenchmarkRangeDelIterate BenchmarkRangeDelIterate/entries=10 BenchmarkRangeDelIterate/entries=10/deleted=10 BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=false-24 1868349 636 ns/op BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=10/deleted=10/snapshotAndCompact=true-24 403632 2945 ns/op BenchmarkRangeDelIterate/entries=10/deleted=9 BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=false-24 647324 1855 ns/op BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=10/deleted=9/snapshotAndCompact=true-24 425908 2793 ns/op BenchmarkRangeDelIterate/entries=1000 BenchmarkRangeDelIterate/entries=1000/deleted=1000 BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=false-24 1874809 642 ns/op BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=1000/deleted=1000/snapshotAndCompact=true-24 10000 106763 ns/op BenchmarkRangeDelIterate/entries=1000/deleted=999 BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=false-24 613291 1895 ns/op BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=1000/deleted=999/snapshotAndCompact=true-24 10000 106169 ns/op BenchmarkRangeDelIterate/entries=100000 BenchmarkRangeDelIterate/entries=100000/deleted=100000 BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=false-24 1865816 640 ns/op BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=100000/deleted=100000/snapshotAndCompact=true-24 106 10928184 ns/op BenchmarkRangeDelIterate/entries=100000/deleted=99999 BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=false-24 614889 1937 ns/op BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true BenchmarkRangeDelIterate/entries=100000/deleted=99999/snapshotAndCompact=true-24 105 10958571 ns/op ```
Yeah, tracking the largest point seqnum, storing it in sstable properties and pulling it into But I also worry about files that contain both live and obsolete data. Most of a file might be covered by its tombstone, but there could be a few point operations more recent than the tombstone. I think in CockroachDB this would happen if an sstable straddles a range boundary (and one of the ranges was deleted) or a table boundary (and one of the tables was dropped). In both of these scenarios, it's unlikely that the more recent point operations overlap the tombstone's key range. One idea is to construct an O(# range tombstones) bitmap that indicates whether each tombstone's keyspace covers exclusively keys at lower sequence numbers. If a read's sequence number is >= the tombstone's and the corresponding bit is set, the sstable iterator can skip over the tombstone's key space. We could stick this bitmap in a table property for backwards compatibility. I haven't thought through how exactly this would shake out in code. Something like #847 seems like a better long-term solution, but the problem is big enough that I think it warrants a backport fix. |
How would you index into this bitmap? Range tombstones don't have an index. Would you just use the implicit order of how they occur in the range-del block? How would you build up this bitmap? I think something would have to be added to |
I believe we are not setting bounds in this mvccPut path since it is using |
btw, do we know how long the longest lived snapshots are in CockroachDB? I am wondering whether in the CockroachDB context we can sidestep this issue by using an |
However long it takes to send a replica snapshot across the network. I'd estimate on the order of a minute or so.
This is an interesting idea worth exploring. I looked in the past at how hard it would be to get rid of snapshots and the need for multiple concurrent iterators was the stumbling block. The downside to this approach is that it could cause excessive memory usage because the iterator will maintain a reference to the active memtables. |
I overlooked the memtables. I think that kills that idea. It is bad feedback loop for slow snapshot sending to cause OOMs, further slowing down the cluster. |
The memtable problem may not be a killer. The iterators returned by |
Oof, yeah, I overlooked the fragmenter. The only other idea that I have is collapsing that bitmap into a single flag that when set indicates all of the sstable's range tombstones are more recent than any point key contained within their key spans. I think this still requires coordination from the range deletion fragmenter, but not as much. Whenever a key is added to an output sstable and the range deletion fragmenter is non-empty, we would need to check if the key is more recent and within the key range of any of the pending range deletions. |
I'm wary of this given that other nearby (CRDB) ranges will have writes continuing to happen after the deleted range has its range tombstone written to the memtable. It seems likely that some of those will find their way into the same sstable. However given our sparing use of range tombstones in CRDB it may be ok to claim for an sstable that: all point keys in the sstable that are within any range tombstone in that sstable are less recent than that tombstone. This would allow us to encode a single bit for the sstable. |
I had a similar thought to this. One challenge here is doing something that is backwards compatible. Keeping the bitmap separate avoids the backwards compatibility issue and we likely have few enough range tombstones in an sstable to make storing the bitmap in a property feasible. |
Sorry, I think it was unclear from my comment, but I think we're saying the same thing. The flag would still be set if there are more recent point tombstones that do not overlap with any range tombstone's span. Whenever we're adding a point to the sstable.Writer, we could check with the fragmenter to see if the key is within and as recent as any of the pending tombstones. If it is, the flag must be cleared. |
Something to note here is that the problem seen in cockroachlabs/support#846 was exacerbated by unusually high rebalance activity caused by a bug in CRDB. In normal operation, range tombstones tend to be compacted out of existence quickly and not kept alive by snapshots. It would be nice to solve this perf problem, but we need to balance the improvement against the difficulty of doing so (moderate to high) and the potential for some short term instability (anything that touches range tombstones is by definition risky). |
I am wary of going down this path since it is possible we have left iterators open for too long that we can improve later, but eliminating the use of a snapshot will prevent that fix. An example is the iterator in
Given that this has recurred, I am wondering whether we could do a simpler workaround, that only involves how we prioritize compactions. When calculating |
I think I tried something similar to this in #872 before the B-Tree and the |
I'm going to attempt this -- it isn't a completely general solution, but should help with the CockroachDB scenario. |
This is motivated by the CockroachDB issues discussed in cockroachdb#1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs cockroachdb#1070
This is motivated by the CockroachDB issues discussed in cockroachdb#1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs cockroachdb#1070
This is motivated by the CockroachDB issues discussed in #1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs #1070
This is motivated by the CockroachDB issues discussed in cockroachdb#1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs cockroachdb#1070
This is motivated by the CockroachDB issues discussed in cockroachdb#1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs cockroachdb#1070
This is motivated by the CockroachDB issues discussed in #1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs #1070
The snapshot predicate feature (#1810) would likely help with this. |
I'm going to close this out in favor of #2424. |
In cockroachlabs/support#846, it looks like high CPU usage was caused by a
SeekGE
for a key just before a range tombstone's start key. If a SeekGE key is just less than a tombstone's start key, and there are no keys between the seek key the tombstone start key,SeekGE
mustNext
through the entire range tombstone.We have a couple longer term ideas that would have helped here:
SeekGE
at a recent sequence number to ignore all the obsolete keys covered by the range tombstone.NewSnapshot
so that a snapshot only applies to keys within the range. This would've allowed Pebble to drop the data covered by the range tombstone, assuming the snapshot did not actually care about the data beneath the tombstone. I think this idea is tricky mostly in the interface exposed to CockroachDB, because CRDB should only be able to read snapshotted data when reading from a snapshot with a range.In the immediate term, it seems like it should be possible to add some sort noninvasive optimization:
Next
into the range tombstone, we're nexting outside the replica's keyspace. I'm not sure if iterator used bymvccPutUsingIter
currently has an upper bound set on its iterator, but if not, we could set one. Then whenever we encounter a new tombstone duringfindNextEntry
, we could compare it to the iterator upper bound and return early.The text was updated successfully, but these errors were encountered: