sstable: sstable-local key kinds for overwritten/deleted points #2465

sumeerbhola · 2023-04-19T14:17:28Z

We have various related problems caused by Pebble snapshots:

P1. RANGEDELs that delete points in the same sstable, but the points were not deleted during the compaction because of an open snapshot. This results in expensive iteration. db: efficient skipping of points deleted by RANGEDEL in the same file #2424 discusses a possible solution using a block property collector for the range of seqnums in a block. It involves using a dynamic block property filter, based on the RANGEDELs at the current point iterator position, to skip blocks. A concern there is the code complexity of this dynamic application (we have significant code complexity in the other case of dynamic application -- range key masking).
P2. When iterating over foreign sstables, we need to do (a) point collapsing to expose at most one point per user key, (b) apply RANGEDELs. The current PR for this *: Implement iterators and seqnum substitution for foreign SSTs #2455 involves a pointCollapsingSSTIter that wraps the actual sst iterator. SST iteration is very performance sensitive code and adding extra wrappers is non ideal: there is no clear end to the life of a foreign sst and we'd rather not force a rewrite of that sstable -- one of the goals of disaggregated storage is to scale compute and disk bandwidth resources as a function of the hot (from a write perspective) data and not the whole data. Also, application of RANGEDELs in this case suffers from the same issue as in P1.

The ideal solution for P2 would allow user-facing reads to utilize the existing SST iterators (with slight modifications) and with no loss of efficiency. And for P1 and P2 we would like to skip whole blocks of overwritten/deleted points. And avoiding key comparisons would be very useful in general.

We observe that:

Reads:
- All user-facing reads in CockroachDB use iterators over the DB, hence have a higher read seqnum than all sstables
- Even for the internal-facing reads that use snapshots, the snapshots are shortlived enough that most L5 and L6 sstables will have all seqnums lower than the snapshot seqnum.
Writes:
- we have already done the key comparisons between points that share the same user key and between points and RANGEDELs on the compaction path (in compactionIter). The idea below uses those comparisons, by encoding them in the written sst, so they can be cheaply utilized when reading.

We ignore MERGE in the following discussion since I don't believe it is worth optimizing:

For P2, we should disallow it from foreign ssts.
For regular ssts, we don't need to optimize for it. They can be handled by the upstream iterators in the iterator tree (specifically pebble.Iterator).

We are left with the following point key kinds: SET, DEL, SINGLEDEL, SETWITHDEL. For each of these we introduce a corresponding *_LOCALLY_DELETED kind e.g. SET_LOCALLY_DELETED.

Reads:
The *_LOCALLY_DELETED kinds are only visible to the sst iterators and the sst iterators will convert them to the corresponding externally visible kind e.g. SET_LOCALLY_DELETED to SET, before exposing them (if they need to be exposed). Since all valid kinds have a uint8 value <= 22, we can do the check for *_LOCALLY_DELETED kind and the conversion cheaply by reserving the most significant bit in the InternalKeyKind uint8 for the *_LOCALLY_DELETED kinds. An (internally installed) block property collector collects whether all points in a block are *_LOCALLY_DELETED. There are two kinds of iteration modes over such a sstable:

A. Hide the *_LOCALLY_DELETED keys: this is used when the read seqnum is > all the seqnums in the sst. This is the common case. Forward and reverse iteration simply skips over the *_LOCALLY_DELETED keys. There are no user-key comparisons needed to do such skipping/collapsing. And skipping uses the block property filter to skip whole blocks whenever possible. This is straightforward compared to a dynamic block property filter and there is no need to do key span comparisons for RANGEDEL application. This hiding is correct because of the level invariant: the sequence of seqnums for a user key in a sstable represents a contiguous subsequence of the seqnums for the userkey across the whole LSM, and is more recent than the seqnums in a sstable in a lower level.
B. map *_LOCALLY_DELETED to the externally visible kind: Used by compactions and certain pebble.Iterator reads.

In general A is best-effort, since not all ssts will satisfy the conditions for A. Hence the normal iterator tree continues to have the ability to apply DELs, RANGEDELs etc. (which it anyway needs to do for merging across levels). One case where A is not best effort is iteration over foreign sstables, which by definition happens at a seqnum > that of the foreign sst.

Writes:
The *_LOCALLY_DELETED types are generated during compactions. compactionIter knows which points are locally deleted either because they are overwritten by a newer point, or by a RANGEDEL, but are being preserved because of snapshots.
Some care is needed here.

Simple case for a userkey: One point output per snapshot stripe (that has points for that userkey). Currently, the output point has the seqnum of the highest point in that snapshot stripe and key kind of this highest seqnum and the corresponding value. We change this to:

The point output from the most recent stripe retains it's key kind.
The output points from older stripes transition to the kind *_LOCALLY_DELETED.
If writing to L6 and the most recent stripe output point is a DEL it is changed to a DEL_LOCALLY_DELETED (there are no levels below for which this DEL is useful).

General case: This occurs due to SETWITHDEL, SINGLEDEL. Various transformations can currently happen, that all apply within a snapshot stripe:

SET => SETWITHDEL based on the next older key after the SET being a DEL or SINGLEDEL
SINGLEDEL => DEL based on the next older key after the SINGLEDEL being a DEL or a SETWITHDEL.
SINGLEDEL vanishing: by consuming the next older key that is a SET. The key exposed for that snapshot stripe may then be an older key in that stripe, or nothing may be exposed from that stripe.

Now that since iteration A will ignore the older keys, we need to do these transformations eagerly for the youngest candidate key for the userkey, since we cannot do this at read time. If the next older key is in the same snapshot stripe as the youngest candidate key (NB: if SINGLEDEL as the youngest candidate key vanishes, it passes the mantle of the youngest candidate key to the next key, which may be in the next older stripe), the same logic in compactionIter suffices. The difficulty arises when the next older key to the youngest candidate key is in an older snapshot stripe. We change the logic to apply this transformation even in that case. This should be straightforward since we do step to the next key when the current candidate is a SET (compactionIter.setNext) or a SINGLEDEL (compactionIter.singleDeleteNext), before outputting the current candidate (both call nextInStripe). The resulting key kind is what is output for the youngest key. In the case where this means the SINGLEDEL would vanish, we output SINGLEDEL_LOCALLY_DELETED.

The text was updated successfully, but these errors were encountered:

jbowens · 2023-04-19T16:30:02Z

I like that this idea localizes complexity to the scope of an sstable. Instead of storing the obsolete bit in the key kind, could we store it in the new Pebblev3 table format's value handle? The obsolescence of a snapshot-pinned key is semantically similar to the obsolescence of an old MVCC key. It's also equally beneficial to store an obsolete SET's value out-of-band. If I recall correctly, there's an extra bit or two in the value handle byte that could be used. DELs are value-less today, but to implement #2340 they (or a new tombstone key kind) will need to carry the length of the deleted value provided by the user.

jbowens · 2023-04-25T14:24:05Z

Thinking about this more, I think we should consider focusing our attention on limiting the amount of obsolete data we accumulate (eg, #1810). In Cockroach at least, I expect most of the obsolete data is unnecessary since we only need to snapshot a range at a time. Even if we can efficiently skip over the garbage we accumulate, we're still suffering the unnecessary write bandwidth and CPU cost of building larger tables. If we reduce the frequency of snapshot-pinned data, I think #2463 becomes more attractive.

sumeerbhola · 2023-04-25T18:45:20Z

Thinking about this more, I think we should consider focusing our attention on limiting the amount of obsolete data we accumulate

I don't disagree. What attracts me to the approach here is that it also provides a guarantee of single non-locally-deleted InternalKey per userkey in a sst, and pre-application of same sst rangedels. This guarantee means we can do foreign sst iteration with a minor change to the sst iterators instead of less efficient wrappers.

jbowens · 2023-04-25T20:19:46Z

Makes sense. Doing #2463 or the like would also keep the sst iterators efficient, right? If we can make snapshot-pinned keys exceedingly rare, refusing to allow sstables with locally-deleted keys to be shared becomes much more palatable.

sumeerbhola · 2023-04-27T03:29:50Z

Doing #2463 or the like would also keep the sst iterators efficient, right?

Agreed.
I do have a vague fear of creating a dependency from snapshots to what compactions are allowed, resulting in some level becoming very large. Also, small LSMs (rare for production, except for new stores, but maybe more common for benchmarking) can have L5 or L6 as Lbase, so we could potentially delay compactions out from L0.

sumeerbhola · 2023-05-11T13:01:54Z

I was thinking again about #2463 vs this issue. #2463 does require narrower snapshots as a prerequisite and I still have my "vague fear" from my previous comment.

Another question is if/when we do stateless read replicas, we will have foreign ssts for all levels, for a range. They will come from the same other-LSM, so the L0-L4 will have been local for that other-LSM and we can just use these seqnums. A different scheme would be to use manufactured seqnums for all levels, which we can do when there is the "guarantee of single non-locally-deleted InternalKey per userkey in a sst".

This guarantee seems generally useful to have, since it allows us the flexibility of even sharing L4 ssts for rebalancing e.g. if we have 5GB ranges, then L6, L5, L4 have 4.5GB, 450MB, 45MB (and L4 has likely more than that since the adjusted level multipliers is < 10x). The L5/L6 sharing was partially pulled out of a hat -- it would be nice if we did not have to hard code this behavior because adding additional shared levels had a detrimental effect on compactions.

jbowens · 2023-05-11T21:36:02Z

All user-facing reads in CockroachDB use iterators over the DB, hence have a higher read seqnum than all sstables

nit: I think it's possible to race with an ingest during iterator construction, obtaining a LSM version that includes the file but a visible sequence number that excludes the file:

Ingest acquires sequence number S.
Ingest acquires manifest lock and installs its version that includes the ingested sstable T.
Iterator construction acquires read state containing T.
Iterator reads current visible sequence number, < S.
Ingest finishes applications and publishes S.

This commit defines but leaves unused a new sstable table Pebblev4 that will subsume Pebblev2 and Pebblev3 sstable formats. Future work (cockroachdb#2465, cockroachdb#2340) will require additional sstable table formats. Stabilization of the table format extensions introduced in the Pebblev3 table format is required before that can happen. The new Pebblev4 format will include Pebblev3's extensions. This commit adjusts the code to not respect the Experimental.EnableValueBlocks setting in future format major versions that make use of the Pebblev4 sstable format. There's some subtlety involved in this change, hence the introduction of the Pebblev4 sstable format before it's used.

This commit defines but leaves unused a new sstable table Pebblev4 that will subsume Pebblev2 and Pebblev3 sstable formats. Future work (#2465, #2340) will require additional sstable table formats. Stabilization of the table format extensions introduced in the Pebblev3 table format is required before that can happen. The new Pebblev4 format will include Pebblev3's extensions. This commit adjusts the code to not respect the Experimental.EnableValueBlocks setting in future format major versions that make use of the Pebblev4 sstable format. There's some subtlety involved in this change, hence the introduction of the Pebblev4 sstable format before it's used.

This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL. Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit. - Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys. - Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties: - S1: will expose at most one internal key per user key, which is the most recent one. - S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables. - Pebble compaction output to these levels that can become foreign sstables. - CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey. The strictness of the sstable is written to the Properties block. The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller. On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster). old is master: name old time/op new time/op delta BlockIterSeekGE/restart=16-16 474ns ± 1% 450ns ± 1% -5.16% (p=0.000 n=10+10) BlockIterSeekLT/restart=16-16 520ns ± 0% 526ns ± 0% +1.20% (p=0.000 n=10+10) BlockIterNext/restart=16-16 19.3ns ± 1% 21.0ns ± 0% +8.76% (p=0.000 n=10+10) BlockIterPrev/restart=16-16 38.7ns ± 1% 39.9ns ± 0% +3.20% (p=0.000 n=9+9) TableIterSeekGE/restart=16,compression=Snappy-16 1.65µs ± 1% 1.61µs ± 3% -2.24% (p=0.000 n=9+10) TableIterSeekGE/restart=16,compression=ZSTD-16 1.67µs ± 3% 1.58µs ± 3% -5.11% (p=0.000 n=10+10) TableIterSeekLT/restart=16,compression=Snappy-16 1.75µs ± 3% 1.68µs ± 2% -4.14% (p=0.000 n=10+9) TableIterSeekLT/restart=16,compression=ZSTD-16 1.74µs ± 3% 1.69µs ± 3% -2.54% (p=0.001 n=10+10) TableIterNext/restart=16,compression=Snappy-16 23.9ns ± 1% 25.3ns ± 0% +5.84% (p=0.000 n=10+10) TableIterNext/restart=16,compression=ZSTD-16 23.9ns ± 1% 25.3ns ± 0% +5.78% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=Snappy-16 45.2ns ± 1% 46.2ns ± 1% +2.09% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=ZSTD-16 45.3ns ± 0% 46.3ns ± 0% +2.23% (p=0.000 n=8+9) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16 51.7ns ± 1% 55.2ns ± 4% +6.82% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16 54.9ns ± 1% 56.4ns ± 3% +2.73% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16 35.0ns ± 1% 34.8ns ± 1% -0.56% (p=0.037 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16 37.8ns ± 0% 38.0ns ± 1% +0.55% (p=0.018 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16 41.5ns ± 2% 42.4ns ± 1% +2.18% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16 94.7ns ± 4% 97.0ns ± 8% ~ (p=0.133 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16 35.4ns ± 2% 36.5ns ± 1% +2.97% (p=0.000 n=10+8) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16 60.1ns ± 1% 57.8ns ± 0% -3.84% (p=0.000 n=9+9) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16 135ns ± 1% 136ns ± 1% +0.44% (p=0.009 n=9+10) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16 139ns ± 0% 139ns ± 0% +0.48% (p=0.000 n=10+8) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16 34.8ns ± 1% 35.5ns ± 2% +2.12% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16 37.6ns ± 0% 38.6ns ± 1% +2.53% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16 215ns ± 1% 216ns ± 0% ~ (p=0.341 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16 220ns ± 1% 220ns ± 0% ~ (p=0.983 n=10+8) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16 41.6ns ± 1% 42.6ns ± 2% +2.42% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16 44.6ns ± 1% 45.6ns ± 1% +2.28% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16 2.16µs ± 0% 2.06µs ± 1% -4.27% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16 2.15µs ± 1% 2.07µs ± 0% -3.71% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16 94.1ns ± 1% 95.9ns ± 2% +1.94% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16 97.5ns ± 1% 98.2ns ± 1% +0.69% (p=0.023 n=10+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16 2.81µs ± 1% 2.66µs ± 1% -5.29% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16 2.82µs ± 1% 2.67µs ± 0% -5.47% (p=0.000 n=8+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16 689ns ± 4% 652ns ± 5% -5.32% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16 694ns ± 2% 657ns ± 1% -5.28% (p=0.000 n=10+8) Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 1.25µs ± 3% 1.15µs ± 1% -8.51% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=2-16 2.49µs ± 2% 2.28µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=3-16 3.82µs ± 3% 3.57µs ± 1% -6.54% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=4-16 5.31µs ± 2% 4.86µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=5-16 6.88µs ± 1% 6.36µs ± 2% -7.49% (p=0.000 n=10+10) MergingIterNext/restart=16/count=1-16 46.0ns ± 1% 46.6ns ± 1% +1.13% (p=0.000 n=10+10) MergingIterNext/restart=16/count=2-16 72.8ns ± 1% 73.0ns ± 0% ~ (p=0.363 n=10+10) MergingIterNext/restart=16/count=3-16 93.5ns ± 0% 93.1ns ± 1% ~ (p=0.507 n=10+9) MergingIterNext/restart=16/count=4-16 104ns ± 0% 104ns ± 1% ~ (p=0.078 n=8+10) MergingIterNext/restart=16/count=5-16 121ns ± 1% 121ns ± 1% -0.52% (p=0.008 n=10+10) MergingIterPrev/restart=16/count=1-16 66.6ns ± 1% 67.8ns ± 1% +1.81% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=2-16 93.2ns ± 0% 94.4ns ± 1% +1.24% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=3-16 114ns ± 0% 114ns ± 1% +0.36% (p=0.032 n=9+10) MergingIterPrev/restart=16/count=4-16 122ns ± 1% 123ns ± 0% +0.41% (p=0.014 n=10+9) MergingIterPrev/restart=16/count=5-16 138ns ± 1% 138ns ± 0% +0.52% (p=0.012 n=10+10) MergingIterSeqSeekGEWithBounds/levelCount=5-16 572ns ± 1% 572ns ± 0% ~ (p=0.842 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.85µs ± 1% 1.76µs ± 1% -4.85% (p=0.000 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 443ns ± 0% 444ns ± 1% ~ (p=0.255 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.86µs ± 1% 1.77µs ± 1% -4.63% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 486ns ± 1% 482ns ± 1% -0.80% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.93µs ± 1% 1.83µs ± 1% -4.95% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 570ns ± 0% 567ns ± 2% -0.47% (p=0.020 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 2.12µs ± 0% 2.03µs ± 1% -4.38% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.43µs ± 1% 1.39µs ± 1% -2.57% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 2.28µs ± 1% 2.18µs ± 0% -4.54% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.59µs ± 1% 1.53µs ± 1% -3.71% (p=0.000 n=10+9) Finally, a read benchmark where all except the first key is obsolete shows improvement. BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10 36 32300029 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10 36 32418979 ns/op 3 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10 82 13357163 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10 90 13256770 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10 36 32396367 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10 26086 46095 ns/op 0 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10 88 13226711 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10 39171 30618 ns/op 0 B/op 0 allocs/op Informs cockroachdb#2465

sumeerbhola · 2023-05-30T19:15:03Z

Currently there is no support for MERGE keys when writing strict sstables with the obsolete bit, since it is non-trivial, and we can exclude all CockroachDB keys up to SystemMax from disaggregated storage. This exclusion contains the timeseries data in CockroachDB, which is the only use of MERGE (and there is no plan to expand the use of MERGE).

@jbowens had the following idea if we later need to add support for MERGE: we could use add a PREMERGE key kind that holds the pre-merged operands and write both the PREMERGE (live) key and the individual MERGE (obsolete bit set) operands (including for the most recent operand). A sstable iterator would be required to return either only the contained PREMERGEs (if the iterator's seqnum is higher than the sstable's LargestSeqNum) or only the contained MERGEs (if not). PREMERGE would otherwise behave just like MERGE, (or the sstable iterator could transparently edit the kind to MERGE before surfacing).

This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL. Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit. - Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys. - Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties: - S1: will expose at most one internal key per user key, which is the most recent one. - S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables. - Pebble compaction output to these levels that can become foreign sstables. - CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey. The strictness of the sstable is written to the Properties block. The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller. On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster). old is master: name old time/op new time/op delta BlockIterSeekGE/restart=16-16 474ns ± 1% 450ns ± 1% -5.16% (p=0.000 n=10+10) BlockIterSeekLT/restart=16-16 520ns ± 0% 526ns ± 0% +1.20% (p=0.000 n=10+10) BlockIterNext/restart=16-16 19.3ns ± 1% 21.0ns ± 0% +8.76% (p=0.000 n=10+10) BlockIterPrev/restart=16-16 38.7ns ± 1% 39.9ns ± 0% +3.20% (p=0.000 n=9+9) TableIterSeekGE/restart=16,compression=Snappy-16 1.65µs ± 1% 1.61µs ± 3% -2.24% (p=0.000 n=9+10) TableIterSeekGE/restart=16,compression=ZSTD-16 1.67µs ± 3% 1.58µs ± 3% -5.11% (p=0.000 n=10+10) TableIterSeekLT/restart=16,compression=Snappy-16 1.75µs ± 3% 1.68µs ± 2% -4.14% (p=0.000 n=10+9) TableIterSeekLT/restart=16,compression=ZSTD-16 1.74µs ± 3% 1.69µs ± 3% -2.54% (p=0.001 n=10+10) TableIterNext/restart=16,compression=Snappy-16 23.9ns ± 1% 25.3ns ± 0% +5.84% (p=0.000 n=10+10) TableIterNext/restart=16,compression=ZSTD-16 23.9ns ± 1% 25.3ns ± 0% +5.78% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=Snappy-16 45.2ns ± 1% 46.2ns ± 1% +2.09% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=ZSTD-16 45.3ns ± 0% 46.3ns ± 0% +2.23% (p=0.000 n=8+9) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16 51.7ns ± 1% 55.2ns ± 4% +6.82% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16 54.9ns ± 1% 56.4ns ± 3% +2.73% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16 35.0ns ± 1% 34.8ns ± 1% -0.56% (p=0.037 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16 37.8ns ± 0% 38.0ns ± 1% +0.55% (p=0.018 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16 41.5ns ± 2% 42.4ns ± 1% +2.18% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16 94.7ns ± 4% 97.0ns ± 8% ~ (p=0.133 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16 35.4ns ± 2% 36.5ns ± 1% +2.97% (p=0.000 n=10+8) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16 60.1ns ± 1% 57.8ns ± 0% -3.84% (p=0.000 n=9+9) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16 135ns ± 1% 136ns ± 1% +0.44% (p=0.009 n=9+10) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16 139ns ± 0% 139ns ± 0% +0.48% (p=0.000 n=10+8) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16 34.8ns ± 1% 35.5ns ± 2% +2.12% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16 37.6ns ± 0% 38.6ns ± 1% +2.53% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16 215ns ± 1% 216ns ± 0% ~ (p=0.341 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16 220ns ± 1% 220ns ± 0% ~ (p=0.983 n=10+8) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16 41.6ns ± 1% 42.6ns ± 2% +2.42% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16 44.6ns ± 1% 45.6ns ± 1% +2.28% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16 2.16µs ± 0% 2.06µs ± 1% -4.27% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16 2.15µs ± 1% 2.07µs ± 0% -3.71% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16 94.1ns ± 1% 95.9ns ± 2% +1.94% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16 97.5ns ± 1% 98.2ns ± 1% +0.69% (p=0.023 n=10+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16 2.81µs ± 1% 2.66µs ± 1% -5.29% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16 2.82µs ± 1% 2.67µs ± 0% -5.47% (p=0.000 n=8+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16 689ns ± 4% 652ns ± 5% -5.32% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16 694ns ± 2% 657ns ± 1% -5.28% (p=0.000 n=10+8) Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 1.25µs ± 3% 1.15µs ± 1% -8.51% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=2-16 2.49µs ± 2% 2.28µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=3-16 3.82µs ± 3% 3.57µs ± 1% -6.54% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=4-16 5.31µs ± 2% 4.86µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=5-16 6.88µs ± 1% 6.36µs ± 2% -7.49% (p=0.000 n=10+10) MergingIterNext/restart=16/count=1-16 46.0ns ± 1% 46.6ns ± 1% +1.13% (p=0.000 n=10+10) MergingIterNext/restart=16/count=2-16 72.8ns ± 1% 73.0ns ± 0% ~ (p=0.363 n=10+10) MergingIterNext/restart=16/count=3-16 93.5ns ± 0% 93.1ns ± 1% ~ (p=0.507 n=10+9) MergingIterNext/restart=16/count=4-16 104ns ± 0% 104ns ± 1% ~ (p=0.078 n=8+10) MergingIterNext/restart=16/count=5-16 121ns ± 1% 121ns ± 1% -0.52% (p=0.008 n=10+10) MergingIterPrev/restart=16/count=1-16 66.6ns ± 1% 67.8ns ± 1% +1.81% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=2-16 93.2ns ± 0% 94.4ns ± 1% +1.24% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=3-16 114ns ± 0% 114ns ± 1% +0.36% (p=0.032 n=9+10) MergingIterPrev/restart=16/count=4-16 122ns ± 1% 123ns ± 0% +0.41% (p=0.014 n=10+9) MergingIterPrev/restart=16/count=5-16 138ns ± 1% 138ns ± 0% +0.52% (p=0.012 n=10+10) MergingIterSeqSeekGEWithBounds/levelCount=5-16 572ns ± 1% 572ns ± 0% ~ (p=0.842 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.85µs ± 1% 1.76µs ± 1% -4.85% (p=0.000 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 443ns ± 0% 444ns ± 1% ~ (p=0.255 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.86µs ± 1% 1.77µs ± 1% -4.63% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 486ns ± 1% 482ns ± 1% -0.80% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.93µs ± 1% 1.83µs ± 1% -4.95% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 570ns ± 0% 567ns ± 2% -0.47% (p=0.020 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 2.12µs ± 0% 2.03µs ± 1% -4.38% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.43µs ± 1% 1.39µs ± 1% -2.57% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 2.28µs ± 1% 2.18µs ± 0% -4.54% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.59µs ± 1% 1.53µs ± 1% -3.71% (p=0.000 n=10+9) Finally, a read benchmark where all except the first key is obsolete shows improvement. BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10 36 32300029 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10 36 32418979 ns/op 3 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10 82 13357163 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10 90 13256770 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10 36 32396367 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10 26086 46095 ns/op 0 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10 88 13226711 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10 39171 30618 ns/op 0 B/op 0 allocs/op Informs cockroachdb#2465

This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL. Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit. - Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys. - Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties: - S1: will expose at most one internal key per user key, which is the most recent one. - S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables. - Pebble compaction output to these levels that can become foreign sstables. - CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey. The strictness of the sstable is written to the Properties block. The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller. On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster). old is master: name old time/op new time/op delta BlockIterSeekGE/restart=16-16 474ns ± 1% 450ns ± 1% -5.16% (p=0.000 n=10+10) BlockIterSeekLT/restart=16-16 520ns ± 0% 526ns ± 0% +1.20% (p=0.000 n=10+10) BlockIterNext/restart=16-16 19.3ns ± 1% 21.0ns ± 0% +8.76% (p=0.000 n=10+10) BlockIterPrev/restart=16-16 38.7ns ± 1% 39.9ns ± 0% +3.20% (p=0.000 n=9+9) TableIterSeekGE/restart=16,compression=Snappy-16 1.65µs ± 1% 1.61µs ± 3% -2.24% (p=0.000 n=9+10) TableIterSeekGE/restart=16,compression=ZSTD-16 1.67µs ± 3% 1.58µs ± 3% -5.11% (p=0.000 n=10+10) TableIterSeekLT/restart=16,compression=Snappy-16 1.75µs ± 3% 1.68µs ± 2% -4.14% (p=0.000 n=10+9) TableIterSeekLT/restart=16,compression=ZSTD-16 1.74µs ± 3% 1.69µs ± 3% -2.54% (p=0.001 n=10+10) TableIterNext/restart=16,compression=Snappy-16 23.9ns ± 1% 25.3ns ± 0% +5.84% (p=0.000 n=10+10) TableIterNext/restart=16,compression=ZSTD-16 23.9ns ± 1% 25.3ns ± 0% +5.78% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=Snappy-16 45.2ns ± 1% 46.2ns ± 1% +2.09% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=ZSTD-16 45.3ns ± 0% 46.3ns ± 0% +2.23% (p=0.000 n=8+9) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16 51.7ns ± 1% 55.2ns ± 4% +6.82% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16 54.9ns ± 1% 56.4ns ± 3% +2.73% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16 35.0ns ± 1% 34.8ns ± 1% -0.56% (p=0.037 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16 37.8ns ± 0% 38.0ns ± 1% +0.55% (p=0.018 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16 41.5ns ± 2% 42.4ns ± 1% +2.18% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16 94.7ns ± 4% 97.0ns ± 8% ~ (p=0.133 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16 35.4ns ± 2% 36.5ns ± 1% +2.97% (p=0.000 n=10+8) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16 60.1ns ± 1% 57.8ns ± 0% -3.84% (p=0.000 n=9+9) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16 135ns ± 1% 136ns ± 1% +0.44% (p=0.009 n=9+10) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16 139ns ± 0% 139ns ± 0% +0.48% (p=0.000 n=10+8) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16 34.8ns ± 1% 35.5ns ± 2% +2.12% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16 37.6ns ± 0% 38.6ns ± 1% +2.53% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16 215ns ± 1% 216ns ± 0% ~ (p=0.341 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16 220ns ± 1% 220ns ± 0% ~ (p=0.983 n=10+8) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16 41.6ns ± 1% 42.6ns ± 2% +2.42% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16 44.6ns ± 1% 45.6ns ± 1% +2.28% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16 2.16µs ± 0% 2.06µs ± 1% -4.27% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16 2.15µs ± 1% 2.07µs ± 0% -3.71% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16 94.1ns ± 1% 95.9ns ± 2% +1.94% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16 97.5ns ± 1% 98.2ns ± 1% +0.69% (p=0.023 n=10+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16 2.81µs ± 1% 2.66µs ± 1% -5.29% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16 2.82µs ± 1% 2.67µs ± 0% -5.47% (p=0.000 n=8+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16 689ns ± 4% 652ns ± 5% -5.32% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16 694ns ± 2% 657ns ± 1% -5.28% (p=0.000 n=10+8) Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 1.25µs ± 3% 1.15µs ± 1% -8.51% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=2-16 2.49µs ± 2% 2.28µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=3-16 3.82µs ± 3% 3.57µs ± 1% -6.54% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=4-16 5.31µs ± 2% 4.86µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=5-16 6.88µs ± 1% 6.36µs ± 2% -7.49% (p=0.000 n=10+10) MergingIterNext/restart=16/count=1-16 46.0ns ± 1% 46.6ns ± 1% +1.13% (p=0.000 n=10+10) MergingIterNext/restart=16/count=2-16 72.8ns ± 1% 73.0ns ± 0% ~ (p=0.363 n=10+10) MergingIterNext/restart=16/count=3-16 93.5ns ± 0% 93.1ns ± 1% ~ (p=0.507 n=10+9) MergingIterNext/restart=16/count=4-16 104ns ± 0% 104ns ± 1% ~ (p=0.078 n=8+10) MergingIterNext/restart=16/count=5-16 121ns ± 1% 121ns ± 1% -0.52% (p=0.008 n=10+10) MergingIterPrev/restart=16/count=1-16 66.6ns ± 1% 67.8ns ± 1% +1.81% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=2-16 93.2ns ± 0% 94.4ns ± 1% +1.24% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=3-16 114ns ± 0% 114ns ± 1% +0.36% (p=0.032 n=9+10) MergingIterPrev/restart=16/count=4-16 122ns ± 1% 123ns ± 0% +0.41% (p=0.014 n=10+9) MergingIterPrev/restart=16/count=5-16 138ns ± 1% 138ns ± 0% +0.52% (p=0.012 n=10+10) MergingIterSeqSeekGEWithBounds/levelCount=5-16 572ns ± 1% 572ns ± 0% ~ (p=0.842 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.85µs ± 1% 1.76µs ± 1% -4.85% (p=0.000 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 443ns ± 0% 444ns ± 1% ~ (p=0.255 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.86µs ± 1% 1.77µs ± 1% -4.63% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 486ns ± 1% 482ns ± 1% -0.80% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.93µs ± 1% 1.83µs ± 1% -4.95% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 570ns ± 0% 567ns ± 2% -0.47% (p=0.020 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 2.12µs ± 0% 2.03µs ± 1% -4.38% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.43µs ± 1% 1.39µs ± 1% -2.57% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 2.28µs ± 1% 2.18µs ± 0% -4.54% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.59µs ± 1% 1.53µs ± 1% -3.71% (p=0.000 n=10+9) Finally, a read benchmark where all except the first key is obsolete shows improvement. BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10 36 32300029 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10 36 32418979 ns/op 3 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10 82 13357163 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10 90 13256770 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10 36 32396367 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10 26086 46095 ns/op 0 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10 88 13226711 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10 39171 30618 ns/op 0 B/op 0 allocs/op Informs #2465

This is to conform to the recently introduced performance recommendation in pebble.IterOptions. Informs cockroachdb/pebble#2465 Epic: none Release note: None

104560: storage: increase capacity of PointKeyFilters slice r=jbowens a=sumeerbhola This is to conform to the recently introduced performance recommendation in pebble.IterOptions. Informs cockroachdb/pebble#2465 Epic: none Release note: None Co-authored-by: sumeerbhola <[email protected]>

jbowens · 2023-07-19T19:19:02Z

@sumeerbhola Is this good to close out now that this is active on cockroach master?

sumeerbhola · 2023-07-19T19:45:33Z

Yes, I'll close it.
I've created #2756 for the disaggregated storage integration.

blathers-crl bot added A-storage T-storage labels Apr 19, 2023

sumeerbhola changed the title ~~sstable: sstable-local key types for overwritten/deleted points~~ sstable: sstable-local key kinds for overwritten/deleted points Apr 19, 2023

jbowens mentioned this issue May 16, 2023

db: add reserved Pebblev4 sstable format #2528

Merged

jbowens mentioned this issue May 17, 2023

*: Implement iterators and seqnum substitution for foreign SSTs #2455

Merged

sumeerbhola mentioned this issue May 22, 2023

db: add DeleteSized and corresponding compaction heuristics #2537

Merged

sumeerbhola mentioned this issue May 23, 2023

sstable,db: introduce a sstable-internal ObsoleteBit in the key kind #2559

Merged

sumeerbhola mentioned this issue Jun 7, 2023

storage: increase capacity of PointKeyFilters slice cockroachdb/cockroach#104560

Merged

jbowens assigned sumeerbhola Jun 26, 2023

sumeerbhola closed this as completed Jul 19, 2023

sumeerbhola mentioned this issue Dec 21, 2023

db: efficient skipping of points deleted by RANGEDEL in the same file #2424

Closed

sumeerbhola mentioned this issue Apr 29, 2024

Disallow compactions into L5/L6 for files that straddle a snapshot stripe #2463

Closed

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sstable: sstable-local key kinds for overwritten/deleted points #2465

sstable: sstable-local key kinds for overwritten/deleted points #2465

sumeerbhola commented Apr 19, 2023

jbowens commented Apr 19, 2023

jbowens commented Apr 25, 2023 •

edited

Loading

sumeerbhola commented Apr 25, 2023

jbowens commented Apr 25, 2023

sumeerbhola commented Apr 27, 2023

sumeerbhola commented May 11, 2023

jbowens commented May 11, 2023

sumeerbhola commented May 30, 2023

jbowens commented Jul 19, 2023

sumeerbhola commented Jul 19, 2023

sstable: sstable-local key kinds for overwritten/deleted points #2465

sstable: sstable-local key kinds for overwritten/deleted points #2465

Comments

sumeerbhola commented Apr 19, 2023

jbowens commented Apr 19, 2023

jbowens commented Apr 25, 2023 • edited Loading

sumeerbhola commented Apr 25, 2023

jbowens commented Apr 25, 2023

sumeerbhola commented Apr 27, 2023

sumeerbhola commented May 11, 2023

jbowens commented May 11, 2023

sumeerbhola commented May 30, 2023

jbowens commented Jul 19, 2023

sumeerbhola commented Jul 19, 2023

jbowens commented Apr 25, 2023 •

edited

Loading