Skip to content

Commit

Permalink
sstable,db: introduce a sstable-internal ObsoleteBit in the key kind
Browse files Browse the repository at this point in the history
This bit marks keys that are obsolete because they are not the newest
seqnum for a user key (in that sstable), or they are masked by a
RANGEDEL.

Setting the obsolete bit on point keys is advanced usage, so we support 2
modes, both of which must be truthful when setting the obsolete bit, but
vary in when they don't set the obsolete bit.
- Non-strict: In this mode, the bit does not need to be set for keys that
  are obsolete. Additionally, any sstable containing MERGE keys can only
  use this mode. An iterator over such an sstable, when configured to
  hideObsoletePoints, can expose multiple internal keys per user key, and
  can expose keys that are deleted by rangedels in the same sstable. This
  is the mode that non-advanced users should use. Pebble without
  disaggregated storage will also use this mode and will best-effort set
  the obsolete bit, to optimize iteration when snapshots have retained many
  obsolete keys.

- Strict: In this mode, every obsolete key must have the obsolete bit set,
  and no MERGE keys are permitted. An iterator over such an sstable, when
  configured to hideObsoletePoints satisfies two properties:
  - S1: will expose at most one internal key per user key, which is the
    most recent one.
  - S2: will never expose keys that are deleted by rangedels in the same
    sstable.
  This is the mode for two use cases in disaggregated storage (which will
  exclude parts of the key space that has MERGEs), for levels that contain
  sstables that can become foreign sstables.
  - Pebble compaction output to these levels that can become foreign
    sstables.
  - CockroachDB ingest operations that can ingest into the levels that can
    become foreign sstables. Note, these are not sstables corresponding to
    copied data for CockroachDB range snapshots. This case occurs for
    operations like index backfills: these trivially satisfy the strictness
    criteria since they only write one key per userkey.

The strictness of the sstable is written to the Properties block.

The Writer implementation discovers keys that are obsolete because they
are the same userkey as the previous key. This can be cheaply done since
we already do user key comparisons in the Writer. For keys obsoleted by
RANGEDELs, the Writer relies on the caller.

On the read path, the obsolete bit is removed by the blockIter. Since
everything reading an sstable uses a blockIter, this prevents any leakage
of this bit. Some effort was made to reduce the regression on the
iteration path, but TableIterNext has +5.84% regression. Some of the
slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).

old is master:

name                                                                              old time/op    new time/op    delta
BlockIterSeekGE/restart=16-16                                                        474ns ± 1%     450ns ± 1%  -5.16%  (p=0.000 n=10+10)
BlockIterSeekLT/restart=16-16                                                        520ns ± 0%     526ns ± 0%  +1.20%  (p=0.000 n=10+10)
BlockIterNext/restart=16-16                                                         19.3ns ± 1%    21.0ns ± 0%  +8.76%  (p=0.000 n=10+10)
BlockIterPrev/restart=16-16                                                         38.7ns ± 1%    39.9ns ± 0%  +3.20%  (p=0.000 n=9+9)
TableIterSeekGE/restart=16,compression=Snappy-16                                    1.65µs ± 1%    1.61µs ± 3%  -2.24%  (p=0.000 n=9+10)
TableIterSeekGE/restart=16,compression=ZSTD-16                                      1.67µs ± 3%    1.58µs ± 3%  -5.11%  (p=0.000 n=10+10)
TableIterSeekLT/restart=16,compression=Snappy-16                                    1.75µs ± 3%    1.68µs ± 2%  -4.14%  (p=0.000 n=10+9)
TableIterSeekLT/restart=16,compression=ZSTD-16                                      1.74µs ± 3%    1.69µs ± 3%  -2.54%  (p=0.001 n=10+10)
TableIterNext/restart=16,compression=Snappy-16                                      23.9ns ± 1%    25.3ns ± 0%  +5.84%  (p=0.000 n=10+10)
TableIterNext/restart=16,compression=ZSTD-16                                        23.9ns ± 1%    25.3ns ± 0%  +5.78%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=Snappy-16                                      45.2ns ± 1%    46.2ns ± 1%  +2.09%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=ZSTD-16                                        45.3ns ± 0%    46.3ns ± 0%  +2.23%  (p=0.000 n=8+9)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16     51.7ns ± 1%    55.2ns ± 4%  +6.82%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16      54.9ns ± 1%    56.4ns ± 3%  +2.73%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16    35.0ns ± 1%    34.8ns ± 1%  -0.56%  (p=0.037 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16     37.8ns ± 0%    38.0ns ± 1%  +0.55%  (p=0.018 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16     41.5ns ± 2%    42.4ns ± 1%  +2.18%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16      94.7ns ± 4%    97.0ns ± 8%    ~     (p=0.133 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16    35.4ns ± 2%    36.5ns ± 1%  +2.97%  (p=0.000 n=10+8)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16     60.1ns ± 1%    57.8ns ± 0%  -3.84%  (p=0.000 n=9+9)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16                 135ns ± 1%     136ns ± 1%  +0.44%  (p=0.009 n=9+10)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16                  139ns ± 0%     139ns ± 0%  +0.48%  (p=0.000 n=10+8)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16            34.8ns ± 1%    35.5ns ± 2%  +2.12%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16             37.6ns ± 0%    38.6ns ± 1%  +2.53%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16                 215ns ± 1%     216ns ± 0%    ~     (p=0.341 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16                  220ns ± 1%     220ns ± 0%    ~     (p=0.983 n=10+8)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16            41.6ns ± 1%    42.6ns ± 2%  +2.42%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16             44.6ns ± 1%    45.6ns ± 1%  +2.28%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16               2.16µs ± 0%    2.06µs ± 1%  -4.27%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16                2.15µs ± 1%    2.07µs ± 0%  -3.71%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16           94.1ns ± 1%    95.9ns ± 2%  +1.94%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16            97.5ns ± 1%    98.2ns ± 1%  +0.69%  (p=0.023 n=10+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16              2.81µs ± 1%    2.66µs ± 1%  -5.29%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16               2.82µs ± 1%    2.67µs ± 0%  -5.47%  (p=0.000 n=8+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16           689ns ± 4%     652ns ± 5%  -5.32%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16            694ns ± 2%     657ns ± 1%  -5.28%  (p=0.000 n=10+8)

Looking at mergingIter, the Next regression seems tolerable, and SeekGE
is better.

name                                                  old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16                 1.25µs ± 3%    1.15µs ± 1%  -8.51%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=2-16                 2.49µs ± 2%    2.28µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=3-16                 3.82µs ± 3%    3.57µs ± 1%  -6.54%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=4-16                 5.31µs ± 2%    4.86µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=5-16                 6.88µs ± 1%    6.36µs ± 2%  -7.49%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=1-16                   46.0ns ± 1%    46.6ns ± 1%  +1.13%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=2-16                   72.8ns ± 1%    73.0ns ± 0%    ~     (p=0.363 n=10+10)
MergingIterNext/restart=16/count=3-16                   93.5ns ± 0%    93.1ns ± 1%    ~     (p=0.507 n=10+9)
MergingIterNext/restart=16/count=4-16                    104ns ± 0%     104ns ± 1%    ~     (p=0.078 n=8+10)
MergingIterNext/restart=16/count=5-16                    121ns ± 1%     121ns ± 1%  -0.52%  (p=0.008 n=10+10)
MergingIterPrev/restart=16/count=1-16                   66.6ns ± 1%    67.8ns ± 1%  +1.81%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16                   93.2ns ± 0%    94.4ns ± 1%  +1.24%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=3-16                    114ns ± 0%     114ns ± 1%  +0.36%  (p=0.032 n=9+10)
MergingIterPrev/restart=16/count=4-16                    122ns ± 1%     123ns ± 0%  +0.41%  (p=0.014 n=10+9)
MergingIterPrev/restart=16/count=5-16                    138ns ± 1%     138ns ± 0%  +0.52%  (p=0.012 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16           572ns ± 1%     572ns ± 0%    ~     (p=0.842 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16     1.85µs ± 1%    1.76µs ± 1%  -4.85%  (p=0.000 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16       443ns ± 0%     444ns ± 1%    ~     (p=0.255 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16     1.86µs ± 1%    1.77µs ± 1%  -4.63%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16       486ns ± 1%     482ns ± 1%  -0.80%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16     1.93µs ± 1%    1.83µs ± 1%  -4.95%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16       570ns ± 0%     567ns ± 2%  -0.47%  (p=0.020 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16     2.12µs ± 0%    2.03µs ± 1%  -4.38%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16      1.43µs ± 1%    1.39µs ± 1%  -2.57%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16    2.28µs ± 1%    2.18µs ± 0%  -4.54%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16     1.59µs ± 1%    1.53µs ± 1%  -3.71%  (p=0.000 n=10+9)

Finally, a read benchmark where all except the first key is obsolete
shows improvement.

BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10         	      36	  32300029 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10          	      36	  32418979 ns/op	       3 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10       	      82	  13357163 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10        	      90	  13256770 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10         	      36	  32396367 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10          	   26086	     46095 ns/op	       0 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10       	      88	  13226711 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10        	   39171	     30618 ns/op	       0 B/op	       0 allocs/op

Informs cockroachdb#2465
  • Loading branch information
sumeerbhola committed Jun 5, 2023
1 parent f98d3df commit 152d19a
Show file tree
Hide file tree
Showing 51 changed files with 1,760 additions and 570 deletions.
8 changes: 4 additions & 4 deletions batch.go
Original file line number Diff line number Diff line change
Expand Up @@ -432,8 +432,8 @@ func (b *Batch) refreshMemTableSize() {
case InternalKeyKindRangeKeySet, InternalKeyKindRangeKeyUnset, InternalKeyKindRangeKeyDelete:
b.countRangeKeys++
case InternalKeyKindDeleteSized:
if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSized {
b.minimumFormatMajorVersion = ExperimentalFormatDeleteSized
if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSizedAndObsolete {
b.minimumFormatMajorVersion = ExperimentalFormatDeleteSizedAndObsolete
}
case InternalKeyKindIngestSST:
if b.minimumFormatMajorVersion < FormatFlushableIngest {
Expand Down Expand Up @@ -729,8 +729,8 @@ func (b *Batch) DeleteSized(key []byte, deletedValueSize uint32, _ *WriteOptions
// complete key slice, letting the caller encode into the DeferredBatchOp.Key
// slice and then call Finish() on the returned object.
func (b *Batch) DeleteSizedDeferred(keyLen int, deletedValueSize uint32) *DeferredBatchOp {
if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSized {
b.minimumFormatMajorVersion = ExperimentalFormatDeleteSized
if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSizedAndObsolete {
b.minimumFormatMajorVersion = ExperimentalFormatDeleteSizedAndObsolete
}

// Encode the sum of the key length and the value in the value.
Expand Down
9 changes: 8 additions & 1 deletion compaction.go
Original file line number Diff line number Diff line change
Expand Up @@ -1415,6 +1415,10 @@ func (c *compaction) newInputIter(
iterOpts := IterOptions{logger: c.logger}
// TODO(bananabrick): Get rid of the extra manifest.Level parameter and fold it into
// compactionLevel.
//
// TODO(bilal): when we start using strict obsolete sstables for L5 and L6
// in disaggregated storage, and rely on the obsolete bit, we will also need
// to configure the levelIter at these levels to hide the obsolete points.
addItersForLevel := func(level *compactionLevel, l manifest.Level) error {
iters = append(iters, newLevelIter(iterOpts, c.cmp, nil /* split */, newIters,
level.files.Iter(), l, &c.bytesIterated))
Expand Down Expand Up @@ -3235,7 +3239,10 @@ func (d *DB) runCompaction(
return nil, pendingOutputs, stats, err
}
}
if err := tw.Add(*key, val); err != nil {
// iter.snapshotPinned is broader than whether the point was covered by
// a RANGEDEL, but it is harmless to pass true when the callee will also
// independently discover that the point is obsolete.
if err := tw.AddWithForceObsolete(*key, val, iter.snapshotPinned); err != nil {
return nil, pendingOutputs, stats, err
}
if iter.snapshotPinned {
Expand Down
26 changes: 24 additions & 2 deletions compaction_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,21 @@ type compactionIter struct {
// compaction iterator was only returned because an open snapshot prevents
// its elision. This field only applies to point keys, and not to range
// deletions or range keys.
//
// snapshotPinned is also used to set the forceObsolete value in the call to
// Writer.AddWithForceObsolete. Note that in that call, it is sufficient to
// mark all keys obsoleted by RANGEDELs as forceObsolete=true and that the
// implementation of Writer.AddWithForceObsolete will itself discover other
// causes of obsolescence. We mention this since in the presence of MERGE,
// obsolescence due to multiple keys at the same user key is not fully
// represented by snapshotPinned=true:
//
// For MERGE, it is possible that doing the merge is interrupted even when
// the next point key is in the same stripe. This can happen if the loop in
// mergeNext gets interrupted by sameStripeNonSkippable.
// sameStripeNonSkippable occurs due to RANGEDELs that sort before
// SET/MERGE/DEL with the same seqnum, so the RANGEDEL does not necessarily
// delete the subsequent SET/MERGE/DEL keys.
snapshotPinned bool
// The index of the snapshot for the current key within the snapshots slice.
curSnapshotIdx int
Expand Down Expand Up @@ -311,10 +326,10 @@ func (i *compactionIter) Next() (*InternalKey, []byte) {
// respect to `iterKey` and related state:
//
// - `!skip && pos == iterPosNext`: `iterKey` is already at the next key.
// - `!skip && pos == iterPosCur`: We are at the key that has been returned.
// - `!skip && pos == iterPosCurForward`: We are at the key that has been returned.
// To move forward we advance by one key, even if that lands us in the same
// snapshot stripe.
// - `skip && pos == iterPosCur`: We are at the key that has been returned.
// - `skip && pos == iterPosCurForward`: We are at the key that has been returned.
// To move forward we skip skippable entries in the stripe.
if i.pos == iterPosCurForward {
if i.skip {
Expand Down Expand Up @@ -383,6 +398,13 @@ func (i *compactionIter) Next() (*InternalKey, []byte) {
} else if cover == keyspan.CoversInvisibly {
// i.iterKey would be deleted by a range deletion if there weren't
// any open snapshots. Mark it as pinned.
//
// NB: there are multiple places in this file where we call
// i.rangeDelFrag.Covers and this is the only one where we are writing
// to i.snapshotPinned. Those other cases occur in mergeNext where the
// caller is deciding whether the value should be merged or not, and the
// key is in the same snapshot stripe. Hence, snapshotPinned is by
// definition false in those cases.
i.snapshotPinned = true
}

Expand Down
2 changes: 1 addition & 1 deletion compaction_iter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ func TestCompactionIter(t *testing.T) {
if formatVersion < FormatSetWithDelete {
return "testdata/compaction_iter"
}
if formatVersion < ExperimentalFormatDeleteSized {
if formatVersion < ExperimentalFormatDeleteSizedAndObsolete {
return "testdata/compaction_iter_set_with_del"
}
return "testdata/compaction_iter_delete_sized"
Expand Down
4 changes: 2 additions & 2 deletions compaction_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1607,11 +1607,11 @@ func TestManualCompaction(t *testing.T) {
{
testData: "testdata/manual_compaction_file_boundaries",
minVersion: FormatMostCompatible,
maxVersion: ExperimentalFormatDeleteSized - 1,
maxVersion: ExperimentalFormatDeleteSizedAndObsolete - 1,
},
{
testData: "testdata/manual_compaction_file_boundaries_delsized",
minVersion: ExperimentalFormatDeleteSized,
minVersion: ExperimentalFormatDeleteSizedAndObsolete,
maxVersion: internalFormatNewest,
},
}
Expand Down
1 change: 1 addition & 0 deletions db.go
Original file line number Diff line number Diff line change
Expand Up @@ -1362,6 +1362,7 @@ func (i *Iterator) constructPointIter(
levelsIndex := len(levels)
mlevels = mlevels[:numMergingLevels]
levels = levels[:numLevelIters]
i.opts.snapshotForHideObsoletePoints = buf.dbi.seqNum
addLevelIterForFiles := func(files manifest.LevelIterator, level manifest.Level) {
li := &levels[levelsIndex]

Expand Down
18 changes: 9 additions & 9 deletions external_iterator.go
Original file line number Diff line number Diff line change
Expand Up @@ -209,15 +209,15 @@ func createExternalPointIter(ctx context.Context, it *Iterator) (internalIterato
pointIter internalIterator
err error
)
pointIter, err = r.NewIterWithBlockPropertyFiltersAndContext(
ctx,
it.opts.LowerBound,
it.opts.UpperBound,
nil, /* BlockPropertiesFilterer */
false, /* useFilterBlock */
&it.stats.InternalStats,
sstable.TrivialReaderProvider{Reader: r},
)
// We could set hideObsoletePoints=true, since we are reading at
// InternalKeySeqNumMax, but we don't bother since these sstables should
// not have obsolete points (so the performance optimization is
// unnecessary), and we don't want to bother constructing a
// BlockPropertiesFilterer that includes obsoleteKeyBlockPropertyFilter.
pointIter, err = r.NewIterWithBlockPropertyFiltersAndContextEtc(
ctx, it.opts.LowerBound, it.opts.UpperBound, nil, /* BlockPropertiesFilterer */
false /* hideObsoletePoints */, false, /* useFilterBlock */
&it.stats.InternalStats, sstable.TrivialReaderProvider{Reader: r})
if err != nil {
return nil, err
}
Expand Down
20 changes: 11 additions & 9 deletions format_major_version.go
Original file line number Diff line number Diff line change
Expand Up @@ -151,11 +151,13 @@ const (
// compactions for files marked for compaction are complete.
FormatPrePebblev1MarkedCompacted

// ExperimentalFormatDeleteSized is a format major version that adds support for
// deletion tombstones that encode the size of the value they're expected to
// delete. This format major version is required before the associated key
// kind may be committed through batch applications or ingests.
ExperimentalFormatDeleteSized
// ExperimentalFormatDeleteSizedAndObsolete is a format major version that adds support
// for deletion tombstones that encode the size of the value they're
// expected to delete. This format major version is required before the
// associated key kind may be committed through batch applications or
// ingests. It also adds support for keys that are marked obsolete (see
// sstable/format.go for details).
ExperimentalFormatDeleteSizedAndObsolete

// internalFormatNewest holds the newest format major version, including
// experimental ones excluded from the exported FormatNewest constant until
Expand All @@ -182,7 +184,7 @@ func (v FormatMajorVersion) MaxTableFormat() sstable.TableFormat {
return sstable.TableFormatPebblev2
case FormatSSTableValueBlocks, FormatFlushableIngest, FormatPrePebblev1MarkedCompacted:
return sstable.TableFormatPebblev3
case ExperimentalFormatDeleteSized:
case ExperimentalFormatDeleteSizedAndObsolete:
return sstable.TableFormatPebblev4
default:
panic(fmt.Sprintf("pebble: unsupported format major version: %s", v))
Expand All @@ -201,7 +203,7 @@ func (v FormatMajorVersion) MinTableFormat() sstable.TableFormat {
case FormatMinTableFormatPebblev1, FormatPrePebblev1Marked,
FormatUnusedPrePebblev1MarkedCompacted, FormatSSTableValueBlocks,
FormatFlushableIngest, FormatPrePebblev1MarkedCompacted,
ExperimentalFormatDeleteSized:
ExperimentalFormatDeleteSizedAndObsolete:
return sstable.TableFormatPebblev1
default:
panic(fmt.Sprintf("pebble: unsupported format major version: %s", v))
Expand Down Expand Up @@ -334,8 +336,8 @@ var formatMajorVersionMigrations = map[FormatMajorVersion]func(*DB) error{
}
return d.finalizeFormatVersUpgrade(FormatPrePebblev1MarkedCompacted)
},
ExperimentalFormatDeleteSized: func(d *DB) error {
return d.finalizeFormatVersUpgrade(ExperimentalFormatDeleteSized)
ExperimentalFormatDeleteSizedAndObsolete: func(d *DB) error {
return d.finalizeFormatVersUpgrade(ExperimentalFormatDeleteSizedAndObsolete)
},
}

Expand Down
36 changes: 18 additions & 18 deletions format_major_version_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ func TestRatchetFormat(t *testing.T) {
require.Equal(t, FormatFlushableIngest, d.FormatMajorVersion())
require.NoError(t, d.RatchetFormatMajorVersion(FormatPrePebblev1MarkedCompacted))
require.Equal(t, FormatPrePebblev1MarkedCompacted, d.FormatMajorVersion())
require.NoError(t, d.RatchetFormatMajorVersion(ExperimentalFormatDeleteSized))
require.Equal(t, ExperimentalFormatDeleteSized, d.FormatMajorVersion())
require.NoError(t, d.RatchetFormatMajorVersion(ExperimentalFormatDeleteSizedAndObsolete))
require.Equal(t, ExperimentalFormatDeleteSizedAndObsolete, d.FormatMajorVersion())

require.NoError(t, d.Close())

Expand Down Expand Up @@ -212,22 +212,22 @@ func TestFormatMajorVersions_TableFormat(t *testing.T) {
// fixture is intentionally verbose.

m := map[FormatMajorVersion][2]sstable.TableFormat{
FormatDefault: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatMostCompatible: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
formatVersionedManifestMarker: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatVersioned: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatSetWithDelete: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatBlockPropertyCollector: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1},
FormatSplitUserKeysMarked: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1},
FormatSplitUserKeysMarkedCompacted: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1},
FormatRangeKeys: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev2},
FormatMinTableFormatPebblev1: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2},
FormatPrePebblev1Marked: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2},
FormatUnusedPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2},
FormatSSTableValueBlocks: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3},
FormatFlushableIngest: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3},
FormatPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3},
ExperimentalFormatDeleteSized: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev4},
FormatDefault: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatMostCompatible: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
formatVersionedManifestMarker: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatVersioned: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatSetWithDelete: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2},
FormatBlockPropertyCollector: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1},
FormatSplitUserKeysMarked: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1},
FormatSplitUserKeysMarkedCompacted: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1},
FormatRangeKeys: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev2},
FormatMinTableFormatPebblev1: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2},
FormatPrePebblev1Marked: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2},
FormatUnusedPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2},
FormatSSTableValueBlocks: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3},
FormatFlushableIngest: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3},
FormatPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3},
ExperimentalFormatDeleteSizedAndObsolete: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev4},
}

// Valid versions.
Expand Down
4 changes: 2 additions & 2 deletions get_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ func (g *getIter) Next() (*InternalKey, base.LazyValue) {
if n := len(g.l0); n > 0 {
files := g.l0[n-1].Iter()
g.l0 = g.l0[:n-1]
iterOpts := IterOptions{logger: g.logger}
iterOpts := IterOptions{logger: g.logger, snapshotForHideObsoletePoints: g.snapshot}
g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
files, manifest.L0Sublevel(n), internalIterOpts{})
g.levelIter.initRangeDel(&g.rangeDelIter)
Expand All @@ -177,7 +177,7 @@ func (g *getIter) Next() (*InternalKey, base.LazyValue) {
continue
}

iterOpts := IterOptions{logger: g.logger}
iterOpts := IterOptions{logger: g.logger, snapshotForHideObsoletePoints: g.snapshot}
g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
g.version.Levels[g.level].Iter(), manifest.Level(g.level), internalIterOpts{})
g.levelIter.initRangeDel(&g.rangeDelIter)
Expand Down
22 changes: 19 additions & 3 deletions internal/base/internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,14 @@ const (
//InternalKeyKindColumnFamilyDeletion InternalKeyKind = 4
//InternalKeyKindColumnFamilyValue InternalKeyKind = 5
//InternalKeyKindColumnFamilyMerge InternalKeyKind = 6

// InternalKeyKindSingleDelete (SINGLEDEL) is a performance optimization
// solely for compactions (to reduce write amp and space amp). Readers other
// than compactions should treat SINGLEDEL as equivalent to a DEL.
// Historically, it was simpler for readers other than compactions to treat
// SINGLEDEL as equivalent to DEL, but as of the introduction of
// InternalKeyKindSSTableInternalObsoleteBit, this is also necessary for
// correctness.
InternalKeyKindSingleDelete InternalKeyKind = 7
//InternalKeyKindColumnFamilySingleDelete InternalKeyKind = 8
//InternalKeyKindBeginPrepareXID InternalKeyKind = 9
Expand Down Expand Up @@ -71,7 +79,7 @@ const (
// value indicating the (len(key)+len(value)) of the shadowed entry the
// tombstone is expected to delete. This value is used to inform compaction
// heuristics, but is not required to be accurate for correctness.
InternalKeyKindDeleteSized = 23
InternalKeyKindDeleteSized InternalKeyKind = 23

// This maximum value isn't part of the file format. Future extensions may
// increase this value.
Expand All @@ -84,12 +92,17 @@ const (
// seqNum.
InternalKeyKindMax InternalKeyKind = 23

// Internal to the sstable format. Not exposed by any sstable iterator.
// Declared here to prevent definition of valid key kinds that set this bit.
InternalKeyKindSSTableInternalObsoleteBit InternalKeyKind = 64
InternalKeyKindSSTableInternalObsoleteMask InternalKeyKind = 191

// InternalKeyZeroSeqnumMaxTrailer is the largest trailer with a
// zero sequence number.
InternalKeyZeroSeqnumMaxTrailer = uint64(InternalKeyKindInvalid)
InternalKeyZeroSeqnumMaxTrailer = uint64(255)

// A marker for an invalid key.
InternalKeyKindInvalid InternalKeyKind = 255
InternalKeyKindInvalid InternalKeyKind = InternalKeyKindSSTableInternalObsoleteMask

// InternalKeySeqNumBatch is a bit that is set on batch sequence numbers
// which prevents those entries from being excluded from iteration.
Expand All @@ -112,6 +125,9 @@ const (
InternalKeyBoundaryRangeKey = (InternalKeySeqNumMax << 8) | uint64(InternalKeyKindRangeKeySet)
)

// Assert InternalKeyKindSSTableInternalObsoleteBit > InternalKeyKindMax
const _ = uint(InternalKeyKindSSTableInternalObsoleteBit - InternalKeyKindMax - 1)

var internalKeyKindNames = []string{
InternalKeyKindDelete: "DEL",
InternalKeyKindSet: "SET",
Expand Down
10 changes: 10 additions & 0 deletions iterator.go
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,9 @@ func (i *Iterator) findNextEntry(limit []byte) {
return

case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized:
// NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not
// only simpler, but is also necessary for correctness due to
// InternalKeyKindSSTableInternalObsoleteBit.
i.nextUserKey()
continue

Expand Down Expand Up @@ -632,6 +635,9 @@ func (i *Iterator) nextPointCurrentUserKey() bool {
return false

case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized:
// NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not
// only simpler, but is also necessary for correctness due to
// InternalKeyKindSSTableInternalObsoleteBit.
return false

case InternalKeyKindSet, InternalKeyKindSetWithDelete:
Expand Down Expand Up @@ -1095,6 +1101,10 @@ func (i *Iterator) mergeNext(key InternalKey, valueMerger ValueMerger) {
case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized:
// We've hit a deletion tombstone. Return everything up to this
// point.
//
// NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not
// only simpler, but is also necessary for correctness due to
// InternalKeyKindSSTableInternalObsoleteBit.
return

case InternalKeyKindSet, InternalKeyKindSetWithDelete:
Expand Down
9 changes: 9 additions & 0 deletions level_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,11 @@ type levelIter struct {
// cache when constructing new table iterators.
internalOpts internalIterOpts

// Scratch space for the obsolete keys filter, when there are no other block
// property filters specified. See the performance note where
// IterOptions.PointKeyFilters is declared.
filtersBuf [1]BlockPropertyFilter

// Disable invariant checks even if they are otherwise enabled. Used by tests
// which construct "impossible" situations (e.g. seeking to a key before the
// lower bound).
Expand Down Expand Up @@ -267,8 +272,12 @@ func (l *levelIter) init(
l.upper = opts.UpperBound
l.tableOpts.TableFilter = opts.TableFilter
l.tableOpts.PointKeyFilters = opts.PointKeyFilters
if len(opts.PointKeyFilters) == 0 {
l.tableOpts.PointKeyFilters = l.filtersBuf[:0:1]
}
l.tableOpts.UseL6Filters = opts.UseL6Filters
l.tableOpts.level = l.level
l.tableOpts.snapshotForHideObsoletePoints = opts.snapshotForHideObsoletePoints
l.cmp = cmp
l.split = split
l.iterFile = nil
Expand Down
Loading

0 comments on commit 152d19a

Please sign in to comment.