Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: use strict-obsolete sstables when writing to and reading from shared storage #2756

Open
sumeerbhola opened this issue Jul 19, 2023 · 2 comments · May be fixed by #3624
Open

db: use strict-obsolete sstables when writing to and reading from shared storage #2756

sumeerbhola opened this issue Jul 19, 2023 · 2 comments · May be fixed by #3624

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jul 19, 2023

These were introduced recently, and permit using the vanilla sstable readers when reading foreign ssts, which is more efficient than wrappers. Documented in

pebble/sstable/format.go

Lines 32 to 181 in 456a2a2

// TableFormatPebblev4, in addition to DELSIZED, introduces the use of
// InternalKeyKindSSTableInternalObsoleteBit.
//
// 1. Motivation
//
// We have various related problems caused by Pebble snapshots:
//
// - P1: RANGEDELs that delete points in the same sstable, but the points
// happen to not get deleted during compactions because of an open snapshot.
// This causes very expensive iteration, that has been observed in
// production deployments
//
// - P2: When iterating over a foreign sstable (in disaggregated storage), we
// need to do (a) point collapsing to expose at most one point per user key,
// (b) apply RANGEDELs in the sstable to hide deleted points in the same
// sstable. This per-sstable point collapsing iteration needs to be very
// efficient (ideally as efficient from a CPU perspective as iteration over
// regular sstables) since foreign sstables can be very long-lived -- one of
// the goals of disaggregated storage is to scale compute and disk bandwidth
// resources as a function of the hot (from a write perspective) data and
// not the whole data, so we don't want to have to rewrite foreign sstables
// solely to improve read performance.
//
// The ideal solution for P2 would allow user-facing reads to utilize the
// existing SST iterators (with slight modifications) and with no loss of
// efficiency. And for P1 and P2 we would like to skip whole blocks of
// overwritten/deleted points. Even when we can't skip whole blocks, avoiding
// key comparisons at iteration time to discover what points are deleted is
// very desirable, since keys can be long.
//
// We observe that:
//
// - Reads:
// - All user-facing reads in CockroachDB use iterators over the DB, hence
// have a higher read seqnum than all sstables (there are some rare cases
// that can violate this, but those are not important from a performance
// optimization perspective).
//
// - Certain internal-facing reads in CockroachDB use snapshots, but the
// snapshots are shortlived enough that most L5 and L6 sstables will have
// all seqnums lower than the snapshot seqnum.
//
// - Writes:
// - We already do key comparisons between points when writing the sstable
// to ensure that the sstable invariant (monotonically increasing internal
// keys) is not violated. So we know which points share the same userkey,
// and thereby which points are obsolete because there is a more recent
// point in the same sstable.
//
// - The compactionIter knows which point id deleted by a RANGEDEL even if
// the point does need to be written because of a snapshot.
//
// So this known information can be encoded in the sstable at write time and
// utilized for optimized reading.
//
// 2. Solution
//
// We primarily scope the solution to the following point kinds: SET,
// SETWITHDEL, DEL, DELSIZED, SINGLEDEL. These are the ones marked locally
// obsolete, i.e., obsolete within the sstable, and we can guarantee that at
// most one point will be exposed per user key. MERGE keys create more
// complexity: MERGE followed by MERGE causes multiple keys to not be
// obsolete. Same applies for MERGE followed by SET/SETWITHDEL/DEL*. Note
// that:
//
// - For regular sst iteration, the obsolete marking is a performance
// optimization, and multiple keys for the same userkey can be handled by
// higher layers in the iterator tree (specifically pebble.Iterator).
//
// - For foreign sst iteration, we disallow MERGEs to be written to such
// shared ssts (details below).
//
// The key kinds are marked with an obsolete bit
// (InternalKeyKindSSTableInternalObsoleteBit) when the key-value pair is
// obsolete. This marking is done within blockWriter, based on information
// passed to it by Writer. In turn, Writer uses a combination of key
// comparisons, and information provided by compactionIter to decide whether a
// key-value pair is obsolete. Additionally, a Pebble-internal
// BlockPropertyCollector (obsoleteKeyBlockPropertyCollector) is used to mark
// blocks where all key-value pairs are obsolete. Since the common case is
// non-obsolete blocks, this block property collector uses the empty byte
// slice to represent a non-obsolete block, which consumes no space in
// BlockHandleWithProperties.Props.
//
// At read time, the obsolete bit is only visible to the blockIter, which can
// be optionally configured to hide obsolete points. This hiding is only
// configured for data block iterators for sstables being read by user-facing
// iterators at a seqnum greater than the max seqnum in the sstable.
// Additionally, when this hiding is configured, a Pebble-internal block
// property filter (obsoleteKeyBlockPropertyFilter), is used to skip whole
// blocks that are obsolete.
//
// 2.1 Correctness
//
// Due to the level invariant, the sequence of seqnums for a user key in a
// sstable represents a contiguous subsequence of the seqnums for the userkey
// across the whole LSM, and is more recent than the seqnums in a sstable in a
// lower level. So exposing exactly one point from a sstable for a userkey
// will also mask the points for the userkey in lower levels. If we expose no
// point, because of RANGEDELs, that RANGEDEL will also mask the points in
// lower levels.
//
// Note that we do not need to do anything special at write time for
// SETWITHDEL and SINGLEDEL. This is because these key kinds are treated
// specially only by compactions, which do not hide obsolete points. For
// regular reads, SETWITHDEL behaves the same as SET and SINGLEDEL behaves the
// same as DEL.
//
// 2.2 Strictness and MERGE
//
// Setting the obsolete bit on point keys is advanced usage, so we support two
// modes, both of which must be truthful when setting the obsolete bit, but
// vary in when they don't set the obsolete bit.
//
// - Non-strict: In this mode, the bit does not need to be set for keys that
// are obsolete. Additionally, any sstable containing MERGE keys can only
// use this mode. An iterator over such an sstable, when configured to
// hideObsoletePoints, can expose multiple internal keys per user key, and
// can expose keys that are deleted by rangedels in the same sstable. This
// is the mode that non-advanced users should use. Pebble without
// disaggregated storage will also use this mode and will best-effort set
// the obsolete bit, to optimize iteration when snapshots have retained many
// obsolete keys.
//
// - Strict: In this mode, every obsolete key must have the obsolete bit set,
// and no MERGE keys are permitted. An iterator over such an sstable, when
// configured to hideObsoletePoints satisfies two properties:
// - S1: will expose at most one internal key per user key, which is the
// most recent one.
// - S2: will never expose keys that are deleted by rangedels in the same
// sstable.
//
// This is the mode for two use cases in disaggregated storage (which will
// exclude parts of the key space that has MERGEs), for levels that contain
// sstables that can become foreign sstables:
// - Pebble compaction output to these levels that can become foreign
// sstables.
//
// - CockroachDB ingest operations that can ingest into the levels that can
// become foreign sstables. Note, these are not sstables corresponding to
// copied data for CockroachDB range snapshots. This case occurs for
// operations like index backfills: these trivially satisfy the strictness
// criteria since they only write one key per userkey.
//
// TODO(sumeer): this latter case is not currently supported, since only
// Writer.AddWithForceObsolete calls are permitted for writing strict
// obsolete sstables. This is done to reduce the likelihood of bugs. One
// simple way to lift this limitation would be to disallow adding any
// RANGEDELs when a Pebble-external writer is trying to construct a strict
// obsolete sstable.

@itsbilal

Jira issue: PEBBLE-60

Epic CRDB-40358

@sumeerbhola sumeerbhola changed the title db: use strict-obsolete sstables when writing to shared storage db: use strict-obsolete sstables when writing to and reading from shared storage Jul 19, 2023
itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 28, 2023
Previously we were using pointCollapsingIters to collapse obsolete
points in foreign sstables. This was necessary because of snapshots
that were open at the time those sstables were written, however they
really blew up code complexity and performance. This change
transitions to using vanilla sstable iterators for foreign sstables
with `hideObsoletePoints` set to true. We error out if the
sstable format for a foreign sstable is not high enough to support
the obsolete flag.

Also remove all cases in the pointCollapsingIter that were not
used by ScanInternal (which is the last remaining user of it).

Part of cockroachdb#2756.
@itsbilal
Copy link
Member

Using strict-obsolete SSTs for the write path is a little more complex as Pebble is currently unaware of what part of the keyspace is shareable at write time. Maybe we can thread this through to Pebble as an Option. and we mark sstables that are completely inside the strict-obsolete keyspace as strict obsolete when writing to them. We currently also put timeseries keys in shared storage which will contain merges but that's okay because we will never share them across pebbles.

itsbilal added a commit to itsbilal/pebble that referenced this issue Jul 28, 2023
Previously we were using pointCollapsingIters to collapse obsolete
points in foreign sstables. This was necessary because of snapshots
that were open at the time those sstables were written, however they
really blew up code complexity and performance. This change
transitions to using vanilla sstable iterators for foreign sstables
with `hideObsoletePoints` set to true. We error out if the
sstable format for a foreign sstable is not high enough to support
the obsolete flag.

Also remove all cases in the pointCollapsingIter that were not
used by ScanInternal (which is the last remaining user of it).

Part of cockroachdb#2756.
itsbilal added a commit that referenced this issue Jul 31, 2023
Previously we were using pointCollapsingIters to collapse obsolete
points in foreign sstables. This was necessary because of snapshots
that were open at the time those sstables were written, however they
really blew up code complexity and performance. This change
transitions to using vanilla sstable iterators for foreign sstables
with `hideObsoletePoints` set to true. We error out if the
sstable format for a foreign sstable is not high enough to support
the obsolete flag.

Also remove all cases in the pointCollapsingIter that were not
used by ScanInternal (which is the last remaining user of it).

Part of #2756.
@itsbilal itsbilal self-assigned this Jul 31, 2023
@itsbilal
Copy link
Member

Based on discussion in Storage weekly, the preference was to add compaction guards around the Table keyspace as the "shareable keyspace", which would be passed into Pebble as an Option. Sstable outputs will be split at these boundaries and any sstables created within this range will be strict-obsolete. This issue now tracks the implementation of that.

@sumeerbhola sumeerbhola self-assigned this May 12, 2024
@itsbilal itsbilal removed their assignment May 13, 2024
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue May 15, 2024
…tions

SharedLowerUserKeyPrefix, if specified, is an additional lower bound on
constraint on key prefixes that should be written to shared files. It
applies only when CreateOnShared permits some shared file creation. It
will be used by CockroachDB to exclude keys below TableDataMin from
shared files, for both correctness (they can contain MERGEs for which
the obsolete bit does not work) and performance reasons (low latency
is more important and the data volume is tiny).

WriteSharedWithStrictObsolete, when true, causes shared files to be
written with WriterOptions.IsStrictObsolete set to true. This adds
an extra measure of configuration protection to accidentally sharing
files where the MERGE could become visible (we currently share such
files, but file virtualization hides these MERGEs).

The enforcement of SharedLowerUserKeyPrefix changes how
PreferSharedStorage is computed during flushes and compactions. It
will only be set if the next key to be written by compact.Runner
permits writing to shared storage. compact.Runner optimizes this
computation for when a compaction is fully within the shared or
unshared bounds. Additionally compact.Runner uses the existing
OutputSplitter to decide when a split should happen when transitioning
from unshared to shared.

While here, we do a tiny optimization in OutputSplitter to remove
a key comparison on each iteration key.

Fixes cockroachdb#2756
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue May 15, 2024
…tions

SharedLowerUserKeyPrefix, if specified, is an additional lower bound on
constraint on key prefixes that should be written to shared files. It
applies only when CreateOnShared permits some shared file creation. It
will be used by CockroachDB to exclude keys below TableDataMin from
shared files, for both correctness (they can contain MERGEs for which
the obsolete bit does not work) and performance reasons (low latency
is more important and the data volume is tiny).

WriteSharedWithStrictObsolete, when true, causes shared files to be
written with WriterOptions.IsStrictObsolete set to true. This adds
an extra measure of configuration protection to accidentally sharing
files where the MERGE could become visible (we currently share such
files, but file virtualization hides these MERGEs).

The enforcement of SharedLowerUserKeyPrefix changes how
PreferSharedStorage is computed during flushes and compactions. It
will only be set if the next key to be written by compact.Runner
permits writing to shared storage. compact.Runner optimizes this
computation for when a compaction is fully within the shared or
unshared bounds. Additionally compact.Runner uses the existing
OutputSplitter to decide when a split should happen when transitioning
from unshared to shared.

While here, we do a tiny optimization in OutputSplitter to remove
a key comparison on each iteration key.

Fixes cockroachdb#2756
@jbowens jbowens moved this to Disagg - Core in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Disagg - Core
Development

Successfully merging a pull request may close this issue.

2 participants