[CF-557] Changelog updates should wait for transaction commit #137

erikdw · 2024-01-11T09:21:33Z

Description

Ensure that we don't propagate changelog updates until after a transaction is committed.

Prior to this fix, every DML statement processed (even within a transaction) would immediately lead to a changelog event being written out. That's problematic when client applications see the changelog event occur, since the clients are racing the transaction commit when that event prompts them to read the LDB for the associated row.

We avoid this by detecting that a transaction has begun. When in a transaction we buffer the changes and wait for the transaction to end before propagating them to the changelog.

Details

We heavily rely on the synchronous nature of this code for the buffering that we are adding. i.e., we don't need any locking since there is no concurrency.
Add comments about how the callback change propagation works (see ApplyDMLStatement in ldb_callback_writer.go).
- On a related note, add a comment above ApplyDMLStatement in ldb_writer_with_changelog.go about how that code isn't actually used.
Pin the Dockerfile's base image to avoid upstream libmusl issues that broke the build.
Add 2 fields to changelog entries, to help with debugging if we ever see any issues around the changelog:
- ledgerSeq: ledger sequence corresponding to the DML statement that led to this changelog entry
- tx: boolean for whether the changelog entry is from a transaction
Don't propagate changes for internal bookkeeping tables that don't appear in the changelog -- we used to prevent the changes from getting to the changelog by only propagating changes for tables with names like family___table.
Add a new unit test that ensures we are getting the number of callbacks we expect, and the number of updates in each callback we expect.

Minor fixes

Remove unused NewChangelogEntry function
Fix typo in config knob name max-healty-latency -> max-healthy-latency
Fix comment typos

TODO

Figure out whether the gauge metric makes sense for accumulated changes on each reflector. We might just want an ever-increasing counter?
Update unit-tests to have tables where REPLACE INTO results in 2 SQLite changes like it does for so many of the real ledger statements we have.
- i.e., create a 2nd column in the table, and have the first one be a key, and "replace into" using the same key with a different value.
Subsequent PR to propagate the change operation (insert or delete). Code was already written but excised to keep this a bit tighter.

Testing

Testing completed successfully if all of these are checked:

Tested locally by running the reflector connected to stage's Ctldb, and running both a tail -F /var/spool/ctlstore/changelog and a simple program that uses the Event Iterator.
Test in stage by running the update reflector on the node where debezium-service runs, since that will mean the QARows checker is using an LDB being updated by the new code, and that should surface if there's any fundamental bug with updating the LDB since it compares the local LDB against the SoR.
Test in stage by running the updated reflector on all nodes of the core pool
Creating some charts in Datadog and check that the metrics we are issuing are what we want.
Test in stage in centrifuge-destinations pool, so that ctlgate that uses the changelog will be tested.

See: mattn/go-sqlite3#1164

…e writing out the changes, and include the ledger sequence in the changelog

…ices, propagate ledger sequence for each statement by storing it into the SQLiteWatchChange, also add a boolean for whether the changelog entry is from a transaction or not, and add another test case

wavetylor · 2024-01-16T15:19:03Z

pkg/changelog/changelog_writer.go

-func NewChangelogEntry(seq int64, family string, table string, key []interface{}) *ChangelogEntry {
-	return &ChangelogEntry{Seq: seq, Family: family, Table: table, Key: key}
-}
-
 func (w *ChangelogWriter) WriteChange(e ChangelogEntry) error {
 	structure := struct {


I don't really get why we inline this struct definition instead of just putitng the json tags on the ChangeLogEntry.

This is a good question -- I don't get it either.

It reminds me that I'm unsure about why we have 2 separate structs for the changelog entries. Felt like I was having to sprinkle the same fields all over the place to get it through the pipes.

writing (this code):

ctlstore/pkg/changelog/changelog_writer.go

Lines 30 to 36 in 3652065

func (w *ChangelogWriter) WriteChange(e ChangelogEntry) error {

structure := struct {

Seq int64 `json:"seq"`

Family string `json:"family"`

Table string `json:"table"`

Key []interface{} `json:"key"`

}{

reading:

ctlstore/pkg/event/entry.go

Lines 3 to 11 in 3652065

// entry represents a single row in the changelog

// e.g.

// {"seq":1,"family":"fam","table":"foo","key":[{"name":"id","type":"int","value":1}]}

type entry struct {

Seq int64 `json:"seq"`

Family string `json:"family"`

Table string `json:"table"`

Key []Key `json:"key"`

}

I wonder if that's a consequence of calling the root of this pkg vs internal

Feels like there should be just one definition of this data.

With the json annotations too.

FWIW, answers by Copilot:

what would you suspect the reason why in this file that the structure in WriteChange function duplicates the ChangelogEntry? why would the developer wrote the code not reuse the ChangelogEntry

The developer might have chosen to define a separate structure within the WriteChange function for a few reasons:

Field Tagging: The inline structure allows the developer to add struct tags (json:"seq", json:"family", etc.) which are used for JSON marshalling. These tags define how the struct fields are encoded to JSON, which is not specified in the ChangelogEntry struct.

Encapsulation: The ChangelogEntry struct might be used in various parts of the codebase, and the developer might not want to tie its implementation too closely to the JSON encoding used in this specific function. By using a separate structure for the JSON encoding, changes to the ChangelogEntry struct will not directly impact the WriteChange function.

Flexibility: If in the future, the WriteChange function needs to include additional data or transform the data before writing, having a separate structure within the function makes it easier to do so without affecting the ChangelogEntry struct.

In general, this approach provides a level of decoupling between the ChangelogEntry data structure and the specific requirements of the WriteChange function.

wavetylor · 2024-01-16T15:22:37Z

pkg/cmd/ctlstore/main.go

@@ -114,7 +114,7 @@ type supervisorCliConfig struct {
 type ledgerHealthConfig struct {
 	Disable                 bool          `conf:"disable" help:"disable ledger latency health attributing (DEPRECATED: use disable-ecs-behavior instead)"`
 	DisableECSBehavior      bool          `conf:"disable-ecs-behavior" help:"disable ledger latency health attributing"`
-	MaxHealthyLatency       time.Duration `conf:"max-healty-latency" help:"Max latency considered healthy"`


one of the annoying aspects of this being OSS is this is a change that we can't verify doesn't break anything. Internally, I didn't see any hits for healty related to ctlstore though.

I searched internally too -- it's a good call out that this could break something externally, but my belief is that no one is actually using this system outside of Segment, so I'm not too concerned.

That being said, I could be convinced to add an "unsupported" entry that has the typo'ed configuration name and then panics. i.e., with this change if someone had the typo'ed configuration knob it would just be silently ignored AFAIU.

Or even have logic to have that typo'ed configuration entry configure the "right" one and spit out a loud deprecation warning message. We don't have real releases so it's a bit unclear what the best strategy is.

I'm fine just doing a breaking change. It may not be great OSS stewardship, but I have a hard time believing there's even a single other user of ctlstore outside of segment

pkg/ldbwriter/ldb_callback_writer.go

wavetylor · 2024-01-16T15:39:26Z

pkg/ldbwriter/ldb_callback_writer.go

 	ChangeBuffer *sqlite.SQLChangeBuffer
+	// Accumulated changes across multiple ApplyDMLStatement calls
+	transactionChanges []sqlite.SQLiteWatchChange


should we consider a cap on the size of a transaction so we don't buffer too much?

Interesting proposal, issues around that are where my head was at as I was adding the metrics.

This consideration relates to the "support SoR transactions" project. We don't yet know how large SoR transactions can get, especially for those that we care about for that project. Putting some cap here with a loud complaint that we exceeded it (log + metric) and then dumping the currently buffered changes to the changelog (i.e., invoking callbacks) seems a fine behavior for now.

We know that the changes would not be greater than 200 right now based on hardcoded limits elsewhere & the behavior of the REPLACE INTO, so perhaps just setting it to like 500 for now would be good. i.e.,

REPLACE INTO is translated by SQLite into a DELETE op then an INSERT op ➡️ 2 changes.

max of 100 entries in a ledger transaction:

ctlstore/pkg/executive/db_executive.go

Lines 455 to 458 in 3652065

// Reject requests that are too large

if len(requests) > limits.LimitMaxMutateRequestCount {

return &errs.PayloadTooLargeError{Err: "Number of requests exceeds maximum"}

}

that sounds good to me

This comment is not relevant to today's transactions since they are size limited but rather to the future when we are supporting SoR transactions. A sane number of statements within a transaction is smart, but by doing this we need to ensure this limit is well understood externally. Once we say we support transactions we need to be sure we really do.

janderson-seg

🔥 Nice work Erik!

…ssibly premature optimization of updating passed-in slice; also remove metric for ldb_changes_accumulated in favor of just using the ldb_changes_written one

[CF-557] Changelog updates should wait for transaction commit

b8c6173

erikdw force-pushed the erikdw/cf-557--changelog-update-fix branch from 3c24db3 to b8c6173 Compare January 11, 2024 09:22

erikdw added 13 commits January 11, 2024 01:47

add back newline that got autoremoved somehow

8574a13

pin alpine version to avoid musl problems

5c5b5d8

See: mattn/go-sqlite3#1164

refactor to make code easier to read, and add another test case

4221eb9

add ⚠️ to comment

e8a2d85

changelog: propagate change type

209e26c

add another stop for the propagation trainline

73d259a

update comment

6b11bd8

rename ChangeType to ChangeOp, defer mapping of SQLite Op until we ar…

690d906

…e writing out the changes, and include the ledger sequence in the changelog

fix transaction tracking by simplyifying code to just allocate new sl…

9679b40

…ices, propagate ledger sequence for each statement by storing it into the SQLiteWatchChange, also add a boolean for whether the changelog entry is from a transaction or not, and add another test case

undo Op propagation

5e21030

add test case for two ledger transactions

4dc0a35

add warning to ldb_writer_with_changelog.go and remove debug printouts

4d4de9b

adjust tests to account for changelog changes

4479f75

erikdw marked this pull request as ready for review January 16, 2024 09:33

erikdw requested a review from a team as a code owner January 16, 2024 09:33

erikdw requested a review from janderson-seg January 16, 2024 09:33

wavetylor reviewed Jan 16, 2024

View reviewed changes

pkg/ldbwriter/ldb_callback_writer.go Outdated Show resolved Hide resolved

wavetylor reviewed Jan 16, 2024

View reviewed changes

fix incorrect copy-pasta comment

a616e62

janderson-seg previously approved these changes Jan 16, 2024

View reviewed changes

make API for ending transaction more clear and less subtle (remove po…

7f3a202

…ssibly premature optimization of updating passed-in slice; also remove metric for ldb_changes_accumulated in favor of just using the ldb_changes_written one

wavetylor approved these changes Jan 16, 2024

View reviewed changes

zhou-hongyu approved these changes Jan 17, 2024

View reviewed changes

erikdw dismissed janderson-seg’s stale review via 7f3a202 July 3, 2024 03:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CF-557] Changelog updates should wait for transaction commit #137

[CF-557] Changelog updates should wait for transaction commit #137

erikdw commented Jan 11, 2024 •

edited

Loading

wavetylor Jan 16, 2024

erikdw Jan 16, 2024

wavetylor Jan 16, 2024

janderson-seg Jan 16, 2024

janderson-seg Jan 16, 2024

zhou-hongyu Jan 17, 2024

wavetylor Jan 16, 2024

erikdw Jan 16, 2024 •

edited

Loading

wavetylor Jan 16, 2024

wavetylor Jan 16, 2024

erikdw Jan 16, 2024 •

edited

Loading

wavetylor Jan 16, 2024

janderson-seg Jan 16, 2024

janderson-seg left a comment

	func (w *ChangelogWriter) WriteChange(e ChangelogEntry) error {
	structure := struct {
	Seq int64 `json:"seq"`
	Family string `json:"family"`
	Table string `json:"table"`
	Key []interface{} `json:"key"`
	}{

	// entry represents a single row in the changelog
	// e.g.
	// {"seq":1,"family":"fam","table":"foo","key":[{"name":"id","type":"int","value":1}]}
	type entry struct {
	Seq int64 `json:"seq"`
	Family string `json:"family"`
	Table string `json:"table"`
	Key []Key `json:"key"`
	}

	// Reject requests that are too large
	if len(requests) > limits.LimitMaxMutateRequestCount {
	return &errs.PayloadTooLargeError{Err: "Number of requests exceeds maximum"}
	}

[CF-557] Changelog updates should wait for transaction commit #137

Are you sure you want to change the base?

[CF-557] Changelog updates should wait for transaction commit #137

Conversation

erikdw commented Jan 11, 2024 • edited Loading

Description

Details

Minor fixes

TODO

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikdw Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikdw Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janderson-seg left a comment

Choose a reason for hiding this comment

erikdw commented Jan 11, 2024 •

edited

Loading

erikdw Jan 16, 2024 •

edited

Loading

erikdw Jan 16, 2024 •

edited

Loading