-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-35304: Add Connects_Tried
and Primary_Retry_Count
to SSS
#3764
base: main
Are you sure you want to change the base?
Conversation
Connects_Tried
and Primary_Retry_Count
to SSS
87112f2
to
43da219
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the code, the --master-retry-count=0
behavior appearently doesn’t match the doc’s expectation:
A value of 0 means the replica will not stop attempting to reconnect.
⸺ https://mariadb.com/kb/en/mariadbd-options/#-master-retry-count
This may be a bug.
The code is found as early as 10.5.
connect | ||
*/ | ||
if (++err_count == master_retry_count) | ||
if (++(mi->connects_tried) == mi->retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s no special case for zero since 👀 2002.
With an ==
, it would treat 0 as ULONG_MAX
+ 1.
Fortunately, even if it retries once every millisecond, I don’t see a practical difference between 2^32 attempts (totalling almost 50 days) and “the server is down”.
if ((*retry_count)++) | ||
{ | ||
if (*retry_count > master_retry_count) | ||
return 1; // Don't retry forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be hard to comprehend. Step by step:
- If
*retry_count
was 0, skip the block - Else If
*retry_count
becomes> master_retry_count
,return 1
- Else, proceed with the block
This could mean that for --master-retry-count=0
, the termination condition always matches, but the flow always bypasses it for the first try. That is, 0
is equivalent to 1
.
I tried writing an MTR test for --master-retry-count
, but my build doesn’t have debug_sync
, and appearently none of our buildbots have both it and log_bin
…??? (#3731)
|
`try_to_reconnect()` wraps `safe_reconnect()` with logging, but the latter already loops reconnection attempts up to `master_retry_count` times with `mi->connect_retry`-msec sleeps inbetween. This means `try_to_reconnect()` has been counting number of disconnects (since it doesn’t have a loop) while `safe_reconnect()` was counting actual attempts (which may be multiple per disconnect). In practice, this outer counter’s only benefit was to cover the edge case `--master-retry-count=0` for the inner loop… by treating it as 1…
“Lightly” refactor `try_to_reconnect()` `messages` to reduce duplication and improve consistency
When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure. These new `Master_info` numbers supplement SHOW REPLICA STATUS’s (most- recent) ‘Connecting’ state with statistics on (re)connect attempts: * `Connects_Tried`: how many retries have been attempted so far This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes. * `Primary_Retry_Count`: out of how many configured Formerly known as the global option `--master-retry-count`, it’s now copied per-replica to pave way for CHANGE MASTER … TO in MDEV-25674.
* `sql/mysqld.cc`: init `master-retry-count` with `master_retry_count` * `get_master_version_and_clock()` de-duplicate label using fall-through * `io_slave_killed()` & `check_io_slave_killed()`: * reüse the result from the level lower * add distinguishing docs * `try_to_reconnect()`: extract `'` from `if`-`else` * `handle_slave_io()`: Both `while`s have the same condition; looks like the outer `while` can simply be an `if`. * `connect_to_master()`: * assume `mysql_errno()` is not 0 on connection error * utilize 0’s falsiness in the loop * remove kill check around error reporting – other kill checks don’t even use this result * extend docs
These tests dump the whole SSS – results are not guaranteed consistent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Room for refactor that I excluded from this PR:
- There are a few instances of
if (!…) … else …
- Both
register_slave_on_master
andrequest_dump
provides failure warnings, yet their caller,handle_slave_io
, still emits similar warnings on their failure. - Extraneous EOL whitespaces
I’m done with staring at these consistent but incomprehensible test failures. |
Description
These new
Master_info
numbers supplement SHOWSLAVEREPLICA STATUS’s (most-recent) ‘Connecting’ state with statistics on (re)connect attempts:Connects_Tried
: how many retries have been attempted so farPrimary_Retry_Count
: out of how many configured--master-retry-count
, it’s now copied per-replica to pave way for CHANGEPRIMARYMASTER … TO in MDEV-25674.This PR also includes additional commits that refactor relevant code.
I strongly recommend reviewing commit-by-commit.
What problem is the patch trying to solve?
When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure.
If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?
From #3764 (comment), appearently:
--master-retry-count=0
is treated as=1
for reconnecting to a previously connected primary.try_to_reconnect()
.In full, looks like this happens when the connection drops at a different substep in the setup phase.
Both of these code are now gone as part of a preparation commit.
Do you think this patch might introduce side-effects in other parts of the server?
Component that expects a fixed number of SHOW REPLICA STATUS entries in fixed sections might start seeïng uninvited guests.
On that note, we have many MTR tests that run
show slave status
directly and in full (i.e., not usingsource include/show_slave_status.inc
).Release Notes
Added two new SLOW REPLICA STATUS entries for statistics of the most-recent ‘Connecting’ state:
Connects_Tried
: How many retries were attemptedPrimary_Retry_Count
: Out of how many configured (i.e., the--master-retry-count
option)Of course, they’re also accessible from INFORMATION_SCHEMA.SLAVE_STATUS; e.g.:
Knowledge Base pages that need changing
How can this PR be tested?
I need some help from test specialists.
I’ve drafted the MTR test
rpl.rpl_connects_tried
forConnets_Tried
’s behavior, but I couldn’t test it locally because my build doesn’t havedebug_sync
.I couldn’t refer to the buildbots either – they’re either successful (no search results on Cross Reference either) or they fail with issues I have no clue how they relate to my modifications 😶🌫️. (Besides the ones that already fail in
main
, that is 😶.)Besides checking that
Primary_Retry_Count
matches--master-retry-count
, we should also test the option itself regarding #3764 (comment).A draft MTR test is at #3731.
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
A tester can instead replicate (wat) the MTR test by observing the SHOW REPLICA STATUS and/or INFORMATION_SCHEMA.SLAVE_STATUS of a long-
master_connect_retry
replication over time.Basing the PR against the correct MariaDB version
main
branch.This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.PR quality check