Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS #3764

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ParadoxV5
Copy link
Contributor

@ParadoxV5 ParadoxV5 commented Jan 14, 2025

  • The Jira issue number for this PR is: MDEV-35304

Description

These new Master_info numbers supplement SHOW SLAVE REPLICA STATUS’s (most-recent) ‘Connecting’ state with statistics on (re)connect attempts:

  • Connects_Tried: how many retries have been attempted so far
    • This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes.
  • Primary_Retry_Count: out of how many configured
    • Formerly known as the global option --master-retry-count, it’s now copied per-replica to pave way for CHANGE PRIMARY MASTER … TO in MDEV-25674.
    • I’m pre-emptively naming this one ‘Primary’ re. MDEV-30189 of MDEV-18777.

This PR also includes additional commits that refactor relevant code.
I strongly recommend reviewing commit-by-commit.

What problem is the patch trying to solve?

When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure.

If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?

From #3764 (comment), appearently:

  • --master-retry-count=0 is treated as =1 for reconnecting to a previously connected primary.
  • There may have been an extra sleep in try_to_reconnect().
    In full, looks like this happens when the connection drops at a different substep in the setup phase.

Both of these code are now gone as part of a preparation commit.

Do you think this patch might introduce side-effects in other parts of the server?

Component that expects a fixed number of SHOW REPLICA STATUS entries in fixed sections might start seeïng uninvited guests.
On that note, we have many MTR tests that run show slave status directly and in full (i.e., not using source include/show_slave_status.inc).

Release Notes

Added two new SLOW REPLICA STATUS entries for statistics of the most-recent ‘Connecting’ state:

  • Connects_Tried: How many retries were attempted
  • Primary_Retry_Count: Out of how many configured (i.e., the --master-retry-count option)

Of course, they’re also accessible from INFORMATION_SCHEMA.SLAVE_STATUS; e.g.:

SELECT
  Connection_name,
  Slave_IO_Running,
  JOIN(Connects_Tried, 'attempt(s) used out of ', Primary_Retry_Count, ' max')
FROM INFORMATION_SCHEMA.SLAVE_STATUS;

Knowledge Base pages that need changing

How can this PR be tested?

I need some help from test specialists.

I’ve drafted the MTR test rpl.rpl_connects_tried for Connets_Tried’s behavior, but I couldn’t test it locally because my build doesn’t have debug_sync.
I couldn’t refer to the buildbots either – they’re either successful (no search results on Cross Reference either) or they fail with issues I have no clue how they relate to my modifications 😶‍🌫️. (Besides the ones that already fail in main, that is 😶.)

Besides checking that Primary_Retry_Count matches --master-retry-count, we should also test the option itself regarding #3764 (comment).
A draft MTR test is at #3731.

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

A tester can instead replicate (wat) the MTR test by observing the SHOW REPLICA STATUS and/or INFORMATION_SCHEMA.SLAVE_STATUS of a long-master_connect_retry replication over time.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@ParadoxV5 ParadoxV5 requested a review from andrelkin January 14, 2025 20:08
@ParadoxV5 ParadoxV5 changed the title Mdev 35304 MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS Jan 14, 2025
@ParadoxV5 ParadoxV5 force-pushed the mdev-35304 branch 6 times, most recently from 87112f2 to 43da219 Compare January 16, 2025 03:51
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the code, the --master-retry-count=0 behavior appearently doesn’t match the doc’s expectation:

A value of 0 means the replica will not stop attempting to reconnect.
https://mariadb.com/kb/en/mariadbd-options/#-master-retry-count

This may be a bug.
The code is found as early as 10.5.

connect
*/
if (++err_count == master_retry_count)
if (++(mi->connects_tried) == mi->retry_count)
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no special case for zero since 👀 2002.

With an ==, it would treat 0 as ULONG_MAX + 1.
Fortunately, even if it retries once every millisecond, I don’t see a practical difference between 2^32 attempts (totalling almost 50 days) and “the server is down”.

Comment on lines -4431 to -4434
if ((*retry_count)++)
{
if (*retry_count > master_retry_count)
return 1; // Don't retry forever
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be hard to comprehend. Step by step:

  1. If *retry_count was 0, skip the block
  2. Else If *retry_count becomes > master_retry_count, return 1
  3. Else, proceed with the block

This could mean that for --master-retry-count=0, the termination condition always matches, but the flow always bypasses it for the first try. That is, 0 is equivalent to 1.

I tried writing an MTR test for --master-retry-count, but my build doesn’t have debug_sync, and appearently none of our buildbots have both it and log_bin…??? (#3731)

@ParadoxV5
Copy link
Contributor Author

ParadoxV5 commented Jan 16, 2025

innodb.log_file_size_online and main.log_state fails on main.
P.S.: and appearently unit.conc_connection too.

`try_to_reconnect()` wraps `safe_reconnect()` with logging, but the
latter already loops reconnection attempts up to `master_retry_count`
times with `mi->connect_retry`-msec sleeps inbetween.
This means `try_to_reconnect()` has been counting number of disconnects
(since it doesn’t have a loop) while `safe_reconnect()` was counting
actual attempts (which may be multiple per disconnect).
In practice, this outer counter’s only benefit was to cover the edge
case `--master-retry-count=0` for the inner loop… by treating it as 1…
“Lightly” refactor `try_to_reconnect()` `messages`
to reduce duplication and improve consistency
When the IO thread (re)connect to a primary,
no updates are available besides unique errors that cause the failure.
These new `Master_info` numbers supplement SHOW REPLICA STATUS’s (most-
recent) ‘Connecting’ state with statistics on (re)connect attempts:

* `Connects_Tried`: how many retries have been attempted so far
  This was previously a local variable that only counter re-attempts;
  it’s now meaningful even after the “Connecting” state concludes.

* `Primary_Retry_Count`: out of how many configured
  Formerly known as the global option `--master-retry-count`, it’s now
  copied per-replica to pave way for CHANGE MASTER … TO in MDEV-25674.
* `sql/mysqld.cc`: init `master-retry-count` with `master_retry_count`
* `get_master_version_and_clock()` de-duplicate label using fall-through
* `io_slave_killed()` & `check_io_slave_killed()`:
  * reüse the result from the level lower
  * add distinguishing docs
* `try_to_reconnect()`: extract `'` from `if`-`else`
* `handle_slave_io()`: Both `while`s have the same condition;
  looks like the outer `while` can simply be an `if`.
* `connect_to_master()`:
  * assume `mysql_errno()` is not 0 on connection error
  * utilize 0’s falsiness in the loop
  * remove kill check around error reporting –
    other kill checks don’t even use this result
  * extend docs
These tests dump the whole SSS – results are not guaranteed consistent!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Room for refactor that I excluded from this PR:

  • There are a few instances of
    if (!…)
    …
    else
  • Both register_slave_on_master and request_dump provides failure warnings, yet their caller, handle_slave_io, still emits similar warnings on their failure.
  • Extraneous EOL whitespaces

@ParadoxV5 ParadoxV5 marked this pull request as ready for review January 16, 2025 06:21
@ParadoxV5
Copy link
Contributor Author

I’m done with staring at these consistent but incomprehensible test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant