MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

ParadoxV5 · 2025-01-14T20:08:23Z

The Jira issue number for this PR is: MDEV-35304

Description

These new Master_info numbers supplement SHOW ~~SLAVE~~ REPLICA STATUS’s (most-recent) ‘Connecting’ state with statistics on (re)connect attempts:

Connects_Tried: how many retries have been attempted so far
- This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes.
Primary_Retry_Count: out of how many configured
- Formerly known as the global option --master-retry-count, it’s now copied per-replica to pave way for CHANGE ~~PRIMARY~~ MASTER … TO in MDEV-25674.
- I’m pre-emptively naming this one ‘Primary’ re. MDEV-30189 of MDEV-18777.

This PR also includes additional commits that refactor relevant code.
I strongly recommend reviewing commit-by-commit.

What problem is the patch trying to solve?

When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure.

If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?

From #3764 (comment), appearently:

--master-retry-count=0 is treated as =1 for reconnecting to a previously connected primary.
There may have been an extra sleep in try_to_reconnect().
In full, looks like this happens when the connection drops at a different substep in the setup phase.

Both of these code are now gone as part of a preparation commit.

Do you think this patch might introduce side-effects in other parts of the server?

Component that expects a fixed number of SHOW REPLICA STATUS entries in fixed sections might start seeïng uninvited guests.
On that note, we have many MTR tests that run show slave status directly and in full (i.e., not using source include/show_slave_status.inc).

Release Notes

Added two new SLOW REPLICA STATUS entries for statistics of the most-recent ‘Connecting’ state:

Connects_Tried: How many retries were attempted
Primary_Retry_Count: Out of how many configured (i.e., the --master-retry-count option)

Of course, they’re also accessible from INFORMATION_SCHEMA.SLAVE_STATUS; e.g.:

SELECT
  Connection_name,
  Slave_IO_Running,
  JOIN(Connects_Tried, 'attempt(s) used out of ', Primary_Retry_Count, ' max')
FROM INFORMATION_SCHEMA.SLAVE_STATUS;

Knowledge Base pages that need changing

https://mariadb.com/kb/en/show-replica-status/ – This PR is about adding two new entries.
https://mariadb.com/kb/en/replica-io-thread-states/ – While refactoring, I also changed ‘master’ in the status keys to ‘primary’ re. MDEV-18777. (Too soon?)

How can this PR be tested?

I need some help from test specialists.

I’ve drafted the MTR test rpl.rpl_connects_tried for Connets_Tried’s behavior, but I couldn’t test it locally because my build doesn’t have debug_sync.
I couldn’t refer to the buildbots either – they’re either successful (no search results on Cross Reference either) or they fail with issues I have no clue how they relate to my modifications 😶‍🌫️. (Besides the ones that already fail in main, that is 😶.)

Besides checking that Primary_Retry_Count matches --master-retry-count, we should also test the option itself regarding #3764 (comment).
A draft MTR test is at #3731.

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

A tester can instead replicate (wat) the MTR test by observing the SHOW REPLICA STATUS and/or INFORMATION_SCHEMA.SLAVE_STATUS of a long-master_connect_retry replication over time.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
~~This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.~~

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

ParadoxV5

According to the code, the --master-retry-count=0 behavior appearently doesn’t match the doc’s expectation:

A value of 0 means the replica will not stop attempting to reconnect.
⸺ https://mariadb.com/kb/en/mariadbd-options/#-master-retry-count

This may be a bug.
The code is found as early as 10.5.

ParadoxV5 · 2025-01-16T03:59:51Z

sql/slave.cc

-      connect
-    */
-    if (++err_count == master_retry_count)
+    if (++(mi->connects_tried) == mi->retry_count)


There’s no special case for zero since 👀 2002.

With an ==, it would treat 0 as ULONG_MAX + 1.
Fortunately, even if it retries once every millisecond, I don’t see a practical difference between 2^32 attempts (totalling almost 50 days) and “the server is down”.

ParadoxV5 · 2025-01-16T04:23:52Z

sql/slave.cc

-  if ((*retry_count)++)
-  {
-    if (*retry_count > master_retry_count)
-      return 1;                             // Don't retry forever


This can be hard to comprehend. Step by step:

If *retry_count was 0, skip the block

Else If *retry_count becomes > master_retry_count, return 1

Else, proceed with the block

This could mean that for --master-retry-count=0, the termination condition always matches, but the flow always bypasses it for the first try. That is, 0 is equivalent to 1.

I tried writing an MTR test for --master-retry-count, but my build doesn’t have debug_sync, and appearently none of our buildbots have both it and log_bin…??? (#3731)

ParadoxV5 · 2025-01-16T04:47:47Z

innodb.log_file_size_online and main.log_state fails on main.
P.S.: and appearently unit.conc_connection too.

`try_to_reconnect()` wraps `safe_reconnect()` with logging, but the latter already loops reconnection attempts up to `master_retry_count` times with `mi->connect_retry`-msec sleeps inbetween. This means `try_to_reconnect()` has been counting number of disconnects (since it doesn’t have a loop) while `safe_reconnect()` was counting actual attempts (which may be multiple per disconnect). In practice, this outer counter’s only benefit was to cover the edge case `--master-retry-count=0` for the inner loop… by treating it as 1…

“Lightly” refactor `try_to_reconnect()` `messages` to reduce duplication and improve consistency

When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure. These new `Master_info` numbers supplement SHOW REPLICA STATUS’s (most- recent) ‘Connecting’ state with statistics on (re)connect attempts: * `Connects_Tried`: how many retries have been attempted so far This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes. * `Primary_Retry_Count`: out of how many configured Formerly known as the global option `--master-retry-count`, it’s now copied per-replica to pave way for CHANGE MASTER … TO in MDEV-25674.

* `sql/mysqld.cc`: init `master-retry-count` with `master_retry_count` * `get_master_version_and_clock()` de-duplicate label using fall-through * `io_slave_killed()` & `check_io_slave_killed()`: * reüse the result from the level lower * add distinguishing docs * `try_to_reconnect()`: extract `'` from `if`-`else` * `handle_slave_io()`: Both `while`s have the same condition; looks like the outer `while` can simply be an `if`. * `connect_to_master()`: * assume `mysql_errno()` is not 0 on connection error * utilize 0’s falsiness in the loop * remove kill check around error reporting – other kill checks don’t even use this result * extend docs

These tests dump the whole SSS – results are not guaranteed consistent!

ParadoxV5 · 2025-01-16T06:11:34Z

sql/slave.cc

Room for refactor that I excluded from this PR:

There are a few instances of
if (!…) … else …

Both register_slave_on_master and request_dump provides failure warnings, yet their caller, handle_slave_io, still emits similar warnings on their failure.

Extraneous EOL whitespaces

ParadoxV5 · 2025-01-16T06:21:31Z

I’m done with staring at these consistent but incomprehensible test failures.

ParadoxV5 requested a review from andrelkin January 14, 2025 20:08

ParadoxV5 changed the title ~~Mdev 35304~~ MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS Jan 14, 2025

ParadoxV5 force-pushed the mdev-35304 branch 6 times, most recently from 87112f2 to 43da219 Compare January 16, 2025 03:51

ParadoxV5 commented Jan 16, 2025

View reviewed changes

ParadoxV5 added 5 commits January 15, 2025 22:00

slave.cc: Refactor reconnect_messages

a6e31e1

“Lightly” refactor `try_to_reconnect()` `messages` to reduce duplication and improve consistency

update .results of bad tests

7353053

These tests dump the whole SSS – results are not guaranteed consistent!

ParadoxV5 force-pushed the mdev-35304 branch from 43da219 to 7353053 Compare January 16, 2025 05:16

ParadoxV5 requested a review from knielsen January 16, 2025 06:03

ParadoxV5 commented Jan 16, 2025

View reviewed changes

ParadoxV5 marked this pull request as ready for review January 16, 2025 06:21

ParadoxV5 added the MariaDB Corporation label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

ParadoxV5 commented Jan 14, 2025 •

edited

Loading

ParadoxV5 left a comment

ParadoxV5 Jan 16, 2025 •

edited

Loading

ParadoxV5 Jan 16, 2025

ParadoxV5 commented Jan 16, 2025 •

edited

Loading

ParadoxV5 Jan 16, 2025

ParadoxV5 commented Jan 16, 2025

MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS #3764

Are you sure you want to change the base?

MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS #3764

Conversation

ParadoxV5 commented Jan 14, 2025 • edited Loading

Description

What problem is the patch trying to solve?

If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?

Do you think this patch might introduce side-effects in other parts of the server?

Release Notes

Knowledge Base pages that need changing

How can this PR be tested?

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

PR quality check

ParadoxV5 left a comment

Choose a reason for hiding this comment

ParadoxV5 Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

ParadoxV5 Jan 16, 2025

Choose a reason for hiding this comment

ParadoxV5 commented Jan 16, 2025 • edited Loading

ParadoxV5 Jan 16, 2025

Choose a reason for hiding this comment

ParadoxV5 commented Jan 16, 2025

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

ParadoxV5 commented Jan 14, 2025 •

edited

Loading

ParadoxV5 Jan 16, 2025 •

edited

Loading

ParadoxV5 commented Jan 16, 2025 •

edited

Loading