Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"max_lag_ms" of mysql_aws_aurora_hostgroups doesn't work #4410

Closed
xykhappy opened this issue Jan 5, 2024 · 3 comments
Closed

"max_lag_ms" of mysql_aws_aurora_hostgroups doesn't work #4410

xykhappy opened this issue Jan 5, 2024 · 3 comments

Comments

@xykhappy
Copy link

xykhappy commented Jan 5, 2024

As described in the document, ProxySQL should be able to change the status of backend servers automatically based on the detected replica lag.

max_lag_ms : replicas that have a replication greater or equal than max_lag_ms milliseconds are automatically disabled from the cluster until their lag returns below the configured threshold.

However it seems like the status change only happens once when mysql_aws_aurora_hostgroups got updated. The status never gets adjusted again regardless of the newly detected lag.

After looking into the codes of ProxySQL, I think there should be a code bug in MySQL_Monitor::evaluate_aws_aurora_results of MySQL_Monitor.cpp as shown below:

At line 6370
unsigned int current_lag_ms = estimate_lag(hse->server_id, lasts_ase, ase_idx, add_lag_ms, min_lag_ms, lag_num_checks);
hse->estimated_lag_ms = current_lag_ms;
if (current_lag_ms > max_latency_ms) {
	enable = false;
}
......
				unsigned int prev_lag_ms = estimate_lag(hse->server_id, lasts_ase, ase_idx, add_lag_ms, min_lag_ms, lag_num_checks);
				if (prev_lag_ms > max_latency_ms) {
					prev_enabled = false;
				}
				if (prev_enabled == enable) {
					// the previous status should be the same
					// do not run any action
					run_action = false;
				}

void MySQL_Monitor::evaluate_aws_aurora_results(unsigned int wHG, unsigned int rHG, AWS_Aurora_status_entry **lasts_ase, unsigned int ase_idx, unsigned int max_latency_ms, unsigned int add_lag_ms, unsigned int min_lag_ms, unsigned int lag_num_checks) {

According to the code pieces, the "ase_idx" used for "estimate_lag" is not changed between the two calls, which results "current_lag_ms" and "prev_lag_ms" the same all the time. So it always falls into "prev_enabled == enable" true and takes no action.

After comparing to previous version, I tried to change the second "ase_idx" to "prev_ase_idx" and tested with the new build. As expected, everything worked well!

Please help check if the conclusion above is correct or not. This is really important to our project.

BTW, the explanation of "max_lag_ms" needs to be corrected as:
max_lag_ms : replicas that have a replication greater or equal than max_lag_ms milliseconds are automatically disabled from the cluster until their lag returns below or equal to the configured threshold.

if (current_lag_ms > max_latency_ms) {

Thanks,
Yuankai

@JavierJF
Copy link
Collaborator

Hello @xykhappy,

first, thanks for the detailed report. This was indeed an issue, and was fixed by this commit as part of this PR. That mentioned PR has been recently merged as part of this bigger PR. So this issue is already fixed in the current development branch, and of course will be part of the next release.

Thank you, Javier.

@xykhappy
Copy link
Author

xykhappy commented Jan 17, 2024

@JavierJF Great to know!
And may I know what's the rough schedule for next release? I have checked the release history and the cadence was like once per 1~2 months. However it hasn't been updated since last August somehow.

@xykhappy
Copy link
Author

xykhappy commented Jan 18, 2024

@JavierJF And do you think if it's safe to cherry-pick this commit to v2.5.5 and build a local binary for production usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants