-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It takes a long time to detect connection lost #205
Comments
Thanks for the detailed analysis (and sorry for the delayed response).
I am not yet sure what is better.
Yeah it takes 1min52s more precisely. My analysis below shows the following # This is the log of the last successful health check. After it
# completes a new ping is scheduled for 30s later or at
# 10:55:03 + 30s = 10:55:33.
# https://github.com/boostorg/redis/blob/e8dd4d69eb3524c9a0efb0c9723790b479a4ba8f/include/boost/redis/detail/health_checker.hpp#L56
10:55:03 writer-op: 45 bytes written.
10:55:03 reader-op: 4 bytes read.
# This block is the first reaction to restarting the Redis server. The
# server closed the connection cleanly by sending and EOF. Once the
# read operation detects it and exits, the write op is also canceled
# since I use wait_for_one in https://github.com/boostorg/redis/blob/e8dd4d69eb3524c9a0efb0c9723790b479a4ba8f/include/boost/redis/detail/connection_base.hpp#L196
10:55:05 reader-op: End of file
10:55:05 reader-op: EOF received. Exiting ...
10:55:05 writer-op: canceled (3). Exiting ...
10:55:05 run-op: The operation completed successfully (reader), The operation completed successfully (writer)
# 30s passes since the last successfull health-check and the next one
# is scheduled. But since there is no connection it sits there waiting
# for a new connection to be stablished. If cancel_if_not_connected
# were set in the health check it would fail immediately here, but
# that can't be used reliably since it uses socket.is_open().
10:55:33 ... # No log entry.
# The check-health operation wakes up here to check if the
# health-check pings are being received, cleans up the response of the
# last successful health-check, that happended at 10:55:03 and goes
# sleep for another 2 * 30s = 60s.
# https://github.com/boostorg/redis/blob/e8dd4d69eb3524c9a0efb0c9723790b479a4ba8f/include/boost/redis/detail/health_checker.hpp#L81
10:55:51 ... # No log entry.
# After 60s the check-health operation wakes up just to see no
# successful response was received, then it exits to the reconnect
# loop to stablish a new connection.
10:56:51 check-timeout-op: Response has no value. Exiting ...
10:56:51 ping_op: error/cancelled (1).
10:56:51 check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
10:56:51 runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
10:56:51 Connection lost.
10:56:52 run-all-op: resolve addresses 127.0.0.1:6379
10:56:52 run-all-op: connected to endpoint 127.0.0.1:6379
10:56:52 writer-op: 130 bytes written.
10:56:52 reader-op: 973 bytes read.
10:56:52 hello-op: Success I am not sure yet, but the only place where I still can adjust to |
The design of the health checks looks OK.
When async_run_all completes the health check can no longer succeed and should be canceled.
|
Good point, perhaps wait_for_one_error is the correct token here, I will check. BOOST_ASIO_CORO_YIELD
asio::experimental::make_parallel_group(
[this](auto token) { return runner_->async_run_all(*conn_, logger_, token); },
[this](auto token) { return runner_->health_checker_.async_check_health(*conn_, logger_, token); },
[this](auto token) { return runner_->async_hello(*conn_, logger_, token); }
).async_wait(
asio::experimental::wait_for_all(),
std::move(self)); |
As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.
As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.
Testing with wait_for_one_error did not make any difference. Maybe because the sever disconnect is not considered an error?
|
As a simple test I added redis/include/boost/redis/detail/runner.hpp Line 158 in e8dd4d6
Then I get:
The reasoning is that no more reads and writes will be attempted after |
Been out for a few days. Without being a Boost.Redis expert, I'd say @mzimbres diagnostics is mainly correct. But bear in mind that from this log line:
This op is completing successfully. I don't know under which circumstances the other ops may complete, but I'd say it makes sense to cancel the other operations once one of them exits. Instead of The degree of nesting async ops makes testing quite difficult, so I'd suggest exploring the option of replacing some of these by a sans-io state machine, which usually leads to more testable code. |
As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.
Back form summer vacation ... @VemundH I couldn't reproduce this problem so far, I tried to run Redis manually, on systemd and on docker as you explained in the PR. But I think you are correct on the fact that since EOF is not considered and error the parallel group linked above will have to wait until the health-check returns an error. I have changed this in the PR here. Can you please try again? |
@mzimbres I have tested with the changes in the PR and now the disconnect is detected when EOF is received. Thanks!
|
As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.
Correct, that is achieved with the PR by making EOF an error and using
The problem with that set of operation is that two of them are long running (
At first I tried to launch all operations in the same group i.e. read, write, handshake, health_check. That reduces the nested ops but bloated the connection class, that is why I split it in two classes. I agree however there is still room for improvements. |
As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.
As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.
Small test program based on the examples in this repo:
Using 30 seconds health check interval as indicated in this comment will be default in a future version.
Typically seeing a delay between 3x to 4x health check interval after restarting Redis before connection lost is detected and reconnect attempted.
When restarting Redis the run-op completes almost immediately. This log shows that almost 2 min later connection lost is detected:
Added a small change to logger to get the timestamps:
Setting
cancel_if_not_connected = true
in the request will result inasync_exec
failing with the expected not connected error:The text was updated successfully, but these errors were encountered: