Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect errors should invalidate DNS cache entries #4593

Closed
JavierJF opened this issue Jul 24, 2024 · 1 comment
Closed

Connect errors should invalidate DNS cache entries #4593

JavierJF opened this issue Jul 24, 2024 · 1 comment

Comments

@JavierJF
Copy link
Collaborator

Current Behavior

Right now the refresh of DNS entries is uniquely determined by variable monitor_local_dns_cache_refresh_interval. This determines the frequency at which the DNS are checked for expired TTL and placed in the resolver queue for renewal.

Issue

Since cache entries are only refreshed at these intervals, if a server IP changes by any reason (e.g. unplanned failover), all subsequent connection attempts to this server will fail until the entry TTL expires and a new check (via refresh_interval) is triggered. A protection for these scenarios is to set a smaller refresh interval than the expected delay due to DNS update propagation. This will be sufficient to reduce the expected downtime of the instance to that given interval.

Improvement

A way to improve this situation would be to remove the cache entry corresponding to the server whenever we find a connection error to a backend instance. This invalidation will be immediate. This will serve as a generic protection mechanism that will reduce downtime to the delay of the DNS update propagation itself. All subsequent connections on that server will perform DNS resolution until the next monitor_local_dns_cache_refresh_interval updates the cache with a new valid value.

Implementation Details

Whenever a connect error is detected for a backend connection:

  • If DNS was used for the connection attempt (entry was retrieved from DNS cache):
    1. A exclusion list should be check, if the error is found, nothing should be done. This exclusion list shall include errors not related to not being able to reach the server, like Access denied errors.
    2. If the error is not found in this list, the corresponding entry for this server in the DNS cache must be removed.
  • If DNS wasn't used (no entry or disabled), nothing should be done.

This should be enough for making all subsequent connections attempts on the server to attempt DNS resolution until the next monitor_local_dns_cache_refresh_interval updates the cache with a new valid value.

@renecannao
Copy link
Contributor

Solved in #4662 and #4656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants