-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure dogpile.cache to deal with memcached pods failures #904
Configure dogpile.cache to deal with memcached pods failures #904
Conversation
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/64e175abee114fb3bf273a352d0e3b48 ✔️ openstack-meta-content-provider SUCCESS in 2h 51m 58s |
@@ -172,8 +172,12 @@ enabled = True | |||
# on contoler we prefer to use memcache when its deployed | |||
{{if .MemcachedTLS}} | |||
backend = dogpile.cache.pymemcache | |||
enable_retry_client = true | |||
retry_attempts = 2 | |||
retry_delay = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is reproduced with server list as well, it does take lot of time, even after several sec of memcached pod start.
but there was no visible change after updating suggested change in confs for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Amit. You need to patch the keystone config as well, otherwise you'll still have keystone waiting for the memcache pod to come back (see openstack-k8s-operators/keystone-operator#511) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other thing to consider is we use memcache to cache the server metadata which his expensive to compute. restaring memcache like this effectively clears the cache so new requests will miss the cache and need to hit the db directly, that's fine but it could cause things to timeout.
this si mostly migrated because we are using config drive by default and there for instance boot has no direct dependency on the matadata API but it is just something to be aware of.
any tempest tests that tried to hit the metadata API for an instance that was created before the minor update will take slightly longer the normal to rescive the response beasue its not coming form the cache.
9616563
to
513919a
Compare
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8c191d1966ed48289fdca1312404556c ✔️ openstack-meta-content-provider SUCCESS in 3h 33m 00s |
Whenever one of the mecached pods disappears, because of a rolling restart during a minor update or as result of a failure, APIs can take a long time to detect that the pod went away and keep trying to reconnect. From a quick round of tests we saw downtimes up to ~150s. By enabling the retry_client and limiting the number of retries the behavior seems much more acceptable. Similarly, when TLS is not in use, we may want to set a lower value for memcache_dead_retry so to eventually reconnect to a new pod (having the same dns name but different ip) much faster. Jira: https://issues.redhat.com/browse/OSPRH-11935
513919a
to
c14e365
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lmiccini, mrkisaolamb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e5a56f9
into
openstack-k8s-operators:main
Whenever one of the mecached pods disappears, because of a rolling restart during a minor update or as result of a failure, APIs can take a long time to detect that the pod went away and keep trying to reconnect.
From a quick round of tests we saw downtimes up to ~150s.
By enabling the retry_client and limiting the number of retries the behavior seems much more acceptable.
Similarly, when TLS is not in use, we may want to set a lower value for memcache_dead_retry so to eventually reconnect to a new pod (having the same dns name but different ip) much faster.
Related: OSPRH-11935