Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure dogpile.cache to deal with memcached pods failures #904

Conversation

lmiccini
Copy link
Contributor

@lmiccini lmiccini commented Nov 29, 2024

Whenever one of the mecached pods disappears, because of a rolling restart during a minor update or as result of a failure, APIs can take a long time to detect that the pod went away and keep trying to reconnect.

From a quick round of tests we saw downtimes up to ~150s.

By enabling the retry_client and limiting the number of retries the behavior seems much more acceptable.

Similarly, when TLS is not in use, we may want to set a lower value for memcache_dead_retry so to eventually reconnect to a new pod (having the same dns name but different ip) much faster.

Related: OSPRH-11935

@openshift-ci openshift-ci bot requested review from kk7ds and mrkisaolamb November 29, 2024 12:39
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/64e175abee114fb3bf273a352d0e3b48

✔️ openstack-meta-content-provider SUCCESS in 2h 51m 58s
✔️ nova-operator-kuttl SUCCESS in 46m 07s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 36m 05s
nova-operator-tempest-multinode-ceph FAILURE in 1h 25m 26s

@@ -172,8 +172,12 @@ enabled = True
# on contoler we prefer to use memcache when its deployed
{{if .MemcachedTLS}}
backend = dogpile.cache.pymemcache
enable_retry_client = true
retry_attempts = 2
retry_delay = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is reproduced with server list as well, it does take lot of time, even after several sec of memcached pod start.
but there was no visible change after updating suggested change in confs for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Amit. You need to patch the keystone config as well, otherwise you'll still have keystone waiting for the memcache pod to come back (see openstack-k8s-operators/keystone-operator#511) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other thing to consider is we use memcache to cache the server metadata which his expensive to compute. restaring memcache like this effectively clears the cache so new requests will miss the cache and need to hit the db directly, that's fine but it could cause things to timeout.

this si mostly migrated because we are using config drive by default and there for instance boot has no direct dependency on the matadata API but it is just something to be aware of.

any tempest tests that tried to hit the metadata API for an instance that was created before the minor update will take slightly longer the normal to rescive the response beasue its not coming form the cache.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8c191d1966ed48289fdca1312404556c

✔️ openstack-meta-content-provider SUCCESS in 3h 33m 00s
nova-operator-kuttl RETRY_LIMIT in 18m 31s
nova-operator-tempest-multinode FAILURE in 22m 04s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 2h 51m 16s

Whenever one of the mecached pods disappears, because of a rolling
restart during a minor update or as result of a failure, APIs can
take a long time to detect that the pod went away and keep trying
to reconnect.

From a quick round of tests we saw downtimes up to ~150s.

By enabling the retry_client and limiting the number of retries
the behavior seems much more acceptable.

Similarly, when TLS is not in use, we may want to set a lower
value for memcache_dead_retry so to eventually reconnect to a new
pod (having the same dns name but different ip) much faster.

Jira: https://issues.redhat.com/browse/OSPRH-11935
Copy link
Contributor

@mrkisaolamb mrkisaolamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

openshift-ci bot commented Jan 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini, mrkisaolamb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jan 7, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit e5a56f9 into openstack-k8s-operators:main Jan 7, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants