Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): set WatchListPageSize when WatchError #13466

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

fyp711
Copy link
Contributor

@fyp711 fyp711 commented Aug 14, 2024

Fixes #13461

Motivation

If relist options , case options.ResourceVersion != "" && options.ResourceVersion != "0": pager.PageSize = 0. It's will set options.Limit=0, and than you will set ResourceVersion=''. So when do relist, it lists all objects from etcd.

Below code is a part of reflector.go.

go func() {
defer func() {
		if r := recover(); r != nil {
			panicCh <- r
		}
	}()
	// Attempt to gather list in chunks, if supported by listerWatcher, if not, the first
	// list request will return the full response.
	pager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {
		return r.listerWatcher.List(opts)
	}))
	switch {
	case r.WatchListPageSize != 0:
		pager.PageSize = r.WatchListPageSize
	case r.paginatedResult:
		// We got a paginated result initially. Assume this resource and server honor
		// paging requests (i.e. watch cache is probably disabled) and leave the default
		// pager size set.
	case options.ResourceVersion != "" && options.ResourceVersion != "0":
		// User didn't explicitly request pagination.
		//
		// With ResourceVersion != "", we have a possibility to list from watch cache,
		// but we do that (for ResourceVersion != "0") only if Limit is unset.
		// To avoid thundering herd on etcd (e.g. on master upgrades), we explicitly
		// switch off pagination to force listing from watch cache (if enabled).
		// With the existing semantic of RV (result is at least as fresh as provided RV),
		// this is correct and doesn't lead to going back in time.
		//
		// We also don't turn off pagination for ResourceVersion="0", since watch cache
		// is ignoring Limit in that case anyway, and if watch cache is not enabled
		// we don't introduce regression.
		pager.PageSize = 0
	}

	list, paginatedResult, err = pager.List(context.Background(), options)
	if isExpiredError(err) || isTooLargeResourceVersionError(err) {
		r.setIsLastSyncResourceVersionUnavailable(true)
		// Retry immediately if the resource version used to list is unavailable.
		// The pager already falls back to full list if paginated list calls fail due to an "Expired" error on
		// continuation pages, but the pager might not be enabled, the full list might fail because the
		// resource version it is listing at is expired or the cache may not yet be synced to the provided
		// resource version. So we need to fallback to resourceVersion="" in all to recover and ensure
		// the reflector makes forward progress.
		list, paginatedResult, err = pager.List(context.Background(), metav1.ListOptions{ResourceVersion: r.relistResourceVersion()})
	}
	close(listCh)
		}()

see this part of code below, if ResourceVersion= "" K8s apiserver will always request etcd directly.

func (c *Cacher) GetToList(ctx context.Context, key string, opts storage.ListOptions, listObj runtime.Object) error {
        resourceVersion := opts.ResourceVersion
	pred := opts.Predicate
	pagingEnabled := utilfeature.DefaultFeatureGate.Enabled(features.APIListChunking)
	hasContinuation := pagingEnabled && len(pred.Continue) > 0
	hasLimit := pagingEnabled && pred.Limit > 0 && resourceVersion != "0"
	if resourceVersion == "" || hasContinuation || hasLimit || opts.ResourceVersionMatch == metav1.ResourceVersionMatchExact {
		// If resourceVersion is not specified, serve it from underlying
		// storage (for backward compatibility). If a continuation is
		// requested, serve it from the underlying storage as well.
		// Limits are only sent to storage when resourceVersion is non-zero
		// since the watch cache isn't able to perform continuations, and
		// limits are ignored when resource version is zero
		return c.storage.GetToList(ctx, key, opts, listObj)
	}

Modifications

Add Limit=500 when set options.ResourceVersion = ""

       if options.Limit == 0 {
		options.Limit = 500
	}

Verification

Wait informer watch failed,and then relist called. You can see that options.Limit=0 in debug mode.

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 14, 2024

About pageSize is aligned with the default pageSize of pager in client-go. see https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/pager/pager.go#L31

@agilgur5 agilgur5 changed the title fix(controller): Correct limit in controller List API when Relist called fix(controller): set limit when Relist called Aug 14, 2024
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some stylistic comments below

workflow/common/common.go Outdated Show resolved Hide resolved
@agilgur5
Copy link

Wait informer watch failed,and then relist called. You can see that options.Limit=0 in debug mode.

this makes sense to me, although I'm not sure if this could have other ramifications/consequences. both during relist, which makes sense to be a full list, and during other operations as this code block is not necessarily specific to relist

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Aug 14, 2024
@Joibel
Copy link
Member

Joibel commented Aug 14, 2024

@terrytangyuan - could you look at this. Given how #12133 fixed quite a subtle problem, it'd be good to have your input on this one.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting discovery. cc @jessesuen to take a look as well

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 15, 2024

Some stylistic comments below

Fixed, Thanks

Comment on lines 918 to 920
if options.ResourceVersion == "0" {
options.ResourceVersion = ""
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough explanation as well as the links to the client-go code to support the need for this change. The rationale for this PR makes sense to me.

Here is my summary of the original issue, the history, and this PR (if only for my own edification):

  • The Kubernetes cache informer will call list with resourceVersion=0. resourceVersion=0 tells the K8s API server that it is allowed to use its API server cache. However, it has the consequence of pagination (a.k.a limits/chunking) not working and returns all workflows in the response. This can lead to OOM issues in the controller in systems with a lot of workflows.
  • To solve this, fix: Correct limit in controller List API calls. Fixes #11134 #11343 had blindly set resourceVersion="" in all list calls, in order to ensure that pagination would always work. This would prevent OOM situations in the controller by avoiding non-paginated huge LIST responses.
  • However, as @fyp711 discovered, unconditionally setting resourceVersion="" is not correct, since the informer might also perform a "relist", which is a list call containing an actual numeric version, e.g. resourceVersion=1234. The fact that our controller overrides 1234 with "" is incorrect behavior because ignoring resourceVersion=1234 could cause consistency issues, but also causes the request to reach etcd.
  • The fix here is to achieve both concerns:
    1. We will still ignore the cache informer's request to list with resourceVersion=0, replacing it with resourceVersion="", since we know 0 will cause memory issues
    2. But we will now honor the cache informer's request to list with resourceVersion=1234. List with resourceVersion=1234 does honor page sizes (I verified this), and we avoid potential consistency issues as well as strain on etcd (since resourceVersion=1234 allows the server to use its API server cache).

@jessesuen
Copy link
Member

The subtle differences between the list options that are worth repeating:

  • resourceVersion=&limit=500 - result is directly from etcd. limit is honored
  • resourceVersion=0&limit=500 - result from API server cache. limit is not honored (all objects will be returned)
  • resourceVersion=1234&limit=500 - result from API server cache. limit is honored

@agilgur5
Copy link

agilgur5 commented Aug 16, 2024

Thanks for the great summary Jesse, that's very helpful!

My only remaining question is the effect/ramifications of the limit=0 -> limit=500 change here

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM. Can we add the link to the PR to the code comments and record the rationale of the changes?

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 16, 2024

Thank you! LGTM. Can we add the link to the PR to the code comments and record the rationale of the changes?

@terrytangyuan fixed, Thanks

}
// The reflector will set the Limit to `0` when `ResourceVersion != "" && ResourceVersion != "0"`, which will fail
// to limit the number of workflow returns. Timeouts and other errors may occur when there are a lots of workflows.
// see https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/cache/reflector.go#L286
Copy link

@agilgur5 agilgur5 Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this.
So in https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/cache/reflector.go#L289-L294 it says that no limit retrieves from watch cache and is intentionally done to avoid a thundering herd on etcd; so that sounds like adding a limit could have negative consequences and cause reading from etcd?

This behavior is confusing enough that I'm genuinely surprised this isn't documented better in client-go with usage examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agilgur5 Thank you for reading, let me share my opinion:
First of all, the purpose of this submission is to make the Limit take effect, in order to solve a series of problems caused by returning a large amount of workflow at once.
Next, the ListAndWatch method will only execute List once each time, and then start continuous Watch, so I think this operation will not cause too much load on etcd.
Summary: The ListWatch is not a frequent List request, Setting Limit = xx can ensure pagination return.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I get that the List portion is not as frequent (it will also happen during informer rebuilds), though part of the purpose of this PR is to reduce load on etcd, and this actually causes a relist from etcd instead of the watch cache

@imliuda
Copy link
Contributor

imliuda commented Aug 16, 2024

2. List with resourceVersion=1234 does honor page sizes (I verified this), and we avoid potential consistency issues as well as strain on etcd (since resourceVersion=1234 allows the server to use its API server cache).

I checked kubernetes source, 1.30 and former versions may not honor this, but, set resourceVersion=1234 is still needed to ensure no events will be missed.

@jessesuen
Copy link
Member

jessesuen commented Aug 16, 2024

  1. List with resourceVersion=1234 does honor page sizes (I verified this), and we avoid potential consistency issues as well as strain on etcd (since resourceVersion=1234 allows the server to use its API server cache).

I checked kubernetes source, 1.30 and former versions may not honor this, but, set resourceVersion=1234 is still needed to ensure no events will be missed.

I just tested with v1.30, v1.29, v1.26. In all cases, when using a resourceVersion > 0 the limit parameter is honored.

@jessesuen
Copy link
Member

effect/ramifications of the limit=0 -> limit=500 change here

This check makes sure that we never perform a list request that wants all objects in the cluster, instead always ensuring that responses are paginated. Since the caller did not specify a limit, we set this to the client-go default of 500.

@imliuda
Copy link
Contributor

imliuda commented Aug 16, 2024

  1. List with resourceVersion=1234 does honor page sizes (I verified this), and we avoid potential consistency issues as well as strain on etcd (since resourceVersion=1234 allows the server to use its API server cache).

I checked kubernetes source, 1.30 and former versions may not honor this, but, set resourceVersion=1234 is still needed to ensure no events will be missed.

I just tested with v1.30, v1.29, v1.26. In all cases, when using a resourceVersion > 0 the limit parameter is honored.

Sorry, maybe I didn't express clearly, when resourceVersion and limit both set, the request will go through etcd, not apiserver cache, if kubernetes is 1.30 or former versions. Following code is from release-1.30.

image

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 17, 2024

@jessesuen @Joibel @agilgur5 @terrytangyuan @imliuda Unfortunately, I did a stress test and found some issues. The current way of setting the Limit is incorrect, which may result in only 500 records of informer data in certain situations, as shown below.
https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/pager/pager.go#L93C3-L110C2

The correct way is to set the WatchListPageSize of the reflector. Later I will submit a new fix. Thanks.
https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/cache/reflector.go#L99

@agilgur5
Copy link

agilgur5 commented Aug 17, 2024

Thanks for stress testing this!

The current way of setting the Limit is incorrect, which may result in only 500 records of informer data in certain situations, as shown below.
https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/pager/pager.go#L93C3-L110C2

Yea this is what I was concerned about, that setting it unconditionally could break cache rebuilds or other times the informer decides to do a full relist, i.e. it can break the expected behavior of an informer. In higher load scenarios, the informer is likely to keep track of more than 500 records

@agilgur5
Copy link

agilgur5 commented Aug 17, 2024

informer rebuilds

cache rebuilds

Apparently the informer framework's resync does not do a full re-list anymore per argoproj/gitops-engine#617 (comment)?

@wojtek-t if you have a minute to check this PR, your expertise could be very valuable here as well! Especially regarding k8s 1.30+ changes/impact🙏

@imliuda
Copy link
Contributor

imliuda commented Aug 18, 2024

if got a errors.IsResourceExpired() error, the pager will set resourceVersion=0, try to do a full list, but it will be updated to 500 by util.CheckLimit(options), then pager.List() only return 500 items.

// len(list) = 500, paginatedResult=false, err=nil
list, paginatedResult, err = pager.List(context.Background(), options)

fyp711 and others added 7 commits August 20, 2024 15:59
Comment on lines +1541 to +1551

func (wfc *WorkflowController) WatchErrorHandler(r *cache.Reflector, err error) {
cache.DefaultWatchErrorHandler(r, err)
if err != io.EOF {
// The reflector will set the Limit to `0` when `ResourceVersion != "" && ResourceVersion != "0"`, which will fail
// to limit the number of workflow returns. Timeouts and other errors may occur when there are a lots of workflows.
// see https://github.com/kubernetes/client-go/blob/ee1a5aaf793a9ace9c433f5fb26a19058ed5f37c/tools/cache/reflector.go#L286
r.WatchListPageSize = common.DefaultPageSize
wfc.WatchListPageSize = r.WatchListPageSize
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an error handler to set the WatchListPageSize of the reflector to ensure that it can meet the Limit restriction when an error occurs in ListAndWatch.
Currently, I have only found this way to set WatchListPageSize. If anyone has a better solution, I am happy to discuss it with you.

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 21, 2024

@wojtek-t Could you help us review this pr ? We do believe that your experience can make this submission better, if you have free time. Thanks !

@fyp711 fyp711 requested a review from agilgur5 August 21, 2024 09:26
@fyp711
Copy link
Contributor Author

fyp711 commented Aug 22, 2024

Hi @agilgur5 , I have updated a new fix solution, what do you think about this fix? Thanks.

@tooptoop4
Copy link
Contributor

@fyp711 do u have a screenshot showing metrics graph of before vs after impact of this change?

@fyp711
Copy link
Contributor Author

fyp711 commented Sep 18, 2024

@fyp711 do u have a screenshot showing metrics graph of before vs after impact of this change?

@tooptoop4 Currently no, This submission is an optimization based on this #11343
Optimize two key points:

  1. Only set to empty when resourceVersion = 0
  2. Set the WatchListPageSize property to ensure pagination

@tooptoop4
Copy link
Contributor

would be good to know how much this helps etcd

@fyp711 fyp711 changed the title fix(controller): set limit when Relist called fix(controller): set WatchListPageSize when WatchError Sep 18, 2024
@fyp711
Copy link
Contributor Author

fyp711 commented Sep 18, 2024

would be good to know how much this helps etcd

I modified the description of the issue, the previous description was not very accurate, the previous discussion should be about another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set WatchListPageSize when WatchError
7 participants