Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache #7851

Merged
merged 10 commits into from
Nov 21, 2024

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Nov 17, 2024

Addresses: #7819

If a watch is registered for a listener resource which is already present in the cache with an old good update as well latest NACK error, the new watcher should receive both good update and error, without a new resource request being sent to the management server.

RELEASE NOTES:

  • xds: fixed an edge-case issue where some clients or servers would not receive errors if another channel or server with the same target was already in use.

Copy link

codecov bot commented Nov 17, 2024

Codecov Report

Attention: Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.89%. Comparing base (87f0254) to head (742da1b).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/xdsclient/authority.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7851      +/-   ##
==========================================
+ Coverage   81.74%   81.89%   +0.15%     
==========================================
  Files         375      375              
  Lines       37980    37986       +6     
==========================================
+ Hits        31045    31110      +65     
+ Misses       5622     5581      -41     
+ Partials     1313     1295      -18     
Files with missing lines Coverage Δ
xds/internal/xdsclient/authority.go 76.82% <57.14%> (+1.38%) ⬆️

... and 21 files with indirect coverage changes

---- 🚨 Try these New Features:

@purnesh42H purnesh42H force-pushed the new-watcher-caching-behavior branch 2 times, most recently from 30597fe to d2aae7d Compare November 17, 2024 18:42
@purnesh42H purnesh42H changed the title xds/internal/xdsclient/test: new watcher resource caching behavior xds/internal/xdsclient/test: add test to verify a new watcher gets old good update and nack error from the cache Nov 17, 2024
@purnesh42H purnesh42H changed the title xds/internal/xdsclient/test: add test to verify a new watcher gets old good update and nack error from the cache xdsclient/test/lds_watchers_test: add test to verify a new watcher gets old good update and nack error from the cache Nov 17, 2024
@purnesh42H purnesh42H added this to the 1.69 Release milestone Nov 18, 2024
@purnesh42H purnesh42H changed the title xdsclient/test/lds_watchers_test: add test to verify a new watcher gets old good update and nack error from the cache xds/internal/xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache Nov 18, 2024
@purnesh42H purnesh42H force-pushed the new-watcher-caching-behavior branch 2 times, most recently from a9ee1a6 to f57c599 Compare November 18, 2024 10:03
@easwars
Copy link
Contributor

easwars commented Nov 18, 2024

I think this might be worth release noting.

xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved

// Register another watch for the same resource. This should get the update
// and error from the cache.
lw2 := newListenerWatcherMultiple(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this new listener watcher type? Why can't we handle this case with the existing listenerWatcher type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current listenerWatcher has channel size of 1 and notifications gets replaced. For this fix we need both good update and error as 2 different notifications to be verified so we need a channel with buffer size > 1. But yeah we don't need a new listenerWatcher struct, we can just modify the current one to have another constructor which accept the size and update OnError to not replace if size > 1, which is what I did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the way i used existing listenerWatcher struct, it was missing resource update during race test. I have added the separate struct back for handling variable size update channel and the race went away. Didn't get a chance to fully debug why it was happening. But may be its fine to have separate struct to hold multiple updates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way would be to change the listenerWatcher to have multiple channels: one each for update, error and resource not found. That way one callback will not interfere with another callback. But this change would touch a lot of tests.

I wanted to do this change when I was working on some of the refactors recently, but never got around to doing that. I would recommend making that change in a separate PR though. What do you think?

Copy link
Contributor Author

@purnesh42H purnesh42H Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah i think i can send a separate PR for that. The idea of having 3 channels for each callback is a good idea. Should this fix be blocked for that though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be Ok if we create an issue for the same and add a TODO in here to remove this new listener watcher type once that issue is taken care of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #7864. It should be simple as well. Added TODO for the new watcher in this PR.

xds/internal/xdsclient/tests/lds_watchers_test.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/tests/lds_watchers_test.go Outdated Show resolved Hide resolved
@easwars easwars assigned purnesh42H and unassigned easwars Nov 18, 2024
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Nov 19, 2024
@purnesh42H
Copy link
Contributor Author

purnesh42H commented Nov 19, 2024

Actually, looks like there is some race condition in the test but feel free to review. Debugging the race condition.

@purnesh42H purnesh42H assigned purnesh42H and unassigned easwars Nov 19, 2024
@purnesh42H purnesh42H changed the title xds/internal/xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache xds: fix new watcher to get both old good update and nack error (if exist) from the cache Nov 19, 2024
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Nov 19, 2024
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still missing a test case for the scenario where the first response from the management server is NACKed, and a second watcher is registered for that same resource (and we expect the error callback to be invoked on that watcher). You could enhance the existing TestLDSWatch_NACKError test for this.

xds/internal/xdsclient/authority.go Show resolved Hide resolved
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Show resolved Hide resolved
xds/internal/xdsclient/tests/lds_watchers_test.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/tests/lds_watchers_test.go Outdated Show resolved Hide resolved
t.Fatalf("timeout when waiting for a listener resource from the management server: %v", err)
}
gotErr = u.(listenerUpdateErrTuple).err
if gotErr == nil || !strings.Contains(gotErr.Error(), wantListenerNACKErr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error type in the xdsresource package provides a way to get to the underlying error type. See: https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/xdsresource/errors.go#L61

I think we should use that instead of string comparison.

I agree existing tests are not doing that. But that is not a good enough reason to make new code not do that. In fact, it would be wonderful in existing code could be cleaned up as well.

Copy link
Contributor Author

@purnesh42H purnesh42H Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for NACK error of no routing, we don't have any specific error type. So, type will always be unknown. That's why we are verifying the string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a new type for NACK errors? If you think it makes sense to do that, we should file an issue to track it and eventually get to it. Might be good to get to (and fixing tests) before making the client public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #7863. I think it should be simple change as i described in the issue to set the NACK error type while decoding. Though it still can be separate PR because we will have to update all the tests.


// Register another watch for the same resource. This should get the update
// and error from the cache.
lw2 := newListenerWatcherMultiple(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way would be to change the listenerWatcher to have multiple channels: one each for update, error and resource not found. That way one callback will not interfere with another callback. But this change would touch a lot of tests.

I wanted to do this change when I was working on some of the refactors recently, but never got around to doing that. I would recommend making that change in a separate PR though. What do you think?

xds/internal/xdsclient/tests/lds_watchers_test.go Outdated Show resolved Hide resolved
@easwars easwars removed their assignment Nov 19, 2024
@purnesh42H
Copy link
Contributor Author

purnesh42H commented Nov 20, 2024

We are still missing a test case for the scenario where the first response from the management server is NACKed, and a second watcher is registered for that same resource (and we expect the error callback to be invoked on that watcher). You could enhance the existing TestLDSWatch_NACKError test for this.

I had done the same #7851 (comment) and this is the change https://github.com/grpc/grpc-go/pull/7851/files#diff-33ea1a548fc69853905a83ab6f29daba065a266e4145ff1db859e74ca8064ad3R939. Did i miss anything?

Comment on lines 939 to 1036
if err != nil {
t.Fatalf("Timeout when waiting for a listener resource from the management server: %v", err)
}
gotErr := u.(listenerUpdateErrTuple).err
if gotErr == nil || !strings.Contains(gotErr.Error(), wantListenerNACKErr) {
t.Fatalf("Update received with error: %v, want %q", gotErr, wantListenerNACKErr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement a helper for this?

func verifyListenerError(ctx context.Context, updateCh *testutils.Channel, wantErr string) error {
	u, err := updateCh.Receive(ctx)
	if err != nil {
		return fmt.Errorf("timeout when waiting for a listener update from the management server: %v", err)
	}
	gotErr := u.(listenerUpdateErrTuple).err
	if gotErr == nil || !strings.Contains(gotErr.Error(), wantErr) {
		return fmt.Errorf("update received with error: %v, want %q", gotErr, wantErr)
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Though we need the same helper for all other resource types as well. Will send a separate PR.

@easwars easwars assigned purnesh42H and unassigned easwars Nov 20, 2024
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Nov 21, 2024
@@ -94,6 +94,8 @@ type listenerWatcherMultiple struct {
updateCh *testutils.Channel
}

// TODO: delete this once `newListenerWatcher` is modified to handle multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link the issue here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@easwars easwars assigned purnesh42H and unassigned easwars Nov 21, 2024
@easwars easwars changed the title xds: fix new watcher to get both old good update and nack error (if exist) from the cache xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache Nov 21, 2024
@purnesh42H purnesh42H merged commit 44a5eb9 into grpc:master Nov 21, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants