-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.30 Backports 2024-12-03 #1022
Conversation
e8071c1
to
7aa0126
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to get #1023 in as well
Yup, I will pick up the change after it is merged into main (just to get main commit sha) |
Merged :-) |
Picked it up 👍 |
cce5182
to
5373c55
Compare
[ upstream commit 4879439 ] Include also sni test not having "l7" in it's name, and check the logs for errors as well. Test with embedded cilium-envoy so that Envoy error and warning logs will also be checked for. Signed-off-by: Jarno Rajahalme <[email protected]> Signed-off-by: Tam Mach <[email protected]>
[ upstream commit db5924e ] Add a runtime assert that the policy destruction happens from the main or test thread. The "test thread" case also covers the destruction from the initial thread, which happens for the static members of 'AllowAllEgressPolicyInstanceImpl' when running `cilium-envoy version', for example. Signed-off-by: Jarno Rajahalme <[email protected]> Signed-off-by: Tam Mach <[email protected]>
[ upstream commit f09ed99 ] Remove the member 'initial_policy_' from the Cilium SocketOption class, as that reference was possibly kept after the policy had already been deleted. This made it possible for the policy to be destructed from the worker thread, which can lead to Envoy crash. 'initial_policy_' was stored as we already did the policy lookup and the same policy is needed in other Cilium filters. Have the cilium.network and cilium.tls_wrapper filters do their own policy lookups instead, so that we do not need to keep the reference in SocketOption. Worker threads do the policy lookup from their own thread local maps, which hold a reference to the policy. These thread local references are relinquished from post function during policy update, which will release worker thread's last reference to the policy at the time. The main thread is the last one to update or delete policy, so the policy instance destruction happens from the main thread. For this to work it is imperative to only hold policy references in these thread local maps. Note that even keeping a weak reference accessible to the worker threads outside of the policy maps is risky, as then the worker thread could convert that weak reference to a shared pointer while the reference has already been relinquished from the thread's local policy map. In this situation the other threads could also relinquish their references concurrently to the worker thread holding the shared pointer, which would then become the last reference and destruction would happen from the worker thread again. Signed-off-by: Jarno Rajahalme <[email protected]> Signed-off-by: Tam Mach <[email protected]>
[ upstream commit 7b1dc6c ] Log an error instead of crashing when Cilium NetworkPolicy resources are destructed in a worker thread. Remove stacktrace that is not useful at all with release builds. Prior to enabling use of SDS secrets running NetworkPolicy destructors in a worker thread did not cause visible problems, even though we had taken measures to not do that a long time ago. Now that references to the policy have been removed from the connection metadata (Cilium SocketOption) this should no longer trigger. Nonetheless, crashing Envoy is too drastic as the event may be survivable (as it apparently has been for a long time when running without SDS references in the policy). Enable trace level logging for troubleshooting if this error ever occurs. Signed-off-by: Jarno Rajahalme <[email protected]> Signed-off-by: Tam Mach <[email protected]>
[upstream commit 16e43f5] Signed-off-by: Jarno Rajahalme <[email protected]> Signed-off-by: Tam Mach <[email protected]>
Remove the build-tests-only argument to include all needed targets in the tests archive. Add upstream tcp integration test in envoy-test-deps. Signed-off-by: Jarno Rajahalme <[email protected]>
Upstream tcp_proxy is missing a nullptr check in ASSERT that crashes tests when run with config that has ASSERT enabled. Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Jarno Rajahalme <[email protected]>
Update tools/stack_decode.py from upstream Envoy. Comment says it works with gcc debug builds, and indeed I had no luck getting it working with clang debug builds. Did not try gcc. We are currently dependent on clang for crosscompilation for arm64. Signed-off-by: Jarno Rajahalme <[email protected]>
Change error logs to IS_ENVOY_BUG to get a stack trace when a policy is destructed from a worker thread. Signed-off-by: Jarno Rajahalme <[email protected]>
Add patch to Envoy to release thread local slots first in the worker threads, and last in the main thread. This makes sure that destructors of shared resources are called from the main thread instead of a random worker thread. This fixes the issue where Cilium Network Policy Map was destructed from a main thread, possibly interfering with the Secret xDS gRPC protocol. Signed-off-by: Jarno Rajahalme <[email protected]>
5373c55
to
12f7899
Compare
Note: This backport PR consolidates all recent fixes related to SDS, we can push new commit if required.
Upstream testing is done in cilium/cilium#36925
Once this PR is merged, a GitHub action will update the labels of these PRs: