Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send error response if member list cannot be retrieved #9722

Merged
merged 2 commits into from
Mar 26, 2024

Conversation

brandond
Copy link
Member

@brandond brandond commented Mar 11, 2024

Proposed Changes

  • Move error response generation code into util
  • Send error response if member list cannot be retrieved
    Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves

I am assuming there was some reason we generated a dummy list instead of sending an error, but it's done this since the initial embedded etcd code was bulk-added by Darren in #1770 so I don't know why this was supposed to be a good idea.

Types of Changes

bugfix

Verification

  • See linked issue
  • Try joining a node that is behind an external load-balancer that includes the new node in its target poor. For example, using the following haproxy config, where SERVER1 is the existing node, and SERVER2 is the node attempting to join the cluster:
    global
        maxconn 256
        max-spread-checks 0

    defaults
        mode tcp
        timeout connect 10ms
        timeout client  60s
        timeout server  60s

    frontend listener
        bind *:6443
        default_backend servers

    backend servers
        retries 10
        retry-on conn-failure
        server server1 SERVER1:6443 check downinter 5ms inter 1s weight 1
        server server2 SERVER2:6443 check downinter 5ms inter 1s weight 100

This haproxy config will cause it to send traffic to the new server as soon as it is up, but before it is actually ready - which will result in it trying to join itself.

Testing

Linked Issues

User-Facing Change

Further Comments

@brandond brandond requested a review from a team as a code owner March 11, 2024 22:26
Copy link

codecov bot commented Mar 11, 2024

Codecov Report

Attention: Patch coverage is 15.06849% with 62 lines in your changes are missing coverage. Please review.

Project coverage is 45.90%. Comparing base (fe2ca9e) to head (f9e74e9).
Report is 4 commits behind head on master.

Files Patch % Lines
pkg/util/apierrors.go 21.95% 30 Missing and 2 partials ⚠️
pkg/etcd/etcd.go 7.14% 13 Missing ⚠️
pkg/server/router.go 9.09% 10 Missing ⚠️
pkg/server/auth.go 0.00% 4 Missing ⚠️
pkg/daemons/control/tunnel.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9722      +/-   ##
==========================================
- Coverage   53.04%   45.90%   -7.14%     
==========================================
  Files         154      155       +1     
  Lines       13599    13588      -11     
==========================================
- Hits         7214     6238     -976     
- Misses       5020     6145    +1125     
+ Partials     1365     1205     -160     
Flag Coverage Δ
e2etests 37.55% <1.36%> (-11.90%) ⬇️
inttests 39.44% <15.06%> (+0.08%) ⬆️
unittests 16.14% <1.81%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves

Signed-off-by: Brad Davidson <[email protected]>
@brandond brandond force-pushed the fix-etcd-self-join branch from 4b289a1 to f9e74e9 Compare March 12, 2024 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants