-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase reliability of EKS LB deprovisioning #3617
Comments
It looks like the method described in the blog post is also documented directly in the ingress-nginx documentation. The YAML referenced there includes this section which sets up a LoadBalancer service backed by an NLB: That would means it's included as part of the ingress-nginx deployment, which isn't what we want, because it would have the same drawbacks as the existing ALB controller-based method. We want to avoid that service definition being deployed with Helm, and instead create it manually using a |
We can use |
Re-evaluating.. |
If we ever need to debug nginx controller in EKS, this solution works! |
When (or if) we do transition away from the LB Controller, there is a module that simplifies the provisioning of LBs in general. |
A relevant issue on the ALB controller, maddeningly closed as stale. It does however describe the expected logs when deletion is working properly. |
We should look for finalizers on the service with We should probably double-check that we have granted all the necessary permissions to the ALB controller; in particular can it manage SecurityGroups? We should probably double-check that we don't have deletion protection enabled. We should probably use the new FeatureGate that limits the ALB controller to handling only LoadBalancer-type Services. |
I can repeat these two steps over and over reliably:
Conclusion: This does not appear to be a problem with the ALB controller app version/deployment, permissions, etc. It only happens during our I do see reconciliation events noted in the event history for the LoadBalancer service, matching what I see in the AWS console. However, I still haven't seen anything indicating that the reconciliation happened in the logs on the controller's side when this happens, so I'm going to dig into that next. If I figure out where the logs are in the normal case, then hopefully I can tail them while I do the terraform destroy operation to figure out what's happening. |
I was able to get the ALB controller to log when reconcile activity occurs by adding the value I am now able to repeat these two steps over and over reliably:
Next step is to do a broader destroy while watching the logs... |
I saw correct behavior when running these two steps:
I just can't seem to reproduce the behavior that we see under CI. Next up... A full destroy...? |
Here are the logs from the ALB controller:
Note in particular the warnings that appear before the gap and then the timeout error:
Is this because the destination of the Target Groups (the ingress-nginx app) is already gone? That seems unlikely because it would also be gone in the "not destroying everything" case, when the helm_release is destroyed which should not differ. I will have to compare this to successful reconciliation when I'm not tearing everything down, to verify that this warning only appears in the full-destroy situation.
If I run
and the terraform output looks like:
Maybe look at the other stuff getting destroyed before the |
User Story
In order to ensure EKS instances deprovision consistently, we want to ensure all AWS resources created by the broker are managed by Terraform.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN I deprovision the EKS instance
THEN deprovisioning succeeds cleanly
AND there is no dangling LB instance associated with the EKS cluster in AWS.
Background
The AWS LB Controller dynamically provisions LBs corresponding to Service and Ingress objects in EKS. The EKS broker uses the LB-controller to dynamically provision just a single (predictable/required) LB for ingress to the ingress-nginx controller (which handles all other ingress for the cluster).
Using the ALB controller means that the EKS cluster makes use of an LB resource that Terraform doesn't know about. This can lead to race conditions and failures when
terraform destroy
is used to deprovision the EKS instance, but is unable to delete ACM and VPC resources because of the dangling LB. The resolution has been to go into AWS' console to manually delete the dangling ALB and target groups, then reattempt deletion, which is not ideal by any means!Security Considerations (required)
No concerns... We are not changing the architecture, just handling our provisioning/deprovisioning in a more reliable and manageable way.
Sketch
[Notes or a checklist reflecting our understanding of the selected approach]
Two potential approaches:
time_sleep.alb_controller_destroy_delay
idea isn't working.The text was updated successfully, but these errors were encountered: