Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase reliability of EKS LB deprovisioning #3617

Closed
1 task done
mogul opened this issue Dec 23, 2021 · 12 comments
Closed
1 task done

Increase reliability of EKS LB deprovisioning #3617

mogul opened this issue Dec 23, 2021 · 12 comments

Comments

@mogul
Copy link
Contributor

mogul commented Dec 23, 2021

User Story

In order to ensure EKS instances deprovision consistently, we want to ensure all AWS resources created by the broker are managed by Terraform.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN I have brokered an EKS instance
    WHEN I deprovision the EKS instance
    THEN deprovisioning succeeds cleanly
    AND there is no dangling LB instance associated with the EKS cluster in AWS.

Background

The AWS LB Controller dynamically provisions LBs corresponding to Service and Ingress objects in EKS. The EKS broker uses the LB-controller to dynamically provision just a single (predictable/required) LB for ingress to the ingress-nginx controller (which handles all other ingress for the cluster).

Using the ALB controller means that the EKS cluster makes use of an LB resource that Terraform doesn't know about. This can lead to race conditions and failures when terraform destroy is used to deprovision the EKS instance, but is unable to delete ACM and VPC resources because of the dangling LB. The resolution has been to go into AWS' console to manually delete the dangling ALB and target groups, then reattempt deletion, which is not ideal by any means!

Security Considerations (required)

No concerns... We are not changing the architecture, just handling our provisioning/deprovisioning in a more reliable and manageable way.

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

Two potential approaches:

  1. Figure out why our time_sleep.alb_controller_destroy_delay idea isn't working.
  2. Set up the needed load-balancer in front of ingress-nginx directly with Terraform, and pull out the AWS LB controller. This way Terraform controls everything and the proper dependency resolution happens to tear everything down cleanly. Amazon published a blog post describing how to set up this ALB-controller-less configuration.
@mogul
Copy link
Contributor Author

mogul commented Dec 23, 2021

It looks like the method described in the blog post is also documented directly in the ingress-nginx documentation.

The YAML referenced there includes this section which sets up a LoadBalancer service backed by an NLB:
Screenshot_20211223-082521.png

That would means it's included as part of the ingress-nginx deployment, which isn't what we want, because it would have the same drawbacks as the existing ALB controller-based method.

We want to avoid that service definition being deployed with Helm, and instead create it manually using a kubernetes_service resource. That will enable Terraform to manage the lifecycle.

@mogul
Copy link
Contributor Author

mogul commented Dec 23, 2021

We can use helm template both with and without the suggested AWS annotations to diff out exactly what should go in the Terraform resource, while still using Helm to manage the installation of the ingess-nginx controller itself (without those annotations).

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Feb 4, 2022

We upgraded the ALB Controller to a version that supports provisioning NLBs. However, Kubernetes service definitions have to be of type LoadBalancer for the ALB Controller to provision an NLB and the Solr Helm chart only supports the Ingress service type.

The AWS Load Balancer Controller manages AWS Elastic Load Balancers for a Kubernetes cluster. The controller provisions the following resources.

  • An AWS Application Load Balancer (ALB) when you create a Kubernetes Ingress.
  • An AWS Network Load Balancer (NLB) when you create a Kubernetes service of type LoadBalancer. In the past, the Kubernetes in-tree load balancer was used for instance targets, but the AWS Load balancer Controller was used for IP targets. With the AWS Load Balancer Controller version 2.3.0 or later, you can create Network Load Balancers using either target type. For more information about NLB target types, see Target type in the User Guide for Network Load Balancers.

As a result, we probably need to manually provision an NLB and then figure out how to connect it to the services created by the Solr Helm chart.

Re-evaluating..

@nickumia-reisys
Copy link
Contributor

If we ever need to debug nginx controller in EKS, this solution works!

@nickumia-reisys
Copy link
Contributor

When (or if) we do transition away from the LB Controller, there is a module that simplifies the provisioning of LBs in general.

@mogul mogul changed the title Increase reliability of EKS ALB deprovisioning Increase reliability of EKS LB deprovisioning Feb 16, 2022
@mogul
Copy link
Contributor Author

mogul commented Mar 25, 2022

A relevant issue on the ALB controller, maddeningly closed as stale. It does however describe the expected logs when deletion is working properly.

@mogul
Copy link
Contributor Author

mogul commented Mar 25, 2022

We should look for finalizers on the service with kubectl edit since that seems to be how the ALB controller finds out that a reconciles is needed.

We should probably double-check that we have granted all the necessary permissions to the ALB controller; in particular can it manage SecurityGroups?

We should probably double-check that we don't have deletion protection enabled.

We should probably use the new FeatureGate that limits the ALB controller to handling only LoadBalancer-type Services.

@mogul
Copy link
Contributor Author

mogul commented Mar 25, 2022

I can repeat these two steps over and over reliably:

  1. When I delete the test LoadBalancer service in k8s by hand, the NLB in the AWS Console with the corresponding name is immediately removed in the AWS console.
  2. When I re-apply the test LoadBalancer service, a new NLB with the same name immediately appears in the AWS console (with a different DNS suffix) .

Conclusion: This does not appear to be a problem with the ALB controller app version/deployment, permissions, etc. It only happens during our terrafom destroy, even with a delay between the destroy of the LoadBalancer Service and the ALB controller.

I do see reconciliation events noted in the event history for the LoadBalancer service, matching what I see in the AWS console. However, I still haven't seen anything indicating that the reconciliation happened in the logs on the controller's side when this happens, so I'm going to dig into that next. If I figure out where the logs are in the normal case, then hopefully I can tail them while I do the terraform destroy operation to figure out what's happening.

@mogul
Copy link
Contributor Author

mogul commented Mar 25, 2022

I was able to get the ALB controller to log when reconcile activity occurs by adding the value logLevel="debug" to the ALB controller Helm release.

I am now able to repeat these two steps over and over reliably:

  1. When I run terraform destroy -target=helm_release.ingress_nginx -auto-approve then I see logs showing that the reconcile process runs, and the LB and TargetGroups are destroyed.
  2. When I run terraform apply -target=helm_release.ingress_nginx -auto-approve then I see logs showing that the reconcile process runs, and a new LB and TargetGroups are created.

Next step is to do a broader destroy while watching the logs...

@mogul
Copy link
Contributor Author

mogul commented Mar 25, 2022

I saw correct behavior when running these two steps:

  1. terraform destroy -target=module.aws_load_balancer_controller
  • The LoadBalancer Service is removed along with the rest of ingress-nginx
  • The ALB controller logs that it removed the NLB, and it disappears in the AWS Console
  • The time_sleep waits another 60s.
  • the rest of the terraform destroy completes.
  1. terraform apply -auto-approve
  • Everything comes up in the correct order.
  • I was able to tail the logs of the ALB Controller in time to see the reconcile logs appear when the ingress-nginx LoadBalancer service was created.

I just can't seem to reproduce the behavior that we see under CI. Next up... A full destroy...?

@mogul
Copy link
Contributor Author

mogul commented Mar 25, 2022

terraform destroy -auto-approve reproduced the problem!

Here are the logs from the ALB controller:

{"level":"info","ts":1648192971.928619,"logger":"controllers.service","msg":"successfully built model","model":"{\"id\":\"kube-system/ingress-nginx-controller\",\"resources\":{}}"}
{"level":"debug","ts":1648192971.9597893,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3"}}
{"level":"debug","ts":1648192971.9598272,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792"}}
{"level":"debug","ts":1648192971.959896,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-33c838b9b3"}
{"level":"debug","ts":1648192971.9599564,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-a3e01cb792"}
{"level":"debug","ts":1648192971.9602473,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3","uid":"f80b9cc0-fdf8-4413-9b8d-59832cc82cfd","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413909"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}
{"level":"debug","ts":1648192971.9603293,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792","uid":"21d60265-0689-4e91-a261-e1492cd671ad","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413911"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}
{"level":"debug","ts":1648192972.009977,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3"}}
{"level":"debug","ts":1648192972.010008,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792"}}
{"level":"debug","ts":1648192972.1748843,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3"}}
{"level":"debug","ts":1648192972.1749246,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792"}}

[...some minutes later...]
{"level":"error","ts":1648193348.1022172,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-33c838b9b3","namespace":"kube-system","error":"RequestError: send request failed\ncaused by: Post \"https://elasticloadbalancing.us-west-2.amazonaws.com/\": dial tcp: i/o timeout"}
{"level":"debug","ts":1648193348.1023147,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-33c838b9b3"}

Note in particular the warnings that appear before the gap and then the timeout error:

{"level":"debug","ts":1648192971.9602473,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3","uid":"f80b9cc0-fdf8-4413-9b8d-59832cc82cfd","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413909"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}
{"level":"debug","ts":1648192971.9603293,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792","uid":"21d60265-0689-4e91-a261-e1492cd671ad","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413911"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}

Is this because the destination of the Target Groups (the ingress-nginx app) is already gone? That seems unlikely because it would also be gone in the "not destroying everything" case, when the helm_release is destroyed which should not differ.

I will have to compare this to successful reconciliation when I'm not tearing everything down, to verify that this warning only appears in the full-destroy situation.
And here's what is happening in the Terraform output.... Clearly the ingress-nginx release is getting stuck waiting for the reconcile to happen.

local_file.kubeconfig[0]: Destroying... [id=0a6a5981ccf3139b5c1d9d9132e364e555951f22]
local_file.kubeconfig[0]: Destruction complete after 0s
kubernetes_storage_class.ebs-sc: Destroying... [id=ebs-sc]
kubernetes_cluster_role_binding.external_dns: Destroying... [id=external-dns]
kubernetes_config_map.logging: Destroying... [id=aws-observability/aws-logging]
kubernetes_cluster_role_binding.admin: Destroying... [id=admin-203f1a492ec4af92-cluster-role-binding]
helm_release.solr-operator: Destroying... [id=solr]
helm_release.zookeeper-operator: Destroying... [id=zookeeper]
helm_release.external_dns: Destroying... [id=external-dns]
kubernetes_cluster_role_binding.external_dns: Destruction complete after 0s
kubernetes_config_map.logging: Destruction complete after 0s
kubernetes_storage_class.ebs-sc: Destruction complete after 0s
kubernetes_cluster_role_binding.admin: Destruction complete after 0s
kubernetes_namespace.logging: Destroying... [id=aws-observability]
kubernetes_cluster_role.external_dns: Destroying... [id=external-dns]
kubernetes_cluster_role.external_dns: Destruction complete after 0s
kubernetes_service_account.admin: Destroying... [id=kube-system/admin-203f1a492ec4af92]
kubernetes_service_account.admin: Destruction complete after 0s
random_id.name: Destroying... [id=ID8aSS7Er5I]
random_id.name: Destruction complete after 0s
aws_kms_alias.cluster: Destroying... [id=alias/DNSSEC-bmog2]
helm_release.external_dns: Destruction complete after 1s
aws_kms_alias.cluster: Destruction complete after 0s
kubernetes_service_account.external_dns: Destroying... [id=kube-system/external-dns]
kubernetes_service_account.external_dns: Destruction complete after 0s
aws_iam_role_policy_attachment.ebs-usage["system_node_group"]: Destroying... [id=mng-bfda7245737744ec-eks-node-group-20220316204503781100000007-20220316214609258400000007]
aws_iam_role_policy_attachment.pod-logging["system_node_group"]: Destroying... [id=mng-bfda7245737744ec-eks-node-group-20220316204503781100000007-20220316214609258500000008]
aws_iam_role_policy_attachment.ssm-usage["system_node_group"]: Destroying... [id=mng-bfda7245737744ec-eks-node-group-20220316204503781100000007-20220324050712438200000001]
aws_iam_role_policy.external_dns: Destroying... [id=k8s-bfda7245737744ec-external-dns:k8s-bfda7245737744ec-external-dns20220316205552044700000013]
aws_route53_record.cluster-ns: Destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS]
aws_ssm_maintenance_window_task.patch-vulnerabilities: Destroying... [id=49636e6f-1154-4f48-acdf-68c90aaa238d]
aws_ssm_maintenance_window_task.patch-vulnerabilities: Destruction complete after 0s
aws_route53_record.nlb: Destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A]
aws_wafv2_web_acl.waf_acl: Destroying... [id=d0d46265-4f21-4864-9f92-e0d2ec598919]
aws_iam_role_policy.external_dns: Destruction complete after 0s
aws_iam_role_policy_attachment.ebs-usage["system_node_group"]: Destruction complete after 0s
aws_iam_role_policy_attachment.ssm-usage["system_node_group"]: Destruction complete after 0s
helm_release.solr-operator: Destruction complete after 2s
aws_iam_role_policy_attachment.pod-logging["system_node_group"]: Destruction complete after 0s
module.eks.aws_eks_addon.this["vpc-cni"]: Destroying... [id=k8s-bfda7245737744ec:vpc-cni]
aws_route53_record.cluster-ds: Destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS]
module.eks.aws_eks_addon.this["kube-proxy"]: Destroying... [id=k8s-bfda7245737744ec:kube-proxy]
helm_release.zookeeper-operator: Destruction complete after 2s
module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Destroying... [id=arn:aws:iam::645945852371:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/0686A5E1B9CC8F183B6FE6880B12F5ED]
module.eks.aws_eks_addon.this["coredns"]: Destroying... [id=k8s-bfda7245737744ec:coredns]
module.eks.aws_eks_addon.this["aws-ebs-csi-driver"]: Destroying... [id=k8s-bfda7245737744ec:aws-ebs-csi-driver]
module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Destruction complete after 0s
aws_acm_certificate_validation.cert: Destroying... [id=2022-03-16 20:45:51.473 +0000 UTC]
aws_acm_certificate_validation.cert: Destruction complete after 0s
aws_ssm_maintenance_window_target.owned-instances: Destroying... [id=3194bc64-2611-4610-9437-0056d33d21b3]
aws_wafv2_web_acl.waf_acl: Destruction complete after 1s
aws_iam_role.external_dns: Destroying... [id=k8s-bfda7245737744ec-external-dns]
aws_ssm_maintenance_window_target.owned-instances: Destruction complete after 0s
aws_iam_policy.ebs-usage: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-bfda7245737744ec-ebs-policy20220316204503773900000006]
aws_iam_policy.ebs-usage: Destruction complete after 0s
aws_iam_policy.ssm-access-policy: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-bfda7245737744ec-ssm-policy]
module.eks.aws_eks_addon.this["vpc-cni"]: Destruction complete after 2s
aws_iam_role.external_dns: Destruction complete after 1s
aws_iam_policy.pod-logging: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-bfda7245737744ec-pod-logging]
module.eks.aws_eks_addon.this["kube-proxy"]: Destruction complete after 2s
aws_ssm_maintenance_window.window: Destroying... [id=mw-03678bcccafc95383]
aws_ssm_maintenance_window.window: Destruction complete after 0s
aws_iam_policy.ssm-access-policy: Destruction complete after 0s
aws_iam_policy.pod-logging: Destruction complete after 0s
module.eks.aws_eks_addon.this["coredns"]: Destruction complete after 3s
kubernetes_namespace.logging: Destruction complete after 7s
module.eks.aws_eks_addon.this["aws-ebs-csi-driver"]: Destruction complete after 5s
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 10s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 10s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 10s elapsed]
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 20s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 20s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 20s elapsed]
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 30s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 30s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 30s elapsed]
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 40s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 40s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 40s elapsed]
aws_route53_record.cluster-ns: Destruction complete after 42s
aws_route53_record.nlb: Destruction complete after 42s
aws_route53_record.cert_validation: Destroying... [id=Z04462961P7B6JRZCK3VT__a52bb6718a4e1b227ef6ade210d1b100.bmog2.ssb-dev.data.gov._CNAME]
helm_release.ingress_nginx: Destroying... [id=ingress-nginx]
aws_route53_record.cluster-ds: Destruction complete after 46s
aws_route53_hosted_zone_dnssec.cluster: Destroying... [id=Z04462961P7B6JRZCK3VT]
aws_route53_record.cert_validation: Still destroying... [id=Z04462961P7B6JRZCK3VT__a52bb6718a4e1b22...10d1b100.bmog2.ssb-dev.data.gov._CNAME, 10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 10s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 10s elapsed]
aws_route53_record.cert_validation: Still destroying... [id=Z04462961P7B6JRZCK3VT__a52bb6718a4e1b22...10d1b100.bmog2.ssb-dev.data.gov._CNAME, 20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 20s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 20s elapsed]
aws_route53_record.cert_validation: Destruction complete after 30s
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 30s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 40s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 40s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Destruction complete after 45s
aws_route53_key_signing_key.cluster: Destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 50s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m0s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m10s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m20s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 40s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m30s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 50s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m40s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 1m0s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m50s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 1m10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m0s elapsed]
aws_route53_key_signing_key.cluster: Destruction complete after 1m14s
aws_route53_zone.cluster: Destroying... [id=Z04462961P7B6JRZCK3VT]
aws_kms_key.cluster: Destroying... [id=e4eb3d67-f004-4584-8dc8-d473f753ba46]
aws_kms_key.cluster: Destruction complete after 0s
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m10s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m20s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m30s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m40s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 40s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m50s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 50s elapsed]
aws_route53_zone.cluster: Destruction complete after 53s
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m0s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m40s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m50s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 4m0s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 4m10s elapsed]

If I run terraform destroy -auto-approve again, then I see more logs from the ALB controller:


{"level":"error","ts":1648193714.5821636,"logger":"controller-runtime.manager.controller.service","msg":"Reconciler error","name":"ingress-nginx-controller","namespace":"kube-system","error":"RequestError: send request failed\ncaused by: Post \"https://ec2.us-west-2.amazonaws.com/\": dial tcp: i/o timeout"}
{"level":"debug","ts":1648193714.582242,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"Service","namespace":"kube-system","name":"ingress-nginx-controller","uid":"3d79c026-0dbe-40e9-9b1f-b72d03adcb18","apiVersion":"v1","resourceVersion":"3415892"},"reason":"FailedDeployModel","message":"Failed deploy model due to RequestError: send request failed\ncaused by: Post \"https://ec2.us-west-2.amazonaws.com/\": dial tcp: i/o timeout"}
{"level":"info","ts":1648193714.58796,"logger":"controllers.service","msg":"successfully built model","model":"{\"id\":\"kube-system/ingress-nginx-controller\",\"resources\":{}}"}
{"level":"error","ts":1648193720.4847617,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-a3e01cb792","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"debug","ts":1648193720.4848325,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-a3e01cb792"}
{"level":"error","ts":1648193720.4849973,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-a3e01cb792","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"info","ts":1648193720.4851959,"logger":"controller-runtime.manager.controller.ingress","msg":"Shutdown signal received, waiting for all workers to finish"}
{"level":"info","ts":1648193720.4852102,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding"}
{"level":"info","ts":1648193720.4852204,"logger":"controller-runtime.manager.controller.service","msg":"Shutdown signal received, waiting for all workers to finish"}
{"level":"info","ts":1648193720.485422,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
{"level":"error","ts":1648193720.485583,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-33c838b9b3","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"debug","ts":1648193720.4856308,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-33c838b9b3"}
{"level":"info","ts":1648193720.4856992,"logger":"controller-runtime.manager.controller.ingress","msg":"All workers finished"}
{"level":"error","ts":1648193720.48578,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-33c838b9b3","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"info","ts":1648193720.485811,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"All workers finished","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding"}
E0325 07:35:20.739123       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: configmaps "aws-load-balancer-controller-leader" is forbidden: User "system:serviceaccount:kube-system:aws-load-balancer-controller" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
E0325 07:35:22.748222       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: configmaps "aws-load-balancer-controller-leader" is forbidden: User "system:serviceaccount:kube-system:aws-load-balancer-controller" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
E0325 07:35:24.750900       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized
E0325 07:35:26.752864       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized
E0325 07:35:28.750485       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized

and the terraform output looks like:

time_sleep.delay_alb_controller_destroy: Destroying... [id=2022-03-25T07:16:28Z]
aws_acm_certificate.cert: Destroying... [id=arn:aws:acm:us-west-2:645945852371:certificate/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 10s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 10s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 20s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 20s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 30s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 30s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 40s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 40s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 50s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 50s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 1m0s elapsed]
time_sleep.delay_alb_controller_destroy: Destruction complete after 1m0s
module.aws_load_balancer_controller.aws_iam_role_policy_attachment.this: Destroying... [id=k8s-k8s-bfda7245737744ec-aws-load-balancer-controller-20220325071624472200000001]
module.aws_load_balancer_controller.kubernetes_cluster_role_binding.this: Destroying... [id=aws-load-balancer-controller]
module.aws_load_balancer_controller.helm_release.alb_controller: Destroying... [id=aws-load-balancer-controller]
module.aws_load_balancer_controller.kubernetes_cluster_role_binding.this: Destruction complete after 1s
module.aws_load_balancer_controller.kubernetes_cluster_role.this: Destroying... [id=aws-load-balancer-controller]
module.aws_load_balancer_controller.kubernetes_cluster_role.this: Destruction complete after 0s
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 1m0s elapsed]
module.aws_load_balancer_controller.aws_iam_role_policy_attachment.this: Destruction complete after 1s
module.aws_load_balancer_controller.aws_iam_policy.this: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-k8s-bfda7245737744ec-alb-management]
module.aws_load_balancer_controller.aws_iam_policy.this: Destruction complete after 1s
module.aws_load_balancer_controller.helm_release.alb_controller: Destruction complete after 2s
module.aws_load_balancer_controller.kubernetes_service_account.this: Destroying... [id=kube-system/aws-load-balancer-controller]
module.aws_load_balancer_controller.kubernetes_service_account.this: Destruction complete after 0s
module.aws_load_balancer_controller.aws_iam_role.this: Destroying... [id=k8s-k8s-bfda7245737744ec-aws-load-balancer-controller]
module.aws_load_balancer_controller.aws_iam_role.this: Destruction complete after 2s
null_resource.cluster-functional: Destroying... [id=9212455146380743635]

[...more stuff until the full destroy gets stuck at this, looping...]

module.vpc.aws_subnet.public[1]: Still destroying... [id=subnet-0e19c6a4eb7ca6a28, 4m20s elapsed]
module.vpc.aws_internet_gateway.this[0]: Still destroying... [id=igw-06d4910593c395c68, 4m20s elapsed]
module.vpc.aws_subnet.public[0]: Still destroying... [id=subnet-09951ac016607037f, 4m20s elapsed]
module.vpc.aws_subnet.public[2]: Still destroying... [id=subnet-010ad3ed565d07801, 4m20s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 6m10s elapsed]

[...until I manually delete the LoadBalancer and TargetGroups, after which everything continues until the destroy is completed successfully!]

Maybe look at the other stuff getting destroyed before the ingress-nginx helm_release that might be causing this problem...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants