Increase reliability of EKS LB deprovisioning #3617

mogul · 2021-12-23T00:40:27Z

User Story

In order to ensure EKS instances deprovision consistently, we want to ensure all AWS resources created by the broker are managed by Terraform.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN I have brokered an EKS instance
WHEN I deprovision the EKS instance
THEN deprovisioning succeeds cleanly
AND there is no dangling LB instance associated with the EKS cluster in AWS.

Background

The AWS LB Controller dynamically provisions LBs corresponding to Service and Ingress objects in EKS. The EKS broker uses the LB-controller to dynamically provision just a single (predictable/required) LB for ingress to the ingress-nginx controller (which handles all other ingress for the cluster).

Using the ALB controller means that the EKS cluster makes use of an LB resource that Terraform doesn't know about. This can lead to race conditions and failures when terraform destroy is used to deprovision the EKS instance, but is unable to delete ACM and VPC resources because of the dangling LB. The resolution has been to go into AWS' console to manually delete the dangling ALB and target groups, then reattempt deletion, which is not ideal by any means!

Security Considerations (required)

No concerns... We are not changing the architecture, just handling our provisioning/deprovisioning in a more reliable and manageable way.

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

Two potential approaches:

Figure out why our time_sleep.alb_controller_destroy_delay idea isn't working.
Set up the needed load-balancer in front of ingress-nginx directly with Terraform, and pull out the AWS LB controller. This way Terraform controls everything and the proper dependency resolution happens to tear everything down cleanly. Amazon published a blog post describing how to set up this ALB-controller-less configuration.

The text was updated successfully, but these errors were encountered:

mogul · 2021-12-23T16:36:34Z

It looks like the method described in the blog post is also documented directly in the ingress-nginx documentation.

The YAML referenced there includes this section which sets up a LoadBalancer service backed by an NLB:

That would means it's included as part of the ingress-nginx deployment, which isn't what we want, because it would have the same drawbacks as the existing ALB controller-based method.

We want to avoid that service definition being deployed with Helm, and instead create it manually using a kubernetes_service resource. That will enable Terraform to manage the lifecycle.

mogul · 2021-12-23T16:55:02Z

We can use helm template both with and without the suggested AWS annotations to diff out exactly what should go in the Terraform resource, while still using Helm to manage the installation of the ingess-nginx controller itself (without those annotations).

nickumia-reisys · 2022-02-04T14:20:34Z

We upgraded the ALB Controller to a version that supports provisioning NLBs. However, Kubernetes service definitions have to be of type LoadBalancer for the ALB Controller to provision an NLB and the Solr Helm chart only supports the Ingress service type.

The AWS Load Balancer Controller manages AWS Elastic Load Balancers for a Kubernetes cluster. The controller provisions the following resources.

An AWS Application Load Balancer (ALB) when you create a Kubernetes Ingress.

An AWS Network Load Balancer (NLB) when you create a Kubernetes service of type LoadBalancer. In the past, the Kubernetes in-tree load balancer was used for instance targets, but the AWS Load balancer Controller was used for IP targets. With the AWS Load Balancer Controller version 2.3.0 or later, you can create Network Load Balancers using either target type. For more information about NLB target types, see Target type in the User Guide for Network Load Balancers.

~~As a result, we probably need to manually provision an NLB and then figure out how to connect it to the services created by the Solr Helm chart.~~

Re-evaluating..

nickumia-reisys · 2022-02-04T17:58:08Z

Useful links (consolidated):

nickumia-reisys · 2022-02-07T17:28:26Z

If we ever need to debug nginx controller in EKS, this solution works!

nickumia-reisys · 2022-02-11T03:44:10Z

When (or if) we do transition away from the LB Controller, there is a module that simplifies the provisioning of LBs in general.

mogul · 2022-03-25T01:01:48Z

A relevant issue on the ALB controller, maddeningly closed as stale. It does however describe the expected logs when deletion is working properly.

mogul · 2022-03-25T01:26:14Z

We should look for finalizers on the service with kubectl edit since that seems to be how the ALB controller finds out that a reconciles is needed.

We should probably double-check that we have granted all the necessary permissions to the ALB controller; in particular can it manage SecurityGroups?

We should probably double-check that we don't have deletion protection enabled.

We should probably use the new FeatureGate that limits the ALB controller to handling only LoadBalancer-type Services.

mogul · 2022-03-25T06:54:40Z

I can repeat these two steps over and over reliably:

When I delete the test LoadBalancer service in k8s by hand, the NLB in the AWS Console with the corresponding name is immediately removed in the AWS console.
When I re-apply the test LoadBalancer service, a new NLB with the same name immediately appears in the AWS console (with a different DNS suffix) .

Conclusion: This does not appear to be a problem with the ALB controller app version/deployment, permissions, etc. It only happens during our terrafom destroy, even with a delay between the destroy of the LoadBalancer Service and the ALB controller.

I do see reconciliation events noted in the event history for the LoadBalancer service, matching what I see in the AWS console. However, I still haven't seen anything indicating that the reconciliation happened in the logs on the controller's side when this happens, so I'm going to dig into that next. If I figure out where the logs are in the normal case, then hopefully I can tail them while I do the terraform destroy operation to figure out what's happening.

mogul · 2022-03-25T07:11:58Z

I was able to get the ALB controller to log when reconcile activity occurs by adding the value logLevel="debug" to the ALB controller Helm release.

I am now able to repeat these two steps over and over reliably:

When I run terraform destroy -target=helm_release.ingress_nginx -auto-approve then I see logs showing that the reconcile process runs, and the LB and TargetGroups are destroyed.
When I run terraform apply -target=helm_release.ingress_nginx -auto-approve then I see logs showing that the reconcile process runs, and a new LB and TargetGroups are created.

Next step is to do a broader destroy while watching the logs...

mogul · 2022-03-25T07:21:25Z

I saw correct behavior when running these two steps:

terraform destroy -target=module.aws_load_balancer_controller

The LoadBalancer Service is removed along with the rest of ingress-nginx
The ALB controller logs that it removed the NLB, and it disappears in the AWS Console
The time_sleep waits another 60s.
the rest of the terraform destroy completes.

terraform apply -auto-approve

Everything comes up in the correct order.
I was able to tail the logs of the ALB Controller in time to see the reconcile logs appear when the ingress-nginx LoadBalancer service was created.

I just can't seem to reproduce the behavior that we see under CI. Next up... A full destroy...?

mogul · 2022-03-25T07:28:07Z

terraform destroy -auto-approve reproduced the problem!

Here are the logs from the ALB controller:

{"level":"info","ts":1648192971.928619,"logger":"controllers.service","msg":"successfully built model","model":"{\"id\":\"kube-system/ingress-nginx-controller\",\"resources\":{}}"}
{"level":"debug","ts":1648192971.9597893,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3"}}
{"level":"debug","ts":1648192971.9598272,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792"}}
{"level":"debug","ts":1648192971.959896,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-33c838b9b3"}
{"level":"debug","ts":1648192971.9599564,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-a3e01cb792"}
{"level":"debug","ts":1648192971.9602473,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3","uid":"f80b9cc0-fdf8-4413-9b8d-59832cc82cfd","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413909"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}
{"level":"debug","ts":1648192971.9603293,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792","uid":"21d60265-0689-4e91-a261-e1492cd671ad","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413911"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}
{"level":"debug","ts":1648192972.009977,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3"}}
{"level":"debug","ts":1648192972.010008,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792"}}
{"level":"debug","ts":1648192972.1748843,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3"}}
{"level":"debug","ts":1648192972.1749246,"logger":"controllers.targetGroupBinding.eventHandlers.endpoints","msg":"enqueue targetGroupBinding for endpoints event","endpoints":{"namespace":"kube-system","name":"ingress-nginx-controller"},"targetGroupBinding":{"namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792"}}

[...some minutes later...]
{"level":"error","ts":1648193348.1022172,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-33c838b9b3","namespace":"kube-system","error":"RequestError: send request failed\ncaused by: Post \"https://elasticloadbalancing.us-west-2.amazonaws.com/\": dial tcp: i/o timeout"}
{"level":"debug","ts":1648193348.1023147,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-33c838b9b3"}

Note in particular the warnings that appear before the gap and then the timeout error:

{"level":"debug","ts":1648192971.9602473,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-33c838b9b3","uid":"f80b9cc0-fdf8-4413-9b8d-59832cc82cfd","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413909"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}
{"level":"debug","ts":1648192971.9603293,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"TargetGroupBinding","namespace":"kube-system","name":"k8s-kubesyst-ingressn-a3e01cb792","uid":"21d60265-0689-4e91-a261-e1492cd671ad","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"3413911"},"reason":"BackendNotFound","message":"backend not found: Endpoints \"ingress-nginx-controller\" not found"}

Is this because the destination of the Target Groups (the ingress-nginx app) is already gone? That seems unlikely because it would also be gone in the "not destroying everything" case, when the helm_release is destroyed which should not differ.

I will have to compare this to successful reconciliation when I'm not tearing everything down, to verify that this warning only appears in the full-destroy situation.
And here's what is happening in the Terraform output.... Clearly the ingress-nginx release is getting stuck waiting for the reconcile to happen.

local_file.kubeconfig[0]: Destroying... [id=0a6a5981ccf3139b5c1d9d9132e364e555951f22]
local_file.kubeconfig[0]: Destruction complete after 0s
kubernetes_storage_class.ebs-sc: Destroying... [id=ebs-sc]
kubernetes_cluster_role_binding.external_dns: Destroying... [id=external-dns]
kubernetes_config_map.logging: Destroying... [id=aws-observability/aws-logging]
kubernetes_cluster_role_binding.admin: Destroying... [id=admin-203f1a492ec4af92-cluster-role-binding]
helm_release.solr-operator: Destroying... [id=solr]
helm_release.zookeeper-operator: Destroying... [id=zookeeper]
helm_release.external_dns: Destroying... [id=external-dns]
kubernetes_cluster_role_binding.external_dns: Destruction complete after 0s
kubernetes_config_map.logging: Destruction complete after 0s
kubernetes_storage_class.ebs-sc: Destruction complete after 0s
kubernetes_cluster_role_binding.admin: Destruction complete after 0s
kubernetes_namespace.logging: Destroying... [id=aws-observability]
kubernetes_cluster_role.external_dns: Destroying... [id=external-dns]
kubernetes_cluster_role.external_dns: Destruction complete after 0s
kubernetes_service_account.admin: Destroying... [id=kube-system/admin-203f1a492ec4af92]
kubernetes_service_account.admin: Destruction complete after 0s
random_id.name: Destroying... [id=ID8aSS7Er5I]
random_id.name: Destruction complete after 0s
aws_kms_alias.cluster: Destroying... [id=alias/DNSSEC-bmog2]
helm_release.external_dns: Destruction complete after 1s
aws_kms_alias.cluster: Destruction complete after 0s
kubernetes_service_account.external_dns: Destroying... [id=kube-system/external-dns]
kubernetes_service_account.external_dns: Destruction complete after 0s
aws_iam_role_policy_attachment.ebs-usage["system_node_group"]: Destroying... [id=mng-bfda7245737744ec-eks-node-group-20220316204503781100000007-20220316214609258400000007]
aws_iam_role_policy_attachment.pod-logging["system_node_group"]: Destroying... [id=mng-bfda7245737744ec-eks-node-group-20220316204503781100000007-20220316214609258500000008]
aws_iam_role_policy_attachment.ssm-usage["system_node_group"]: Destroying... [id=mng-bfda7245737744ec-eks-node-group-20220316204503781100000007-20220324050712438200000001]
aws_iam_role_policy.external_dns: Destroying... [id=k8s-bfda7245737744ec-external-dns:k8s-bfda7245737744ec-external-dns20220316205552044700000013]
aws_route53_record.cluster-ns: Destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS]
aws_ssm_maintenance_window_task.patch-vulnerabilities: Destroying... [id=49636e6f-1154-4f48-acdf-68c90aaa238d]
aws_ssm_maintenance_window_task.patch-vulnerabilities: Destruction complete after 0s
aws_route53_record.nlb: Destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A]
aws_wafv2_web_acl.waf_acl: Destroying... [id=d0d46265-4f21-4864-9f92-e0d2ec598919]
aws_iam_role_policy.external_dns: Destruction complete after 0s
aws_iam_role_policy_attachment.ebs-usage["system_node_group"]: Destruction complete after 0s
aws_iam_role_policy_attachment.ssm-usage["system_node_group"]: Destruction complete after 0s
helm_release.solr-operator: Destruction complete after 2s
aws_iam_role_policy_attachment.pod-logging["system_node_group"]: Destruction complete after 0s
module.eks.aws_eks_addon.this["vpc-cni"]: Destroying... [id=k8s-bfda7245737744ec:vpc-cni]
aws_route53_record.cluster-ds: Destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS]
module.eks.aws_eks_addon.this["kube-proxy"]: Destroying... [id=k8s-bfda7245737744ec:kube-proxy]
helm_release.zookeeper-operator: Destruction complete after 2s
module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Destroying... [id=arn:aws:iam::645945852371:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/0686A5E1B9CC8F183B6FE6880B12F5ED]
module.eks.aws_eks_addon.this["coredns"]: Destroying... [id=k8s-bfda7245737744ec:coredns]
module.eks.aws_eks_addon.this["aws-ebs-csi-driver"]: Destroying... [id=k8s-bfda7245737744ec:aws-ebs-csi-driver]
module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Destruction complete after 0s
aws_acm_certificate_validation.cert: Destroying... [id=2022-03-16 20:45:51.473 +0000 UTC]
aws_acm_certificate_validation.cert: Destruction complete after 0s
aws_ssm_maintenance_window_target.owned-instances: Destroying... [id=3194bc64-2611-4610-9437-0056d33d21b3]
aws_wafv2_web_acl.waf_acl: Destruction complete after 1s
aws_iam_role.external_dns: Destroying... [id=k8s-bfda7245737744ec-external-dns]
aws_ssm_maintenance_window_target.owned-instances: Destruction complete after 0s
aws_iam_policy.ebs-usage: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-bfda7245737744ec-ebs-policy20220316204503773900000006]
aws_iam_policy.ebs-usage: Destruction complete after 0s
aws_iam_policy.ssm-access-policy: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-bfda7245737744ec-ssm-policy]
module.eks.aws_eks_addon.this["vpc-cni"]: Destruction complete after 2s
aws_iam_role.external_dns: Destruction complete after 1s
aws_iam_policy.pod-logging: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-bfda7245737744ec-pod-logging]
module.eks.aws_eks_addon.this["kube-proxy"]: Destruction complete after 2s
aws_ssm_maintenance_window.window: Destroying... [id=mw-03678bcccafc95383]
aws_ssm_maintenance_window.window: Destruction complete after 0s
aws_iam_policy.ssm-access-policy: Destruction complete after 0s
aws_iam_policy.pod-logging: Destruction complete after 0s
module.eks.aws_eks_addon.this["coredns"]: Destruction complete after 3s
kubernetes_namespace.logging: Destruction complete after 7s
module.eks.aws_eks_addon.this["aws-ebs-csi-driver"]: Destruction complete after 5s
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 10s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 10s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 10s elapsed]
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 20s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 20s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 20s elapsed]
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 30s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 30s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 30s elapsed]
aws_route53_record.cluster-ns: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_NS, 40s elapsed]
aws_route53_record.nlb: Still destroying... [id=Z04462961P7B6JRZCK3VT_bmog2.ssb-dev.data.gov_A, 40s elapsed]
aws_route53_record.cluster-ds: Still destroying... [id=Z04971202EMFHZ92VIV2T_bmog2.ssb-dev.data.gov_DS, 40s elapsed]
aws_route53_record.cluster-ns: Destruction complete after 42s
aws_route53_record.nlb: Destruction complete after 42s
aws_route53_record.cert_validation: Destroying... [id=Z04462961P7B6JRZCK3VT__a52bb6718a4e1b227ef6ade210d1b100.bmog2.ssb-dev.data.gov._CNAME]
helm_release.ingress_nginx: Destroying... [id=ingress-nginx]
aws_route53_record.cluster-ds: Destruction complete after 46s
aws_route53_hosted_zone_dnssec.cluster: Destroying... [id=Z04462961P7B6JRZCK3VT]
aws_route53_record.cert_validation: Still destroying... [id=Z04462961P7B6JRZCK3VT__a52bb6718a4e1b22...10d1b100.bmog2.ssb-dev.data.gov._CNAME, 10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 10s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 10s elapsed]
aws_route53_record.cert_validation: Still destroying... [id=Z04462961P7B6JRZCK3VT__a52bb6718a4e1b22...10d1b100.bmog2.ssb-dev.data.gov._CNAME, 20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 20s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 20s elapsed]
aws_route53_record.cert_validation: Destruction complete after 30s
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 30s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 40s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 40s elapsed]
aws_route53_hosted_zone_dnssec.cluster: Destruction complete after 45s
aws_route53_key_signing_key.cluster: Destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 50s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m0s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m10s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m20s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 40s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m30s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 50s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m40s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 1m0s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 1m50s elapsed]
aws_route53_key_signing_key.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT,bmog2.ssb-dev.data.gov, 1m10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m0s elapsed]
aws_route53_key_signing_key.cluster: Destruction complete after 1m14s
aws_route53_zone.cluster: Destroying... [id=Z04462961P7B6JRZCK3VT]
aws_kms_key.cluster: Destroying... [id=e4eb3d67-f004-4584-8dc8-d473f753ba46]
aws_kms_key.cluster: Destruction complete after 0s
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m10s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m20s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m30s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m40s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 40s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 2m50s elapsed]
aws_route53_zone.cluster: Still destroying... [id=Z04462961P7B6JRZCK3VT, 50s elapsed]
aws_route53_zone.cluster: Destruction complete after 53s
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m0s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m10s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m20s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m30s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m40s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 3m50s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 4m0s elapsed]
helm_release.ingress_nginx: Still destroying... [id=ingress-nginx, 4m10s elapsed]

If I run terraform destroy -auto-approve again, then I see more logs from the ALB controller:


{"level":"error","ts":1648193714.5821636,"logger":"controller-runtime.manager.controller.service","msg":"Reconciler error","name":"ingress-nginx-controller","namespace":"kube-system","error":"RequestError: send request failed\ncaused by: Post \"https://ec2.us-west-2.amazonaws.com/\": dial tcp: i/o timeout"}
{"level":"debug","ts":1648193714.582242,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"Service","namespace":"kube-system","name":"ingress-nginx-controller","uid":"3d79c026-0dbe-40e9-9b1f-b72d03adcb18","apiVersion":"v1","resourceVersion":"3415892"},"reason":"FailedDeployModel","message":"Failed deploy model due to RequestError: send request failed\ncaused by: Post \"https://ec2.us-west-2.amazonaws.com/\": dial tcp: i/o timeout"}
{"level":"info","ts":1648193714.58796,"logger":"controllers.service","msg":"successfully built model","model":"{\"id\":\"kube-system/ingress-nginx-controller\",\"resources\":{}}"}
{"level":"error","ts":1648193720.4847617,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-a3e01cb792","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"debug","ts":1648193720.4848325,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-a3e01cb792"}
{"level":"error","ts":1648193720.4849973,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-a3e01cb792","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"info","ts":1648193720.4851959,"logger":"controller-runtime.manager.controller.ingress","msg":"Shutdown signal received, waiting for all workers to finish"}
{"level":"info","ts":1648193720.4852102,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding"}
{"level":"info","ts":1648193720.4852204,"logger":"controller-runtime.manager.controller.service","msg":"Shutdown signal received, waiting for all workers to finish"}
{"level":"info","ts":1648193720.485422,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
{"level":"error","ts":1648193720.485583,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-33c838b9b3","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"debug","ts":1648193720.4856308,"logger":"controllers.targetGroupBinding","msg":"Reconcile request","name":"k8s-kubesyst-ingressn-33c838b9b3"}
{"level":"info","ts":1648193720.4856992,"logger":"controller-runtime.manager.controller.ingress","msg":"All workers finished"}
{"level":"error","ts":1648193720.48578,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-kubesyst-ingressn-33c838b9b3","namespace":"kube-system","error":"RequestCanceled: request context canceled\ncaused by: context canceled"}
{"level":"info","ts":1648193720.485811,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"All workers finished","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding"}
E0325 07:35:20.739123       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: configmaps "aws-load-balancer-controller-leader" is forbidden: User "system:serviceaccount:kube-system:aws-load-balancer-controller" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
E0325 07:35:22.748222       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: configmaps "aws-load-balancer-controller-leader" is forbidden: User "system:serviceaccount:kube-system:aws-load-balancer-controller" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
E0325 07:35:24.750900       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized
E0325 07:35:26.752864       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized
E0325 07:35:28.750485       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized

and the terraform output looks like:

time_sleep.delay_alb_controller_destroy: Destroying... [id=2022-03-25T07:16:28Z]
aws_acm_certificate.cert: Destroying... [id=arn:aws:acm:us-west-2:645945852371:certificate/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 10s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 10s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 20s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 20s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 30s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 30s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 40s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 40s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 50s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 50s elapsed]
time_sleep.delay_alb_controller_destroy: Still destroying... [id=2022-03-25T07:16:28Z, 1m0s elapsed]
time_sleep.delay_alb_controller_destroy: Destruction complete after 1m0s
module.aws_load_balancer_controller.aws_iam_role_policy_attachment.this: Destroying... [id=k8s-k8s-bfda7245737744ec-aws-load-balancer-controller-20220325071624472200000001]
module.aws_load_balancer_controller.kubernetes_cluster_role_binding.this: Destroying... [id=aws-load-balancer-controller]
module.aws_load_balancer_controller.helm_release.alb_controller: Destroying... [id=aws-load-balancer-controller]
module.aws_load_balancer_controller.kubernetes_cluster_role_binding.this: Destruction complete after 1s
module.aws_load_balancer_controller.kubernetes_cluster_role.this: Destroying... [id=aws-load-balancer-controller]
module.aws_load_balancer_controller.kubernetes_cluster_role.this: Destruction complete after 0s
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 1m0s elapsed]
module.aws_load_balancer_controller.aws_iam_role_policy_attachment.this: Destruction complete after 1s
module.aws_load_balancer_controller.aws_iam_policy.this: Destroying... [id=arn:aws:iam::645945852371:policy/k8s-k8s-bfda7245737744ec-alb-management]
module.aws_load_balancer_controller.aws_iam_policy.this: Destruction complete after 1s
module.aws_load_balancer_controller.helm_release.alb_controller: Destruction complete after 2s
module.aws_load_balancer_controller.kubernetes_service_account.this: Destroying... [id=kube-system/aws-load-balancer-controller]
module.aws_load_balancer_controller.kubernetes_service_account.this: Destruction complete after 0s
module.aws_load_balancer_controller.aws_iam_role.this: Destroying... [id=k8s-k8s-bfda7245737744ec-aws-load-balancer-controller]
module.aws_load_balancer_controller.aws_iam_role.this: Destruction complete after 2s
null_resource.cluster-functional: Destroying... [id=9212455146380743635]

[...more stuff until the full destroy gets stuck at this, looping...]

module.vpc.aws_subnet.public[1]: Still destroying... [id=subnet-0e19c6a4eb7ca6a28, 4m20s elapsed]
module.vpc.aws_internet_gateway.this[0]: Still destroying... [id=igw-06d4910593c395c68, 4m20s elapsed]
module.vpc.aws_subnet.public[0]: Still destroying... [id=subnet-09951ac016607037f, 4m20s elapsed]
module.vpc.aws_subnet.public[2]: Still destroying... [id=subnet-010ad3ed565d07801, 4m20s elapsed]
aws_acm_certificate.cert: Still destroying... [id=arn:aws:acm:us-west-2:645945852371:cert...e/be3ae5f1-6b9d-48bd-a80d-ad64c21969f4, 6m10s elapsed]

[...until I manually delete the LoadBalancer and TargetGroups, after which everything continues until the destroy is completed successfully!]

Maybe look at the other stuff getting destroyed before the ingress-nginx helm_release that might be causing this problem...

mogul assigned mogul and nickumia-reisys Feb 3, 2022

nickumia-reisys mentioned this issue Feb 4, 2022

Update ALB Controller to provision NLBs GSA-TTS/datagov-brokerpak-eks#69

Merged

mogul changed the title ~~Increase reliability of EKS ALB deprovisioning~~ Increase reliability of EKS LB deprovisioning Feb 16, 2022

nickumia-reisys mentioned this issue Feb 18, 2022

Enable 307 redirects for EKS load-balancer (using NLBs instead of ALBs) #3677

Closed

4 tasks

mogul added the component/ssb label Mar 18, 2022

hkdctol mentioned this issue Mar 24, 2022

Script ECS/EKS Cleanup Procedures #3714

Closed

12 tasks

mogul mentioned this issue Mar 27, 2022

Make terraform destroy reliable GSA-TTS/datagov-brokerpak-eks#88

Merged

hkdctol closed this as completed Mar 31, 2022

hkdctol added this to the Sprint 20220331 milestone Apr 14, 2022

nickumia-reisys added Testing CI/CD labels Oct 7, 2023

nickumia-reisys added this to data.gov team board Oct 7, 2023

nickumia-reisys moved this to 🗄 Closed in data.gov team board Oct 7, 2023

nickumia-reisys mentioned this issue Oct 11, 2023

🧟 Revive the Dead GSA-TTS/datagov-brokerpak-eks#112

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase reliability of EKS LB deprovisioning #3617

Increase reliability of EKS LB deprovisioning #3617

mogul commented Dec 23, 2021 •

edited

Loading

mogul commented Dec 23, 2021 •

edited

Loading

mogul commented Dec 23, 2021

nickumia-reisys commented Feb 4, 2022 •

edited

Loading

nickumia-reisys commented Feb 4, 2022 •

edited

Loading

nickumia-reisys commented Feb 7, 2022

nickumia-reisys commented Feb 11, 2022

mogul commented Mar 25, 2022

mogul commented Mar 25, 2022 •

edited

Loading

mogul commented Mar 25, 2022 •

edited

Loading

mogul commented Mar 25, 2022

mogul commented Mar 25, 2022

mogul commented Mar 25, 2022 •

edited

Loading

Increase reliability of EKS LB deprovisioning #3617

Increase reliability of EKS LB deprovisioning #3617

Comments

mogul commented Dec 23, 2021 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

mogul commented Dec 23, 2021 • edited Loading

mogul commented Dec 23, 2021

nickumia-reisys commented Feb 4, 2022 • edited Loading

nickumia-reisys commented Feb 4, 2022 • edited Loading

nickumia-reisys commented Feb 7, 2022

nickumia-reisys commented Feb 11, 2022

mogul commented Mar 25, 2022

mogul commented Mar 25, 2022 • edited Loading

mogul commented Mar 25, 2022 • edited Loading

mogul commented Mar 25, 2022

mogul commented Mar 25, 2022

mogul commented Mar 25, 2022 • edited Loading

mogul commented Dec 23, 2021 •

edited

Loading

mogul commented Dec 23, 2021 •

edited

Loading

nickumia-reisys commented Feb 4, 2022 •

edited

Loading

nickumia-reisys commented Feb 4, 2022 •

edited

Loading

mogul commented Mar 25, 2022 •

edited

Loading

mogul commented Mar 25, 2022 •

edited

Loading

mogul commented Mar 25, 2022 •

edited

Loading