Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kong DP node killed with SIGQUIT during startup with large configuration #11651

Closed
1 task done
shettyh opened this issue Sep 26, 2023 · 12 comments
Closed
1 task done

Kong DP node killed with SIGQUIT during startup with large configuration #11651

shettyh opened this issue Sep 26, 2023 · 12 comments
Labels
core/performance pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...

Comments

@shettyh
Copy link
Contributor

shettyh commented Sep 26, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.4.0

Current Behavior

During the startup DP nodes crash without any error in the logs

CP Logs

2023/09/26 10:42:56 [debug] 1264#0: *1964 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:42:56 [debug] 1264#0: *2947 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:42:56 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:42:56 [debug] 1266#0: *2745 [lua] control_plane.lua:114: plugin configuration map key: clustering:cp_plugins_configured:worker_0 configuration: {"prometheus-x":true,"lake-events":true,"acl":true,"zipkin-x":true,"rate-limiter-non-contextual":true,"rate-limiter-contextual":true,"consumer-authentication":true,"key-auth":true,"log-tracer":true,"upstream-jwt":true,"consumer-identifier":true,"basic-auth-x":true,"ip-restriction-x":true,"authz-enforcer":true,"prometheus":true,"jwt-x":true,"impersonation-grant":true}
2023/09/26 10:42:57 [debug] 1263#0: *2955 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:42:57 [debug] 1263#0: *2955 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:42:57 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:01 [debug] 1264#0: *1998 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:01 [debug] 1263#0: *2940 [lua] control_plane.lua:114: plugin configuration map key: clustering:cp_plugins_configured:worker_1 configuration: {"prometheus-x":true,"lake-events":true,"acl":true,"zipkin-x":true,"rate-limiter-non-contextual":true,"rate-limiter-contextual":true,"consumer-authentication":true,"key-auth":true,"log-tracer":true,"upstream-jwt":true,"consumer-identifier":true,"basic-auth-x":true,"ip-restriction-x":true,"authz-enforcer":true,"prometheus":true,"jwt-x":true,"impersonation-grant":true}
2023/09/26 10:43:02 [debug] 1264#0: *2966 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:02 [debug] 1264#0: *2966 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:02 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:04 [debug] 1264#0: *2967 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:04 [debug] 1264#0: *2967 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.25.68.236 - - [26/Sep/2023:10:43:04 +0530] "GET /metrics HTTP/1.1" 200 19667 "-" "Go-http-client/1.1"
2023/09/26 10:43:06 [debug] 1264#0: *2969 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:06 [debug] 1264#0: *2969 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:06 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:06 [debug] 1264#0: *1979 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:07 [debug] 1264#0: *2978 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:07 [debug] 1264#0: *2978 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:07 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:07 [debug] 1263#0: *2940 [lua] control_plane.lua:345: [clustering] received ping frame from data plane [id: e7fcc90d-4558-4645-9bb4-dd459980ba9f, host: edge-dp-7899c75c8f-7jw8s, ip: 10.24.47.33, version: 3.4.0]
2023/09/26 10:43:07 [debug] 1263#0: *2733 [lua] callback.lua:114: do_event(): worker-events: handling event; source=dao:crud, event=create, wid=nil
2023/09/26 10:43:07 [debug] 1263#0: *2940 [lua] control_plane.lua:437: [clustering] sent config update to data plane [id: e7fcc90d-4558-4645-9bb4-dd459980ba9f, host: edge-dp-7899c75c8f-7jw8s, ip: 10.24.47.33, version: 3.4.0]
2023/09/26 10:43:07 [debug] 1263#0: *2940 [lua] control_plane.lua:391: [clustering] sent pong frame to data plane [id: e7fcc90d-4558-4645-9bb4-dd459980ba9f, host: edge-dp-7899c75c8f-7jw8s, ip: 10.24.47.33, version: 3.4.0]
2023/09/26 10:43:11 [debug] 1263#0: *2985 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:11 [debug] 1264#0: *1982 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:11 [debug] 1263#0: *2985 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.25.68.216 - - [26/Sep/2023:10:43:11 +0530] "GET /metrics HTTP/1.1" 200 20095 "-" "vm_promscrape"
2023/09/26 10:43:12 [debug] 1263#0: *2987 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:12 [debug] 1263#0: *2987 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:12 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:16 [debug] 1263#0: *2989 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:16 [debug] 1263#0: *2989 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:16 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:16 [debug] 1264#0: *1999 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:17 [debug] 1263#0: *2997 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:17 [debug] 1263#0: *2997 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:17 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:21 [debug] 1264#0: *1976 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:22 [debug] 1263#0: *3005 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:22 [debug] 1263#0: *3005 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:22 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:23 [debug] 1263#0: *2940 [lua] control_plane.lua:345: [clustering] received ping frame from data plane [id: e7fcc90d-4558-4645-9bb4-dd459980ba9f, host: edge-dp-7899c75c8f-7jw8s, ip: 10.24.47.33, version: 3.4.0]
2023/09/26 10:43:23 [debug] 1263#0: *2733 [lua] callback.lua:114: do_event(): worker-events: handling event; source=dao:crud, event=update, wid=nil
2023/09/26 10:43:23 [debug] 1263#0: *2940 [lua] control_plane.lua:391: [clustering] sent pong frame to data plane [id: e7fcc90d-4558-4645-9bb4-dd459980ba9f, host: edge-dp-7899c75c8f-7jw8s, ip: 10.24.47.33, version: 3.4.0]
2023/09/26 10:43:26 [debug] 1263#0: *3014 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:26 [debug] 1263#0: *3014 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:26 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:26 [debug] 1264#0: *1991 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:27 [debug] 1263#0: *3018 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:27 [debug] 1263#0: *3018 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:27 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:31 [debug] 1264#0: *1968 [lua] init.lua:283: [cluster_events] polling events from: 1695705160.249
2023/09/26 10:43:32 [debug] 1263#0: *3023 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:32 [debug] 1263#0: *3023 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:32 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:35 [error] 1263#0: *2940 [lua] control_plane.lua:463: serve_cluster_listener(): [clustering] failed to receive the first 2 bytes: closed [id: e7fcc90d-4558-4645-9bb4-dd459980ba9f, host: edge-dp-7899c75c8f-7jw8s, ip: 10.24.47.33, version: 3.4.0], client: 10.24.47.33, server: kong_cluster_listener, request: "GET /v1/outlet?node_id=e7fcc90d-4558-4645-9bb4-dd459980ba9f&node_hostname=edge-dp-7899c75c8f-7jw8s&node_version=3.4.0 HTTP/1.1", host: "edge-cp.perf1.svc.cluster.local:8005"
10.24.47.33 - - [26/Sep/2023:10:43:35 +0530] "GET /v1/outlet?node_id=e7fcc90d-4558-4645-9bb4-dd459980ba9f&node_hostname=edge-dp-7899c75c8f-7jw8s&node_version=3.4.0 HTTP/1.1" 101 30678182 "-" "-"
2023/09/26 10:43:36 [debug] 1263#0: *3028 [lua] init.lua:23: poll(): worker-events: emulate poll method
2023/09/26 10:43:36 [debug] 1263#0: *3028 [lua] init.lua:23: poll(): worker-events: emulate poll method
10.21.103.111 - - [26/Sep/2023:10:43:36 +0530] "GET /status HTTP/1.1" 200 1555 "-" "kube-probe/1.20"
2023/09/26 10:43:36 [debug] 1264#0: *1963 [lua] init.lua:283: [cluster_events] polling events from:

DP Logs

2023/09/26 10:42:33 [debug] 1260#0: *455 [lua] client.lua:623: init(): [dns-client] ndots = 1
2023/09/26 10:42:33 [debug] 1260#0: *455 [lua] client.lua:625: init(): [dns-client] search = perf1.svc.cluster.local, svc.cluster.local, cluster.local, ap-south-1.compute.internal
2023/09/26 10:42:33 [debug] 1260#0: *455 [lua] client.lua:638: init(): [dns-client] badTtl = 1 s
2023/09/26 10:42:33 [debug] 1260#0: *455 [lua] client.lua:640: init(): [dns-client] emptyTtl = 30 s
2023/09/26 10:42:33 [debug] 1260#0: *455 [lua] upstreams.lua:270: update_balancer_state(): update proxy state timer scheduled
2023/09/26 10:42:33 [debug] 1260#0: *453 [lua] worker.lua:147: communicate(): 3 on (unix:/usr/local/kong/worker_events.sock) is ready
2023/09/26 10:42:33 [debug] 1257#0: *456 [lua] broker.lua:73: broadcast_events(): event published to 4 workers
2023/09/26 10:42:33 [debug] 1259#0: *9 [lua] callback.lua:114: do_event(): worker-events: handling event; source=mlcache, event=mlcache:invalidations:kong_db_cache, wid=3
2023/09/26 10:42:33 [debug] 1257#0: *300 [lua] callback.lua:114: do_event(): worker-events: handling event; source=mlcache, event=mlcache:invalidations:kong_db_cache, wid=3
2023/09/26 10:42:33 [debug] 1258#0: *6 [lua] callback.lua:114: do_event(): worker-events: handling event; source=mlcache, event=mlcache:invalidations:kong_db_cache, wid=3
2023/09/26 10:42:33 [debug] 1260#0: *453 [lua] callback.lua:114: do_event(): worker-events: handling event; source=mlcache, event=mlcache:invalidations:kong_db_cache, wid=3
2023/09/26 10:42:33 [debug] 1257#0: *302 [lua] data_plane.lua:104: send_ping(): [clustering] sent ping frame to control plane [edge-cp.perf1.svc.cluster.local:8005]
2023/09/26 10:42:33 [debug] 1258#0: *5 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0.0069444444444444, runable_jobs_avg: 1, alive_threads_avg: 144
2023/09/26 10:42:33 [debug] 1259#0: *7 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0.0069444444444444, runable_jobs_avg: 1, alive_threads_avg: 144
2023/09/26 10:42:33 [debug] 1257#0: *299 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0.013888888888889, runable_jobs_avg: 2, alive_threads_avg: 144
2023/09/26 10:42:33 [debug] 1260#0: *452 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0.0069444444444444, runable_jobs_avg: 1, alive_threads_avg: 144
2023/09/26 10:42:45 [info] 1257#0: *302 [lua] data_plane.lua:317: communicate(): [clustering] connection to control plane closed [edge-cp.perf1.svc.cluster.local:8005], context: ngx.timer
2023/09/26 10:42:51 [debug] 1257#0: *373 [lua] data_plane.lua:104: send_ping(): [clustering] sent ping frame to control plane [edge-cp.perf1.svc.cluster.local:8005]
2023/09/26 10:43:07 [debug] 1257#0: *373 [lua] data_plane.lua:292: [clustering] received pong frame from control plane [edge-cp.perf1.svc.cluster.local:8005]
2023/09/26 10:43:10 [debug] 1257#0: *373 [lua] data_plane.lua:212: [clustering] received reconfigure frame from control plane with timestamp: 1695705183.294 [edge-cp.perf1.svc.cluster.local:8005]
2023/09/26 10:43:23 [debug] 1257#0: *373 [lua] data_plane.lua:104: send_ping(): [clustering] sent ping frame to control plane [edge-cp.perf1.svc.cluster.local:8005]
2023/09/26 10:43:23 [debug] 1257#0: *373 [lua] data_plane.lua:292: [clustering] received pong frame from control plane [edge-cp.perf1.svc.cluster.local:8005]
2023/09/26 10:43:34 [notice] 1#0: signal 3 (SIGQUIT) received from 1330, shutting down
2023/09/26 10:43:34 [notice] 1258#0: gracefully shutting down
2023/09/26 10:43:34 [notice] 1260#0: gracefully shutting down
2023/09/26 10:43:34 [notice] 1259#0: gracefully shutting down
2023/09/26 10:43:34 [notice] 1257#0: gracefully shutting down
2023/09/26 10:43:34 [notice] 1259#0: exiting
2023/09/26 10:43:34 [notice] 1258#0: exiting
2023/09/26 10:43:34 [notice] 1260#0: exiting
2023/09/26 10:43:34 [notice] 1259#0: exit
2023/09/26 10:43:34 [notice] 1#0: signal 17 (SIGCHLD) received from 1259
2023/09/26 10:43:34 [notice] 1#0: worker process 1259 exited with code 0
2023/09/26 10:43:34 [notice] 1#0: signal 29 (SIGIO) received
2023/09/26 10:43:34 [notice] 1258#0: exit
2023/09/26 10:43:34 [notice] 1#0: signal 17 (SIGCHLD) received from 1258
2023/09/26 10:43:34 [notice] 1#0: worker process 1258 exited with code 0
2023/09/26 10:43:34 [notice] 1#0: signal 29 (SIGIO) received
2023/09/26 10:43:34 [notice] 1260#0: exit
2023/09/26 10:43:34 [notice] 1#0: signal 17 (SIGCHLD) received from 1260
2023/09/26 10:43:34 [notice] 1#0: worker process 1260 exited with code 0
2023/09/26 10:43:34 [notice] 1#0: signal 29 (SIGIO) received
2023/09/26 10:43:35 [notice] 1257#0: exiting
2023/09/26 10:43:35 [notice] 1257#0: exit
2023/09/26 10:43:37 [notice] 1#0: signal 17 (SIGCHLD) received from 1257
2023/09/26 10:43:37 [notice] 1#0: worker process 1257 exited with code 0
2023/09/26 10:43:37 [notice] 1#0: exit

Expected Behavior

DP Node should come up properly

Steps To Reproduce

  • Start CP nodes ( DB with lot of config and consumers ( ~2 Million credentials )
  • Start DP Node

Anything else?

Trying to setup CP/DP setup for the first time with the existing Kong deployment, existing deployment already has lot of data (Increased cluster_max_payload value for the same). How it is usually done ? All the data from CP to DP is sent in a single shot ? Is there anyway to send this in batches ?

@shettyh
Copy link
Contributor Author

shettyh commented Sep 26, 2023

CPU and Memory of CP nodes are too high during the start

Screenshot 2023-09-26 at 3 52 00 PM
Screenshot 2023-09-26 at 3 52 12 PM

@hbagdi
Copy link
Member

hbagdi commented Sep 26, 2023

2023/09/26 10:43:34 [notice] 1#0: signal 3 (SIGQUIT) received from 1330, shutting down

Something sent the DP SIGQUIT.

@shettyh
Copy link
Contributor Author

shettyh commented Sep 27, 2023

@hbagdi thanks for the response. Yes SIGQUIT is received from somewhere but not sure from where. Anyway any reason why CP is using high amount of memory and CPU ?

Official docs for CP/DP doesnt talk about how the deployment model should look like (Should DP have persistent volumes attached or not etc), how the initial DP boot up happens in case of large amount of data present in CP

@chobits
Copy link
Contributor

chobits commented Sep 27, 2023

@hbagdi thanks for the response. Yes SIGQUIT is received from somewhere but not sure from where. Anyway any reason why CP is using high amount of memory and CPU ?

It's usually sent by command like kong stop or restyor

Official docs for CP/DP doesnt talk about how the deployment model should look like (Should DP have persistent volumes attached or not etc), how the initial DP boot up happens in case of large amount of data present in CP

See here hybrid mode overview.

How many entities do you have? If there are too many, it could significantly impact your CPU usage and MEMORY usage.

@shettyh
Copy link
Contributor Author

shettyh commented Sep 27, 2023

@chobits Thanks for the response.

See here hybrid mode overview.

Went through this, this doc does not talk about how to persist DP config across restarts ( like in kubernetes environments ), Also does not talk more about why TTL in schema does not work in plugins, what we need to consider while developing custom plugins in CP/DP model etc are missing.

How many entities do you have? If there are too many, it could significantly impact your CPU usage and MEMORY usage.

Number of entities are large, Like we have 1M consumers, 2M Basic auth credentials etc. CP alway tries to send the whole config in one shot, there is no config to batch this

@chobits
Copy link
Contributor

chobits commented Sep 27, 2023

Also does not talk more about why TTL in schema does not work in plugins,

This is a bug of old version. And has been fixed in the latest master branch. See here: #11464

Number of entities are large, Like we have 1M consumers, 2M Basic auth credentials etc. CP alway tries to send the whole config in one shot, there is no config to batch this

Yeah, it's a known performance issue for the current kong version. We're working on addressing it and plan to
make improvements to reduce CPU usage when updating a large number of entities. This may involve using a more efficient protocol to data transmission between CP and DP, like Incremental update or other methods.

But having a million entities is excessive, you might want to consider reducing them from a business standpoint.

@shettyh
Copy link
Contributor Author

shettyh commented Sep 28, 2023

Thanks for detailed response. Will wait for the performance fixes

But having a million entities is excessive, you might want to consider reducing them from a business standpoint.

But having millions of users is not abnormal right, there should be a way to support such usecases IMO

@chobits
Copy link
Contributor

chobits commented Sep 28, 2023

Thanks for detailed response. Will wait for the performance fixes

But having a million entities is excessive, you might want to consider reducing them from a business standpoint.

But having millions of users is not abnormal right, there should be a way to support such usecases IMO

ok. keep a glance to the update of kong project:)

@hanshuebner
Copy link
Contributor

Could it be that the DP does not turn healthy and thus is killed, with SIGQUIT, by its supervisor (I.e. docker-compose, Kubernetes). Can you increase the timeout that is allotted to DP processes? Processing the initial configuration (and subsequent updates) will take a lot of time if you have millions of users, and there currently is no way to avoid that in hybrid mode.

@hanshuebner hanshuebner added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 2, 2023
@shettyh
Copy link
Contributor Author

shettyh commented Oct 3, 2023

Thanks @hanshuebner for the response. Yes i have tried that, dataplane will stay in pending state even after 2 mins, And have seen control plane also gets crashed due to high node memory usage. May be for our use case CP/DP might not be ready as of now without batched updates to DP nodes

@hanshuebner
Copy link
Contributor

At this point, it is safe to say that millions of consumers are not supported in hybrid mode. We don't have a specific maximum number of consumers that we can specify. The whole configuration is sent from the control plane to the data plane, and each consumer is an object in that configuration. The resulting configuration file will simply be too large if you have millions of consumers. Note that full configurations are also sent when the configuration is updated, so the excessive memory usage will occur not only during startup, but with each configuration change.

We're currently discussing both incremental configuration updates in hybrid mode and other ways how we can improve Kong's authentication system. Unfortunately, we cannot give you a firm date as to when this will become reality.

@hanshuebner hanshuebner changed the title Kong DP node crashes without any error in the log Kong DP node killed with SIGQUIT during startup with large configuration Oct 5, 2023
@hanshuebner
Copy link
Contributor

I'm closing this issue even though we could not provide you with a solution at this point. We are aware of the limitation and hope to be able to solve the issue in a future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/performance pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...
Projects
None yet
Development

No branches or pull requests

4 participants