Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

killQuery needs luck with a TCP-balanced cluster #434

Open
hdhoang opened this issue May 17, 2024 · 0 comments
Open

killQuery needs luck with a TCP-balanced cluster #434

hdhoang opened this issue May 17, 2024 · 0 comments

Comments

@hdhoang
Copy link

hdhoang commented May 17, 2024

Hello,

Describe the bug

When killQuery happens, a TCP/k8s loadbalancer may direct the connection to a node which doesn't run the to-be-killed query. KILL succeeds anyway, but wasn't effective.

We will update our configuration to list shard/nodes directly instead as a fix, trading complexity there.

To Reproduce

  1. Declare a cluster with loadbalancer address, instead of replica list
  2. Run & cancel a long running query
  3. KILL runs on a different node than initial query, thus ineffective

Expected behavior

Could you consider selecting the kill targets with initial_query_id instead? It would improve the chance of cutting out resources consumption early.

OTOH, KILL QUERY ON CLUSTER {cluster} would require configuring/passing the "native" cluster name somewhere.

Environment information

For our production clusters, we supply applications with a keepalived-balanced endpoint. In chproxy config:

  scheme: https
  nodes:
  - lb-clickhouse.example:8443

Screenshots

DEBUG: 2024/05/16 18:14:56 proxy.go:84: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(1); RemoteAddr: "....
DEBUG: 2024/05/16 18:15:36 proxy.go:238: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(2); RemoteAddr: "..."; LocalAddr: "..."; Duration: 39435136 μs]: remote client closed the connection in 39.433581047s; query: "select ...
DEBUG: 2024/05/16 18:15:36 scope.go:256: killing the query with query_id=17CF3EA10D4DDB62
DEBUG: 2024/05/16 18:15:36 scope.go:296: killed the query with query_id=17CF3EA10D4DDB62; respBody: ""
DEBUG: 2024/05/16 18:15:36 proxy.go:156: [ Id: 17CF3EA10D4DDB62; User "u"(1) proxying as "p"(1) to "lb-clickhouse.example:8443"(2); RemoteAddr: "..."; LocalAddr: "..."; Duration: 39854873 μs]: request failure: non-200 status code 502; query: "select....FORMAT JSONCompact"; Method: POST; URL: "https://lb-clickhouse.example:8443/?max_execution_time=10800&max_memory_usage=42949672960&priority=4&query_id=17CF3EA10D4DDB62&result_overflow_mode=throw&session_timeout=60"

The KILL query ran at node ch3v, while the other nodes wasted time running the query to the end:

SELECT
    hostname,
    is_initial_query,
    type,
    event_time
FROM system.distributed_query_log
WHERE (event_date = '2024-05-16') AND (type > 1) AND (initial_query_id = '17CF3EA10D4DDB62')

hostname──┬─is_initial_query─┬─type────────┬──────────event_time─
     ch4v │                0 │ QueryFinish │ 2024-05-16 19:43:50
     ch5v │                1 │ QueryFinish │ 2024-05-16 19:43:51
     ch2v │                0 │ QueryFinish │ 2024-05-16 19:43:50

Environment information

chproxy v1.19.0, clickhouse 22.8

thank you

(sorry, I created issue from code line. I updated description according to BUG template)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant