Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quilkin agent running into issues & not getting reported by the relay as an active control plane #1134

Open
koslib opened this issue Feb 24, 2025 · 0 comments
Labels
kind/bug Something isn't working

Comments

@koslib
Copy link
Collaborator

koslib commented Feb 24, 2025

Quilkin agent seems to be losing connection with the relay, or at least it's not reported back via the relay as an active control plane.

What happened:

Promquery used when issue identified:

sum(quilkin_active_control_planes{cluster="our-relay-cluster"}) by (control_plane)

In some clusters where there are 2 quilkin agents, only one of them is reported back from the relay for that cluster.

What you expected to happen:

Both agents to be reported back from the relay for a given cluster with two agents.

How to reproduce it (as minimally and precisely as possible):

No clear steps to reproduce - hopefully the logs attached below are helpful.

Anything else we need to know?:

Environment:

  • Quilkin version: 0.9.0-dev (commit 432893923c613bf9b7965990dcc70a8ef8d60b49)
  • Execution environment (binary, container, etc): container
  • Log(s):
{"timestamp":"2025-02-12T10:53:08.573989Z","level":"INFO","fields":{"message":"Starting Quilkin","version":"0.9.0-dev","commit":"432893923c613bf9b7965990dcc70a8ef8d60b49"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2025-02-12T10:53:08.574848Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"[::]:8000"},"target":"quilkin::components::admin","filename":"src/components/admin.rs"}
{"timestamp":"2025-02-12T10:53:08.576089Z","level":"INFO","fields":{"message":"attempting to connect to `http://our-relay-address/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"run"},"spans":[{"name":"run"},{"name":"run"}]}
{"timestamp":"2025-02-12T10:53:13.576703Z","level":"INFO","fields":{"message":"Retrying to connect","attempt":1},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"run"},"spans":[{"name":"run"},{"name":"run"}]}
{"timestamp":"2025-02-12T10:53:13.576767Z","level":"WARN","fields":{"message":"Unable to connect to the XDS server","error":"tonic::transport::Error(Transport, hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"run"},"spans":[{"name":"run"},{"name":"run"}]}
{"timestamp":"2025-02-12T10:53:14.355880Z","level":"INFO","fields":{"message":"attempting to connect to `http://our-relay-address/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"run"},"spans":[{"name":"run"},{"name":"run"}]}
{"timestamp":"2025-02-12T10:53:14.568949Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"run"},"spans":[{"name":"run"},{"name":"run"}]}
{"timestamp":"2025-02-19T12:56:26.769053Z","level":"WARN","fields":{"message":"provider task error, retrying","attempt":"1","error":"error returned by apiserver during watch: too old resource version: 660382101 (660387224): Expired"},"target":"quilkin::config::providers","filename":"src/config/providers.rs"}
{"timestamp":"2025-02-20T01:59:27.597708Z","level":"WARN","fields":{"message":"provider task error, retrying","attempt":"2","error":"error returned by apiserver during watch: too old resource version: 661381234 (661383082): Expired"},"target":"quilkin::config::providers","filename":"src/config/providers.rs"}

As we can see from the logs, the agent ran into issues. But it's not an issue happening regularly, it ran for ~2w before running into it.

A simple restart fixed the issue and the relay reported the correct number of agents once again.

@koslib koslib added the kind/bug Something isn't working label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant