Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kfp-persistence has invalid pebble health check #514

Open
orfeas-k opened this issue Jun 12, 2024 · 1 comment
Open

kfp-persistence has invalid pebble health check #514

orfeas-k opened this issue Jun 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

Bug Description

kfp-persistence has a health check that checks for accessibility on a metrics endpoint. However, neither the charm implements a MetricsEndpointProvider neither upstream code seems to implement any metrics. This was introduced during the sidecar rewrite with baseCharm, which means that it could be a misconception about how we use health checks. The check thus should be removed.

To Reproduce

Deploy kfp-persistence and relate it to required dependencies

Environment

Juju 3.5, Microk8s 1.28

Relevant Log Output

─$ kfl kfp-persistence-0 -c persistenceagent -f
2024-06-12T08:52:35.461Z [pebble] HTTP API server listening on ":38813".
2024-06-12T08:52:35.461Z [pebble] Started daemon.
2024-06-12T08:52:54.189Z [pebble] GET /v1/plan?format=yaml 78.41µs 200
2024-06-12T08:52:54.190Z [pebble] POST /v1/layers 166.969µs 200
2024-06-12T08:53:05.499Z [pebble] GET /v1/notices?timeout=30s 30.000493302s 200
2024-06-12T08:53:35.500Z [pebble] GET /v1/notices?timeout=30s 30.001060881s 200
2024-06-12T08:54:05.501Z [pebble] GET /v1/notices?timeout=30s 30.000893481s 200
2024-06-12T08:54:13.983Z [pebble] POST /v1/files 3.690543ms 200
2024-06-12T08:54:14.005Z [pebble] GET /v1/plan?format=yaml 162.142µs 200
2024-06-12T08:54:14.007Z [pebble] POST /v1/layers 296.708µs 200
2024-06-12T08:54:14.011Z [pebble] POST /v1/services 4.262304ms 202
2024-06-12T08:54:14.014Z [pebble] GET /v1/notices?timeout=30s 8.512968209s 200
2024-06-12T08:54:14.015Z [pebble] Service "persistenceagent" starting: persistence_agent --logtostderr=true --namespace= --ttlSecondsAfterWorkflowFinish=86400 --numWorker=2 --mlPipelineAPIServerName=kfp-api.kubeflow
2024-06-12T08:54:14.096Z [persistenceagent] W0612 08:54:14.096332      15 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-12T08:54:15.022Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A14.011973404Z&timeout=30s 1.007109898s 200
2024-06-12T08:54:15.022Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.010184868s 200
2024-06-12T08:54:15.055Z [pebble] GET /v1/services 83.884µs 200
2024-06-12T08:54:17.391Z [pebble] GET /v1/services 49.967µs 200
2024-06-12T08:54:44.011Z [pebble] Check "persistenceagent-get" failure 1 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:54:45.023Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.00090974s 200
2024-06-12T08:55:14.008Z [pebble] Check "persistenceagent-get" failure 2 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:55:15.024Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.000130261s 200
2024-06-12T08:55:44.010Z [pebble] Check "persistenceagent-get" failure 3 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:55:44.010Z [pebble] Check "persistenceagent-get" failure threshold 3 hit, triggering action
2024-06-12T08:55:45.025Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.001000892s 200
2024-06-12T08:56:14.011Z [pebble] Check "persistenceagent-get" failure 4 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:56:15.026Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.000986384s 200
2024-06-12T08:56:16.458Z [persistenceagent] time="2024-06-12T08:56:16Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get \"http://kfp-api.kubeflow:8888/apis/v1beta1/healthz\": dial tcp 10.152.183.187:8888: connect: connection refused: Waiting for ml pipeline API server failed after all attempts.: Get \"http://kfp-api.kubeflow:8888/apis/v1beta1/healthz\": dial tcp 10.152.183.187:8888: connect: connection refused"
2024-06-12T08:56:16.461Z [pebble] Service "persistenceagent" stopped unexpectedly with code 1
2024-06-12T08:56:16.461Z [pebble] Service "persistenceagent" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2024-06-12T08:56:17.002Z [pebble] Service "persistenceagent" starting: persistence_agent --logtostderr=true --namespace= --ttlSecondsAfterWorkflowFinish=86400 --numWorker=2 --mlPipelineAPIServerName=kfp-api.kubeflow
2024-06-12T08:56:17.033Z [persistenceagent] W0612 08:56:17.033566      29 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-12T08:56:44.011Z [pebble] Check "persistenceagent-get" failure 5 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:56:45.028Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.000947153s 200
2024-06-12T08:57:14.010Z [pebble] Check "persistenceagent-get" failure 6 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused
2024-06-12T08:57:15.029Z [pebble] GET /v1/notices?after=2024-06-12T08%3A54%3A15.017313558Z&timeout=30s 30.001004338s 200
2024-06-12T08:57:44.011Z [pebble] Check "persistenceagent-get" failure 7 (threshold 3): Get "http://localhost:8080/metrics": dial tcp [::1]:8080: connect: connection refused

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label Jun 12, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5863.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant