Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kfp-v2 bundle tests fail with all charms in waiting #601

Closed
orfeas-k opened this issue Nov 14, 2024 · 9 comments · Fixed by #614
Closed

kfp-v2 bundle tests fail with all charms in waiting #601

orfeas-k opened this issue Nov 14, 2024 · 9 comments · Fixed by #614
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented Nov 14, 2024

Bug Description

The tests fail intermittently (but really often) during test test_upload_pipeline

tests/integration/test_kfp_functional_v2.py::test_build_and_deploy PASSED
tests/integration/test_kfp_functional_v2.py::test_upload_pipeline ERROR

with the error

httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.1.0.106:16443/apis/kubeflow.org/v1/profiles/kubeflow-user-example-com?fieldManager=kfp-operators'

source: https://github.com/canonical/kfp-operators/actions/runs/11821634908/job/32936826850#step:5:125

Investigating with @DnPlas it looks like the tests fail during the kfp_client fixture calling

  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/conftest.py", line 58, in apply_profile
    apply_manifests(lightkube_client, PROFILE_FILE_PATH)
  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/helpers/k8s_resources.py", line 47, in apply_manifests
    lightkube_client.apply(

which prompted to if the profile crd is available at the time. Looking at the juju status
it looks like all charms are in waiting status

Model     Controller                Cloud/Region        Version  SLA          Timestamp
kubeflow  github-pr-1f8[7](https://github.com/canonical/kfp-operators/actions/runs/11838229512/job/32991449536#step:7:8)2-microk8s  microk8s/localhost  3.4.6    unsupported  15:46:04Z

App                      Version  Status   Scale  Charm                    Channel       Rev  Address         Exposed  Message
argo-controller                   waiting      1  argo-controller          latest/edge   596  10.152.1[8](https://github.com/canonical/kfp-operators/actions/runs/11838229512/job/32991449536#step:7:9)3.56   no       installing agent
envoy                             waiting      1  envoy                    latest/edge   307  10.152.183.21[9](https://github.com/canonical/kfp-operators/actions/runs/11838229512/job/32991449536#step:7:10)  no       installing agent
istio-ingressgateway              waiting      1  istio-gateway            latest/edge  1287  [10](https://github.com/canonical/kfp-operators/actions/runs/11838229512/job/32991449536#step:7:11).152.183.22   no       installing agent
istio-pilot                       waiting      1  istio-pilot              latest/edge  1235  10.152.183.191  no       installing agent
kfp-api                           waiting      1  kfp-api                                  0  10.152.183.160  no       installing agent
kfp-db                            waiting    0/1  mysql-k8s                8.0/stable    180                  no       installing agent
kfp-metadata-writer               waiting    0/1  kfp-metadata-writer                      0                  no       installing agent
kfp-persistence                   waiting    0/1  kfp-persistence                          0                  no       installing agent
kfp-profile-controller            waiting    0/1  kfp-profile-controller                   0                  no       installing agent
kfp-schedwf                       waiting    0/1  kfp-schedwf                              0                  no       installing agent
kfp-ui                            waiting    0/1  kfp-ui                                   0                  no       installing agent
kfp-viewer                        waiting    0/1  kfp-viewer                               0                  no       installing agent
kfp-viz                           waiting    0/1  kfp-viz                                  0                  no       installing agent
kubeflow-profiles                 waiting    0/1  kubeflow-profiles        latest/edge   456                  no       installing agent
kubeflow-roles                    waiting    0/1  kubeflow-roles           latest/edge   264                  no       installing agent
metacontroller-operator           waiting    0/1  metacontroller-operator  latest/edge   349                  no       installing agent
minio                             waiting    0/1  minio                    latest/edge   376                  no       installing agent
mlmd                              waiting    0/1  mlmd                     latest/edge   243                  no       installing agent

Unit                       Workload     Agent       Address      Ports  Message
argo-controller/0*         blocked      idle        10.1.97.[20](https://github.com/canonical/kfp-operators/actions/runs/11838229512/job/32991449536#step:7:21)2         [relation:object_storage] Expected data from exactly 1 related applications - got 0.
envoy/0*                   waiting      idle        10.1.97.204         [grpc] Empty or missing data in grpc relation. This may be transient, but if it persists it is likely an error.
istio-ingressgateway/0*    blocked      idle        10.1.97.203         Please add required relation to istio-pilot
istio-pilot/0*             maintenance  executing   10.1.97.205         (install) Deploying Istio control plane with Istio CNI plugin.
kfp-api/0*                 blocked      executing   10.1.97.206         Please add required relation object-storage
kfp-db/0                   waiting      allocating                      installing agent
kfp-metadata-writer/0*     waiting      allocating  10.1.97.209         agent initialising
kfp-persistence/0          waiting      allocating                      installing agent
kfp-profile-controller/0   waiting      allocating                      installing agent
kfp-schedwf/0              waiting      allocating                      installing agent
kfp-ui/0                   waiting      allocating                      installing agent
kfp-viewer/0               waiting      allocating                      installing agent
kfp-viz/0                  waiting      allocating                      installing agent
kubeflow-profiles/0        waiting      allocating                      installing agent
kubeflow-roles/0           waiting      allocating                      installing agent
metacontroller-operator/0  waiting      allocating                      installing agent
minio/0                    waiting      allocating                      installing agent
mlmd/0                     waiting      allocating                      installing agent

which means that either for some reason build_and_deploy succeeds although the charms are not active/idle or they go to active and then they go back to waiting. We also noticed that the kubeflow-profiles pod doesn't exist on the cluster

During on_pull_request, the tests succeeded.

To Reproduce

rerun the CI from main

Environment

juju 3.4.6
microk8s 1.29
charmcraft 3.2.1

Relevant Log Output

/home/runner/work/kfp-operators/kfp-operators/tests/integration/test_kfp_functional_v2.py
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-8.3.2, pluggy-1.5.0 -- /home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/bin/python
cachedir: .tox/bundle-integration-v2/.pytest_cache
rootdir: /home/runner/work/kfp-operators/kfp-operators
configfile: pyproject.toml
plugins: anyio-4.4.0, operator-0.35.0, asyncio-0.21.2
asyncio: mode=strict
collecting ... collected 6 items

tests/integration/test_kfp_functional_v2.py::test_build_and_deploy PASSED
tests/integration/test_kfp_functional_v2.py::test_upload_pipeline ERROR
tests/integration/test_kfp_functional_v2.py::test_create_and_monitor_run ERROR
tests/integration/test_kfp_functional_v2.py::test_create_and_monitor_recurring_run ERROR
tests/integration/test_kfp_functional_v2.py::test_apply_sample_viewer FAILED
tests/integration/test_kfp_functional_v2.py::test_viz_server_healthcheck FAILED

==================================== ERRORS ====================================
____________________ ERROR at setup of test_upload_pipeline ____________________
Traceback (most recent call last):
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 341, in from_call
    result: TResult | None = func()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 242, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/unraisableexception.py", line 90, in pytest_runtest_setup
    yield from unraisable_exception_runtest_hook()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/unraisableexception.py", line 70, in unraisable_exception_runtest_hook
    yield
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/logging.py", line 842, in pytest_runtest_setup
    yield from self._runtest_for(item, "setup")
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/logging.py", line 831, in _runtest_for
    yield
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/capture.py", line 874, in pytest_runtest_setup
    return (yield)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/threadexception.py", line 87, in pytest_runtest_setup
    yield from thread_exception_runtest_hook()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/threadexception.py", line 68, in thread_exception_runtest_hook
    yield
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 160, in pytest_runtest_setup
    item.session._setupstate.setup(item)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 514, in setup
    col.setup()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/python.py", line 1630, in setup
    self._request._fillfixtures()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 696, in _fillfixtures
    item.funcargs[argname] = self.getfixturevalue(argname)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 531, in getfixturevalue
    fixturedef = self._get_active_fixturedef(argname)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 616, in _get_active_fixturedef
    fixturedef.execute(request=subrequest)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 1046, in execute
    fixturedef = request._get_active_fixturedef(argname)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 616, in _get_active_fixturedef
    fixturedef.execute(request=subrequest)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 1090, in execute
    result = ihook.pytest_fixture_setup(fixturedef=self, request=request)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/setuponly.py", line 36, in pytest_fixture_setup
    return (yield)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 1139, in pytest_fixture_setup
    result = call_fixture_func(fixturefunc, request, kwargs)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/fixtures.py", line 890, in call_fixture_func
    fixture_result = next(generator)
  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/conftest.py", line 58, in apply_profile
    apply_manifests(lightkube_client, PROFILE_FILE_PATH)
  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/helpers/k8s_resources.py", line 47, in apply_manifests
    lightkube_client.apply(
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/lightkube/core/client.py", line 457, in apply
    return self.patch(type(obj), name, obj, namespace=namespace,
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/lightkube/core/client.py", line 325, in patch
    return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/lightkube/core/generic_client.py", line 245, in request
    return self.handle_response(method, resp, br)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/httpx/_models.py", line 761, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.1.0.106:16443/apis/kubeflow.org/v1/profiles/kubeflow-user-example-com?fieldManager=kfp-operators'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label Nov 14, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6574.

This message was autogenerated

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Nov 20, 2024

Grabbing a juju status with the use of sh (sh.juju.status(format="json", no_color=True, model="kubeflow")) right after the wait_for_idle confirms that none of the charms are in active/idle when the wait_for_idle call has returned, with most being still in waiting (installing agent) (juju-status.txt).

This can be seen in this run (from this code and PR). Something that's also notable in this run is that the run fails when, right after sh.juju.status, we 're calling manually ops_test.model.get_status() https://github.com/canonical/kfp-operators/actions/runs/11915207048/job/33245318988#step:6:367.

  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/test_kfp_functional_v2.py", line 122, in test_build_and_deploy

which ends in the code raising here:

    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: no close frame received or sent

and logs the following error

INFO     sh.command:sh.py:579 <Command '/snap/bin/juju status --format=json --no-color --model=kubeflow', pid 263491>: process started
WARNING  juju.client.connection:connection.py:654 RPC: Connection closed, reconnecting
ERROR    juju.client.connection:connection.py:662 RPC: Automatic reconnect failed

This issue is the same we saw in #549.

EDIT: Discussing offline with @DnPlas, the wait_for_idle code never calls the same get_status() we called for debugging purposes. Instead, it gets applications from self.applications
https://github.com/juju/python-libjuju/blob/6c6d70d483937cec4404116c376b926106e05835/juju/model.py#L2854
and then it uses app.get_status()
https://github.com/juju/python-libjuju/blob/6c6d70d483937cec4404116c376b926106e05835/juju/model.py#L2893

EDIT 2: It looks like though app.get_status() starts with the same two lines of model.get_status()

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Nov 20, 2024

Note that this bug has been noticed intermittently (with a smaller frequency) in kfp-v1 bundle tests too https://github.com/canonical/kfp-operators/actions/runs/11934968609/job/33268744186?pr=604#step:6:23.

@orfeas-k
Copy link
Contributor Author

Modified the wait_for_idle() hardcoding the specific applications that we deploy and the result is:

INFO     juju.model:model.py:2972 Waiting for model:
  argo-controller (missing)
  envoy (missing)
  istio-ingressgateway (missing)
  istio-pilot (missing)
  kfp-api (missing)
  kfp-db (missing)
  kfp-metadata-writer (missing)
  kfp-persistence (missing)
  kfp-profile-controller (missing)
  kfp-schedwf (missing)
  kfp-ui (missing)
  kfp-viewer (missing)
  kfp-viz (missing)
  kubeflow-profiles (missing)
  kubeflow-roles (missing)
  metacontroller-operator (missing)
  minio (missing)
  mlmd (missing)

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Nov 21, 2024

Another attempt was adding a await ops_test.track_model("kubeflow", model_name="kubeflow", use_existing=True) to ensure ops_test was connected to kubeflow model but while it looks connected, still wait_for_idle couldn't find the charms

INFO     pytest_operator.plugin:plugin.py:734 Connecting to existing model github-pr-9f925-microk8s:kubeflow on unspecified cloud
INFO     juju.model:model.py:2972 Waiting for model:
  argo-controller (missing)
  envoy (missing)
  istio-ingressgateway (missing)
  istio-pilot (missing)
  kfp-api (missing)
  kfp-db (missing)
  kfp-metadata-writer (missing)
  kfp-persistence (missing)
  kfp-profile-controller (missing)
  kfp-schedwf (missing)
  kfp-ui (missing)
  kfp-viewer (missing)
  kfp-viz (missing)
  kubeflow-profiles (missing)
  kubeflow-roles (missing)
  metacontroller-operator (missing)
  minio (missing)
  mlmd (missing)

@orfeas-k
Copy link
Contributor Author

This bug could be related to issue #549 (comment) since we noticed similar disconnection (specially when using the .get_status()). When discussed with the pylibjuju maintainers (in this discussion), they had suggested to modify the max_frame_size in the connection. Doing it this way hadn't solved it in the past and what we would need to do is modify it in the connection to the model which happens here in pytest-operator. Such option is not exposed though by the package.

@dimaqq
Copy link

dimaqq commented Nov 22, 2024

Upstream: juju/python-libjuju#1204

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Nov 22, 2024

Workaround

A workaround for the above bug is to replace wait_for_idle with juju wait-for call with the use of sh. That would look like

sh.juju("wait-for","model","kubeflow", query="forEach(applications, app => app.status == 'active')", timeout="30m")

and it is implemented in #614. Note that the bug is still present and tracked in upstream issue juju/python-libjuju#1204.

orfeas-k added a commit that referenced this issue Nov 22, 2024
@dimaqq
Copy link

dimaqq commented Jan 24, 2025

the status severity code received a minor update, and there's the new wait_for_idle available behind a feature flag in pypi:juju==3.6.1.0

I can't guarantee that either of these fixes the issue, but it may be worth trying out.

https://discourse.charmhub.io/t/python-libjuju-3-6-1-0-release-notes/16149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants