Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command Router service - readiness probe fails for a long duration after the IP of the Infinispan pod has changed #3658

Open
petr-cada opened this issue Oct 2, 2024 · 4 comments
Labels
C&C Command and Control question

Comments

@petr-cada
Copy link

We are installing Eclipse Hono to our Kubernetes cluster using a Helm chart:

dependencies:

  - name: hono
    version: 2.6.3
    repository: "https://eclipse.org/packages/charts/"

We are also installing Infinispan to our Kubernetes cluster using a Helm chart:

dependencies:

  - name: infinispan
    version: 0.4.1
    repository: "https://charts.openshift.io/"

In our values we have following configuration of command router (to use Infinispan):

hono:
  commandRouterService:
    hono:
      commandRouter:
        cache:
          remote:
            serverList: dmp-infinispan:11222
            socketTimeout: 5000
            connectTimeout: 5000

Everything works fine until the IP of the Infinispan pod changes (for example, when the Infinispan pod is deleted and then automatically recreated).
When this happens, the Command Router service begins to log 'Closing connection ...' as WARN messages and 'Exception encountered ...' as ERROR messages.

e.g.
Closing connection [id: 0x10332e0e, L:/10.1.108.52:32922 ! R:10.1.108.50/10.1.108.50:11222] due to transport error ... (look to log file hono-command-router-infinispan.txt)

The issue is that the Command Router continues to log 'Closing connection' WARN messages and 'Exception encountered ...' ERROR messages with the old IP of Infinispan (10.1.108.50), even though Infinispan is functioning correctly with the new IP (10.1.108.53). During this time, the Command Router's readiness probe is failing. The problem sometimes resolves itself, for example, after 5 minutes (refer to log file hono-command-router-infinispan.txt and find messages with Infinispan's new IP 10.1.108.53), but at other times, the issue persists even after 30 minutes.

It appears that the command router is caching the IP of Infinispan and not attempting to resolve the hostname of Infinispan (in our case, dmp-infinispan) for an extended period of time. Do you have any idea how to fix this problem?

@sophokles73
Copy link
Contributor

Am I right in assuming that you are using a single-node Infinispan cluster? If so, then in order to make the setup resilient to crashes, you should switch to a multi-node Infinispan cluster. This will allow the Command Router to fail over to another Infinispan pod, once the one it is currently interacting with, is no longer available.

@sophokles73 sophokles73 added question C&C Command and Control labels Oct 13, 2024
@ondrej-charvat
Copy link

Hi @sophokles73,
I work on the same project like @petr-cada and took the issue over from him. You're right that we're using a single node Infinispan installation. I've tried to setup a cluster of 2 nodes but it didn't bring any improvement. When I restarted the Infinispan node to which the Command router was connected the Command router was still trying to connect to the old node (previous pod's IP address). When I restarted the Command router it connected to one of the two Infinispan nodes randomly. With any further Infinispan pod restart the same behavior repeated.

@sophokles73
Copy link
Contributor

I've tried to setup a cluster of 2 nodes but it didn't bring any improvement.

This is not really helpful information ;-) Can you share your Infinispan cluster config file? Which version of Infinispan do you use?

@ondrej-charvat
Copy link

Hi @sophokles73,

we're using Infinispan 15.0.10.Final.

Here is the current config
infinispan.yml.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C&C Command and Control question
Projects
None yet
Development

No branches or pull requests

3 participants