Command Router service - readiness probe fails for a long duration after the IP of the Infinispan pod has changed #3658

petr-cada · 2024-10-02T13:15:29Z

We are installing Eclipse Hono to our Kubernetes cluster using a Helm chart:

dependencies:

  - name: hono
    version: 2.6.3
    repository: "https://eclipse.org/packages/charts/"

We are also installing Infinispan to our Kubernetes cluster using a Helm chart:

dependencies:

  - name: infinispan
    version: 0.4.1
    repository: "https://charts.openshift.io/"

In our values we have following configuration of command router (to use Infinispan):

hono:
  commandRouterService:
    hono:
      commandRouter:
        cache:
          remote:
            serverList: dmp-infinispan:11222
            socketTimeout: 5000
            connectTimeout: 5000

Everything works fine until the IP of the Infinispan pod changes (for example, when the Infinispan pod is deleted and then automatically recreated).
When this happens, the Command Router service begins to log 'Closing connection ...' as WARN messages and 'Exception encountered ...' as ERROR messages.

e.g.
Closing connection [id: 0x10332e0e, L:/10.1.108.52:32922 ! R:10.1.108.50/10.1.108.50:11222] due to transport error ... (look to log file hono-command-router-infinispan.txt)

The issue is that the Command Router continues to log 'Closing connection' WARN messages and 'Exception encountered ...' ERROR messages with the old IP of Infinispan (10.1.108.50), even though Infinispan is functioning correctly with the new IP (10.1.108.53). During this time, the Command Router's readiness probe is failing. The problem sometimes resolves itself, for example, after 5 minutes (refer to log file hono-command-router-infinispan.txt and find messages with Infinispan's new IP 10.1.108.53), but at other times, the issue persists even after 30 minutes.

It appears that the command router is caching the IP of Infinispan and not attempting to resolve the hostname of Infinispan (in our case, dmp-infinispan) for an extended period of time. Do you have any idea how to fix this problem?

The text was updated successfully, but these errors were encountered:

sophokles73 · 2024-10-13T12:47:41Z

Am I right in assuming that you are using a single-node Infinispan cluster? If so, then in order to make the setup resilient to crashes, you should switch to a multi-node Infinispan cluster. This will allow the Command Router to fail over to another Infinispan pod, once the one it is currently interacting with, is no longer available.

ondrej-charvat · 2024-10-25T12:28:48Z

Hi @sophokles73,
I work on the same project like @petr-cada and took the issue over from him. You're right that we're using a single node Infinispan installation. I've tried to setup a cluster of 2 nodes but it didn't bring any improvement. When I restarted the Infinispan node to which the Command router was connected the Command router was still trying to connect to the old node (previous pod's IP address). When I restarted the Command router it connected to one of the two Infinispan nodes randomly. With any further Infinispan pod restart the same behavior repeated.

sophokles73 · 2024-10-28T06:35:00Z

I've tried to setup a cluster of 2 nodes but it didn't bring any improvement.

This is not really helpful information ;-) Can you share your Infinispan cluster config file? Which version of Infinispan do you use?

ondrej-charvat · 2024-11-08T07:36:20Z

Hi @sophokles73,

we're using Infinispan 15.0.10.Final.

Here is the current config
infinispan.yml.gz

sophokles73 added question C&C Command and Control labels Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command Router service - readiness probe fails for a long duration after the IP of the Infinispan pod has changed #3658

Command Router service - readiness probe fails for a long duration after the IP of the Infinispan pod has changed #3658

petr-cada commented Oct 2, 2024

sophokles73 commented Oct 13, 2024

ondrej-charvat commented Oct 25, 2024

sophokles73 commented Oct 28, 2024

ondrej-charvat commented Nov 8, 2024

Command Router service - readiness probe fails for a long duration after the IP of the Infinispan pod has changed #3658

Command Router service - readiness probe fails for a long duration after the IP of the Infinispan pod has changed #3658

Comments

petr-cada commented Oct 2, 2024

sophokles73 commented Oct 13, 2024

ondrej-charvat commented Oct 25, 2024

sophokles73 commented Oct 28, 2024

ondrej-charvat commented Nov 8, 2024