When workload asks admin server for /commands/ruok, connection refused because of localhost resolution #170

Thinking-Dragon · 2024-11-14T19:55:00Z

Steps to reproduce

Important note: this happens in a very specific client environment, simply reproducing this deployment in a LXD cloud will not be sufficient to get the error. I am pasting the bundle here for reference. However I was able to pinpoint exactly where and how the charm is failing, specific details can be found further down. You can skip to the Additional context section to see the details.

Side note: reproduction steps include creating a manual cloud because it is on a manual cloud that I get this error. However, it is irrelevant, I would get the same error on a LXD cloud so you can ignore that part.

Create a manual cloud juju add-cloud ... and bootstrap a controller on it
Create a model juju add-model kafka
Add 6 machines to it juju add-machine ssh:[...]
Deploy the following bundle juju deploy ./kafka.yaml

variables:
  customer_key: &customer_key include-base64://./certs/tls_certificate.pem
  customer_crt: &customer_crt include-base64://./certs/ca_certificate.pem
  ca-chain:  &ca_chain  include-base64://./certs/ca_chain.pem
saas:
 loki-logging:
   url: cos:admin/cos.loki-logging
 prometheus-receive-remote-write:
   url: cos:admin/cos.prometheus-receive-remote-write
 grafana-dashboards:
   url: cos:admin/cos.grafana-dashboards
 alertmanager-karma-dashboard:
   url: cos:admin/cos.alertmanager-karma-dashboard
applications:
  kafka:
    channel: 3/stable
    charm: kafka
    num_units: 3
    bindings:
      ? ''
      : alpha
        kafka-client: data 
    base: [email protected]
    to:
      - "0"
      - "1"
      - "2"
  tls-certificates-operator:
    channel: latest/stable
    charm: tls-certificates-operator
    num_units: 1
    base: [email protected]
    bindings:
      ? ''
      : alpha
    options:
      generate-self-signed-certificates: False
      certificate:    *customer_key
      ca-certificate: *customer_crt
      ca-chain:       *ca_chain
    to:
      - "0"
  manual-tls-certificates:
    charm: manual-tls-certificates
    channel: latest/stable
    num_units: 1
    to:
      - "0"
  external-ca:
    channel: latest/stable
    charm: tls-certificates-operator
    num_units: 1
    base: [email protected]
    options:
      generate-self-signed-certificates: False
      certificate:    *customer_key
      ca-certificate: *customer_crt
      ca-chain:       *ca_chain
    to:
      - "3"
  zookeeper:
    channel: 3/stable
    charm: zookeeper
    num_units: 3
    base: [email protected]
    bindings:
      ? ''
      : alpha
    to:
      - "3"
      - "4"
      - "5"
  data-integrator:
    channel: latest/stable
    charm: data-integrator
    num_units: 1
    base: [email protected]
    bindings:
      ? ''
      : alpha
    options:
      topic-name: "default"
      extra-user-roles: "admin"
    to:
      - "0"
  grafana-agent:
    channel: latest/stable
    charm: grafana-agent
description: A fast, secure and fault-tolerant Apache Kafka, supported by Apache ZooKeeper
issues: https://github.com/canonical/kafka-bundle/issues/new
name: kafka-bundle
relations:
- - zookeeper:certificates
  - manual-tls-certificates:certificates
  - tls-certificates-operator:certificates
- - kafka:certificates
  - manual-tls-certificates:certificates
  - tls-certificates-operator:certificates
- - kafka:zookeeper
  - zookeeper:zookeeper
- - kafka
  - data-integrator
- - external-ca:certificates
  - kafka:trusted-ca
- - grafana-agent
  - zookeeper
- - grafana-agent
  - kafka

base: [email protected]

machines:
  "0": {}
  "1": {}
  "2": {}
  "3": {}
  "4": {}
  "5": {}

Expected behavior

Running juju status should show all units in active state after convergence.

Actual behavior

As you can see in the screenshot, all ZooKeeper units are shown as blocked with message zookeeper service is unreachable or not serving requests (which corresponds to SERVICE_UNHEALTHY state in the code.

Important note: while Juju is showing the units as blocked, in reality the ZooKeeper service is running and Kafka is able to interact with it. My client's engineering team was even able to connect their application to Kafka and run successful bench-marking tests.

See Additional context section below for a detailed explanation of what is happening and why it is happening.

Versions

Operating system: Ubuntu 22.04.5 LTS

Juju CLI: 3.5.4-genericlinux-amd64

Juju agent: 3.5.4

Charm revision: 149

LXD: 5.0.3

Log output

Juju debug log:

unit-data-integrator-0: 13:16:28 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-1: 13:16:33 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-2: 13:16:33 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-0: 13:16:33 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-7: 13:17:30 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-6: 13:17:30 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-external-ca-0: 13:17:45 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-0: 13:18:41 WARNING unit.zookeeper/0.update-status /var/lib/juju/agents/unit-zookeeper-0/charm/./src/charm.py:459: DeprecationWarning: Using 'uris' in the databag is deprecated, use 'endpoints' instead
unit-zookeeper-0: 13:18:41 WARNING unit.zookeeper/0.update-status   "uris": client.uris,
unit-zookeeper-0: 13:18:45 ERROR unit.zookeeper/0.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-0: 13:18:46 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-2: 13:19:19 ERROR unit.zookeeper/2.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-2: 13:19:19 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-1: 13:21:05 ERROR unit.zookeeper/1.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-1: 13:21:06 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-10: 13:21:17 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-8: 13:21:18 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-9: 13:21:18 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-11: 13:21:23 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-7: 13:21:56 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-6: 13:21:56 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-data-integrator-0: 13:22:20 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-0: 13:22:26 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-1: 13:22:26 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-2: 13:22:26 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-tls-certificates-operator-0: 13:22:26 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-external-ca-0: 13:22:43 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-0: 13:23:34 WARNING unit.zookeeper/0.update-status /var/lib/juju/agents/unit-zookeeper-0/charm/./src/charm.py:459: DeprecationWarning: Using 'uris' in the databag is deprecated, use 'endpoints' instead
unit-zookeeper-0: 13:23:34 WARNING unit.zookeeper/0.update-status   "uris": client.uris,
unit-zookeeper-0: 13:23:38 ERROR unit.zookeeper/0.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-0: 13:23:39 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-2: 13:24:44 ERROR unit.zookeeper/2.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-2: 13:24:44 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-10: 13:25:36 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-9: 13:25:37 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-8: 13:25:37 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-11: 13:25:39 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-7: 13:26:03 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-grafana-agent-6: 13:26:04 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-1: 13:26:56 ERROR unit.zookeeper/1.juju-log zookeeper service is unreachable or not serving requests
unit-zookeeper-1: 13:26:56 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-external-ca-0: 13:27:12 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-tls-certificates-operator-0: 13:28:03 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-data-integrator-0: 13:28:03 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-1: 13:28:09 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-2: 13:28:09 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kafka-0: 13:28:09 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
controller-2: 13:28:40 INFO juju.worker.logforwarder config change - log forwarding not enabled

Additional context

Issue

My units are in blocked state with message zookeeper service is unreachable or not serving requests (SERVICE_UNHEALTHY status level).

The reason is that when the charm checks if its workload is healthy, the HTTP request to the admin server for http://localhost:8080/commands/ruok gets Connection refused.

(1) When I curl localhost:8080/commands/ruok it works fine.

(2) When I run in an interactive Python shell httpx.get("http://localhost:8080/commands/ruok", timout=10) I get connection refused.

(3) When I run in an interactive Python shell httpx.get("http://127.0.0.1:8080/commands/ruok", timout=10) it works fine.

I ran a tcpdump on port 8080 and got this for situation (2):

tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
09:25:54.274781 ens160 Out IP zookpr-nonprd-1.inside.customer.org.49158 > localhost.inside.customer.org.http-alt: Flags [S], seq 1609136223, win 64240, options [mss 1460,sackOK,TS val 1602603686 ecr 0,nop,wscale 7], length 0
09:25:54.274931 ens160 In  IP localhost.inside.customer.org.http-alt > zookpr-nonprd-1.inside.customer.org.49158: Flags [R.], seq 0, ack 1609136224, win 0, length 0

As you can see localhost:8080 was resolved to localhost.inside.customer.org.http-alt.

When I run 127.0.0.1:8080 instead, it does not resolve it to the *.inside.customer.org domain so it works.

Solutions

I see two obvious options here to solve the problem:

Make a change to the charm code to replace localhost by 127.0.0.1 when doing the request. It should not impact other deployments negatively as far as I am aware since localhost is an alias for 127.0.0.1 anyways.
Add an exception to their internal DNS (or possibly an alias in /etc/hosts for 127.0.0.1 localhost.inside.customer.org on the VMs) to prevent localhost from being resolved to localhost.inside.customer.org. In which case I think it's better if I ask them to choose the way they want to make that exception so that they are aware of the exception and that they are able to manage it properly if they ever change their DNS configurations.

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-11-14T19:55:09Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6030.

This message was autogenerated

marcoppenheimer · 2024-11-15T17:54:13Z

Hi @Thinking-Dragon! Thank you for raising this Issue, and especially for such a detailed outline of the problem, that really helps us 👍🏾

We've done some exploring of using the loopback address rather than localhost, and it seems like a safe bet. I'd like to understand a bit more about your set-up though if possible.
On a standard Ubuntu deploy which is where we develop, test and guarantee, you'll get "127.0.0.1 localhost" as the first line in /etc/hosts. What is the reason for it being different here?

Thinking-Dragon · 2024-11-25T22:08:36Z

Hi @marcoppenheimer thank you for the feedback!

Sorry for the late response, I was at a trade show last week.

So basically my client had an entry in their DNS for localhost.inside.<customer>.org I am still uncertain why they made this entry in the first place, but I ended up asking them if they can remove it ; they agreed and the deployment works as expected now.

My personal assessment is that this is not actually a bug in the ZooKeeper charm. In my opinion localhost should always redirect to 127.0.0.1 and any environment that violates this rule is (in my opinion) a corrupted environment that need to be fixed.

The reason for this opinion is that even if we change the ZooKeeper charm to use 127.0.0.1 instead of localhost, the client may encounter this exact same problem with any number of other applications that they may deploy on their environment in the future. Thus I consider the environment to be broken, not the charm.

I think a better "solution" from the charm's perspective for this situation would be if rather than using 127.0.0.1, we added a check in the exception catch for when the request to http://localhost:8080/commands/ruok fails with Connection refused that would then verify if localhost resolves to 127.0.0.1 and print an error message if it does not.

I haven't tested it, but I'm thinking something like this:

import socket

...

@property
@override
@retry(
    wait=wait_fixed(1),
    stop=stop_after_attempt(5),
    retry=retry_if_result(lambda result: result is False),
    retry_error_callback=lambda _: False,
)
def healthy(self) -> bool:
    """Flag to check if the unit service is reachable and serving requests."""
    if not self.alive:
        return False

    try:
        response = httpx.get(f"http://localhost:{ADMIN_SERVER_PORT}/commands/ruok", timeout=10)
        response.raise_for_status()

    except httpx.ConnectError:
        self.verify_localhost_ip_resolution()
        return False

    except httpx.HTTPStatusError:
        return False

    if response.json().get("error", None):
        return False

    return True

def verify_localhost_ip_resolution(self):
    """ Ensure that localhost resolves to 127.0.0.1 """
    try:
        localhost_ip_address = socket.gethostbyname('localhost')
        if localhost_ip_address != '127.0.0.1':
            # Put the charm in an error state that says something like:
            #     "localhost" should always resolve to "127.0.0.1" but resolves to "{localhost_ip_address}".
            #     Please check your DNS settings.
    except socket.gaierror as e:
        # Put the charm in an error state that says something like:
        #     Could not resolve "localhost".

Maybe that check would need to be in a property other than healthy(self) or healthy(self) would need to be replaced by something like health_status(self) and return an enum rather than a boolean.

What do you think about my approach? Maybe I'm overthinking this and the charm should just be left as is.

Thinking-Dragon added the bug Something isn't working label Nov 14, 2024

Thinking-Dragon mentioned this issue Nov 26, 2024

temp: use loopback addr canonical/zookeeper-k8s-operator#115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When workload asks admin server for /commands/ruok, connection refused because of localhost resolution #170

When workload asks admin server for /commands/ruok, connection refused because of localhost resolution #170

Thinking-Dragon commented Nov 14, 2024

syncronize-issues-to-jira bot commented Nov 14, 2024

marcoppenheimer commented Nov 15, 2024

Thinking-Dragon commented Nov 25, 2024

When workload asks admin server for /commands/ruok, connection refused because of localhost resolution #170

When workload asks admin server for /commands/ruok, connection refused because of localhost resolution #170

Comments

Thinking-Dragon commented Nov 14, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

Issue

Solutions

syncronize-issues-to-jira bot commented Nov 14, 2024

marcoppenheimer commented Nov 15, 2024

Thinking-Dragon commented Nov 25, 2024