-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When workload asks admin server for /commands/ruok, connection refused because of localhost resolution #170
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6030.
|
Hi @Thinking-Dragon! Thank you for raising this Issue, and especially for such a detailed outline of the problem, that really helps us 👍🏾 We've done some exploring of using the loopback address rather than |
Hi @marcoppenheimer thank you for the feedback! Sorry for the late response, I was at a trade show last week. So basically my client had an entry in their DNS for My personal assessment is that this is not actually a bug in the ZooKeeper charm. In my opinion The reason for this opinion is that even if we change the ZooKeeper charm to use I think a better "solution" from the charm's perspective for this situation would be if rather than using I haven't tested it, but I'm thinking something like this: import socket
...
@property
@override
@retry(
wait=wait_fixed(1),
stop=stop_after_attempt(5),
retry=retry_if_result(lambda result: result is False),
retry_error_callback=lambda _: False,
)
def healthy(self) -> bool:
"""Flag to check if the unit service is reachable and serving requests."""
if not self.alive:
return False
try:
response = httpx.get(f"http://localhost:{ADMIN_SERVER_PORT}/commands/ruok", timeout=10)
response.raise_for_status()
except httpx.ConnectError:
self.verify_localhost_ip_resolution()
return False
except httpx.HTTPStatusError:
return False
if response.json().get("error", None):
return False
return True
def verify_localhost_ip_resolution(self):
""" Ensure that localhost resolves to 127.0.0.1 """
try:
localhost_ip_address = socket.gethostbyname('localhost')
if localhost_ip_address != '127.0.0.1':
# Put the charm in an error state that says something like:
# "localhost" should always resolve to "127.0.0.1" but resolves to "{localhost_ip_address}".
# Please check your DNS settings.
except socket.gaierror as e:
# Put the charm in an error state that says something like:
# Could not resolve "localhost". Maybe that check would need to be in a property other than What do you think about my approach? Maybe I'm overthinking this and the charm should just be left as is. |
Steps to reproduce
Important note: this happens in a very specific client environment, simply reproducing this deployment in a LXD cloud will not be sufficient to get the error. I am pasting the bundle here for reference. However I was able to pinpoint exactly where and how the charm is failing, specific details can be found further down. You can skip to the Additional context section to see the details.
Side note: reproduction steps include creating a manual cloud because it is on a manual cloud that I get this error. However, it is irrelevant, I would get the same error on a LXD cloud so you can ignore that part.
juju add-cloud ...
and bootstrap a controller on itjuju add-model kafka
6
machines to itjuju add-machine ssh:[...]
juju deploy ./kafka.yaml
Expected behavior
Running
juju status
should show all units inactive
state after convergence.Actual behavior
As you can see in the screenshot, all ZooKeeper units are shown as
blocked
with messagezookeeper service is unreachable or not serving requests
(which corresponds to SERVICE_UNHEALTHY state in the code.Important note: while Juju is showing the units as
blocked
, in reality the ZooKeeper service is running and Kafka is able to interact with it. My client's engineering team was even able to connect their application to Kafka and run successful bench-marking tests.See Additional context section below for a detailed explanation of what is happening and why it is happening.
Versions
Operating system:
Ubuntu 22.04.5 LTS
Juju CLI:
3.5.4-genericlinux-amd64
Juju agent:
3.5.4
Charm revision:
149
LXD:
5.0.3
Log output
Juju debug log:
Additional context
Issue
My units are in blocked state with message
zookeeper service is unreachable or not serving requests
(SERVICE_UNHEALTHY
status level).The reason is that when the charm checks if its workload is healthy, the HTTP request to the admin server for
http://localhost:8080/commands/ruok
getsConnection refused
.(1) When I
curl localhost:8080/commands/ruok
it works fine.(2) When I run in an interactive Python shell
httpx.get("http://localhost:8080/commands/ruok", timout=10)
I get connection refused.(3) When I run in an interactive Python shell
httpx.get("http://127.0.0.1:8080/commands/ruok", timout=10)
it works fine.I ran a tcpdump on port
8080
and got this for situation (2):As you can see
localhost:8080
was resolved tolocalhost.inside.customer.org.http-alt
.When I run
127.0.0.1:8080
instead, it does not resolve it to the*.inside.customer.org
domain so it works.Solutions
I see two obvious options here to solve the problem:
Make a change to the charm code to replace
localhost
by127.0.0.1
when doing the request. It should not impact other deployments negatively as far as I am aware sincelocalhost
is an alias for127.0.0.1
anyways.Add an exception to their internal DNS (or possibly an alias in
/etc/hosts
for127.0.0.1 localhost.inside.customer.org
on the VMs) to preventlocalhost
from being resolved tolocalhost.inside.customer.org
. In which case I think it's better if I ask them to choose the way they want to make that exception so that they are aware of the exception and that they are able to manage it properly if they ever change their DNS configurations.The text was updated successfully, but these errors were encountered: