Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected EOF resulting in unable to reconnect #300

Open
sevaho opened this issue Apr 21, 2022 · 3 comments
Open

unexpected EOF resulting in unable to reconnect #300

sevaho opened this issue Apr 21, 2022 · 3 comments

Comments

@sevaho
Copy link

sevaho commented Apr 21, 2022

Problem

We are running an application built in Python to translate HTTP calls to NATS. This applications acts as our main gateway. This is dockerized (python:3.10-slim) and ran on EKS.

Following shows how network calls come in:

load balancer -> nginx-ingress -> APP ---- nats ---> resource server

Since 2 weeks we are starting to get problems with nats client disconnecting from server and spitting out the following error: unexpected EOF.

As you can see in the graphs there is a spike in data received, I can assure you that this is not a DOS, just more requests than usual.
I have the feeling that this is due to to many requests being doing from the app to the resource server. Meaning the nats client is unable to handle a lot of requests at the same time. Do you know if this can be the case and what we can do about it? Is it possible to directly reconnect? Could the issue be related to fastapi (web server) and nats sharing the same loop?

App specs

Connection config:

class ConnectConfig(BaseSettings):
    servers: Union[str, List[str]] = ["nats://127.0.0.1:4222"]
    error_cb: Any = None
    closed_cb: Any = None
    reconnected_cb: Any = None
    disconnected_cb: Any = None
    discovered_server_cb: Any = None
    name: Any = None
    pedantic: Any = False
    allow_reconnect: Any = True
    connect_timeout: Any = DEFAULT_CONNECT_TIMEOUT
    max_reconnect_attempts: Any = DEFAULT_MAX_RECONNECT_ATTEMPTS
    dont_randomize: Any = False
    flusher_queue_size: Any = DEFAULT_MAX_FLUSHER_QUEUE_SIZE
    no_echo: Any = False
    tls: Any = None
    tls_hostname: Any = None
    user: Any = None
    password: Any = None
    token: Any = None
    drain_timeout: Any = DEFAULT_DRAIN_TIMEOUT
    signature_cb: Any = None
    user_jwt_cb: Any = None
    user_credentials: Any = None
    nkeys_seed: str = None
    verbose=True,
    ping_interval=30,
    max_outstanding_pings=5,
    reconnect_time_wait=5,

Versions

  • Python: 3.10.4
  • nats.py: 2.1.0
  • fastapi: 0.75.1

Logs

Application

In loki:

2022-04-21 11:45:28 ERROR nats: connection closed
2022-04-21 11:45:28 ERROR nats: connection closed
2022-04-21 11:45:28 WARNING Cleaned up 0 coroutines.
2022-04-21 11:45:28 WARNING Cleanup coroutines that handle NATS messages.
2022-04-21 11:45:28 WARNING [APP_TEARDOWN] <fastapi.applications.FastAPI object at 0x7f21fd6d50c0>
2022-04-21 11:45:19 ERROR nats: connection closed
2022-04-21 11:45:19 ERROR nats: connection closed
2022-04-21 11:45:07 WARNING NATS CLOSED
2022-04-21 11:45:07 ERROR NATS ERROR: [Errno 104] Connection reset by peer
2022-04-21 11:45:07 ERROR NATS ERROR: [Errno 104] Connection reset by peer
2022-04-21 11:45:07 ERROR NATS ERROR: nats: unexpected EOF
2022-04-21 11:45:07 ERROR NATS ERROR: nats: unexpected EOF
2022-04-21 11:45:07 WARNING returning true from eof_received() has no effect when using ssl
2022-04-21 11:45:07 WARNING Executing <Task pending name='starlette.middleware.base.BaseHTTPMiddleware.__call__.<locals>.call_next.<locals>.coro' coro=<BaseHTTPMiddleware.__call__.<locals>.call_next.<locals>.coro() running at /usr/local/lib/python3.10/site-packages/starlette/middleware/base.py:34> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.10/site-packages/nats/aio/client.py:1101> cb=[TaskGroup._spawn.<locals>.task_done() at /usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py:629] created at /usr/local/lib/python3.10/asyncio/tasks.py:337> took 0.605 seconds

In sentry:

image

Nats Server

No logs (even debug mode)

Graphs

individual pod 5min
image
indiviuael pod 1h
image
deployment (3 pods running in round robin)
image

@ronigober
Copy link

Hey!
Do you had any progress with this issue?
We experienced similar issues with NATS, and after a long research we managed to partially solve it by increasing the TLS handshake timeout in the server configuration.
But, still from time to time we get this error.
The main issue here is that it cause to the Python client to get stuck instead of just raising an error.
The only visible difference that we can see is the log of
WARNING returning true from eof_received() has no effect when using ssl.
Thanks!

@tothandras
Copy link

tothandras commented Nov 22, 2022

Have you tried to manually reconnect on an error or is it just stuck? I've similar problems with TimeoutError when the client is idle for a while and this is what I did:

try:
  await js.publish(stream, json.dumps(data).encode())
except nats.errors.TimeoutError:
  js = nc.jetstream()
  # retry
  await js.publish(stream, json.dumps(data).encode())

@ronigober
Copy link

ronigober commented Nov 23, 2022

Hey @tothandras :)
The problem here is that we don't get any TimeoutError, or any other exception,
the process just getting stuck.
I can't even know where because I can't reproduce the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants