unexpected EOF resulting in unable to reconnect #300

sevaho · 2022-04-21T15:45:25Z

Problem

We are running an application built in Python to translate HTTP calls to NATS. This applications acts as our main gateway. This is dockerized (python:3.10-slim) and ran on EKS.

Following shows how network calls come in:

load balancer -> nginx-ingress -> APP ---- nats ---> resource server

Since 2 weeks we are starting to get problems with nats client disconnecting from server and spitting out the following error: unexpected EOF.

As you can see in the graphs there is a spike in data received, I can assure you that this is not a DOS, just more requests than usual.
I have the feeling that this is due to to many requests being doing from the app to the resource server. Meaning the nats client is unable to handle a lot of requests at the same time. Do you know if this can be the case and what we can do about it? Is it possible to directly reconnect? Could the issue be related to fastapi (web server) and nats sharing the same loop?

App specs

Connection config:

class ConnectConfig(BaseSettings):
    servers: Union[str, List[str]] = ["nats://127.0.0.1:4222"]
    error_cb: Any = None
    closed_cb: Any = None
    reconnected_cb: Any = None
    disconnected_cb: Any = None
    discovered_server_cb: Any = None
    name: Any = None
    pedantic: Any = False
    allow_reconnect: Any = True
    connect_timeout: Any = DEFAULT_CONNECT_TIMEOUT
    max_reconnect_attempts: Any = DEFAULT_MAX_RECONNECT_ATTEMPTS
    dont_randomize: Any = False
    flusher_queue_size: Any = DEFAULT_MAX_FLUSHER_QUEUE_SIZE
    no_echo: Any = False
    tls: Any = None
    tls_hostname: Any = None
    user: Any = None
    password: Any = None
    token: Any = None
    drain_timeout: Any = DEFAULT_DRAIN_TIMEOUT
    signature_cb: Any = None
    user_jwt_cb: Any = None
    user_credentials: Any = None
    nkeys_seed: str = None
    verbose=True,
    ping_interval=30,
    max_outstanding_pings=5,
    reconnect_time_wait=5,

Versions

Python: 3.10.4
nats.py: 2.1.0
fastapi: 0.75.1

Logs

Application

In loki:

2022-04-21 11:45:28 ERROR nats: connection closed
2022-04-21 11:45:28 ERROR nats: connection closed
2022-04-21 11:45:28 WARNING Cleaned up 0 coroutines.
2022-04-21 11:45:28 WARNING Cleanup coroutines that handle NATS messages.
2022-04-21 11:45:28 WARNING [APP_TEARDOWN] <fastapi.applications.FastAPI object at 0x7f21fd6d50c0>
2022-04-21 11:45:19 ERROR nats: connection closed
2022-04-21 11:45:19 ERROR nats: connection closed
2022-04-21 11:45:07 WARNING NATS CLOSED
2022-04-21 11:45:07 ERROR NATS ERROR: [Errno 104] Connection reset by peer
2022-04-21 11:45:07 ERROR NATS ERROR: [Errno 104] Connection reset by peer
2022-04-21 11:45:07 ERROR NATS ERROR: nats: unexpected EOF
2022-04-21 11:45:07 ERROR NATS ERROR: nats: unexpected EOF
2022-04-21 11:45:07 WARNING returning true from eof_received() has no effect when using ssl
2022-04-21 11:45:07 WARNING Executing <Task pending name='starlette.middleware.base.BaseHTTPMiddleware.__call__.<locals>.call_next.<locals>.coro' coro=<BaseHTTPMiddleware.__call__.<locals>.call_next.<locals>.coro() running at /usr/local/lib/python3.10/site-packages/starlette/middleware/base.py:34> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.10/site-packages/nats/aio/client.py:1101> cb=[TaskGroup._spawn.<locals>.task_done() at /usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py:629] created at /usr/local/lib/python3.10/asyncio/tasks.py:337> took 0.605 seconds

In sentry:

Nats Server

No logs (even debug mode)

Graphs

individual pod 5min

indiviuael pod 1h

deployment (3 pods running in round robin)

The text was updated successfully, but these errors were encountered:

ronigober · 2022-11-22T14:20:35Z

Hey!
Do you had any progress with this issue?
We experienced similar issues with NATS, and after a long research we managed to partially solve it by increasing the TLS handshake timeout in the server configuration.
But, still from time to time we get this error.
The main issue here is that it cause to the Python client to get stuck instead of just raising an error.
The only visible difference that we can see is the log of
WARNING returning true from eof_received() has no effect when using ssl.
Thanks!

tothandras · 2022-11-22T17:51:51Z

Have you tried to manually reconnect on an error or is it just stuck? I've similar problems with TimeoutError when the client is idle for a while and this is what I did:

try:
  await js.publish(stream, json.dumps(data).encode())
except nats.errors.TimeoutError:
  js = nc.jetstream()
  # retry
  await js.publish(stream, json.dumps(data).encode())

ronigober · 2022-11-23T08:27:37Z

Hey @tothandras :)
The problem here is that we don't get any TimeoutError, or any other exception,
the process just getting stuck.
I can't even know where because I can't reproduce the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected EOF resulting in unable to reconnect #300

unexpected EOF resulting in unable to reconnect #300

sevaho commented Apr 21, 2022 •

edited

Loading

ronigober commented Nov 22, 2022

tothandras commented Nov 22, 2022 •

edited

Loading

ronigober commented Nov 23, 2022 •

edited

Loading

unexpected EOF resulting in unable to reconnect #300

unexpected EOF resulting in unable to reconnect #300

Comments

sevaho commented Apr 21, 2022 • edited Loading

Problem

App specs

Versions

Logs

Application

Nats Server

Graphs

ronigober commented Nov 22, 2022

tothandras commented Nov 22, 2022 • edited Loading

ronigober commented Nov 23, 2022 • edited Loading

sevaho commented Apr 21, 2022 •

edited

Loading

tothandras commented Nov 22, 2022 •

edited

Loading

ronigober commented Nov 23, 2022 •

edited

Loading