Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gunicorn worker hangs and closes connections #3314

Open
dantebarba opened this issue Oct 23, 2024 · 4 comments
Open

Gunicorn worker hangs and closes connections #3314

dantebarba opened this issue Oct 23, 2024 · 4 comments

Comments

@dantebarba
Copy link

Hi

I've been dealing with this issue since we moved our application from flask development application server to wsgi and I'm unable to find a solution to it

Runtime environment

python==3.7.16
gunicorn==23.0.0
Flask==1.1.2
Docker: yes
Docker image: python:3.7.16-slim
VM: GCP e2-small with ContainerOS

Dockerfile

FROM python:3.7.16-slim

RUN apt update && apt install -y build-essential libssl-dev libffi-dev

WORKDIR /app

COPY requirements.txt requirements.txt

RUN pip install pip==20.0.1 && pip install -r requirements.txt

ARG VERSION=""
ARG BUILD_TIMESTAMP=""
ARG BUILD_ENVIRONMENT="test"

ENV APP_SETTINGS="config.StagingConfig"
ENV FLASK_APP=create_app.py
ENV FLASK_ENV "production"
ENV VERSION $VERSION
ENV BUILD_TIMESTAMP $BUILD_TIMESTAMP
ENV BUILD_ENVIRONMENT $BUILD_ENVIRONMENT
ENV LOG_LEVEL "INFO"
ENV EMAIL_USE_SSL "True"
ENV REDIS_URL "redis-node"
# setting max worker timeout to match Cloudflare max timeout
ENV WORKER_TIMEOUT "100"
ENV PYTHONFAULTHANDLER "1"
ENV GRPC_POLL_STRATEGY "epoll1"
# default is 2048
ENV GUNICORN_BACKLOG="2048"

COPY . .

EXPOSE 80

RUN mkdir -p /app/log && touch /app/log/client_library.log

CMD gunicorn --worker-class=gthread --workers=3 --threads=4 wsgi:app --bind 0.0.0.0:80 --timeout ${WORKER_TIMEOUT} --access-logfile /dev/null --error-logfile - --log-level ${LOG_LEVEL} --limit-request-line 4094 --limit-request-fields 100 --limit-request-field_size 8190 --backlog ${GUNICORN_BACKLOG}

Description

We started experiencing some random hangs on the application. We noticed because our uptime monitor would alert us. Downtime usually lasts about 3-5 minutes. We analyzed the logs and found that usually these hanging events are preceded by a request spike.

Our first attempt was to change the worker and threads configuration. We tested various combinations from 1 worker and 1 thread to 8 workers and 2 threads, all of them reported similar issues when doing stress tests. The one that was configured with 1 worker and 1 thread was the fastest to freeze, after only 10 requests.

One of the things that we noticed was that the application would return to life after bursting a bunch of [DEBUG] Closing connection. log entries.

image

This issue only happens when deploying to a VM, on my local environment (Macbook Air M1) this does not happen, the application can serve multiple requests and all stress tests were successful.

Here is a stress test sample

image

Any thoughts?

@pajod
Copy link
Contributor

pajod commented Oct 24, 2024

The configuration seems to omit a possible relevant dependency (GRPC_POLL_STRATEGY, libffi-dev and whats up with the EoL Python & pip version?) - probably worth a shot bisecting dependencies to rule out a loaded C module is misbehaving.

@dantebarba
Copy link
Author

dantebarba commented Oct 25, 2024

libffi-dev

The GRPC configuration is due to the following issue with grpcio: grpc/grpc#29044. We use GRPC to connect to GCP services. Libffi was added to support cffi and cryptography packages.

The main issue resides on the fact that on all my local environments the docker image runs perfectly fine. My first assumption was some kind of firewall issue with Cloudflare or our load balancer but it was quickly ruled out since during the stress test if I login into the VM and do a simple curl localhost the application would not respond. So there is nothing blocking the requests. We also have an external redis instance running on GCP but that shouldn't be an issue since the test call doesn't even interact with the cache.

This is a sample from my current local machine. Same results were achieved (but with less performance) on an M1 laptop. Hanging issues only occur on VM.

Local machine sample

VM memory when non-responsive (I can login via ssh though without any issues, even login into the container)

               total        used        free      shared  buff/cache   available
Mem:            1982        1165         317           2         499         673
Swap:              0           0           0

@dantebarba
Copy link
Author

Update: switched back to flask development server, did a couple of stress tests and aside from a rate limit ban I didn't have any requests or performance issue.

Fun fact, since flask can process as much as 3 times more requests than gunicorn it made the load balancer rate limiter to kick in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@pajod @dantebarba and others