-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure and improve DB/API performance #139
Comments
I reran baseline 1 after making the change to ignore already closed connections and saw almost no errors up to 80 users. However the errors that did start at that point were very interesting...
This is just a normal endpoint now failing saying the connection is closed. I'm suspicious of 2 things:
This is worth further exploration... |
New baseline (after changes from #140) using the development main branch deployment. Run 1Config:
Also running on second gen cloud run instances Errors on getting a connection, so I don't think the underlying issue is solved completely. Run 2Same config except:
Saw:
I started locust spawning 2 users per second up to 50 users. Left that for a few minutes, then ramped up adding 0.2 users per second up to 75 users. |
I set up a locust script that simulates heavy usage to measure and ideally improve our api/db robustness.
I configured locust to create users at a rate of
0.05 users per second
, and just keep adding users (up to a thousand or so). I stopped the experiment when the backend locked up, or when there were 5% failures.Baseline 1
I established a baseline measuring performance on Cloud run (development deployment of PR #138). That sets limits on the database connections each container opens to 20, and the cloud run revision limits the concurrency to 25, and the number of containers to 3. The dev database supports 50 connections, so in theory 2 containers worth of traffic should be "fine", but somewhere after the third is added we should see not enough connection errors.
As with all experiments we use a First generation Cloud Run service, with 1 CPU and 512 MiB of memory.
Hypothesis: This will hopefully break somewhat similar to our production environment. As in it should work ok for low numbers of concurrent requests, then break.
Notes on baseline experiment 1
sqlalchemy.exc.InterfaceError: (psycopg2.InterfaceError) connection already closed
errors./recommend?limit=5
isn't too bad. About 150ms median response time, 390ms at 99%ile./auth/me
weighs a wopping65 KiB
trimming that would likely help.Baseline 2
Next we will configure the deployment to only allow 10 db connections for each container, and allow 4 containers.
Hypothesis: Based on the container scaling in baseline 1 I expect we get even less throughput, and I expect to see timeouts rather than database connection errors.
Notes on baseline experiment 2
Baseline 3
Next similar to baseline 2 we will configure the deployment to only allow 10 db connections for each container, and allow 4 containers, but this time we lower the concurrency in Google Cloud Run to 5 so we should see all 4 containers spin up.
Hypothesis: Probably the highest throughput for the baselines. I hope to see timeouts rather than database connection errors.
Notes on baseline experiment 3
429 Client Error: Too Many Requests
and server 500 errors.sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 5 reached, connection timed out, timeout 120.00
InterfaceError: (psycopg2.InterfaceError) connection already closed
exceptions (in the get_session finally clause).Improvements
sqlalchemy.exc.InterfaceError: (psycopg2.InterfaceError) connection already closed
With these baseline established we can try to improve the default settings of the production deployment, and have a good starting point for measuring the effectiveness of particular optimizations.
The text was updated successfully, but these errors were encountered: