Measure and improve DB/API performance #139

hardbyte · 2022-08-30T09:48:43Z

I set up a locust script that simulates heavy usage to measure and ideally improve our api/db robustness.

I configured locust to create users at a rate of 0.05 users per second, and just keep adding users (up to a thousand or so). I stopped the experiment when the backend locked up, or when there were 5% failures.

Baseline 1

I established a baseline measuring performance on Cloud run (development deployment of PR #138). That sets limits on the database connections each container opens to 20, and the cloud run revision limits the concurrency to 25, and the number of containers to 3. The dev database supports 50 connections, so in theory 2 containers worth of traffic should be "fine", but somewhere after the third is added we should see not enough connection errors.

As with all experiments we use a First generation Cloud Run service, with 1 CPU and 512 MiB of memory.

Hypothesis: This will hopefully break somewhat similar to our production environment. As in it should work ok for low numbers of concurrent requests, then break.

Notes on baseline experiment 1

I see fine performance all the way up to 35 users, at about 23 requests per second before the first server 500 occurs.
I did see a few sqlalchemy.exc.InterfaceError: (psycopg2.InterfaceError) connection already closed errors.
The Cloud SQL System insights shows that many connections are in state "idle in transaction", but we didn't exceed the connection limit of the database.
It appears to me that only one container was active, and 1 idle.
/recommend?limit=5 isn't too bad. About 150ms median response time, 390ms at 99%ile.
/auth/me weighs a wopping 65 KiB trimming that would likely help.

Baseline 2

Next we will configure the deployment to only allow 10 db connections for each container, and allow 4 containers.

Hypothesis: Based on the container scaling in baseline 1 I expect we get even less throughput, and I expect to see timeouts rather than database connection errors.

Notes on baseline experiment 2

I see similar performance up to 14 users, at about 10 requests per second before the responses just stop coming.
Similar connections many idle and "idle in transaction" in Cloud SQL (got to ~38).

Baseline 3

Next similar to baseline 2 we will configure the deployment to only allow 10 db connections for each container, and allow 4 containers, but this time we lower the concurrency in Google Cloud Run to 5 so we should see all 4 containers spin up.

Hypothesis: Probably the highest throughput for the baselines. I hope to see timeouts rather than database connection errors.

Notes on baseline experiment 3

Was worse than baseline 1. Got to 20 users, 10 requests per second before starting to fail with 429 Client Error: Too Many Requests and server 500 errors.
Logs had sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 5 reached, connection timed out, timeout 120.00
Saw a few more InterfaceError: (psycopg2.InterfaceError) connection already closed exceptions (in the get_session finally clause).

Improvements

We can probably catch and safely ignore the sqlalchemy.exc.InterfaceError: (psycopg2.InterfaceError) connection already closed
We should use baseline 1 config for now!

With these baseline established we can try to improve the default settings of the production deployment, and have a good starting point for measuring the effectiveness of particular optimizations.

The text was updated successfully, but these errors were encountered:

hardbyte · 2022-08-30T10:43:24Z

I reran baseline 1 after making the change to ignore already closed connections and saw almost no errors up to 80 users.

However the errors that did start at that point were very interesting...

...
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 160, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/app/api/works.py", line 71, in get_works
    session.execute(
...
sqlalchemy.exc.InterfaceError: (psycopg2.InterfaceError) connection already closed
[SQL: SELECT works.id, works.type, works.title, works.subtitle, works.leading_article, works.info 
FROM works 
WHERE works.type = %(type_1)s AND (EXISTS (SELECT 1 
FROM editions 
WHERE works.id = editions.work_id)) ORDER BY works.id ASC 
 LIMIT %(param_1)s OFFSET %(param_2)s]
[parameters: [{}]]
(Background on this error at: https://sqlalche.me/e/14/rvf5)", "level":"error", "logger":"app.main", "request":"URL('http://pr-138---wriveted-api-development-lg5ntws4da-ts.a.run.app/v1/works')", "request_id":"c338ce74-c453-4a4e-ad70-d1a57a45437f", "request_path":"/v1/works"}

This is just a normal endpoint now failing saying the connection is closed. I'm suspicious of 2 things:

the interaction between async and sqlalchemy
the pydantic serialization of response data relying on orm_mode and a live session possibly after the injected session has been closed.

This is worth further exploration...

hardbyte · 2022-08-30T10:49:48Z

fastapi/fastapi#3205

hardbyte · 2022-08-31T04:38:03Z

New baseline (after changes from #140) using the development main branch deployment.

Run 1

Config:

concurrency: 20
max instances: 3
db connections: 10 + 10

Also running on second gen cloud run instances

Errors on getting a connection, so I don't think the underlying issue is solved completely.
sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 120.00

Run 2

Same config except:

db connections: 10 + 5

Saw:

an InterfaceError: (psycopg2.InterfaceError) connection already closed . in get_session.
The request was aborted because there was no available instance
TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 120.00

I started locust spawning 2 users per second up to 50 users. Left that for a few minutes, then ramped up adding 0.2 users per second up to 75 users.

hardbyte mentioned this issue Aug 30, 2022

Attempted solution to db pool deadlock issue #140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure and improve DB/API performance #139

Measure and improve DB/API performance #139

hardbyte commented Aug 30, 2022

hardbyte commented Aug 30, 2022

hardbyte commented Aug 30, 2022

hardbyte commented Aug 31, 2022

Measure and improve DB/API performance #139

Measure and improve DB/API performance #139

Comments

hardbyte commented Aug 30, 2022

Baseline 1

Notes on baseline experiment 1

Baseline 2

Notes on baseline experiment 2

Baseline 3

Notes on baseline experiment 3

Improvements

hardbyte commented Aug 30, 2022

hardbyte commented Aug 30, 2022

hardbyte commented Aug 31, 2022

Run 1

Run 2