Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_layer_map performance test failing #10409

Closed
jcsp opened this issue Jan 15, 2025 · 7 comments · Fixed by #10676
Closed

test_layer_map performance test failing #10409

jcsp opened this issue Jan 15, 2025 · 7 comments · Fixed by #10676
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Collaborator

jcsp commented Jan 15, 2025

https://neon-github-public-dev.s3.amazonaws.com/reports/main/12791644069/index.html#suites/b8bb9797b578168beaf6275d4a5cca96/853734a845c3913b/

test_runner/performance/test_layer_map.py:40: in test_layer_map
    cur.execute("SELECT count(*) from t")
E   psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

This reproduces locally.

@jcsp jcsp added a/test Area: related to testing c/storage/pageserver Component: storage: pageserver labels Jan 15, 2025
@arssher
Copy link
Contributor

arssher commented Jan 29, 2025

@VladLazar @problame The test failed because SELECT count(*) from t couldn't execute in 2m (presumably, adding logging in #10554 to make it more visible). On my dev box in release mode vacuum t in the test takes about 1m 15s. I don't know test expectations, perhaps it is ok and we should just increase statement_timeout?

github-merge-queue bot pushed a commit that referenced this issue Jan 29, 2025
## Problem

test_layer_map doesn't log statements and it is not clear how long they
take.

## Summary of changes

Do log them.

ref #10409
@VladLazar
Copy link
Contributor

@VladLazar @problame The test failed because SELECT count(*) from t couldn't execute in 2m (presumably, adding logging in #10554 to make it more visible). On my dev box in release mode vacuum t in the test takes about 1m 15s. I don't know test expectations, perhaps it is ok and we should just increase statement_timeout?

Before extending the timeout, I'd like to understand why/when it became slower. Since it's in the perf tree we can't blame the job runner. Identifying when it started failing and looking at the commits around that time might be interesting.

@arssher
Copy link
Contributor

arssher commented Feb 3, 2025

History of failures:
https://neonprod.grafana.net/d/fddp4rvg7k2dcf/neon-regression-test-suite-flaky-tests?orgId=1&var-test_name=test_layer_map%5Bgithub-actions-selfhosted%5D&from=now-30d&to=now&timezone=utc&var-restrict&var-max_count=100&var-reference=$__all&var-ignore_reference=refs%2Fpull%2F0000%2Fmerge
Image

The only directly related commit seems to
commit fcbe9fb
Author: Vlad Lazar [email protected]
Date: Tue Feb 20 19:42:54 2024 +0000

test: adjust checkpoint distance in `test_layer_map` (#6842)

But don't think this is very useful.

winter-loo pushed a commit to winter-loo/neon that referenced this issue Feb 4, 2025
## Problem

test_layer_map doesn't log statements and it is not clear how long they
take.

## Summary of changes

Do log them.

ref neondatabase#10409
@arssher
Copy link
Contributor

arssher commented Feb 5, 2025

Ok, the graph above was wrong, asked to fix it. Here is another one.

Seems like it started failing 14.01

@arssher
Copy link
Contributor

arssher commented Feb 5, 2025

Bisected, and

(non-package-mode-py3.12) arseny@neon-hetzner-dev-arseny ~/neon $ git bisect good                                                                                           
ffaa52ff5d3f36becaf70be2b6053c7423deba61 is the first bad commit                                                                                                            
commit ffaa52ff5d3f36becaf70be2b6053c7423deba61                                                                                                                             
Author: Erik Grinaker <[email protected]>                                                                                                                                      
Date:   Tue Jan 14 17:31:59 2025 +0100                                                                                                                                      
                                                                                                                                                                            
    pageserver: reorder upload queue when possible (#10218)                                                                                                                 
                                                                 

Without it, timings in
#10670
look like

test_layer_map[release-pg16].vacuum t: 21.810 s                                                                                                                             
test_layer_map[release-pg16].SELECT count(*) from t: 2.311 s                                                                                                                
test_layer_map[release-pg16].flush_ep_to_pageserver: 0.111 s                                                                                                                
test_layer_map[release-pg16].timeline_checkpoint: 16.061 s                                                                                                                  

With it

test_layer_map[release-pg16].vacuum t: 62.777 s                                                                                                                             
test_layer_map[release-pg16].SELECT count(*) from t: 8.981 s                                                                                                                
test_layer_map[release-pg16].flush_ep_to_pageserver: 0.110 s                                                                                                                
test_layer_map[release-pg16].timeline_checkpoint: 25.211 s

It is also very variable, in CI I've seen select count(*) taking 5s, 90s and 2m+ (when test fails).

@erikgrinaker would you like to take over from here?

@erikgrinaker
Copy link
Contributor

Yeah, I'll take this.

@erikgrinaker erikgrinaker assigned erikgrinaker and unassigned arssher Feb 5, 2025
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 5, 2025

This is just because the test schedules a ton of tiny uploads, and the upload queue is quadratic in the number of operations.

queued=1533 inprogress=2562

In production, we limit the number of inprogress tasks to the remote concurrency limit to bound the quadratic cost, but the LocalFs backend doesn't have a concurrency limit and thus the quadratic queue slows everything down.

I'll submit a workaround.

github-merge-queue bot pushed a commit that referenced this issue Feb 5, 2025
## Problem

It it is not very clear how much time take different operations.

## Summary of changes

Record more timings.

ref #10409
github-merge-queue bot pushed a commit that referenced this issue Feb 6, 2025
## Problem

The local filesystem backend for remote storage doesn't set a
concurrency limit. While it can't/won't enforce a concurrency limit
itself, this also bounds the upload queue concurrency. Some tests create
thousands of uploads, which slows down the quadratic scheduling of the
upload queue, and there is no point spawning that many Tokio tasks.

Resolves #10409.

## Summary of changes

Set a concurrency limit of 100 for the LocalFS backend.

Before: `test_layer_map[release-pg17].test_query: 68.338 s`
After: `test_layer_map[release-pg17].test_query: 5.209 s`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants