-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(api): optimise archived list query. Fixes #13295 #13566
fix(api): optimise archived list query. Fixes #13295 #13566
Conversation
the current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels. this new change is backward compatible and it is tested on our prodction env (using postgresql). realated issue: argoproj#13295 query change example: previous: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1; ``` now: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE "clustername" = 'default' AND UID IN (SELECT UID FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1); ``` Signed-off-by: Xiaofan Hu <[email protected]>
interesting analysis, a bit different than in the issue.
also agreed, I would think the DB would know to avoid this 😕 It looks like some E2E tests are failing, and one of them is getting a 500 on this API call -- can you check them? |
95c4116
to
b766093
Compare
hi @agilgur5 thank for the feedback. I have pushed another fix for mysql. it requires a little bit more handling. there is a test failure about ( |
b766093
to
057aca6
Compare
hmm, it seems that retry helps. :P |
Sounds like a test flake then. Although it looks like it was with
This one I believe has flaked before IIRC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good to me but would like some folks to double-check cc @jiachengxu @terrytangyuan @Joibel @jessesuen
Also might be worthwhile to push up an image for users on the issue to test and confirm results
While this requires a good bit of rows to test, I'm also thinking that we could potentially write a simplified test that checks if the query returns under a certain amount of time on a test seed. For example, under 100ms. |
I can test it out. We have a large non-production deployment we can test it in. |
@agilgur5 yeah, it sounds quite reasonable and would be helpful in the long run. but unfortunately, I didn't spot a simple way for me to add a new regression test fast enough and my recent work schedule is quite tight. could we add a new issue and linking it to this pr for now? I might be able to make the contribution a few weeks later. |
the optimized query is introduced in 9856df7 Signed-off-by: Xiaofan Hu <[email protected]>
14d07ed
to
91bf3a6
Compare
Yea that'd be fine so long as we get some more manual tests of this. I'm loading up my dev env right now to push up an image |
Provide me an image to use and I will test it. |
Pushed up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good.
5ec0599
to
66717c8
Compare
66717c8
to
367f21c
Compare
367f21c
to
f6db948
Compare
related to 9856df7 Signed-off-by: Xiaofan Hu <[email protected]>
f6db948
to
9f39a95
Compare
I can confirm this change reduces query times. Loading time in the UI on filtering labels was reduced by 94.44%. Query time Before: 18 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the testing @ryancurrah !
With that, some of the individual query tests folks made across both MySQL and Postgres, and 2 other approvals, I think this is good to go 🚀
Note that this query seems to be further optimizable per #13295 (comment), so we may want to open a new issue to add further optimizations, but this at least fixes the performance issues and gets us to a much more reasonable baseline
If you need someone to test any more query optimizations feel free to ping me. |
Added a follow-up issue for further optimizations to this query in #13601 |
…j#13566) Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>
Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>
Fixes #13295
Motivation
The current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.
this new change is backward compatible and it is tested on our production env (using postgresql).
Modifications
rewriting the generated query to use a subquery without unmarshalling json payload first (effectively reducing workload required from postgres, although arguably postgres should be smart enough to only perform the operation after querying ;)):
previous:
now:
Verification
the query runtime on our production env is reduced from 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.