-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: further optimize archive workflow listing. Fixes #13601 #13819
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
@@ -39,6 +39,7 @@ E2E_PARALLEL ?= 20 | |||
E2E_SUITE_TIMEOUT ?= 15m | |||
GOTEST ?= go test -v -p 20 | |||
ALL_BUILD_TAGS ?= api,cli,cron,executor,examples,corefunctional,functional,plugins | |||
BENCHMARK_COUNT ?= 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so benchstat gives confidence intervals. Without this, it'll show the message ¹ need >= 6 samples for confidence interval at level 0.95
ansiSQLChange(`drop index argo_archived_workflows_i4 on argo_archived_workflows`), | ||
ansiSQLChange(`drop index argo_archived_workflows_i4`), | ||
), | ||
ansiSQLChange(`create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how quick does this run? quick enough to include in a patch version? i.e. to cherry-pick into 3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very quickly. Tested after populating my DB with 100,000 rows, all associated with the same clustername
:
MySQL:
$ time go run ./hack/db migrate -d 'mysql:password@tcp/argo'
INFO[0000] Migrating database schema clusterName=default dbType=mysql
INFO[0000] applying database change change="drop index argo_archived_workflows_i4 on argo_archived_workflows" changeSchemaVersion=61
INFO[0000] applying database change change="create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)" changeSchemaVersion=62
2024/10/26 02:24:13 Session ID: 00001
Query: create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)
Stack:
fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673
fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756
fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1208
fmt.Append@/usr/local/go/src/fmt/print.go:289
log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261
log.(*Logger).output@/usr/local/go/src/log/log.go:238
log.(*Logger).Print@/usr/local/go/src/log/log.go:260
github.com/argoproj/argo-workflows/v3/persist/sqldb.ansiSQLChange.apply@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/ansi_sql_change.go:11
github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:301
github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:290
github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.Exec@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:279
main.NewMigrateCommand.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:50
github.com/spf13/cobra.(*Command).execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:985
github.com/spf13/cobra.(*Command).ExecuteC@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117
github.com/spf13/cobra.(*Command).Execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041
main.main@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:39
runtime.main@/usr/local/go/src/runtime/proc.go:272
runtime.goexit@/usr/local/go/src/runtime/asm_amd64.s:1700
Rows affected: 0
Last insert ID: 0
Error: upper: slow query
Time taken: 0.95033s
Context: context.Background
real 0m1.758s
user 0m1.074s
sys 0m0.294s
PostgreSQL:
$ time go run ./hack/db migrate
INFO[0000] Migrating database schema clusterName=default dbType=postgres
INFO[0000] applying database change change="drop index argo_archived_workflows_i4" changeSchemaVersion=61
INFO[0000] applying database change change="create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)" changeSchemaVersion=62
real 0m0.868s
user 0m1.117s
sys 0m0.271s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dope, great work again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks for deep-diving & analyzing this, describing it, and testing it!
Signed-off-by: Mason Malone <[email protected]>
Fixes #13601
Motivation
Listing archived workflows can be slow if you have a very large number of workflows (100,000+), or the average workflow size is high (100KB), even after the optimizations from #13566 and #13779. This makes some additional optimizations that speed up the queries by ~90% on MySQL and ~50% on PostgreSQL.
Modifications
The bottleneck for these queries depended on whether you use MySQL or PostgreSQL, each of which required a different fix. For PostgreSQL, the bottleneck was detoasting overhead, as explained in #13601 (comment). The fix was to use a common table expression to reduce the amount of times
workflow
needs to be detoasted, as suggested by @kodieg in #13601 (comment). The new query looks like this:For MySQL, the bottleneck was the optimizer inexplicably refusing to use the
argo_archived_workflows_i4
index and instead using the primary key, which is much more expensive. As explained by @Danny5487401 in #13563 (comment), two ways of solving that are usingFORCE INDEX
or adding a union index on(clustername, startedat)
. UsingFORCE INDEX
is slightly hacky, and adding a new index is needlessly wasteful when we already haveargo_archived_workflows_i4
, so I opted to modify that index to cover(clustername, startedat)
. The new query looks like this:Verification
First, I used #13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template:
time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000
time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000 -d 'mysql:password@tcp/argo'
Then, I ran
make BenchmarkWorkflowArchive
once on themain
branch and once on this branch (with migration applied), and used benchstat to compare: