k3s database size grow due to slow compact process #10626

ValeriiVozniuk · 2024-07-31T21:06:10Z

ValeriiVozniuk
Jul 31, 2024

Environmental Info:
K3s Version:
k3s version v1.28.11+k3s2 (d076d9a)
go version go1.21.11

Node(s) CPU architecture, OS, and Version:
Linux master01 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
6 master nodes in 2 DCs with MariaDB Galera 4 nodes cluster as backend

Describe the bug:
When k3s installed in the configuration with database backend, the database size is quickly growing even on relatively "calm" cluster.

Steps To Reproduce:

Installed K3s with following parameters in service on all nodes
/usr/local/bin/k3s server --kube-apiserver-arg oidc-username-claim=name --kube-apiserver-arg oidc-groups-claim=groups --kube-apiserver-arg oidc-client-id=kubernetes --kube-apiserver-arg oidc-issuer-url=https://auth.domain.com/auth/realms/test --disable local-storage --disable coredns --disable traefik --kubelet-arg kube-reserved=cpu=250m,memory=512Mi --kubelet-arg kube-reserved=cpu=250m,memory=512Mi --kubelet-arg system-reserved=cpu=250m,memory=512Mi --tls-san cluster-test.domain.com --datastore-endpoint=mysql://user:password@tcp(database.hostname:3306)/test_db
Installed additional helm charts like CoreDNS, Prometheus, ingress-nginx, keda
Start running some application workloads on the cluster.

Expected behavior:
Database size stays proportional to the running workloads.

Actual behavior:
Database size growing up to 100+ GB

Additional context / logs:
I've took a look on the similar reports in other bugs, and do some research. It looks like compaction process cannot handle the load cluster creates during run even with light load, and constantly falling behind.
Test cluster, no actual workloads, only infra stuff, test query

SELECT 
  COUNT(*) AS `rows`, 
  (SELECT prev_revision FROM kine WHERE name = "compact_rev_key" LIMIT 1) AS `compact_rev`,
  MAX(id) AS `current_rev`
FROM kine;

Result after just 1 day of run

+--------+-------------+-------------+
| rows   | compact_rev | current_rev |
+--------+-------------+-------------+
| 577689 |     1135674 |     1712491 |
+--------+-------------+-------------+

Out of these, 226 884 rows are "deleted = 0", and 349 493 are "deleted = 1"
Compact log message
Jul 31 20:36:11 multi-01 k3s[31503]: time="2024-07-31T20:36:11Z" level=info msg="COMPACT compact revision changed since last iteration: 1143674 => 1145674"
It seems that it is deleting just 2000 records at once every 5 minutes, so just to catch up it would need (349 493/2000)*5 = 873 minutes, doesn't seem realistic.
I assume the number of compacted records should be in proportion to the number of "deleted = 1" records.

Records breakdown, query:

SELECT
  COUNT(*) AS `rows`,
  SUBSTRING_INDEX(name, "/", 3) AS `prefix`
FROM kine
GROUP BY SUBSTRING_INDEX(name, "/", 3)
ORDER BY `rows` DESC
LIMIT 10;

Result:

197558	/registry/leases
27743	/registry/masterleases
1029	/registry/argoproj.io
939	/registry/minions
662	/registry/velero.io
516	/registry/events
369	/registry/services
167	/registry/pods
93	/registry/clusterroles
89	/registry/endpointslices

Gap count, query

SELECT COUNT(id) FROM kine WHERE name LIKE 'gap-%';

Result: 353953
And this is just test cluster with no application load. On the production cluster, stat from 20 GB database for queries above

+----------+-------------+-------------+
| rows     | compact_rev | current_rev |
+----------+-------------+-------------+
| 25983221 |    58095745 |    84077866 |
+----------+-------------+-------------+
1 row in set (17.969 sec)
+---------+------------------------------------+
| rows    | prefix                             |
+---------+------------------------------------+
| 8727360 | /registry/leases                   |
| 1099282 | /registry/masterleases             |
|   60491 | /registry/velero.io                |
|   37639 | /registry/minions                  |
|   31608 | /registry/events                   |
|   15916 | /registry/catalog.cattle.io        |
|   14316 | /registry/horizontalpodautoscalers |
|   14181 | /registry/keda.sh                  |
|   13219 | /registry/services                 |
|    7386 | /registry/pods                     |
+---------+------------------------------------+
10 rows in set (2 min 20.977 sec)
+-----------+
| COUNT(id) |
+-----------+
|  15940953 |
+-----------+
1 row in set (5.644 sec)

Additional query:

SELECT name, COUNT(*) FROM kine GROUP BY name HAVING COUNT(*) > 100000;

Result:

+---------------------------------------------------------------------+----------+
| name                                                                | COUNT(*) |
+---------------------------------------------------------------------+----------+
| /registry/leases/cattle-fleet-system/fleet-agent-lock               |   904366 |
| /registry/leases/ingress-nginx/ingress-nginx-leader                 |   243468 |
| /registry/leases/keda/operator.keda.sh                              |   905185 |
| /registry/leases/kube-node-lease/master01                           |   179188 |
| /registry/leases/kube-node-lease/master02                           |   179164 |
| /registry/leases/kube-node-lease/master03                           |   179188 |
| /registry/leases/kube-node-lease/master04                           |   179076 |
| /registry/leases/kube-node-lease/master05                           |   179162 |
| /registry/leases/kube-node-lease/master06                           |   179128 |
| /registry/leases/kube-system/apiserver-7ajc5ueif3en7fnnmtb5knqbrm   |   179203 |
| /registry/leases/kube-system/apiserver-k6sabwuqurynmtpfz6723h6xlm   |   179091 |
| /registry/leases/kube-system/apiserver-qatbxangwi6c6bepfw4va4rckq   |   179200 |
| /registry/leases/kube-system/apiserver-qxi2aypxl4lhwpkgxm3f2ll7ya   |   179184 |
| /registry/leases/kube-system/apiserver-v2folei563bi44d34zerrzdkru   |   179147 |
| /registry/leases/kube-system/apiserver-vxdgtwymqaycdypoa3gkbd6afi   |   179179 |
| /registry/leases/kube-system/cattle-controllers                     |   905269 |
| /registry/leases/kube-system/k3s                                    |   904619 |
| /registry/leases/kube-system/k3s-cloud-controller-manager           |   905301 |
| /registry/leases/kube-system/kube-controller-manager                |   905105 |
| /registry/leases/kube-system/kube-scheduler                         |   905452 |
| /registry/masterleases/172.23.4.10                                  |   183187 |
| /registry/masterleases/172.23.4.11                                  |   183261 |
| /registry/masterleases/172.23.4.12                                  |   183209 |
| /registry/masterleases/172.23.5.10                                  |   183263 |
| /registry/masterleases/172.23.5.11                                  |   183266 |
| /registry/masterleases/172.23.5.12                                  |   183262 |
+---------------------------------------------------------------------+----------+

And the log of k3s service is full of messages like

time="2024-07-29T12:34:52Z" level=info msg="Slow SQL (started: 2024-07-29 12:34:48.041347319 +0000 UTC m=+2507.659656348) (total time: 3.979146841s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? AND mkv.name > ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.thename ASC ) c : [[/registry/leases/% /registry/leases/ false]]"

Thus it is clear that compact/old data deletion process is not working properly and should be improved. Even if I would temporary shutdown all nodes and do a manual kine table cleanup, the database quickly will fill up again.

brandond · 2024-07-31T21:36:26Z

brandond
Jul 31, 2024
Collaborator

Jul 31 20:36:11 multi-01 k3s[31503]: time="2024-07-31T20:36:11Z" level=info msg="COMPACT compact revision changed since last iteration: 1143674 => 1145674"

This indicates that some other server node is compacting, and moved the compact revision forward since this node last checked. Examine the logs on your other node for log messages about compaction.

It seems that it is deleting just 2000 records at once every 5 minutes, so just to catch up it would need (349 493/2000)*5 = 873 minutes, doesn't seem realistic.

That is not what K3s does. Kine observes the same behavior as the Kubernetes apiserver itself, and attempts to compact up to the revision 5 minutes in the past. Every 5 minutes, it compacts (in chunks of 1000 rows) up to the max revision (the newest row) from 5 minutes ago. The target compact revision is also capped to be at least 1000 revisions back from the current revision.

If the compact transaction times out, it is definitely possible for things to spiral. Check the logs on your nodes to see how long compaction is taking.

You may want to temporarily shut down all of your cluster members except for 1 server, and allow it to catch up with compaction. Once that is done, add additional nodes and workloads, and monitor compaction to ensure that it is keeping up. If it is not, you should consider adding additional capacity to your SQL server.

MariaDB Galera 4 nodes cluster

Note that we do not support multi-master databases that use an auto_increment_increment greater than 1. Kine expects that the revision will always move forward by exactly 1 when a key is successfully inserted. We have not, as far as I know, tested on Galera.

6 replies

ValeriiVozniuk Aug 2, 2024
Author

1 day after

+--------+-------------+-------------+
| rows   | compact_rev | current_rev |
+--------+-------------+-------------+
| 694745 |     2919674 |     3613550 |
+--------+-------------+-------------+

A bit better than before, but still nowhere close to what it should be

brandond Aug 2, 2024
Collaborator

All 6 nodes in your cluster are master nodes? Why are you doing that? This is probably putting a lot of unnecessary extra load on your database. I would probably just run one node per region as a server, and convert the rest to agents. Or MAYBE have 2 per region if you must. Running all of them as servers increases direct load on the datastore without significantly increasing availability. You've got 6x the queries, 6x the caches in the apiserver, and so on.

Aug 01 15:57:00 fr-multi-01 k3s[125865]: time="2024-08-01T15:57:00Z" level=info msg="COMPACT deleted 1000 rows from 1000 revisions in 211.822184ms - compacted to 1716674/2594874"

That is pretty good, it looks like it's compacting quickly enough, until the database runs into this weird deadlock on the transaction...

Aug 01 15:57:00 fr-multi-01 k3s[125865]: time="2024-08-01T15:57:00Z" level=error msg="Compact failed: failed to compact to revision 1717674: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction"

That's not a message I've ever seen before. It appears to be coming from the database layer, not anything in kine. I suspect that it's related to the fact you're running so many servers against a geo-distributed database cluster. I am guessing that the servers each point to a database cluster member in their local region? Again, this isn't something we've tested with.

Am I understand correctly that compact here started actually working, and it deletes 1000 rows per iteration, but when it gets deadlock on compact query, it backs off for 10 minutes?

No, it just fails compaction for that attempt. This node or another node will try again on the normal compaction interval - 5 minutes.

Are these numbers configurable somewhere?

No

ValeriiVozniuk Aug 3, 2024
Author

All 6 nodes in your cluster are master nodes? Why are you doing that?

Lead wants cluster to be highly available in each DC, and be able to continue working in the region even in case of 2 servers failure. For less important clusters there are 4 master nodes configuration.

That's not a message I've ever seen before. It appears to be coming from the database layer, not anything in kine.

Right, I'm chasing the DBA responsible for database cluster to provide his thoughts on the error, but no luck so far.

This node or another node will try again on the normal compaction interval - 5 minutes.

Seems effectively 10 minutes, let's look into the logs again

Aug 03 14:25:02 fr-multi-01 k3s[276658]: time="2024-08-03T14:25:02Z" level=info msg="COMPACT deleted 1000 rows from 1000 revisions in 255.913597ms - compacted to 4110674/4364457"
Aug 03 14:25:02 fr-multi-01 k3s[276658]: time="2024-08-03T14:25:02Z" level=error msg="Compact failed: failed to get compact revision: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction"
Aug 03 14:30:02 fr-multi-01 k3s[276658]: time="2024-08-03T14:30:02Z" level=info msg="COMPACT compact revision changed since last iteration: 4109674 => 4110674"
Aug 03 14:35:02 fr-multi-01 k3s[276658]: time="2024-08-03T14:35:02Z" level=info msg="COMPACT compactRev=4110674 targetCompactRev=4111674 currentRev=4371401"

k3s run last successful compact before the deadlock, then it got a deadlock, after 5 minutes k3s just notes that compact revision has changed to the number in the last successful compact, and then waits for another 5 minutes to restart the process, thus effective interval between compactions is 10 minutes -> too much, imo.
These are from test cluster which is running with single node for now. Leaving it for another night made things noticeably better

+--------+-------------+-------------+
| rows   | compact_rev | current_rev |
+--------+-------------+-------------+
| 260758 |     4112674 |     4372558 |
+--------+-------------+-------------+
1 row in set (1.212 sec)

still, out of those 260k of rows 194k are "deleted = 1", so the situations is far from resolve point.

brandond Aug 13, 2024
Collaborator

k3s run last successful compact before the deadlock, then it got a deadlock, after 5 minutes k3s just notes that compact revision has changed to the number in the last successful compact, and then waits for another 5 minutes to restart the process, thus effective interval between compactions is 10 minutes -> too much, imo.

Right, because the compact transaction failed, kine doesn't record the new compact revision in memory. When it runs next it sees that it has changed in the database - which is odd, if the compact transaction failed and was rolled back, the value of the compact rev key shouldn't have changed. So I think something weird is definitely going on with database transactions here.

I am curious what your DBA has to say about the transaction sequence and reported deadlock. As I've mentioned a few times, we don't test on multi-master georeplicated DBs like Galera.

ValeriiVozniuk Aug 19, 2024
Author

So, here are final updates from my side :)
0. I'm moving out from current project, so the issue with Galera, and k3s in general, is not my problem anymore.

DBA is still hiding, pretending that he doesn't see our mails, thus no comments on database deadlock.
I've deployed two separate database instances for testing in active-active replication mode, but all writes from k3s nodes are going to a single node at a time, and don't see an issue with compaction there even with failover/failback between database nodes, thus something is definitely suboptimal in primary Galera deployment there. With proper DBA we may even find out what exactly is, but...
I still think that for bad case scenario like above k3s may behave compaction better, but it is up to you of course, I have no reasons now to push this question forward :).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3s database size grow due to slow compact process #10626

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

k3s database size grow due to slow compact process #10626

ValeriiVozniuk Jul 31, 2024

Replies: 1 comment · 6 replies

brandond Jul 31, 2024 Collaborator

ValeriiVozniuk Aug 2, 2024 Author

brandond Aug 2, 2024 Collaborator

ValeriiVozniuk Aug 3, 2024 Author

brandond Aug 13, 2024 Collaborator

ValeriiVozniuk Aug 19, 2024 Author

ValeriiVozniuk
Jul 31, 2024

Replies: 1 comment 6 replies

brandond
Jul 31, 2024
Collaborator

ValeriiVozniuk Aug 2, 2024
Author

brandond Aug 2, 2024
Collaborator

ValeriiVozniuk Aug 3, 2024
Author

brandond Aug 13, 2024
Collaborator

ValeriiVozniuk Aug 19, 2024
Author