Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icingadb process crashes with Error 1452: Cannot add or update a child row: a foreign key constraint fails #577

Closed
patrickesser opened this issue Apr 5, 2023 · 38 comments

Comments

@patrickesser
Copy link

The icingadb process crashes at irregular intervals with the following output:

Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE)
can't perform "INSERT INTO "history" ("environment_id", "endpoint_id", "host_id", "service_id", "event_time", "object_type", "event_type", "id", "state_history_id") VALUES (:environment_id,:endpoint_id,:host_id,:service_id,:event_time,:object_type,:event_type,:id,:state_history_id) ON DUPLICATE KEY UPDATE "id" = VALUES("id")"
github.com/icinga/icingadb/internal.CantPerformQuery
github.com/icinga/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
github.com/icinga/icingadb/pkg/icingadb/db.go:394
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:45
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
github.com/icinga/icingadb/pkg/icingadb/db.go:389
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
runtime/asm_amd64.s:1594
can't retry
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:64
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
github.com/icinga/icingadb/pkg/icingadb/db.go:389
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
runtime/asm_amd64.s:1594

__

I can't find more detailed log or any posibillity to figure out which entry causes the crash.

Your Environment

Icinga DB version: v1.1.0

Build information:
Go version: go1.19.3 (linux, amd64)
Git commit: a0093d1

System information:
Platform: Debian GNU/Linux
Platform version: 11 (bullseye)

@Al2Klimov
Copy link
Member

@patrickesser Are there other kinds of errors/warnings in the log? Which magnitude of irregular intervals are we talking about? Btw., what are your typical state change intervals? And, btw., do the crashes correlate with state changes of particular objects? What's your DB type/version/cluster?

@julianbrost Your family name dominates in IDE annotation of history sync code. Any speculative idea what could happen here? As far as I see we execute bulks outside of transactions. Once committed(!) we inform via onSuccess that the Redis message may pass to the next stage which obviously hits the FK.

@julianbrost
Copy link
Contributor

Any speculative idea what could happen here?

Not really, that code should be written to prevents this, so I'm eager to know what is going wrong here.

Some more questions for @patrickesser:

  1. After this crash happened, did a restart suffice to get things running again? (If there was something wrong with a specific history entry, it should retry that entry and would fail again, so if that didn't happen it's probably our code or the database doing something unexpected here.)
  2. Do you run Icinga DB in a high-availability configuration? (Now I'm really getting into wild speculation: maybe if one connection updates a row, another connection might get a foreign key violation when referencing that specific row at the very time the other update is happening?)

@Al2Klimov
Copy link
Member

Btw. for such cases, shall we re-try (at least a limited amount of times) serialization failures?

@julianbrost
Copy link
Contributor

The database doesn't report a serialization failure here. I wouldn't retry here without any idea what's actually going on.

@Al2Klimov
Copy link
Member

Oh, sorry. I've mixed up GH issues. I meant FK failures of course.

@julianbrost
Copy link
Contributor

Sounds too much like "no idea what's going on, let's try this and hope for the best". If we knew that was some bug/limitation of the database, that could be a workaround, but not really the current state of information.

@NavidSassan
Copy link

we also see the following sql errors on a master-master setup, with a mariadb galera cluster as the database:

Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."user_notification_history", CONSTRAINT "fk_user_notification_history_notification_history">
	can't perform "INSERT INTO \"user_notification_history\" (\"id\", \"notification_history_id\", \"user_id\", \"environment_id\") VALUES (:id,:notification_history_id,:user_id,:envir>
	github.com/icinga/icingadb/internal.CantPerformQuery
			github.com/icinga/icingadb/internal/internal.go:30
	github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
			github.com/icinga/icingadb/pkg/icingadb/db.go:394
	github.com/icinga/icingadb/pkg/retry.WithBackoff
			github.com/icinga/icingadb/pkg/retry/retry.go:45
	github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
			github.com/icinga/icingadb/pkg/icingadb/db.go:389
	golang.org/x/sync/errgroup.(*Group).Go.func1
			golang.org/x/[email protected]/errgroup/errgroup.go:57
	runtime.goexit
			runtime/asm_amd64.s:1594
	can't retry
	github.com/icinga/icingadb/pkg/retry.WithBackoff
			github.com/icinga/icingadb/pkg/retry/retry.go:64
	github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
			github.com/icinga/icingadb/pkg/icingadb/db.go:389
	golang.org/x/sync/errgroup.(*Group).Go.func1
			golang.org/x/[email protected]/errgroup/errgroup.go:57
	runtime.goexit
			runtime/asm_amd64.s:1594
Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."user_notification_history", CONSTRAINT "fk_user_notification_history_notification_history" F>
	can't perform "INSERT INTO \"user_notification_history\" (\"notification_history_id\", \"user_id\", \"environment_id\", \"id\") VALUES (:notification_history_id,:user_id,:environment>
	github.com/icinga/icingadb/internal.CantPerformQuery
		  github.com/icinga/icingadb/internal/internal.go:30
	github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
		  github.com/icinga/icingadb/pkg/icingadb/db.go:394
	github.com/icinga/icingadb/pkg/retry.WithBackoff
		  github.com/icinga/icingadb/pkg/retry/retry.go:45
	github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
		  github.com/icinga/icingadb/pkg/icingadb/db.go:389
	golang.org/x/sync/errgroup.(*Group).Go.func1
		  golang.org/x/[email protected]/errgroup/errgroup.go:57
	runtime.goexit
		  runtime/asm_amd64.s:1594
	can't retry
	github.com/icinga/icingadb/pkg/retry.WithBackoff
		  github.com/icinga/icingadb/pkg/retry/retry.go:64
	github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
		  github.com/icinga/icingadb/pkg/icingadb/db.go:389
	golang.org/x/sync/errgroup.(*Group).Go.func1
		  golang.org/x/[email protected]/errgroup/errgroup.go:57
	runtime.goexit
		  runtime/asm_amd64.s:1594

how can we further troubleshoot this problem?

@julianbrost
Copy link
Contributor

What are the answers to the questions I asked in #577 (comment) in your setup?

  • After this crash happened, did a restart suffice to get things running again?
  • Do you run Icinga DB in a high-availability configuration?

@NavidSassan
Copy link

  1. yes, a restart was enough
  2. yes, there are two instances of redis and icingadb which are sharing one mariadb database. both icinga2 master have the icingadb feature enabled. the environment id is the same on both servers

@log1-c
Copy link

log1-c commented Jul 13, 2023

Jul 11 05:08:59 <hostname> icingadb[2250476]: pq: unexpected message 'E'; expected ReadyForQuery
                                               can't perform "INSERT INTO \"state_history\" (\"previous_soft_state\", \"output\", \"soft_state\", \"state_type\", \"previous_hard_state\", \"check_attempt\",>
                                               github.com/icinga/icingadb/internal.CantPerformQuery
                                                       github.com/icinga/icingadb/internal/internal.go:30
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:394
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:45
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:389
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/[email protected]/errgroup/errgroup.go:57
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1594
                                               can't retry
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:64
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:389
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/[email protected]/errgroup/errgroup.go:57
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1594
Jul 11 05:08:59 <hostname> systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 05:08:59 <hostname> systemd[1]: icingadb.service: Failed with result 'exit-code'.

seeing a pretty similar error message.
setup:
2 icinga masters, both with icingadb installed/enabled
pgsql as database

crash happend on master1, on master2 it is still running.

a restart of the daemon was possible and it is running again.

Jul 13 11:06:09 <hostname> systemd[1]: Starting Icinga DB...
Jul 13 11:06:09 <hostname> icingadb[3569944]: Starting Icinga DB
Jul 13 11:06:09 <hostname> icingadb[3569944]: Connecting to database at 'localhost:5432'
Jul 13 11:06:09 <hostname> systemd[1]: Started Icinga DB.
Jul 13 11:06:09 <hostname> icingadb[3569944]: Connecting to Redis at 'localhost:6380'
Jul 13 11:06:09 <hostname> icingadb[3569944]: Starting history sync
Jul 13 11:06:10 <hostname> icingadb[3569944]: heartbeat: Received Icinga heartbeat
Jul 13 11:06:10 <hostname> icingadb[3569944]: high-availability: Another instance is active

@log1-c
Copy link

log1-c commented Jul 17, 2023

Just a follow up.
This weekend the daemon crashed on both masters. Again with the "INSERT_INTO" error message.
This time the second master was rebooted due to automatic updates at that time and then also failed (same message).

We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use pgbouncer and haproxy for the connection to the database.
pgbounder listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.

After some testing we experienced the following:

If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:

Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597b190: icingadb/[email protected]:6432 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597acd0: icingadb/[email protected]:6432 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/[email protected]:6432 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 pooler error: server conn crashed?
Jul 17 15:42:50 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:34438 login attempt: db=icingadb user=icingadb tls=no
Jul 17 15:42:50 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/[email protected]:6432 new connection to server (from 127.0.0.1:43752)
Jul 17 15:42:52 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/[email protected]:6432 closing because: server conn crashed? (age=2s)
Jul 17 15:42:52 ma02 pgbouncer[875]: stats: 0 xacts/s, 1 queries/s, in 556 B/s, out 75 B/s, xact 37939 us, query 9738 us, wait 0 us
Jul 17 15:42:53 ma02 icingadb[205409]: pq: unexpected message 'E'; expected ReadyForQuery#012can't perform "INSERT INTO \"state_history\" (\"state_type\", \"check_source\", \"id\", \"previous_soft_state\", \"max_check_attempts\", \"previous_hard_state\", \"check_attempt\", \"scheduling_source\", \"endpoint_id\", \"host_id\", \"object_type\", \"service_id\", \"event_time\", \"soft_state\", \"hard_state\", \"output\", \"long_output\", \"environment_id\") VALUES (:state_type,:check_source,:id,:previous_soft_state,:max_check_attempts,:previous_hard_state,:check_attempt,:scheduling_source,:endpoint_id,:host_id,:object_type,:service_id,:event_time,:soft_state,:hard_state,:output,:long_output,:environment_id) ON CONFLICT ON CONSTRAINT pk_state_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""#012github.com/icinga/icingadb/internal.CantPerformQuery#012#011github.com/icinga/icingadb/internal/internal.go:30#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:394#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/[email protected]/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/[email protected]/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Without a config deploy at switch-over time the error message changes to:

Jul 17 15:48:06 ma02 icingadb[348949]: pq: cannot use serializable mode in a hot standby#012can't start transaction#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:253#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:128#012sync.(*Once).doSlow#012#011sync/once.go:74#012sync.(*Once).Do#012#011sync/once.go:65#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:126#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:218#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:335#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:250#012runtime.goexit#012#011runtime/asm_amd64.s:1594

Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.

@julianbrost
Copy link
Contributor

@log1-c That's a different issue from what was originally reported here. Please open a new issue for that.

@Al2Klimov
Copy link
Member

We have several options:

  • Delay progress by X between stages such as state_history and history, so that the cluster has enough time to sync the rows
  • Re-try FK errors
  • Drop our FKs
  • Connect with the MySQL variable wsrep_sync_wait=4, catch the error 1193 unknown system variable and if it occurs, connect without it

@julianbrost
Copy link
Contributor

  • Delay progress by X between stages such as state_history and history, so that the cluster has enough time to sync the rows

Would there be some time that is guaranteed to be enough? (I doubt it, we're not doing real-time stuff here.)

  • Drop our FKs

That would break the history cleanup.

What about the idea/suggestion you mentioned at some point: pinning all inserts for the same history event to the same connection? (1. If we can do that, would that work reliably? 2. Can this be implemented feasibly with Go sql/sqlx?)

@Al2Klimov Al2Klimov added the needs-feedback We'll only proceed once we hear from you again label Aug 8, 2023
@Al2Klimov
Copy link
Member

What about the idea/suggestion you mentioned at some point: pinning all inserts for the same history event to the same connection? (1. If we can do that, would that work reliably? 2. Can this be implemented feasibly with Go sql/sqlx?)

It would break re-trials of broken connections if not done via transactions.

@julianbrost
Copy link
Contributor

It would break re-trials of broken connections if not done via transactions.

What would be the problem? Shouldn't (1) get a single database connection (2) send query A (3) send query B referencing row created by A do the job if it's retried from the beginning if the connection fails at any point?

I haven't looked into what code changes would exactly be necessary to support this, but conceptually this sounds like my preferred solution.

The next best solution would probably be wsrep_sync_wait, but that feels more like a quick and dirty fix compared to a proper solution. What would be interesting though: can you set this on the server side as an immediate workaround?

@Al2Klimov
Copy link
Member

Now as you say it:

@NavidSassan @patrickesser Please try wsrep_sync_wait=4 in your MySQL config. It works for me with https://github.com/Al2Klimov/provoke-galera , but -as you may guess- these are laboratory conditions.

While I'm on it: quick Galera test cluster guide for my colleagues

On both Debian 12 nodes:

  • apt install mariadb-server galera-4
  • vim /etc/mysql/mariadb.conf.d/60-galera.cnf
    • uncomment Mandatory settings and bind-address
    • put the first node's IP after gcomm://
    • add wsrep_provider = /usr/lib/galera/libgalera_smm.so

On first node run:

  • galera_new_cluster
  • iptables -A OUTPUT -d SECOND_NODE_IP -m limit --limit 100/sec -j ACCEPT
  • iptables -A OUTPUT -d SECOND_NODE_IP -j DROP

On second node run: systemctl restart mariadb.service

wsrep_sync_wait, but that feels more like a quick and dirty fix compared to a proper solution.

I don't agree. E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!! My approach would more/less join those ranks. After all we also check the schema version once at startup. The same-connection solution would be dirty, but not quick. I'd prefer not to fiddle DB conns around, but just say the database (or even recommend the users?) what we want: Please do wsrep_sync_wait=4, thank you. (Why isn't an even stricter wsrep_sync_wait the default btw.?) Also, with only one/few DB conn(s) allowed, you'd get a general performance problem.

@Al2Klimov Al2Klimov assigned NavidSassan and unassigned lippserd Aug 17, 2023
@julianbrost
Copy link
Contributor

(Why isn't an even stricter wsrep_sync_wait the default btw.?)

Because it isn't free and comes with its own drawbacks. Excerpt from the documentation:

While the check is taking place, new queries are blocked on the node to allow the server to catch up with all updates made in the cluster up to the point where the check was begun. Once reached, the original query is executed on the node. This can result in higher latency.

So doing the change in the way you suggest, this will incur a latency penalty for all inserts. So we should at least consider only setting it for queries that require it.

E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!!

I haven't looked into the details of that, so I can't say whether this makes sense. One has to look in the advantages and drawbacks there as well.

@Al2Klimov
Copy link
Member

E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!!

I haven't looked into the details of that, so I can't say whether this makes sense. One has to look in the advantages and drawbacks there as well.

I mean, I also could do SHOW VARIABLES LIKE if you consider that less dirty, should I?

While the check is taking place, new queries are blocked on the node to allow the server to catch up with all updates made in the cluster up to the point where the check was begun. Once reached, the original query is executed on the node. This can result in higher latency.

So doing the change in the way you suggest, this will incur a latency penalty for all inserts. So we should at least consider only setting it for queries that require it.

Well, we could set that variable only for (history) rows which are referenced by others using a separate sql.DB instance. But a second sql.DB would either circumvent our overall connection limit (significantly?) if not restricted to one connection (per table?). Do you prefer this over setting that variable everywhere?

@julianbrost
Copy link
Contributor

I mean, I also could do SHOW VARIABLES LIKE if you consider that less dirty, should I?

It would be nicer if we didn't have to set that variable at all. Would sending inserts for state_history and history over the connection have downsides for a non-clustered database? I don't see any, so that would be a fix without any special-casing involved.

Well, we could set that variable only for (history) rows which are referenced by others using a separate sql.DB instance. But a second sql.DB would either circumvent our overall connection limit (significantly?) if not restricted to one connection (per table?). Do you prefer this over setting that variable everywhere?

Why a separate sql.DB? What about: get connection from pool, set variable, perform insert, reset variable, return connection to pool?

@Al2Klimov
Copy link
Member

Do we set any of our vars like this? No. (AFAIK Eric neither would support Vitess like this.) I don't like the idea of fiddling around with connections or even max connections (per table) semaphores. Neither the idea that different DB connections work different. Unnecessary to mention that I still prefer my solution. Let's wait for both Eric's opinion as professional and our two users' test results.

Apropos. I guess one can also SET GLOBAL w/o restarting the cluster.

@julianbrost
Copy link
Contributor

Do we set any of our vars like this?

What other variables do we set? Do they have downsides as well like adding latency to many queries (where most don't gain anything from that latency)?

@Al2Klimov
Copy link
Member

ref/IP/46635

@oxzi
Copy link
Member

oxzi commented Feb 21, 2024

Locally, I set up a small Galera cluster to test the error and the solutions mentioned here.

In a nutshell, the setup looked like this:

  • Three MariaDB Nodes, building a Galera cluster
  • HAProxy as a L4 Load Balancer to access the Galera cluster (allows selected traffic distribution)
  • Two Icinga 2 masters with their own Icinga DB and Redis in an HA scenario
  • Multiple dummy hosts to be monitored by Icinga 2

The HAProxy was configured

Due to the fact that Icinga DB utilizes multiple database connections, max_connections defaulting to 16, and HAProxy balances connections randomly, Icinga DB established connections to different MariaDB cluster nodes.

After setting up, the system operated for an estimated hour without any problems.

However, after a restart, the first Icinga node's Icinga DB crashed with a FOREIGN KEY constraint violation directly after starting up.
This error looks very similar to the one in the original post.
Fortunately, the error can be reproduced fairly well from this point onwards.

2024-02-20T14:34:11.256Z    INFO    icingadb        Starting Icinga DB
2024-02-20T14:34:11.256Z    INFO    icingadb        Connecting to database at 'mysql:3306'
2024-02-20T14:34:11.259Z    INFO    icingadb        Connecting to Redis at 'redis-master1:6380'
2024-02-20T14:34:11.260Z    DEBUG   icingadb        Redis schema version is correct
2024-02-20T14:34:11.260Z    DEBUG   icingadb        Checking Icinga 2 and Icinga DB compatibility
2024-02-20T14:34:11.260Z    DEBUG   high-availability       Starting HA     {"instance_id": "701bc13acd8d4fa88296ffee2b705a39"}
2024-02-20T14:34:11.260Z    INFO    icingadb        Starting history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting notification history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting state history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting downtime history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting comment history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting flapping history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting acknowledgement history sync
2024-02-20T14:34:11.448Z    DEBUG   database        Finished executing "INSERT INTO \"history\" (\"environment_id\", \"service_id\", \"event_type\", \"object_type\", \"host_id\", \"id\", \"state_history_id\", \"event_time\", \"endpoint_id\") VALUES (:environment_id,:service_id,:event_type,:object_type,:host_id,:id,:state_history_id,:event_time,:endpoint_id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" with 0 rows in 118.491144ms
2024-02-20T14:34:11.448Z    DEBUG   database        Finished executing "INSERT INTO \"state_history\" (\"event_time\", \"state_type\", \"check_attempt\", \"hard_state\", \"long_output\", \"max_check_attempts\", \"scheduling_source\", \"endpoint_id\", \"id\", \"object_type\", \"service_id\", \"environment_id\", \"host_id\", \"soft_state\", \"previous_soft_state\", \"previous_hard_state\", \"output\", \"check_source\") VALUES (:event_time,:state_type,:check_attempt,:hard_state,:long_output,:max_check_attempts,:scheduling_source,:endpoint_id,:id,:object_type,:service_id,:environment_id,:host_id,:soft_state,:previous_soft_state,:previous_hard_state,:output,:check_source) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" with 5005 rows in 149.674765ms
2024-02-20T14:34:11.451Z    FATAL   icingadb        Error 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE)
can't perform "INSERT INTO \"history\" (\"environment_id\", \"service_id\", \"event_type\", \"object_type\", \"host_id\", \"id\", \"state_history_id\", \"event_time\", \"endpoint_id\") VALUES (:environment_id,:service_id,:event_type,:object_type,:host_id,:id,:state_history_id,:event_time,:endpoint_id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")"
github.com/icinga/icingadb/internal.CantPerformQuery
    /go/src/github.com/Icinga/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
    /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:391
github.com/icinga/icingadb/pkg/retry.WithBackoff
    /go/src/github.com/Icinga/icingadb/pkg/retry/retry.go:49
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
    /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:386
golang.org/x/sync/errgroup.(*Group).Go.func1
    /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695
can't retry
github.com/icinga/icingadb/pkg/retry.WithBackoff
    /go/src/github.com/Icinga/icingadb/pkg/retry/retry.go:68
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
    /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:386
golang.org/x/sync/errgroup.(*Group).Go.func1
    /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695
exit status 1

The MariaDB node receiving the faulty query stored the following error:

MariaDB [(none)]> show engine innodb status\G
[. . .]
------------------------
LATEST FOREIGN KEY ERROR
------------------------
2024-02-20 14:34:11 0x7f023a2fe640 Transaction:
TRANSACTION 22514, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 1
MariaDB thread id 9, OS thread handle 139647542879808, query id 31 192.168.0.8 icingadb Update
INSERT INTO "history" ("environment_id", "service_id", "event_type", "object_type", "host_id", "id", "state_history_id", "event_time", "endpoin
t_id") VALUES (?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,
?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?
,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,
?,?),(?,?,?,?,?,?,?,?,?),(?,
Foreign key constraint fails for table "icingadb"."history":
,
  CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE in parent table, i
n index idx_history_state tuple:
DATA TUPLE: 2 fields;
 0: len 20; hex 48037240963505d5062d1afa0391f3d382931c01; asc H r@ 5   -          ;;
 1: len 20; hex 2b5047c920f5a76634967ac64be62d59ada90afc; asc +PG    f4 z K -Y    ;;

But in parent table "icingadb"."state_history", in index PRIMARY,
the closest match we can find is record:
PHYSICAL RECORD: n_fields 20; compact format; info bits 0
 0: len 20; hex 480f2d2d601202226dcbf9b07858a210d3dc067d; asc H --`  "m   xX     };;
 1: len 6; hex 000000000000; asc       ;;
 2: len 7; hex 80000000000000; asc        ;;
 3: len 20; hex df8a471cac09c1f48168e45050d98a26632b6253; asc   G      h PP  &c+bS;;
 4: len 20; hex cca97d78c2ded0024d2999a2e42f891bc1f4d953; asc   }x    M)   /     S;;
 5: len 1; hex 01; asc  ;;
 6: len 20; hex a01d2130bdb887a845c411cc21130970a264cbe1; asc   !0    E   !  p d  ;;
 7: SQL NULL;
 8: len 8; hex 0000018dc6a34cc6; asc       L ;;
 9: len 1; hex 01; asc  ;;
 10: len 1; hex 00; asc  ;;
 11: len 1; hex 00; asc  ;;
 12: len 1; hex 63; asc c;;
 13: len 1; hex 63; asc c;;
 14: len 1; hex 01; asc  ;;
 15: len 30; hex 596f752077696c6c2072656d656d6265722c20576174736f6e2c20686f77; asc You will remember, Watson, how; (total 59 bytes);
 16: len 30; hex 416265726e657474792066616d696c79207761732066697273742062726f; asc Abernetty family was first bro; (total 140 bytes);
 17: len 4; hex 00000003; asc     ;;
 18: len 15; hex 6963696e6761322d6d617374657231; asc icinga2-master1;;
 19: len 15; hex 6963696e6761322d6d617374657231; asc icinga2-master1;;
[. . .]

These logs provide a number of insights and allow us to draw conclusions and make assumptions.

  • There are two SQL queries being logged, resulting into violating fk_history_state_history.
  • Both queries are being DEBUG logged as finished at the exact same time.
    This, however, is mostly misleading and the consumed time, 118.491144ms and 149.674765ms respectively, should be considered.
  • The first query inserting into state_history (logged second) created 5005 rows in 149.674765ms.
  • The second query inserting into history (logged first) created zero rows and produces fk_history_state_history violation after 118.491144ms.
  • Thus, the first query actually started before the second.
    I would guess that after the second query failed, the context was canceled, resulting in both queries being defer db.loged.

The history synchronization logic is a pipeline reading from Redis, piping through stageFuncs, and eventually being removed from Redis again:

// The pipeline consists of n+2 stages connected sequentially using n+1 channels of type chan redis.XMessage,
// where n = len(pipeline), i.e. the number of actual sync stages. So the resulting pipeline looks like this:
//
// readFromRedis() Reads from redis and sends the history entries to the next stage
// ↓ ch[0]
// pipeline[0]() First actual sync stage, receives history items from the previous stage, syncs them
// and once completed, sends them off to the next stage.
// ↓ ch[1]
// ... There may be a different number of pipeline stages in between.
// ↓ ch[n-1]
// pipeline[n-1]() Last actual sync stage, once it's done, sends the history item to the final stage.
// ↓ ch[n]
// deleteFromRedis() After all stages have processed a message successfully, this final stage deletes
// the history entry from the Redis stream as it is now persisted in the database.
//
// Each history entry is processed by at most one stage at each time. Each state must forward the entry after
// it has processed it, even if the stage itself does not do anything with this specific entry. It should only
// forward the entry after it has completed its own sync so that later stages can rely on previous stages being
// executed successfully.

"state": {
writeOneEntityStage((*v1.StateHistory)(nil)), // state_history
writeOneEntityStage((*v1.HistoryState)(nil)), // history (depends on state_history)
writeMultiEntityStage(stateHistoryToSlaEntity), // sla_history_state
},

Going down the rabbit hole^W^W call stack, writeOneEntityStagewriteMultiEntityStageDB.UpsertStreamedDB.NamedBulkExec where, eventually, each query is being submitted.

Due to the pipeline architecture, this should not be a problem as the queries are submitted sequentially one after the other.
But this assumption is incorrect if the database is no longer a single server, but rather a cluster which is not always fully synchronized.

Several solutions have already been mentioned in this thread - some more, some less hacky - of which I have tried out some.
Below I am going to list what I have tried and also some further ideas, including what I would not choose.

  1. Configuring wsrep_sync_wait=4 through @Al2Klimov's PR MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665 for every DB connection seems to work fine.
    However, enforcing a synchronization delay for all INSERT and REPLACE queries seems to be an overkill.

  2. As this only affects certain INSERTs, those types could, e.g., implement some attribute or an interface, resulting in wsrep_sync_wait being set accordingly only for the corresponding queries.
    Fortunately, the dependencies are noted as comments in the syncPipelines map.

  3. Purely time-based waiting until the cluster is synchronized is not feasible in my opinion, as it is not possible to estimate when the cluster will be synchronized, especially under high load.

  4. A synchronization stage in the pipeline could be considered, which waits for the previous stage to be completed, then requests a single synchronization and then continues with the next stage.

    However, this would have (at least) two disadvantages:

    • In the worst case, the synchronization does not affect the next used node, becoming useless.
    • This contradicts the pipeline concept and costs additional time and buffer memory.

    I would therefore be reluctant to interrupt or buffer the pipeline.

  5. Go's database/sql - what sqlx builds on - does connection pooling.
    However, it should be possible to get a single distinct connection, which can be passed through the pipeline.
    With a certain degree of refactoring, this should work.

  6. Otherwise, the number of connections could be reduced to just a single one through Icinga DB's max_connections configuration settings.
    However, this comes at the expense of performance and may still fail if the connection is disconnected and then reestablished to another MariaDB node.

  7. A more reasonable version of this idea would be to always promote the same node for all connections, independent of Icinga DB.
    With a load balancer - such as HAProxy - this is possible by distributing the load to the same node at all times.

    I altered my HAProxy setup

    • to balance source, resulting in the same source IP hitting the same backend, and
    • to balance first, resulting in the same backend being used by everyone.

    The latter is essentially equivalent to assigning a virtual IP as @epinter did in Crash after "Resource deadlock avoided" #651.
    But, to put this into context, this degrades the cluster to a glorified primary-(multi-)replica setup.

    BTW, both configurations have mitigated the error and were working for me.

To put things into perspective, both Go's sql library and Galera are clustering connections.
As a result, even a single Icinga DB node might hold connections to different MariaDB nodes and, if SQL transactions are not explicitly used, can cause an out-of-order error.
The devil is in the details, I guess.

  • The points 1. and 2. from the list above are addressing this issue by explicitly telling Galera nodes to synchronize before performing an INSERT.
    This is at the expense of a higher latency.

    While @Al2Klimov's approach includes every INSERT, 2. would only address history and user_notification_history.
    On the downside, 2. would require a bigger change, which itself might introduce new bugs.

    As a non-representative data point, here are my table sizes (rough sizes, because InnoDB):

    MariaDB [icingadb]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'icingadb';
    +-----------------+
    | SUM(TABLE_ROWS) |
    +-----------------+
    |          431933 |
    +-----------------+
    1 row in set (0.001 sec)
    
    MariaDB [icingadb]> SELECT TABLE_NAME, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'icingadb' ORDER BY TABLE_ROWS DESC LIMIT 10;
    +---------------------+------------+
    | TABLE_NAME          | TABLE_ROWS |
    +---------------------+------------+
    | history             |      64533 |
    | servicegroup_member |      54885 |
    | state_history       |      54371 |
    | host_customvar      |      47975 |
    | service_customvar   |      45562 |
    | hostgroup_member    |      38799 |
    | sla_history_state   |      33485 |
    | service             |      30866 |
    | service_state       |      29397 |
    | host                |      10287 |
    +---------------------+------------+
    10 rows in set (0.001 sec)
    
    MariaDB [icingadb]> SELECT TABLE_NAME, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'icingadb' AND TABLE_NAME IN ('history', 'user_notification_history');
    +---------------------------+------------+
    | TABLE_NAME                | TABLE_ROWS |
    +---------------------------+------------+
    | user_notification_history |          5 |
    | history                   |      64540 |
    +---------------------------+------------+
    2 rows in set (0.001 sec)
    

    Even with history being the most populated table, the issue should only affect roughly 15%, (64540+5)/431933 = 0.1494, of all rows.
    Whether a more selective wsrep_sync_wait is worth the extra effort, completely depends on how expensive the forced synchronization itself is.

  • Points 3. and 4. have already been rejected in the arguments above.

  • I think the 5. point is worth noting and I would further investigate this approach.
    At the moment, I cannot even say if this would work - even after a successfully refactoring - and how this might affect performance.

  • Point 6. can also be ignored.

  • Finally, point 7. is a mitigation, which requires additional software and also might degrade the Galera cluster's performance.

To summarize, I would see if the costs of wsrep_sync_wait can be quantified and evaluate whether distinct connections can be integrated into the pipeline.

@julianbrost
Copy link
Contributor

  1. Go's database/sql - what sqlx builds on - does connection pooling.
    However, it should be possible to get a single distinct connection, which can be passed through the pipeline.
    With a certain degree of refactoring, this should work.

Feel free to give it a try. The tricky part will be the error handling for when that connection fails (the connection you'd retry it on could be to another node, so that might not have seen the previous stage yet).

@oxzi
Copy link
Member

oxzi commented Feb 28, 2024

While reducing the connection pool to a distinct connection seemed like a good idea in the first place, it comes with multiple issues when being implemented. For example, one has to reimplement the reconnection logic and there is even some missing sqlx support, jmoiron/sqlx#780.

Thus, I have eventually dropped this idea and went with another approach: using a distinct DB with only one allowed DB connection for the history sync. Feel free to take a look at #681.

@oxzi
Copy link
Member

oxzi commented Mar 6, 2024

As foreshadowed in my other post, I tried to get some somehow reliable numbers.
Thus, I benchmarked the SQL query execution time from within Icinga DB in different scenarios.

There were three different database scenarios:

  • single-ha-none is a single MariaDB node without any Load Balancer or clustering whatsoever.
  • galera-ha-balance-first is a three node MariaDB Galera cluster with a Load Balancer, always using the same node, acting like a virtual IP.
  • galera-ha-balance-random is the same three node MariaDB Galera cluster with a Load Balancer, but now assigning connections to random database servers.

Then, different Icinga DB flavors were used:

Furthermore, to simulate a real network, some delays were added: delay-none, delay-10ms, and delay-100ms.
The delay was set manually after starting the setup through a command like the following:

for i in $(seq 1 3); do docker exec mysql-n${i} tc qdisc add dev eth0 root netem delay 100ms; done

Thus, it might happen that some of the first connections were not limited.
Note that this applies to both incoming and outgoing traffic, i.e., a 10ms delay leads to 20ms for a ping.

The time was taken by debug logging the execution.

Logging Patch
diff --git a/pkg/icingadb/db.go b/pkg/icingadb/db.go
index 4ff3e0d..e5f451f 100644
--- a/pkg/icingadb/db.go
+++ b/pkg/icingadb/db.go
@@ -386,10 +386,12 @@ func (db *DB) NamedBulkExec(
                                                return retry.WithBackoff(
                                                        ctx,
                                                        func(ctx context.Context) error {
+                                                               t0 := time.Now()
                                                                _, err := db.NamedExecContext(ctx, query, b)
                                                                if err != nil {
                                                                        return internal.CantPerformQuery(err, query)
                                                                }
+                                                               db.logger.Debugf("Executed Query\t%q\t%d\t%v", query, len(b), time.Since(t0))

                                                                counter.Add(uint64(len(b)))

This resulted in different runs, each roughly five minutes long, and named like ${database-scenario}-${icingadb-flavor}-${delay}.
I only tested those combinations that I considered reasonable.

A small statistic was extracted from the logs, showing mean execution time, median execution time, and the standard deviation.

Script

This AWK script requires the gawk interpreter.

# time_normalize a Go time.Duration to a number representing microseconds.
function time_normalize(t) {
  if (match(t, /([0-9]+\.[0-9]+)s$/, a))
    return a[1] * 1000.0
  else if (match(t, /([0-9]+\.[0-9]+)ms$/, a))
    return a[1]
  else if (match(t, /([0-9]+)\.[0-9]+µs$/, a))
    return a[1] / 1000.0

  printf("%s:%d: cannot parse time duration: %s\n", FILENAME, FNR, t) > "/dev/stderr"
  exit 1
}

BEGIN {
  print "| Filename | Mean | Median | SD |"
  print "| -------- | ---- | ------ | -- |"
}

/database\tExecuted Query/ {
  times[times_no++] = time_normalize($NF)
}

ENDFILE {
  asort(times)

  sum = 0
  for (i in times)
    sum += times[i]

  mean = sum / times_no
  median = times[int(times_no/2)]

  variance = 0
  for (i in times)
    variance += (times[i] - mean)^2
  variance = sqrt(variance/times_no)
  sd = sqrt(variance)

  print "| " FILENAME " | " mean "ms | " median "ms | " sd "ms |"

  delete times
  times_no = 0
}
Filename Mean Median SD
single-ha-none-main-delay-none 3.94283ms 3.168657ms 2.09901ms
galera-ha-balance-first-main-delay-none 3.69792ms 2.582401ms 1.95272ms
galera-ha-balance-random-pr665-pr679-delay-none 6.27922ms 3.017348ms 3.1379ms
galera-ha-balance-random-pr681-pr679-delay-none 6.47338ms 3.014271ms 3.57157ms
single-ha-none-main-delay-10ms 25.9425ms 23.304029ms 4.14129ms
galera-ha-balance-first-main-delay-10ms 64.7782ms 53.42998ms 9.38948ms
galera-ha-balance-random-pr665-pr679-delay-10ms 72.4412ms 54.419461ms 9.09667ms
galera-ha-balance-random-pr681-pr679-delay-10ms 48.0431ms 53.711958ms 6.05208ms
single-ha-none-main-delay-100ms 247.345ms 204.465222ms 12.5254ms
galera-ha-balance-first-main-delay-100ms 597.217ms 503.405229ms 24.2195ms
galera-ha-balance-random-pr665-pr679-delay-100ms 685.79ms 644.591792ms 19.3462ms
galera-ha-balance-random-pr681-pr679-delay-100ms 622.714ms 505.723366ms 21.7324ms

A first look shows a significant increase in time for a Galera cluster when a longer delay is set.
When compared to a single MariaDB, this is particularly noticeable: 204.5ms vs 503.4ms/644.6ms/505.7ms in the median.

Within the Galera measurements, the pr665-pr679 stands out with a bigger delay, when a delay between the nodes was configured.
This is not really surprising, as setting wsrep_sync_wait=4 forces a cluster synchronization before executing the query.
The other two Galera options - either to the same node through a LB or with a distinct connection - are roughly similar.

What absolutely should be mentioned is that #681 has errors, at least the log looked suspicious and the CI of the PR still fails.
Furthermore, this adds additional code with more logic, without having a relevant time advantage over a load balancer or a virtual IP.

Thus, even if this is not visible from the data at first glance, I would recommend @Al2Klimov's PR #665 over #681, but change it to make the functionality optionally configurable.
Thus, a DBA can decide whether Icinga DB enforces synchronization - costing some additional time - or whether a load balancer, a virtual IP or the like should be configured.
One should be aware that a Galera Cluster is not always a drop-in replacement for a single database server or a classic replication.

@Al2Klimov
Copy link
Member

a Galera Cluster is not always a drop-in replacement for a single database server

Unfortunately. I try to emulate exactly that with

But if the delay is too bad, we could also re-try FK errors (for 5 minutes) instead.

@oxzi
Copy link
Member

oxzi commented Mar 7, 2024

But if the delay is too bad, we could also re-try FK errors (for 5 minutes) instead.

I am not quite sure if I am in favor of the "retry almost everything after/for n minutes"-idea.
For example, #651 experienced frequent Galera-related deadlock errors.
Retrying everything might stutter the query out of the undesired situation, but would unnecessary slow down everything.

Thus, as I have written before, I would support your wsrep_sync_wait=4 PR, but make it configurable in the config file.

@Al2Klimov
Copy link
Member

Not everything, just specific X for the clear reason Y. I mean, if it's an actual FK error, it will persist and crash the whole thing after 5m. If it's a cluster thing, it'll go away. No configuration needed and no performance decrease. 👍

@julianbrost
Copy link
Contributor

When compared to a single MariaDB, this is particularly noticeable: 204.5ms vs 503.4ms/644.6ms/505.7ms in the median.

Within the Galera measurements, the pr665-pr679 stands out with a bigger delay, when a delay between the nodes was configured. This is not really surprising, as setting wsrep_sync_wait=4 forces a cluster synchronization before executing the query.

One thing that confuses me: the numbers suggest that the impact of wsrep_sync_wait=4 depends on the latency but on the other hand, it's not explainable by an extra round-trip (as that should add 20ms or 200ms with the different latency values).

Also, with reasonable latency (remember, 200ms is in the once around the whole globe order of magnitude), the impact of wsrep_sync_wait=4 seems quite insignificant actually. So I don't see much reason against just setting it.

I don't really understand Galera, I'd have expected a more drastic difference for it to be worth to sacrifice some consistency guarantees.

@oxzi
Copy link
Member

oxzi commented Mar 7, 2024

One thing that confuses me: the numbers suggest that the impact of wsrep_sync_wait=4 depends on the latency but on the other hand, it's not explainable by an extra round-trip (as that should add 20ms or 200ms with the different latency values).

I took another look at the logs, even cleaned out the first 20% of entries - to really make sure that the tc command was already executed - and the last 20% - because I stopped the Galera cluster node by node to avoid conflicts on the next start. However, the fastest query with a 100ms delay on pr665-pr679 took 200.976409ms. Thus, it was directly executed on the database node.

The log line in question was:

icingadb-master1 | 2024-03-05T14:53:31.375Z DEBUG database Executed Query"INSERT INTO "state_history" ("object_type", "id", "event_time", "soft_state", "previous_soft_state", "previous_hard_state", "check_attempt", "long_output", "scheduling_source", "environment_id", "state_type", "host_id", "endpoint_id", "check_source", "hard_state", "max_check_attempts", "service_id", "output") VALUES (:object_type,:id,:event_time,:soft_state,:previous_soft_state,:previous_hard_state,:check_attempt,:long_output,:scheduling_source,:environment_id,:state_type,:host_id,:endpoint_id,:check_source,:hard_state,:max_check_attempts,:service_id,:output) ON DUPLICATE KEY UPDATE "id" = VALUES("id")" 1 200.976409ms

This was just a guess, but this INSERT query ends with ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" and setting wsrep_sync_wait=4 only covers INSERT and REPLACE, not UPDATE.

Thus, I changed the wsrep_sync_wait value to 6 and even 15.. and the fastest queries still took round about 200ms. The other metrics have also not changed relevantly.

I don't really understand Galera, I'd have expected a more drastic difference for it to be worth to sacrifice some consistency guarantees.

I'll second that.

What kind of magic does it do? I would really like to find some technical documentation about Galera, describing how it works internally.

@slalomsk8er
Copy link

Not everything, just specific X for the clear reason Y. I mean, if it's an actual FK error, it will persist and crash the whole thing after 5m. If it's a cluster thing, it'll go away. No configuration needed and no performance decrease. 👍

Even with wsrep_sync_wait, I've seen icingadb choke on something like ERROR 1047 (08S01): WSREP has not yet prepared node for application use so a retry might by in order for some, if not all errors. I propose usage of the SMTP retry logic, do a couple of them but in increasing intervals.

@oxzi
Copy link
Member

oxzi commented Mar 8, 2024

@slalomsk8er: There is #679 addressing especially this error.

But, out of curiosity, does this only happen during cluster startup times or even sometimes in between?

@slalomsk8er
Copy link

@oxzi thanks for the link - I requested this issue via Linuxfabrik.
I think, ERROR 1047 was independent of the icinga2 cluster startup and caused by network and/or maintenance work on the Galera nodes.
Can't find this particular error in the current logs with journalctl -p 2 -u icingadb.service atm.
I found a timeline of the error in graylog as recorded by the icingacli vpheredb checks and no correlation with icinga2 uptime:
image
image
The bars are very wide and in reality a event looks more like this:
image
image

@yhabteab yhabteab added this to the 1.1.2 milestone Apr 5, 2024
@yhabteab yhabteab removed the needs-feedback We'll only proceed once we hear from you again label Apr 5, 2024
@yhabteab
Copy link
Member

yhabteab commented Apr 5, 2024

Fixed by #711

@yhabteab yhabteab closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants