-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
icingadb process crashes with Error 1452: Cannot add or update a child row: a foreign key constraint fails #577
Comments
@patrickesser Are there other kinds of errors/warnings in the log? Which magnitude of irregular intervals are we talking about? Btw., what are your typical state change intervals? And, btw., do the crashes correlate with state changes of particular objects? What's your DB type/version/cluster? @julianbrost Your family name dominates in IDE annotation of history sync code. Any speculative idea what could happen here? As far as I see we execute bulks outside of transactions. Once committed(!) we inform via onSuccess that the Redis message may pass to the next stage which obviously hits the FK. |
Not really, that code should be written to prevents this, so I'm eager to know what is going wrong here. Some more questions for @patrickesser:
|
Btw. for such cases, shall we re-try (at least a limited amount of times) serialization failures? |
The database doesn't report a serialization failure here. I wouldn't retry here without any idea what's actually going on. |
Oh, sorry. I've mixed up GH issues. I meant FK failures of course. |
Sounds too much like "no idea what's going on, let's try this and hope for the best". If we knew that was some bug/limitation of the database, that could be a workaround, but not really the current state of information. |
we also see the following sql errors on a master-master setup, with a mariadb galera cluster as the database:
how can we further troubleshoot this problem? |
What are the answers to the questions I asked in #577 (comment) in your setup?
|
|
seeing a pretty similar error message. crash happend on master1, on master2 it is still running. a restart of the daemon was possible and it is running again.
|
Just a follow up. We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use After some testing we experienced the following: If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:
Without a config deploy at switch-over time the error message changes to:
Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time. |
@log1-c That's a different issue from what was originally reported here. Please open a new issue for that. |
We have several options:
|
Would there be some time that is guaranteed to be enough? (I doubt it, we're not doing real-time stuff here.)
That would break the history cleanup. What about the idea/suggestion you mentioned at some point: pinning all inserts for the same history event to the same connection? (1. If we can do that, would that work reliably? 2. Can this be implemented feasibly with Go sql/sqlx?) |
It would break re-trials of broken connections if not done via transactions. |
What would be the problem? Shouldn't (1) get a single database connection (2) send query A (3) send query B referencing row created by A do the job if it's retried from the beginning if the connection fails at any point? I haven't looked into what code changes would exactly be necessary to support this, but conceptually this sounds like my preferred solution. The next best solution would probably be |
Now as you say it: @NavidSassan @patrickesser Please try wsrep_sync_wait=4 in your MySQL config. It works for me with https://github.com/Al2Klimov/provoke-galera , but -as you may guess- these are laboratory conditions. While I'm on it: quick Galera test cluster guide for my colleaguesOn both Debian 12 nodes:
On first node run:
On second node run: systemctl restart mariadb.service
I don't agree. E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!! My approach would more/less join those ranks. After all we also check the schema version once at startup. The same-connection solution would be dirty, but not quick. I'd prefer not to fiddle DB conns around, but just say the database (or even recommend the users?) what we want: Please do wsrep_sync_wait=4, thank you. (Why isn't an even stricter wsrep_sync_wait the default btw.?) Also, with only one/few DB conn(s) allowed, you'd get a general performance problem. |
Because it isn't free and comes with its own drawbacks. Excerpt from the documentation:
So doing the change in the way you suggest, this will incur a latency penalty for all inserts. So we should at least consider only setting it for queries that require it.
I haven't looked into the details of that, so I can't say whether this makes sense. One has to look in the advantages and drawbacks there as well. |
I mean, I also could do SHOW VARIABLES LIKE if you consider that less dirty, should I?
Well, we could set that variable only for (history) rows which are referenced by others using a separate sql.DB instance. But a second sql.DB would either circumvent our overall connection limit (significantly?) if not restricted to one connection (per table?). Do you prefer this over setting that variable everywhere? |
It would be nicer if we didn't have to set that variable at all. Would sending inserts for
Why a separate |
Do we set any of our vars like this? No. (AFAIK Eric neither would support Vitess like this.) I don't like the idea of fiddling around with connections or even max connections (per table) semaphores. Neither the idea that different DB connections work different. Unnecessary to mention that I still prefer my solution. Let's wait for both Eric's opinion as professional and our two users' test results. Apropos. I guess one can also SET GLOBAL w/o restarting the cluster. |
What other variables do we set? Do they have downsides as well like adding latency to many queries (where most don't gain anything from that latency)? |
ref/IP/46635 |
Locally, I set up a small Galera cluster to test the error and the solutions mentioned here. In a nutshell, the setup looked like this:
The HAProxy was configured
Due to the fact that Icinga DB utilizes multiple database connections, After setting up, the system operated for an estimated hour without any problems. However, after a restart, the first Icinga node's Icinga DB crashed with a
The MariaDB node receiving the faulty query stored the following error:
These logs provide a number of insights and allow us to draw conclusions and make assumptions.
The history synchronization logic is a pipeline reading from Redis, piping through icingadb/pkg/icingadb/history/sync.go Lines 51 to 69 in 6c8b52f
icingadb/pkg/icingadb/history/sync.go Lines 361 to 365 in 6c8b52f
Going down the rabbit hole^W^W call stack, Due to the pipeline architecture, this should not be a problem as the queries are submitted sequentially one after the other. Several solutions have already been mentioned in this thread - some more, some less hacky - of which I have tried out some.
To put things into perspective, both Go's
To summarize, I would see if the costs of |
Feel free to give it a try. The tricky part will be the error handling for when that connection fails (the connection you'd retry it on could be to another node, so that might not have seen the previous stage yet). |
While reducing the connection pool to a distinct connection seemed like a good idea in the first place, it comes with multiple issues when being implemented. For example, one has to reimplement the reconnection logic and there is even some missing sqlx support, jmoiron/sqlx#780. Thus, I have eventually dropped this idea and went with another approach: using a distinct DB with only one allowed DB connection for the history sync. Feel free to take a look at #681. |
As foreshadowed in my other post, I tried to get some somehow reliable numbers. There were three different database scenarios:
Then, different Icinga DB flavors were used:
Furthermore, to simulate a real network, some delays were added:
Thus, it might happen that some of the first connections were not limited. The time was taken by debug logging the execution. Logging Patchdiff --git a/pkg/icingadb/db.go b/pkg/icingadb/db.go
index 4ff3e0d..e5f451f 100644
--- a/pkg/icingadb/db.go
+++ b/pkg/icingadb/db.go
@@ -386,10 +386,12 @@ func (db *DB) NamedBulkExec(
return retry.WithBackoff(
ctx,
func(ctx context.Context) error {
+ t0 := time.Now()
_, err := db.NamedExecContext(ctx, query, b)
if err != nil {
return internal.CantPerformQuery(err, query)
}
+ db.logger.Debugf("Executed Query\t%q\t%d\t%v", query, len(b), time.Since(t0))
counter.Add(uint64(len(b)))
This resulted in different runs, each roughly five minutes long, and named like A small statistic was extracted from the logs, showing mean execution time, median execution time, and the standard deviation. ScriptThis AWK script requires the # time_normalize a Go time.Duration to a number representing microseconds.
function time_normalize(t) {
if (match(t, /([0-9]+\.[0-9]+)s$/, a))
return a[1] * 1000.0
else if (match(t, /([0-9]+\.[0-9]+)ms$/, a))
return a[1]
else if (match(t, /([0-9]+)\.[0-9]+µs$/, a))
return a[1] / 1000.0
printf("%s:%d: cannot parse time duration: %s\n", FILENAME, FNR, t) > "/dev/stderr"
exit 1
}
BEGIN {
print "| Filename | Mean | Median | SD |"
print "| -------- | ---- | ------ | -- |"
}
/database\tExecuted Query/ {
times[times_no++] = time_normalize($NF)
}
ENDFILE {
asort(times)
sum = 0
for (i in times)
sum += times[i]
mean = sum / times_no
median = times[int(times_no/2)]
variance = 0
for (i in times)
variance += (times[i] - mean)^2
variance = sqrt(variance/times_no)
sd = sqrt(variance)
print "| " FILENAME " | " mean "ms | " median "ms | " sd "ms |"
delete times
times_no = 0
}
A first look shows a significant increase in time for a Galera cluster when a longer delay is set. Within the Galera measurements, the What absolutely should be mentioned is that #681 has errors, at least the log looked suspicious and the CI of the PR still fails. Thus, even if this is not visible from the data at first glance, I would recommend @Al2Klimov's PR #665 over #681, but change it to make the functionality optionally configurable. |
Unfortunately. I try to emulate exactly that with But if the delay is too bad, we could also re-try FK errors (for 5 minutes) instead. |
I am not quite sure if I am in favor of the "retry almost everything after/for n minutes"-idea. Thus, as I have written before, I would support your |
Not everything, just specific X for the clear reason Y. I mean, if it's an actual FK error, it will persist and crash the whole thing after 5m. If it's a cluster thing, it'll go away. No configuration needed and no performance decrease. 👍 |
One thing that confuses me: the numbers suggest that the impact of Also, with reasonable latency (remember, 200ms is in the once around the whole globe order of magnitude), the impact of I don't really understand Galera, I'd have expected a more drastic difference for it to be worth to sacrifice some consistency guarantees. |
I took another look at the logs, even cleaned out the first 20% of entries - to really make sure that the The log line in question was:
This was just a guess, but this Thus, I changed the
I'll second that. What kind of magic does it do? I would really like to find some technical documentation about Galera, describing how it works internally. |
Even with |
@slalomsk8er: There is #679 addressing especially this error. But, out of curiosity, does this only happen during cluster startup times or even sometimes in between? |
@oxzi thanks for the link - I requested this issue via Linuxfabrik. |
Fixed by #711 |
The icingadb process crashes at irregular intervals with the following output:
Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE)
can't perform "INSERT INTO "history" ("environment_id", "endpoint_id", "host_id", "service_id", "event_time", "object_type", "event_type", "id", "state_history_id") VALUES (:environment_id,:endpoint_id,:host_id,:service_id,:event_time,:object_type,:event_type,:id,:state_history_id) ON DUPLICATE KEY UPDATE "id" = VALUES("id")"
github.com/icinga/icingadb/internal.CantPerformQuery
github.com/icinga/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
github.com/icinga/icingadb/pkg/icingadb/db.go:394
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:45
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
github.com/icinga/icingadb/pkg/icingadb/db.go:389
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
runtime/asm_amd64.s:1594
can't retry
github.com/icinga/icingadb/pkg/retry.WithBackoff
github.com/icinga/icingadb/pkg/retry/retry.go:64
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
github.com/icinga/icingadb/pkg/icingadb/db.go:389
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
runtime/asm_amd64.s:1594
__
I can't find more detailed log or any posibillity to figure out which entry causes the crash.
Your Environment
Icinga DB version: v1.1.0
Build information:
Go version: go1.19.3 (linux, amd64)
Git commit: a0093d1
System information:
Platform: Debian GNU/Linux
Platform version: 11 (bullseye)
The text was updated successfully, but these errors were encountered: