Engine healthcheck: deal with empty uuid file #2926

lmbarros · 2022-12-15T17:53:11Z

In rare cases (believed to be caused by a non-atomic file creation and writing operation in containerd), we end up with an empty file at /mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid. This causes ctr version (and hence the health check) to fail. See balena-os/balena-engine#322

This PR addresses this issue in two ways:

Before running ctr version, we check if the uuid file exists and is empty. If so, we remove it. (The subsequent execution of ctr version by the healthcheck will create the file again.)
After running ctr version, we check if the uuid file was really created and is not empty.

In both cases, when an empty uuid file is detected, we log the event to help us confirm our hypothesis about the root cause.

Testing: I manually tested a build of this branch on a Pi 3. First, by replacing the uuid file with an empty one. Second, by changing the healthcheck to simulate an hypothetical case in which ctr version creates an empty uuid file.

alexgg · 2022-12-16T09:20:21Z

meta-balena-common/recipes-containers/balena/balena/balena-healthcheck

+    mkdir -p "$BREADCRUMBS_DIR"
+    date > "$BREADCRUMBS_DIR/empty-uuid-before"
+fi
+


@lmbarros would logging something in the journal not work (and persisting on request if needed)? I know this is meant to be temporary but the writing logs to disk without persistent logging enabled makes me uneasy.

@alexgg, I expected this would be controversial. :-)

Here's my reasoning:

We don't know when these events happen -- and in the case of the the second one (ctr version creating an empty uuid file), we don't even know if they ever happen. Asking a user to enable persistent logging after the fact would almost certainly mean that we'd lose data.

I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).

I considered appending to these files, but decided keep just the last event time to avoid the growing files. (Side note: we can also get the date of the first event, by stating these files to get their creation date.)

Based on what we have seen with users and in my manual testing, these events are quite rare and don't happen again (or even more rarely do) after we remove the empty uuid file. So, I don't think we'd be adding significantly to storage media wearing.

The code is already logging to the journal. I can remove the creation of the breadcrumb files, but I think it could make it quite harder for us to get to the root cause.

hey @lmbarros

asking a user to enable persistent logging after the fact

We ask the user to enable persistent logging, reproduce, and then provide the logs

I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).

I fail to see how this is better that having all the events logged in persistent logs

I don't think we'd be adding significantly to storage media wearing.

Let's not speculate, if we don't write there is no media wear. If we do write, we don't have data to tell.

Could you summarize why file logging is better in any way than enabling persistent logging before the event?

But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.

Hey @alexgg,

We ask the user to enable persistent logging, reproduce, and then provide the logs

The problem is that we don't know how to reproduce. My goal here is to collect data about a rare event that would be otherwise lost. If we know how to reproduce I wouldn't bother with the breadcrumbs.

But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.

I largely, but not completely agree here.

First, all evidence we have so far points that this is a rare event. What do we do after asking a user to enable persistent logging? Maybe we remove the uuid or replace it with an empty one to try to force the issue happening? There's a very high probability the result will be a fully working device and we don't learn anything. (Maybe you have some different procedure in mind that what I said.)

Second, I would be particularly interested in learning that ctr version could complete successfully and still generate an empty uuid file. This is a sanity check: I don't think this can possibly happen, but if it ever does it means that there's something important out of our radar. I am not sure we'd be able to detect this from inside containerd.

Another way to look at it: what's the worst thing that could happen if we leave the breadcrumbs creation in? Even if all our beliefs about how rare this is are completely wrong, we'll make one or two writes to disk every time the health checks run. This is not good, granted, but for years we did much worse than that (when we used to run a container as part of the health check, this causes even more writes to disk).

hi @lmbarros there are several things I don't like:

We are creating these files for all device types, on all fleets. This is just not good practice. Even if I agree the wear is probably not going to be a problem, it's just not a good example to commit.

You are basically saying that persistent logging is not good enough to debug this problem, and I don't see this. It's the mechanism that has been agreed to have in the product, so if it has shortcomings we need to discuss those in arch calls and agree to fix them.

But there might be something that persistent logs are missing that I can't see. I'll open a thread in Zulip to discuss how you view the debugging.

alexgg

lgtm, just a comment to use the os-helpers-logging functions for logging.

alexgg · 2022-12-20T09:50:26Z

meta-balena-common/recipes-containers/balena/balena/balena-healthcheck

+# present and empty. See https://github.com/balena-os/balena-engine/issues/322
+UUID_FILE="/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid"
+if [ -f "$UUID_FILE" -a ! -s "$UUID_FILE" ]; then
+    echo "healthcheck: removing empty $UUID_FILE"


@lmbarros for consistency with other scripts, could you log using os-helpers-logging? That way we get a standard format in the logs.

Updated, @alexgg !

alexgg

lgtm

In rare cases (believed to be caused by a non-atomic file creation and writing operation in containerd), we end up with an empty file at `/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid`. This causes `ctr version` (and hence the health check) to fail. See balena-os/balena-engine#322 This commit addresses this issue in two ways: 1. Before running `ctr version`, we check if the uuid file exists and is empty. If so, we remove it. (The subsequent execution of `ctr version` by the healthcheck will create the file again.) 2. After running `ctr version`, we check if the uuid file was really created and is not empty. In both cases, when an empty uuid file is detected, we log the event to help us confirm our hypothesis about the root cause. Signed-off-by: Leandro Motta Barros <[email protected]> Change-type: patch

lmbarros self-assigned this Dec 15, 2022

lmbarros force-pushed the lmb/rm-empty-uuid branch 2 times, most recently from 4293061 to f708ab9 Compare December 15, 2022 21:05

lmbarros marked this pull request as ready for review December 15, 2022 21:11

alexgg reviewed Dec 16, 2022

View reviewed changes

lmbarros force-pushed the lmb/rm-empty-uuid branch from f708ab9 to b5c61ca Compare December 16, 2022 20:36

balena-ci enabled auto-merge December 16, 2022 20:39

lmbarros force-pushed the lmb/rm-empty-uuid branch 2 times, most recently from ca24e1e to 57f0336 Compare December 19, 2022 18:25

alexgg requested changes Dec 20, 2022

View reviewed changes

lmbarros force-pushed the lmb/rm-empty-uuid branch 2 times, most recently from 2867e0c to 149fa3a Compare December 20, 2022 13:25

lmbarros requested a review from alexgg December 21, 2022 17:17

alexgg approved these changes Dec 21, 2022

View reviewed changes

lmbarros force-pushed the lmb/rm-empty-uuid branch from 149fa3a to 345d144 Compare December 21, 2022 18:15

balena-ci merged commit 8e67335 into master Dec 21, 2022

balena-ci deleted the lmb/rm-empty-uuid branch December 21, 2022 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine healthcheck: deal with empty uuid file #2926

Engine healthcheck: deal with empty uuid file #2926

lmbarros commented Dec 15, 2022 •

edited

Loading

alexgg Dec 16, 2022

lmbarros Dec 16, 2022

alexgg Dec 16, 2022

alexgg Dec 16, 2022 •

edited

Loading

lmbarros Dec 16, 2022

lmbarros Dec 16, 2022

alexgg Dec 16, 2022

alexgg left a comment

alexgg Dec 20, 2022

lmbarros Dec 20, 2022

alexgg left a comment

Engine healthcheck: deal with empty uuid file #2926

Engine healthcheck: deal with empty uuid file #2926

Conversation

lmbarros commented Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgg Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgg left a comment

Choose a reason for hiding this comment

lmbarros commented Dec 15, 2022 •

edited

Loading

alexgg Dec 16, 2022 •

edited

Loading