Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine healthcheck: deal with empty uuid file #2926

Merged
merged 1 commit into from
Dec 21, 2022
Merged

Conversation

lmbarros
Copy link
Contributor

@lmbarros lmbarros commented Dec 15, 2022

In rare cases (believed to be caused by a non-atomic file creation and writing operation in containerd), we end up with an empty file at /mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid. This causes ctr version (and hence the health check) to fail. See balena-os/balena-engine#322

This PR addresses this issue in two ways:

  1. Before running ctr version, we check if the uuid file exists and is empty. If so, we remove it. (The subsequent execution of ctr version by the healthcheck will create the file again.)
  2. After running ctr version, we check if the uuid file was really created and is not empty.

In both cases, when an empty uuid file is detected, we log the event to help us confirm our hypothesis about the root cause.

Testing: I manually tested a build of this branch on a Pi 3. First, by replacing the uuid file with an empty one. Second, by changing the healthcheck to simulate an hypothetical case in which ctr version creates an empty uuid file.

@lmbarros lmbarros self-assigned this Dec 15, 2022
@lmbarros lmbarros force-pushed the lmb/rm-empty-uuid branch 2 times, most recently from 4293061 to f708ab9 Compare December 15, 2022 21:05
@lmbarros lmbarros marked this pull request as ready for review December 15, 2022 21:11
mkdir -p "$BREADCRUMBS_DIR"
date > "$BREADCRUMBS_DIR/empty-uuid-before"
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmbarros would logging something in the journal not work (and persisting on request if needed)? I know this is meant to be temporary but the writing logs to disk without persistent logging enabled makes me uneasy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexgg, I expected this would be controversial. :-)

Here's my reasoning:

  • We don't know when these events happen -- and in the case of the the second one (ctr version creating an empty uuid file), we don't even know if they ever happen. Asking a user to enable persistent logging after the fact would almost certainly mean that we'd lose data.
  • I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).
    • I considered appending to these files, but decided keep just the last event time to avoid the growing files. (Side note: we can also get the date of the first event, by stating these files to get their creation date.)
  • Based on what we have seen with users and in my manual testing, these events are quite rare and don't happen again (or even more rarely do) after we remove the empty uuid file. So, I don't think we'd be adding significantly to storage media wearing.

The code is already logging to the journal. I can remove the creation of the breadcrumb files, but I think it could make it quite harder for us to get to the root cause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @lmbarros

asking a user to enable persistent logging after the fact

We ask the user to enable persistent logging, reproduce, and then provide the logs

I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).

I fail to see how this is better that having all the events logged in persistent logs

I don't think we'd be adding significantly to storage media wearing.

Let's not speculate, if we don't write there is no media wear. If we do write, we don't have data to tell.

Could you summarize why file logging is better in any way than enabling persistent logging before the event?

Copy link
Contributor

@alexgg alexgg Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alexgg,

We ask the user to enable persistent logging, reproduce, and then provide the logs

The problem is that we don't know how to reproduce. My goal here is to collect data about a rare event that would be otherwise lost. If we know how to reproduce I wouldn't bother with the breadcrumbs.

But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.

I largely, but not completely agree here.

First, all evidence we have so far points that this is a rare event. What do we do after asking a user to enable persistent logging? Maybe we remove the uuid or replace it with an empty one to try to force the issue happening? There's a very high probability the result will be a fully working device and we don't learn anything. (Maybe you have some different procedure in mind that what I said.)

Second, I would be particularly interested in learning that ctr version could complete successfully and still generate an empty uuid file. This is a sanity check: I don't think this can possibly happen, but if it ever does it means that there's something important out of our radar. I am not sure we'd be able to detect this from inside containerd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to look at it: what's the worst thing that could happen if we leave the breadcrumbs creation in? Even if all our beliefs about how rare this is are completely wrong, we'll make one or two writes to disk every time the health checks run. This is not good, granted, but for years we did much worse than that (when we used to run a container as part of the health check, this causes even more writes to disk).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @lmbarros there are several things I don't like:

  • We are creating these files for all device types, on all fleets. This is just not good practice. Even if I agree the wear is probably not going to be a problem, it's just not a good example to commit.
  • You are basically saying that persistent logging is not good enough to debug this problem, and I don't see this. It's the mechanism that has been agreed to have in the product, so if it has shortcomings we need to discuss those in arch calls and agree to fix them.

But there might be something that persistent logs are missing that I can't see. I'll open a thread in Zulip to discuss how you view the debugging.

@balena-ci balena-ci enabled auto-merge December 16, 2022 20:39
@lmbarros lmbarros force-pushed the lmb/rm-empty-uuid branch 2 times, most recently from ca24e1e to 57f0336 Compare December 19, 2022 18:25
Copy link
Contributor

@alexgg alexgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a comment to use the os-helpers-logging functions for logging.

# present and empty. See https://github.com/balena-os/balena-engine/issues/322
UUID_FILE="/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid"
if [ -f "$UUID_FILE" -a ! -s "$UUID_FILE" ]; then
echo "healthcheck: removing empty $UUID_FILE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmbarros for consistency with other scripts, could you log using os-helpers-logging? That way we get a standard format in the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, @alexgg !

@lmbarros lmbarros force-pushed the lmb/rm-empty-uuid branch 2 times, most recently from 2867e0c to 149fa3a Compare December 20, 2022 13:25
@lmbarros lmbarros requested a review from alexgg December 21, 2022 17:17
Copy link
Contributor

@alexgg alexgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

In rare cases (believed to be caused by a non-atomic file creation and
writing operation in containerd), we end up with an empty file at
`/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid`.
This causes `ctr version` (and hence the health check) to fail. See
balena-os/balena-engine#322

This commit addresses this issue in two ways:

1. Before running `ctr version`, we check if the uuid file exists and is
   empty. If so, we remove it. (The subsequent execution of `ctr
   version` by the healthcheck will create the file again.)
2. After running `ctr version`, we check if the uuid file was really
   created and is not empty.

In both cases, when an empty uuid file is detected, we log the event to
help us confirm our hypothesis about the root cause.

Signed-off-by: Leandro Motta Barros <[email protected]>
Change-type: patch
@balena-ci balena-ci merged commit 8e67335 into master Dec 21, 2022
@balena-ci balena-ci deleted the lmb/rm-empty-uuid branch December 21, 2022 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants