-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine healthcheck: deal with empty uuid file #2926
Conversation
4293061
to
f708ab9
Compare
mkdir -p "$BREADCRUMBS_DIR" | ||
date > "$BREADCRUMBS_DIR/empty-uuid-before" | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmbarros would logging something in the journal not work (and persisting on request if needed)? I know this is meant to be temporary but the writing logs to disk without persistent logging enabled makes me uneasy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexgg, I expected this would be controversial. :-)
Here's my reasoning:
- We don't know when these events happen -- and in the case of the the second one (
ctr version
creating an empty uuid file), we don't even know if they ever happen. Asking a user to enable persistent logging after the fact would almost certainly mean that we'd lose data. - I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the
date
of the last event).- I considered appending to these files, but decided keep just the last event time to avoid the growing files. (Side note: we can also get the date of the first event, by
stat
ing these files to get their creation date.)
- I considered appending to these files, but decided keep just the last event time to avoid the growing files. (Side note: we can also get the date of the first event, by
- Based on what we have seen with users and in my manual testing, these events are quite rare and don't happen again (or even more rarely do) after we remove the empty uuid file. So, I don't think we'd be adding significantly to storage media wearing.
The code is already logging to the journal. I can remove the creation of the breadcrumb files, but I think it could make it quite harder for us to get to the root cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @lmbarros
asking a user to enable persistent logging after the fact
We ask the user to enable persistent logging, reproduce, and then provide the logs
I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).
I fail to see how this is better that having all the events logged in persistent logs
I don't think we'd be adding significantly to storage media wearing.
Let's not speculate, if we don't write there is no media wear. If we do write, we don't have data to tell.
Could you summarize why file logging is better in any way than enabling persistent logging before the event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alexgg,
We ask the user to enable persistent logging, reproduce, and then provide the logs
The problem is that we don't know how to reproduce. My goal here is to collect data about a rare event that would be otherwise lost. If we know how to reproduce I wouldn't bother with the breadcrumbs.
But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.
I largely, but not completely agree here.
First, all evidence we have so far points that this is a rare event. What do we do after asking a user to enable persistent logging? Maybe we remove the uuid or replace it with an empty one to try to force the issue happening? There's a very high probability the result will be a fully working device and we don't learn anything. (Maybe you have some different procedure in mind that what I said.)
Second, I would be particularly interested in learning that ctr version
could complete successfully and still generate an empty uuid file. This is a sanity check: I don't think this can possibly happen, but if it ever does it means that there's something important out of our radar. I am not sure we'd be able to detect this from inside containerd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to look at it: what's the worst thing that could happen if we leave the breadcrumbs creation in? Even if all our beliefs about how rare this is are completely wrong, we'll make one or two writes to disk every time the health checks run. This is not good, granted, but for years we did much worse than that (when we used to run a container as part of the health check, this causes even more writes to disk).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @lmbarros there are several things I don't like:
- We are creating these files for all device types, on all fleets. This is just not good practice. Even if I agree the wear is probably not going to be a problem, it's just not a good example to commit.
- You are basically saying that persistent logging is not good enough to debug this problem, and I don't see this. It's the mechanism that has been agreed to have in the product, so if it has shortcomings we need to discuss those in arch calls and agree to fix them.
But there might be something that persistent logs are missing that I can't see. I'll open a thread in Zulip to discuss how you view the debugging.
f708ab9
to
b5c61ca
Compare
ca24e1e
to
57f0336
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a comment to use the os-helpers-logging functions for logging.
# present and empty. See https://github.com/balena-os/balena-engine/issues/322 | ||
UUID_FILE="/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid" | ||
if [ -f "$UUID_FILE" -a ! -s "$UUID_FILE" ]; then | ||
echo "healthcheck: removing empty $UUID_FILE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmbarros for consistency with other scripts, could you log using os-helpers-logging
? That way we get a standard format in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, @alexgg !
2867e0c
to
149fa3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
In rare cases (believed to be caused by a non-atomic file creation and writing operation in containerd), we end up with an empty file at `/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid`. This causes `ctr version` (and hence the health check) to fail. See balena-os/balena-engine#322 This commit addresses this issue in two ways: 1. Before running `ctr version`, we check if the uuid file exists and is empty. If so, we remove it. (The subsequent execution of `ctr version` by the healthcheck will create the file again.) 2. After running `ctr version`, we check if the uuid file was really created and is not empty. In both cases, when an empty uuid file is detected, we log the event to help us confirm our hypothesis about the root cause. Signed-off-by: Leandro Motta Barros <[email protected]> Change-type: patch
149fa3a
to
345d144
Compare
In rare cases (believed to be caused by a non-atomic file creation and writing operation in containerd), we end up with an empty file at /mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid. This causes
ctr version
(and hence the health check) to fail. See balena-os/balena-engine#322This PR addresses this issue in two ways:
ctr version
, we check if the uuid file exists and is empty. If so, we remove it. (The subsequent execution ofctr version
by the healthcheck will create the file again.)ctr version
, we check if the uuid file was really created and is not empty.In both cases, when an empty uuid file is detected, we log the event to help us confirm our hypothesis about the root cause.
Testing: I manually tested a build of this branch on a Pi 3. First, by replacing the uuid file with an empty one. Second, by changing the healthcheck to simulate an hypothetical case in which
ctr version
creates an empty uuid file.