Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to send mail in case btrfs issues were detected #107

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ximion
Copy link

@ximion ximion commented Mar 19, 2022

Hi!
This PR adds an extremely basic script that just runs btrfs device stats --check on all btrfs filesystems every hour and sends an email to a user-defined address (most likely root in 90% of all cases) in case any issues were found.
This should very much work like the mdadm daemon feature that also sends mail in case one of the RAID members is about to fail.

A feature like this can be very useful for smaller setups where the admin still would
like to receive an email in case a disk in a btrfs RAID array fails.
This also is likely the billionth time someone has written such a script, so putting a version in one place where it can be shared and improved seemed like a good idea, and btrfsmaintenance seems to be the perfect place to add such a feature.

Thanks for considering this PR!

btrfs-errmail.sh Outdated Show resolved Hide resolved
@eku
Copy link

eku commented Mar 19, 2022

I suggest a cron job, cause cron knows how to send mails.

[email protected]
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'

@ximion
Copy link
Author

ximion commented Mar 19, 2022

I suggest a cron job, cause cron knows how to send mails.

[email protected]
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'

Doing that would result in:

  • An email without a clear subject on what the issue was
  • An email that possibly doesn't have enough information to get an overview of the issue
  • Spam every hour in case there was a failure event, instead of once every day / per reboot
  • The user having to manually add an entry like this for each btrfs mountpoint, instead of solving this once for all the btrfs filesystems on the machine
  • The user actually having to use cron and configure it (maybe systemd timers are actually preferred)

So, I still see good reasons to have the extra script for this :-)

@sten0
Copy link
Contributor

sten0 commented Mar 19, 2022

I suggest a cron job, cause cron knows how to send mails.

Which cron implementation can do this without an MTA? When I investigated this, I discovered that Fedora now appears to log cron output to syslog (and by now, maybe journald rather than syslog) rather than piping output to the MTA; using journald might also be problematic, because not all systems have adequate persistent journal retention policies. I like the idea of using a file (/run/btrfs-issue-mail-sent), and I wonder if this idea could be extended. @ximion, what do you think about the following approach (pros, cons, etc):

Poll btrfs stats on an hourly basis, and dump it to a file. Limit notification emails similarly to the logic you've proposed, but send a follow up email if the rate of errors rapidly increases.

The reason I wonder about this approach is because of the following case: One disk is begins to fail rapidly, and the rate of failed reads (or failed writes) is increasing hour by hour. Meanwhile, the firmware lies about SMART data while claiming everything is fine.

It also seems like having a file with regularly updated stats could be used to enable desktop notifications, albeit in another project, since this seems out of scope for btrfsmaintenance. Btrfs dev stats are "updated during filesystem [mount] lifetime" in addition to "from a scrub run" (btrfs-device(8)), which is why I think this approach may have value :-)

@sten0
Copy link
Contributor

sten0 commented Mar 19, 2022

@ximion
Copy link
Author

ximion commented Mar 19, 2022

In general I think those are good ideas, and the case of errors rapidly increasing on a disk actually appears to be relatively common - on our systems once a disk is starting to fail, I can pretty much bet on this behavior.
This would need a script that's a lot more complex than the proposal here though, and I have to say that the idea of just writing a btrfs maintenance daemon that's lightweight and running all the time did cross my mind :-D The btrfs commands pretty much all have nice JSON output that such a daemon could parse to perform the appropriate actions, be it sending an email, writing a log message or sending a message to a desktop environment (but for that case, having a feature like that in udisks is likely the better spot).
Major drawback of this is that such a tool would have to be written and maintained in the first place ^^

@karlmistelberger
Copy link
Contributor

Why not use mail instead of sendmail? See the following fragment from unit packagekit-background.service

this is when something useful was done

if [ $PKCON_RETVAL -ne 5 ]; then
# send email
if [ -n "$MAILTO" ]; then
mail -Ssendwait -s "System updates available: $SYSTEM_NAME" $MAILTO < $PKTMP
else
# default behavior is to use cron's internal mailing of output from cron-script
cat $PKTMP
fi
fi

This can be very useful for smaller setups where the admin still would
like to receive an email in case a disk in a btrfs RAID array fails.

Partially resolves kdave#88
@AuHau
Copy link

AuHau commented Mar 27, 2022

Small suggestion. It would be a good idea if there would be some test path to validate that everything is set up correctly and that I will indeed get the email notification when something goes wrong. Similarly like SMART has the -M test flag.

But otherwise, this is very much needed for me so thanks a lot for this PR! Hopefully this will be merged 👍

btrfs-issuemail.sh Outdated Show resolved Hide resolved
@sten0
Copy link
Contributor

sten0 commented May 8, 2022

In general I think those are good ideas, and the case of errors rapidly increasing on a disk actually appears to be relatively common - on our systems once a disk is starting to fail, I can pretty much bet on this behavior.

Thanks. I imagine it's stuff you've already thought of, of course ;) I'm encouraged to hear that this failure mode is common, because common problems of sufficient severity make something work towards a solution pragmatically useful.

This would need a script that's a lot more complex than the proposal here though, and I have to say that the idea of just writing a btrfs maintenance daemon that's lightweight and running all the time did cross my mind :-D The btrfs commands pretty much all have nice JSON output that such a daemon could parse to perform the appropriate actions, be it sending an email, writing a log message or sending a message to a desktop environment (but for that case, having a feature like that in udisks is likely the better spot). Major drawback of this is that such a tool would have to be written and maintained in the first place ^^

Yes, definitely, and there was upstream thread that indicates a need for it:

Zygo Blaxell proposes an autodefrag daemon here: https://www.spinics.net/lists/linux-btrfs/msg122168.html
Qu Wenruo supports the idea here: https://www.spinics.net/lists/linux-btrfs/msg122170.html

And a user (Ghislain Adnet) requests what this PR solves here: https://www.spinics.net/lists/linux-btrfs/msg110798.html

I find Adnet's request interesting because this would be where a future btrfsd could initiate a replace from hot spare, or rebalance to higher raid1c$redundancy level to defend against the rapidly increasing errors failure mode (ie: it's probable that two disks in the volume are from the same batch, and if one is failing, another may soon begin to fail).

@sten0
Copy link
Contributor

sten0 commented May 8, 2022

/\ @ximion

@rjlasko
Copy link

rjlasko commented Jul 25, 2022

Agree that an email-on-error service should be added. ZFS supports this behavior, for any preinstalled mail service, via zed configuration.

@clickwir
Copy link

clickwir commented Oct 11, 2022 via email

Co-authored-by: Adam Uhlíř <[email protected]>
@ximion
Copy link
Author

ximion commented Feb 20, 2023

/\ @ximion

Do you know if any progress has been made on the "btrfsd" front?

@sten0
Copy link
Contributor

sten0 commented Apr 18, 2023 via email

@ximion
Copy link
Author

ximion commented Apr 18, 2023

I'm working on a thing (called btrfsd for now because I don't have a better name...) which will basically be a small binary called by a systemd timer to perform actions like btrfsmaintenance does, but likely a bit more basic, and scratch my particular itch about mail sending and syslog-message-writing, because this patch apparently won't be merged anytime soon.
No ETA on this thing yet though, as I am drowning in work a bit and this will be a "when time permits" kind of project.
grub-btrfs looks super cool! Probably does make sense being its own project though (consolidating all tools would ease maintenance a bit, but would also require the maintainers to be familiar with every aspect of the software...)

@sten0
Copy link
Contributor

sten0 commented Apr 27, 2023 via email

@ximion
Copy link
Author

ximion commented Aug 24, 2023

Thank you, much appreciated! Please CC me news.

I actually had some time to work on this, and tiny Btrfsd is born :-)
I am currently testing it on my computer and a server, and if things work out well, make the tool available in Debian as well. It is not as extensive as btrfsmaintenance and will probably only ever support stats/scrub/balance, but it has some nice features (like sending mail on errors, and more mails if errors increase, or only running scrub/balance if the system is not running on battery power).
Maybe you'll like it, and others find it useful too :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants