Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adding quiet mode for smartctl command #153

Closed
wants to merge 1 commit into from

Conversation

buroa
Copy link

@buroa buroa commented Aug 23, 2023

Fixes #152

@buroa buroa changed the title fix: remove log error from smartctl command fix: adding quiet mode for smartctl command Aug 23, 2023
@buroa
Copy link
Author

buroa commented Aug 23, 2023

@NiceGuyIT or @SuperQ one for you when you have a moment :)

@NiceGuyIT
Copy link
Member

I'll have time this weekend 👍

@k0ste
Copy link
Contributor

k0ste commented Aug 24, 2023

The exit-code is needed for resultCodeIsOk() function, I think. This useful feature, to determine some kind of errors, and bit's can be ignored on prometheus side, for example

smartctl_device * on (instance, device) group_left()
  (smartctl_device_smartctl_exit_status > 0) != 64

This means ignore exit-code 6. Exit-code 6 is "ata error log have log entries"

This option will exclude metric value for smartctl_device_smartctl_exit_status?

@frittentheke
Copy link

@k0ste those bits are not mutually exclusive I believe (see https://www.smartmontools.org/browser/trunk/smartmontools/smartctl.8.in#lbAI) ... so comparing them to their decimal values does only work if a single bit is set.

@k0ste
Copy link
Contributor

k0ste commented Aug 24, 2023

@k0ste those bits are not mutually exclusive I believe (see https://www.smartmontools.org/browser/trunk/smartmontools/smartctl.8.in#lbAI) ... so comparing them to their decimal values does only work if a single bit is set.

Yes, you can define multiple variants for your environment

@buroa
Copy link
Author

buroa commented Aug 24, 2023

The exit-code is needed for resultCodeIsOk() function, I think. This useful feature, to determine some kind of errors, and bit's can be ignored on prometheus side, for example

smartctl_device * on (instance, device) group_left()
  (smartctl_device_smartctl_exit_status > 0) != 64

This means ignore exit-code 6. Exit-code 6 is "ata error log have log entries"

This option will exclude metric value for smartctl_device_smartctl_exit_status?

What does any of that have to do with this pull request? You are stating promql. Adding --log=error introduced a failure in the smartctl for some devices, which because of it, the return code is a 4 for my case. Previous versions exported all metrics, including temperature. This doesn't export one single thing for that device.

@k0ste
Copy link
Contributor

k0ste commented Aug 24, 2023

What does any of that have to do with this pull request? You are stating promql. Adding --log=error introduced a failure in the smartctl for some devices, which because of it, the return code is a 4 for my case. Previous versions exported all metrics, including temperature. This doesn't export one single thing for that device.

Because it is not clear what problem it solves. If you brought specific conclusions where this command blocks the work, it would be more obvious. Yes, working with hardware is sometimes unpredictable, you can never encounter specific problems in your life 🙂

@buroa
Copy link
Author

buroa commented Aug 24, 2023

What does any of that have to do with this pull request? You are stating promql. Adding --log=error introduced a failure in the smartctl for some devices, which because of it, the return code is a 4 for my case. Previous versions exported all metrics, including temperature. This doesn't export one single thing for that device.

Because it is not clear what problem it solves. If you brought specific conclusions where this command blocks the work, it would be more obvious. Yes, working with hardware is sometimes unpredictable, you can never encounter specific problems in your life 🙂

Did you read the issue #152?

@k0ste
Copy link
Contributor

k0ste commented Aug 24, 2023

What does any of that have to do with this pull request? You are stating promql. Adding --log=error introduced a failure in the smartctl for some devices, which because of it, the return code is a 4 for my case. Previous versions exported all metrics, including temperature. This doesn't export one single thing for that device.

Because it is not clear what problem it solves. If you brought specific conclusions where this command blocks the work, it would be more obvious. Yes, working with hardware is sometimes unpredictable, you can never encounter specific problems in your life 🙂

Did you read the issue #152?

Yes, and only compared with your previous message I can figure out that with current command json output was blocked for your case.

Unclear:

  • What is your Device
  • What exactly problem with your device? Bad firmware?
  • What is smarmontools version?

@buroa
Copy link
Author

buroa commented Aug 24, 2023

What does any of that have to do with this pull request? You are stating promql. Adding --log=error introduced a failure in the smartctl for some devices, which because of it, the return code is a 4 for my case. Previous versions exported all metrics, including temperature. This doesn't export one single thing for that device.

Because it is not clear what problem it solves. If you brought specific conclusions where this command blocks the work, it would be more obvious. Yes, working with hardware is sometimes unpredictable, you can never encounter specific problems in your life 🙂

Did you read the issue #152?

Yes, and only compared with your previous message I can figure out that with current command json output was blocked for your case.

Unclear:

  • What is your Device
  • What exactly problem with your device? Bad firmware?
  • What is smarmontools version?

I'm using v0.5.0 of the prometheus-smartctl-exporter chart @ https://github.com/prometheus-community/helm-charts

My device doesn't have bad firmware, but it probably doesn't support all the things other NVMe devices use. It's just using the default NVMe storage device inside a 2018 Mac mini. It was previously working on v0.4.3 of prometheus-smartctl-exporter.

smartctl --version
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.44-talos] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

smartctl comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
the terms of the GNU General Public License; either
version 2, or (at your option) any later version.
See https://www.gnu.org for further details.

smartmontools release 7.3 dated 2022-02-28 at 16:33:40 UTC
smartmontools SVN rev 5338 dated 2022-02-28 at 16:34:26
smartmontools build host: x86_64-alpine-linux-musl
smartmontools build with: C++11, GCC 12.2.1 20220924
smartmontools configure arguments: [hidden in reproducible builds]
reproducible build SOURCE_DATE_EPOCH: 1681228881 (2023-04-11 16:01:21)

https://github.com/buroa/k8s-gitops/blob/master/kubernetes/apps/monitoring/exporters/smartctl-exporter/app/helmrelease.yaml

@NiceGuyIT
Copy link
Member

@buroa I agree with @k0ste. Adding --quietmode=errorsonly will remove messages section from the JSON and cause jsonIsOk to always return true.

If I run smartctl --json as a regular user, I get this.

    "messages": [
      {
        "string": "Smartctl open device: /dev/sdd failed: Permission denied",
        "severity": "error"
      }
    ],
    "exit_status": 2

If I add --quietmode=errorsonly, I get the following.

    "exit_status": 2

messages is used in jsonIsOk which is called from readSMARTctl. Adding --quietmode=errorsonly means jsonIsOk will always return true because messages will never exist.

I started a troubleshooting document (#157) and will work with you to identify what #152 introduced to cause a change in your environment.

@NiceGuyIT NiceGuyIT closed this Aug 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exit code 4 from --log=error introduced by #131
4 participants