Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit code 4 from --log=error introduced by #131 #152

Open
buroa opened this issue Aug 23, 2023 · 3 comments
Open

Exit code 4 from --log=error introduced by #131 #152

buroa opened this issue Aug 23, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@buroa
Copy link

buroa commented Aug 23, 2023

https://github.com/prometheus-community/smartctl_exporter/pull/131/files#diff-c249026c4aeb110469ab01c3170ce12c5a292612584702fd378d42a1c868a686

This introduced a bug where smartctl return code is 4 instead of 0 if any messages were present in the call.

    "messages": [
      {
        "string": "Read 1 entries from Error Information Log failed: NVMe Status 0x109",
        "severity": "error"
      }
    ],
    "exit_status": 4
ts=2023-08-24T15:44:05.879Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 4" device=/dev/nvme2n1
ts=2023-08-24T15:44:05.879Z caller=readjson.go:127 level=warn msg="Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure" device=/dev/nvme2n1
ts=2023-08-24T15:44:05.879Z caller=readjson.go:155 level=error msg="Read 1 entries from Error Information Log failed: NVMe Status 0x109"

We also need to be using --quietmode=errorsonly in addition with --log=error:

https://linux.die.net/man/8/smartctl

errorsonly - only print: For the '-l error' option, if nonzero, the number of errors recorded in the SMART error log and the power-on time when they occurred; For the '-l selftest' option, errors recorded in the device self-test log; For the '-H' option, SMART "disk failing" status or device Attributes (pre-failure or usage) which failed either now or in the past; For the '-A' option, device Attributes (pre-failure or usage) which failed either now or in the past.

@NiceGuyIT
Copy link
Member

The current exporter uses this command. Can you please provide the output the device in question? Obscure sensitive data.

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error <device>

This was changed in e884420 on 2022-10-03 which was introduced in v0.9.0. Can you please run this command for the same device? Again, obscure sensitive data.

smartctl --json --xall <device>

@akremer
Copy link

akremer commented Mar 20, 2024

--log=error renders this exporter unusable on MacOS/Darwin because smartctrl always returns the following error on Mac NVMEs:

Read 1 entries from Error Information Log failed: GetLogPage failed: system=0x38, sub=0x0, code=745

Running the earlier-specified smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error <device> works great and produces valid and complete JSON output, eg:

{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "Darwin 23.4.0 arm64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
      "--info",
      "--health",
      "--attributes",
      "--tolerance=verypermissive",
      "--nocheck=standby",
      "--format=brief",
      "--log=error",
      "/dev/disk0"
    ],
    "messages": [
      {
        "string": "Read 1 entries from Error Information Log failed: GetLogPage failed: system=0x38, sub=0x0, code=745",
        "severity": "error"
      }
    ],
    "exit_status": 4
  },
  "local_time": {
    "time_t": 1710975545,
    "asctime": "Wed Mar 20 18:59:05 2024 EDT"
  },
  "device": {
    "name": "/dev/disk0",
    "info_name": "/dev/disk0",
    "type": "nvme",
    "protocol": "NVMe"
  },
  "model_name": "APPLE SSD AP1024Q",
  "serial_number": "xxx",
  "firmware_version": "373.100.",
  "nvme_pci_vendor": {
    "id": 4203,
    "subsystem_id": 4203
  },
  "nvme_ieee_oui_identifier": 0,
  "nvme_controller_id": 0,
  "nvme_version": {
    "string": "<1.2",
    "value": 0
  },
  "nvme_number_of_namespaces": 3,
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "smart_status": {
    "passed": true,
    "nvme": {
      "value": 0
    }
  },
  "nvme_smart_health_information_log": {
    "critical_warning": 0,
    "temperature": 31,
    "available_spare": 100,
    "available_spare_threshold": 99,
    "percentage_used": 7,
    "data_units_read": 363901154,
    "data_units_written": 274998549,
    "host_reads": 6290253662,
    "host_writes": 3321280137,
    "controller_busy_time": 0,
    "power_cycles": 287,
    "power_on_hours": 2778,
    "unsafe_shutdowns": 20,
    "media_errors": 0,
    "num_err_log_entries": 0
  },
  "temperature": {
    "current": 31
  },
  "power_cycle_count": 287,
  "power_on_time": {
    "hours": 2778
  }
}

However, when calling the exporter's /metrics endpoint, only the following is returned:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.22.0"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 257952
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 257952
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 8456
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 1.504888e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 257952
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.351104e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.51552e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 1150
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 2.351104e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.866624e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 1150
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 44000
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 48960
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 407648
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 327680
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 327680
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 6.179856e+06
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 5
# HELP smartctl_devices Number of devices configured or dynamically discovered
# TYPE smartctl_devices gauge
smartctl_devices 1

Note the complete absence of any metrics. By removing --log=error from the following location

out, err := exec.Command(*smartctlPath, "--json", "--info", "--health", "--attributes", "--tolerance=verypermissive", "--nocheck=standby", "--format=brief", "--log=error", device).Output()
and recompiling, the exporter functions great and returns valid data:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 7
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.22.1"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 258464
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 258464
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 8458
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 1.49872e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 258464
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 1.908736e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.859584e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 1126
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.908736e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.76832e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 1126
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 54560
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 65280
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.183926e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 425984
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 425984
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 6.966288e+06
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
# HELP smartctl_device Device info
# TYPE smartctl_device gauge
smartctl_device{ata_additional_product_id="unknown",ata_version="",device="disk0",firmware_version="373.100.",form_factor="",interface="nvme",model_family="unknown",model_name="APPLE SSD AP1024Q",protocol="NVMe",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="xxx"} 1
# HELP smartctl_device_available_spare Normalized percentage (0 to 100%) of the remaining spare capacity available
# TYPE smartctl_device_available_spare counter
smartctl_device_available_spare{device="disk0"} 100
# HELP smartctl_device_available_spare_threshold When the Available Spare falls below the threshold indicated in this field, an asynchronous event completion may occur. The value is indicated as a normalized percentage (0 to 100%)
# TYPE smartctl_device_available_spare_threshold counter
smartctl_device_available_spare_threshold{device="disk0"} 99
# HELP smartctl_device_block_size Device block size
# TYPE smartctl_device_block_size gauge
smartctl_device_block_size{blocks_type="logical",device="disk0"} 0
smartctl_device_block_size{blocks_type="physical",device="disk0"} 0
# HELP smartctl_device_capacity_blocks Device capacity in blocks
# TYPE smartctl_device_capacity_blocks gauge
smartctl_device_capacity_blocks{device="disk0"} 0
# HELP smartctl_device_capacity_bytes Device capacity in bytes
# TYPE smartctl_device_capacity_bytes gauge
smartctl_device_capacity_bytes{device="disk0"} 0
# HELP smartctl_device_critical_warning This field indicates critical warnings for the state of the controller
# TYPE smartctl_device_critical_warning counter
smartctl_device_critical_warning{device="disk0"} 0
# HELP smartctl_device_media_errors Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field
# TYPE smartctl_device_media_errors counter
smartctl_device_media_errors{device="disk0"} 0
# HELP smartctl_device_num_err_log_entries Contains the number of Error Information log entries over the life of the controller
# TYPE smartctl_device_num_err_log_entries counter
smartctl_device_num_err_log_entries{device="disk0"} 0
# HELP smartctl_device_percentage_used Device write percentage used
# TYPE smartctl_device_percentage_used counter
smartctl_device_percentage_used{device="disk0"} 7
# HELP smartctl_device_power_cycle_count Device power cycle count
# TYPE smartctl_device_power_cycle_count counter
smartctl_device_power_cycle_count{device="disk0"} 287
# HELP smartctl_device_power_on_seconds Device power on seconds
# TYPE smartctl_device_power_on_seconds counter
smartctl_device_power_on_seconds{device="disk0"} 1.00008e+07
# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="disk0"} 1
# HELP smartctl_device_smartctl_exit_status Exit status of smartctl on device
# TYPE smartctl_device_smartctl_exit_status gauge
smartctl_device_smartctl_exit_status{device="disk0"} 0
# HELP smartctl_device_temperature Device temperature celsius
# TYPE smartctl_device_temperature gauge
smartctl_device_temperature{device="disk0",temperature_type="current"} 31
# HELP smartctl_devices Number of devices configured or dynamically discovered
# TYPE smartctl_devices gauge
smartctl_devices 1
# HELP smartctl_version smartctl version
# TYPE smartctl_version gauge
smartctl_version{build_info="(local build)",json_format_version="1.0",smartctl_version="7.4",svn_revision="5530"} 1

I am not a Go expert, but I believe this is occurring because of this area:

if message.Get("severity").String() == "error" {
level.Error(logger).Log("msg", message.Get("string").String())
return false

...which, by returning false due to the present error message, causes the upstream readData method to return an empty JSON object.

json, ok := readSMARTctl(logger, device)
if ok {
jsonCache.Store(device, JSONCache{JSON: json, LastCollect: time.Now()})
j, found := jsonCache.Load(device)
if !found {
level.Warn(logger).Log("msg", "device not found", "device", device)
}
return j.(JSONCache).JSON
}
return gjson.Result{}

@niektoniekde
Copy link

I can confirm this behaviour with set of old SSDs (5y+) because they don't support error log. Metrics are not collected because exit status value of 4. Running smartctl without '--log=error' or with '--quietmode=errorsonly' returns metrics with exit status value of 0.

Maybe adding commandline option to disable error log argument to smartctl will solve this and related issues?

@NiceGuyIT NiceGuyIT added the bug Something isn't working label Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants