What are the commands to verify health of a drive? #103

FurkanGozukara · 2023-02-28T09:42:47Z

FurkanGozukara
Feb 28, 2023

Hello. I want to verify health of a drive.

What commands should I execute? Also from results how to evaluate them?

Thank you

vonericsen · 2023-02-28T17:35:55Z

vonericsen
Feb 28, 2023
Maintainer

This is a great question!
I will answer this after I collect all the necessary information to give this a good answer and post it as soon as I get all that information together. This may take a little time since I'm trying to get a new openSeaChest release out.

2 replies

FurkanGozukara Mar 1, 2023
Author

Hi @FurkanGozukara,

This is a great question! I will answer this after I collect all the necessary information to give this a good answer and post it as soon as I get all that information together. This may take a little time since I'm trying to get a new openSeaChest release out.

awesome. looking forward to

vonericsen Mar 2, 2023
Maintainer

I have something I plan to share as the answer to this under review by our failure analysis team. I'm running it by them since they are the experts on drive health to make sure I answer this accurately. I will share it once they have reviewed it.

vonericsen · 2023-03-13T18:48:09Z

vonericsen
Mar 13, 2023
Maintainer

There are some easy ways to assess drive health, but failure prediction quickly becomes complicated.

For the easy things to do, openSeaChest offers the ability to run a “SMART check” and a “Short diagnostic self test”. These are commands that can respond relatively quickly to tell you if a drive is currently bad.
SMART check is a very quick snapshot of when the command is issued, does the drive’s SMART/health monitoring think it is good or bad.

If the drive’s firmware assesses itself and it appears good, openSeaChest will say that the SMART check “passed”.
If the drive’s firmware assesses itself and it appears that a known reason for failure has occurred, openSeaChest will get a response and say that SMART check “Failed”. If the attributes and thresholds are available, openSeaChest will attempt to read these and figure out which attribute is the cause of the failure to report as the reason. This may not be available on all drives out in the world as the Thresholds reporting has been obsolete since ATA-3.

If you run a Short Diagnostic self-test (DST) the drive will do a longer assessment of itself. This includes testing mechanical operation and the ability to read and write among other checks in the firmware to evaluate if the drive is good or bad. A short DST will complete within approximately 2 minutes.
Outputs can vary depending on what is happening to the drive, but a healthy drive will return success as it completed the full test. Other failures can be due to reading bad sectors, mechanical errors, electrical errors, and handling damage, but sometimes there are status’s if the OS (or other software) sends a command to abort the DST. These will all be output in openSeaChest if any of these situations occur.

There is a long diagnostic self-test as well, but the only difference is that this test requires the drive to read all addressable blocks successfully. It will stop on the first LBA with an error. On SATA drives, long DST can only report progress increments in 10’s, so 10, 20, 30, 40, etc. On a large drive, these can take so long to see an update it may appear like it has stalled, but it will still be running.

The conveyance self-test for SATA is similar to short-dst but with a focus on checking for handling damage. Not all devices will support this capability.

A lot of times SMART software will focus on the SMART attributes and their values. This can be useful and tell you some information as well, but it is less straightforward and may require the SMART thresholds to know if the drive is close to a tripping point or not.
SMART attributes as talked about in this post are referencing SATA drives only. SAS/SCSI and NVMe have different ways to report useful data that are standardized. New SATA drives also contain the device statistics log which is effectively standardized SMART attributes that the industry can agree on.

It is important to know that SMART attributes are all vendor unique and can change between products from the same vendor. That said many vendors have common customers that want the same attributes on the drives they purchase, so some attributes are “common”, at least in their name between vendors. This does not mean you can directly compare different vendors on the same attributes just because they appear common.

In a ATA SMART attribute dump, there are a few fields that can be assessed: The current or nominal value, the worst ever value, the threshold and the status-flags.
For a attribute that sets the status flag for “Pre-fail”, this means this attribute is monitored as a possible indication of failure. These are generally the only attributes with a threshold, but other attributes can set a threshold to help track when a drive has hit a maximum value for “old-age” tracking (spin-ups for example).

So, if an attribute is marked as pre-fail, you can compare the current/nominal value or the worst-ever value to the threshold value. If either the current or worse-ever values are at or below the threshold value, then that attribute indicates that the drive has hit the point where it can be considered failing.
Current/nominal, worst-ever, and the threshold values are reported in percentages. 100% meaning perfectly health, and anything less than 100% meaning things are less healthy.

A smartCheck basically asks the drive to compare nominal and threshold values for you and give you a pass/fail result.

With the Device Statistics log, there is a way that a vendor can indicate a failure with the “Monitored condition met” bit to indicate when a statistic is at a threshold. The manufacturer’s threshold is not revealed in this log though. If a device supports the device-statistic notifications feature, it is possible to see this show up when a notification is requested for a given statistic.

These tests are what can cover a warranty check for Seagate drives in openSeaChest/SeaChest and SeaTools as long as the drive is authentically Seagate and within its warranty period.

SMART and device statistics (and similar logs on SCSI/NVMe) can also report lots of other information about a drive. How many reads, writes, head parking events, reallocated sectors, etc there are on the device. Not all of these are used to indicate failure but can be useful to understand how the drive is functioning at a given time that these are read. Detecting when a drive is “failing” basically comes down to taking multiple snapshots consistently over time to assess when significant changes are happening.

For example, if a drive were run for a year or two and there were no reallocated sectors, then suddenly 20 (this is an example only) showed up in these logs, this could indicate an event has happened that may not be good. If the drive thinks this kind of sudden uptick on reallocated sectors is bad, it could set a failure in the smartcheck output to inform that it thinks a significant event has occurred that indicates it is failing.
The example of 20 reallocated sectors may be a lot or may not be enough depending on the SMART algorithms in the firmware of the drive to indicate a failure. It could even be that you need multiple of these events to cause a failure indication to show up. The SMART algorithms are extremely complicated and can vary between products for many different factors, some of these factors being what hardware is used inside the drive or which sensors are available on a given product. It can even vary between air-filled drives and helium filled drives.

Another example may be that the drive got too hot for too long. This can degrade the reliability of the drive when this happens. If it happens occasionally for only a minute or two, that may not be bad, but if it got too hot for an hour or more, that can be really bad for the long-term reliability of the drive. The device statistics output can be very helpful here as it can report the manufacturer’s critical and warning temperature limits as well as how long the drive has run over these temperatures and can allow a customer to investigate cooling issues that may be happening in their system and address those to keep the drive operating as long as possible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the commands to verify health of a drive? #103

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What are the commands to verify health of a drive? #103

FurkanGozukara Feb 28, 2023

Replies: 2 comments · 2 replies

vonericsen Feb 28, 2023 Maintainer

FurkanGozukara Mar 1, 2023 Author

vonericsen Mar 2, 2023 Maintainer

vonericsen Mar 13, 2023 Maintainer

FurkanGozukara
Feb 28, 2023

Replies: 2 comments 2 replies

vonericsen
Feb 28, 2023
Maintainer

FurkanGozukara Mar 1, 2023
Author

vonericsen Mar 2, 2023
Maintainer

vonericsen
Mar 13, 2023
Maintainer