What are the commands to verify health of a drive? #103
Replies: 2 comments 2 replies
-
Hi @FurkanGozukara, This is a great question! |
Beta Was this translation helpful? Give feedback.
-
There are some easy ways to assess drive health, but failure prediction quickly becomes complicated. For the easy things to do, openSeaChest offers the ability to run a “SMART check” and a “Short diagnostic self test”. These are commands that can respond relatively quickly to tell you if a drive is currently bad. If the drive’s firmware assesses itself and it appears good, openSeaChest will say that the SMART check “passed”. If you run a Short Diagnostic self-test (DST) the drive will do a longer assessment of itself. This includes testing mechanical operation and the ability to read and write among other checks in the firmware to evaluate if the drive is good or bad. A short DST will complete within approximately 2 minutes. There is a long diagnostic self-test as well, but the only difference is that this test requires the drive to read all addressable blocks successfully. It will stop on the first LBA with an error. On SATA drives, long DST can only report progress increments in 10’s, so 10, 20, 30, 40, etc. On a large drive, these can take so long to see an update it may appear like it has stalled, but it will still be running. The conveyance self-test for SATA is similar to short-dst but with a focus on checking for handling damage. Not all devices will support this capability. A lot of times SMART software will focus on the SMART attributes and their values. This can be useful and tell you some information as well, but it is less straightforward and may require the SMART thresholds to know if the drive is close to a tripping point or not. It is important to know that SMART attributes are all vendor unique and can change between products from the same vendor. That said many vendors have common customers that want the same attributes on the drives they purchase, so some attributes are “common”, at least in their name between vendors. This does not mean you can directly compare different vendors on the same attributes just because they appear common. In a ATA SMART attribute dump, there are a few fields that can be assessed: The current or nominal value, the worst ever value, the threshold and the status-flags. So, if an attribute is marked as pre-fail, you can compare the current/nominal value or the worst-ever value to the threshold value. If either the current or worse-ever values are at or below the threshold value, then that attribute indicates that the drive has hit the point where it can be considered failing. A smartCheck basically asks the drive to compare nominal and threshold values for you and give you a pass/fail result. With the Device Statistics log, there is a way that a vendor can indicate a failure with the “Monitored condition met” bit to indicate when a statistic is at a threshold. The manufacturer’s threshold is not revealed in this log though. If a device supports the device-statistic notifications feature, it is possible to see this show up when a notification is requested for a given statistic. These tests are what can cover a warranty check for Seagate drives in openSeaChest/SeaChest and SeaTools as long as the drive is authentically Seagate and within its warranty period. SMART and device statistics (and similar logs on SCSI/NVMe) can also report lots of other information about a drive. How many reads, writes, head parking events, reallocated sectors, etc there are on the device. Not all of these are used to indicate failure but can be useful to understand how the drive is functioning at a given time that these are read. Detecting when a drive is “failing” basically comes down to taking multiple snapshots consistently over time to assess when significant changes are happening. For example, if a drive were run for a year or two and there were no reallocated sectors, then suddenly 20 (this is an example only) showed up in these logs, this could indicate an event has happened that may not be good. If the drive thinks this kind of sudden uptick on reallocated sectors is bad, it could set a failure in the smartcheck output to inform that it thinks a significant event has occurred that indicates it is failing. Another example may be that the drive got too hot for too long. This can degrade the reliability of the drive when this happens. If it happens occasionally for only a minute or two, that may not be bad, but if it got too hot for an hour or more, that can be really bad for the long-term reliability of the drive. The device statistics output can be very helpful here as it can report the manufacturer’s critical and warning temperature limits as well as how long the drive has run over these temperatures and can allow a customer to investigate cooling issues that may be happening in their system and address those to keep the drive operating as long as possible. |
Beta Was this translation helpful? Give feedback.
-
Hello. I want to verify health of a drive.
What commands should I execute? Also from results how to evaluate them?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions