3x1t ssd zfs mirror, two of them faulted, and the remaining one also had errors. How to troubleshoot? #11773
Unanswered
woyaojizhu8
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My laptop is dell precision 7740, with 5 ssd inserted, one is windows10 system drive, the other is ubuntu 20.04 (my daily OS) system drive, and there is also a zfs pool, including three 1t ssd mirrored , to store data. After setup, I used it for a year without checking the status because the three SSDs were all new ones and in good condition when I got them. Last mouth I ran into some problems, so I used "sudo zpool status -v" and found that two of them were faulted, the remaining one also had many errors.
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 91.2G in 0 days 00:20:39 with 0 errors on Wed Feb 10 17:30:23 2021
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 28 0 0
nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 47 0 220 too many errors
nvme-PLEXTORPX-1TM9PGN+_P02952500063 FAULTED 32 0 2 too many errors
nvme-PLEXTORPX-1TM9PGN+_P02952500220 FAULTED 22 0 3 too many errors
I could not believe that these ssds would not work anymore. I checked the smart info and there is no abnormality. The Data Units Written is about 70TB, and the life span is 640TB. I checked my data, some of them are corrupted. Then I reboot my laptop ,after reboot:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Feb 14 15:38:53 2021 40.5G scanned at 251M/s, 11.1G issued at 68.8M/s, 755G total 23.1G resilvered, 1.47% done, 0 days 03:04:30 to go
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 0 0 0 too many errors
nvme-PLEXTORPX-1TM9PGN+_P02952500063 ONLINE 0 0 7 (resilvering)
nvme-PLEXTORPX-1TM9PGN+_P02952500220 ONLINE 0 0 11 (resilvering)
after finishing resilvering:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 83.5G in 0 days 00:10:26 with 0 errors on Sun Feb 14 15:49:19 2021
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 2 0 9 too many errors
nvme-PLEXTORPX-1TM9PGN+_P02952500063 ONLINE 0 0 15
nvme-PLEXTORPX-1TM9PGN+_P02952500220 ONLINE 0 0 19
errors: No known data errors
Then I scrubbed this zpool:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device repaired.
scan: scrub in progress since Sun Feb 14 15:56:49 2021
209G scanned at 1.76G/s, 903M issued at 7.59M/s, 755G total
849K repaired, 0.12% done, no estimated completion time config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 3 0 9 too many errors (repairing)
nvme-PLEXTORPX-1TM9PGN+_P02952500063 FAULTED 32 0 1.90K too many errors (repairing)
nvme-PLEXTORPX-1TM9PGN+_P02952500220 FAULTED 64 0 419 too many errors (repairing)
After finishing repairing:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 970K in 0 days 00:29:42 with 213 errors on Sun Feb 14 16:26:31 2021 config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 168 0 0
nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 327 0 2.34K too many errors
nvme-PLEXTORPX-1TM9PGN+_P02952500063 FAULTED 32 0 690K too many errors
nvme-PLEXTORPX-1TM9PGN+_P02952500220 FAULTED 64 0 682K too many errors
some errors are repaired, but most are not.
Then I reboot and disable P02952500057 in BIOS. When I reboot to Ubuntu this time, with only two disks mounted, data can be read and written normally, and all data is still there without corrupted. But P02952500063 is still DEGRADED, and P02952500220 is ONLINE. Even if it restarts, both are ONLINE at the beginning, but after manually scrubbing, P02952500063 is DEGRADED again. Scrub can detect some errors, and they can all be repaired successfully, but if scrubbing again, zfs can still detect errors , and then they can all be repaired successfully. As if only one disk is running, manual scrubbing will synchronize P02952500063 with P02952500220 once. I transferred my data to a safe place , destroyed this zpool and then rebuild one, no errors now. But I am still worried that this kind of failure will happen again. How to find out the reason of this? Any suggestions to avoid this happening again?
Beta Was this translation helpful? Give feedback.
All reactions