Errors on your console / email notifications: Device: /dev/sdb [SAT], 9 Currently unreadable (pending) sectors Device: /dev/sdb [SAT], 9 Offline uncorrectable sectors ***NOTE: just RMA? man smartctl ***NOTE: search -t TEST man hdparm ***NOTE: search --read-sector ***NOTE: RAID vs not #cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md4 : active raid6 sdb1[9] sdi1[7] sdj1[6] sdh1[5] sdg1[4] sde1[1] sdd1[0] sdc1[3] sdf1[2] 13671854784 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU] #mdadm --fail /dev/md4 /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md4 #mdadm --remove /dev/md4 /dev/sdb1 mdadm: hot removed /dev/sdb1 from /dev/md4 #cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md4 : active raid6 sdi1[7] sdj1[6] sdh1[5] sdg1[4] sde1[1] sdd1[0] sdc1[3] sdf1[2] 13671854784 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/8] [UUUUUUUU_] #smartctl -a /dev/sdb [...] === START OF INFORMATION SECTION === Model Family: Western Digital Se Device Model: WDC WD2000F9YZ-09N20L1 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 14 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 14 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 16 [...] #smartctl -t long /dev/sdb smartctl 6.5 2016-05-07 r4318 [i686-linux-4.7.9-100.fc23.i686+PAE] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 240 minutes for test to complete. Test will complete after Wed Jan 25 05:50:33 2017 Use smartctl -X to abort test. ***NOTE: Wait about 20s for the test to start and report status. #smartctl -a /dev/sdb | grep -PA10 'LBA_of_first_error|CURRENT_TEST_STATUS|Self-test execution status' Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (21960) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. -- Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 8471 4542656 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 1132600 3907029167 Not_testing ***NOTE: the selective part above will still show the previous selected range even though the brand new -t long really is starting from the beginning (confirmed). It sometimes doesn't show progress either, so if you want sane counters and progress use: #smartctl -t select,0-max /dev/sdb #hdparm --yes-i-know-what-i-am-doing --read-sector 4542656 /dev/sdb /dev/sdb: reading sector 4542656: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 00 51 e0 01 11 04 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 succeeded ffff 8d84 249c 0300 00c7 4424 083c 0000 0089 4424 0489 3424 e823 8000 00e8 7ea0 ffff 8b00 8944 2440 8d83 b586 ffff 8904 24e8 ca9e ffff 0fb6 8424 9d03 0000 c744 2404 0100 0000 8944 2410 0fb6 8424 9c03 0000 8944 240c 8d83 987b ffff 8944 2408 8b83 f4ef ffff 8b00 8904 24e8 d0a0 ffff e95c eaff ffe8 26a0 ffff 89c6 8b00 8944 2440 8b44 242c e805 cbff ff8b 83f0 0900 [...] #hdparm --yes-i-know-what-i-am-doing --write-sector 4542656 /dev/sdb /dev/sdb: re-writing sector 4542656: succeeded #hdparm --yes-i-know-what-i-am-doing --read-sector 4542656 /dev/sdb /dev/sdb: reading sector 4542656: succeeded 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 #hdparm --yes-i-know-what-i-am-doing --write-sector 4542657 /dev/sdb /dev/sdb: re-writing sector 4542657: succeeded #hdparm --yes-i-know-what-i-am-doing --write-sector 4542658 /dev/sdb ***NOTE: 8 sectors because 4K internal, 512B blocks logical /dev/sdb: re-writing sector 4542658: succeeded #hdparm --yes-i-know-what-i-am-doing --read-sector 4542664 /dev/sdb ***NOTE: try +1, errors often come in groups /dev/sdb: reading sector 4542664: SG_IO: bad/missing sense data, sb[]: 70 00 03 00 00 00 00 0a 00 51 e0 01 11 04 00 00 00 c8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 succeeded 0000 0000 0000 0000 fd03 0000 b823 6ff4 307f b4f4 f006 e5f4 a897 b0f4 7074 ecf5 906d eef3 b0f5 3bf4 a88a 4af4 e0e4 7ef6 #smartctl -t select,4542656-max /dev/sdb smartctl 6.5 2016-05-07 r4318 [i686-linux-4.7.9-100.fc23.i686+PAE] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Selective self-test routine immediately in off-line mode". SPAN STARTING_LBA ENDING_LBA 0 4542656 3907029167 Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful. Testing has begun. #smartctl -a /dev/sdb | grep -PA10 'LBA_of_first_error|CURRENT_TEST_STATUS|Self-test execution status' # 1 Selective offline Completed: read failure 90% 8471 4542680 ***NOTE: repeat procedure ***NOTE: script to speed up procedure, but not fully automated #cat /usr/local/sbin/drive-fix-sector #!/usr/bin/perl -w $usage="Usage: drive-fix-sector \n"; $drive=shift or die $usage; $drive=~/^sd[a-z]$/ or die $usage; { my $mdout=`cat /proc/mdstat`; $mdout=~/\b$drive[0-9]/ and die "I won't let you use this on a drive still in an md array"; } $drive="/dev/$drive"; -e $drive or die $usage; $beg=shift or die $usage; for my $sec ($beg..$beg+8) { print "*** fixing $drive $sec"; system "hdparm --yes-i-know-what-i-am-doing --write-sector $sec $drive"; } print "subsequent sector test:\n"; $beg+=9; system "hdparm --yes-i-know-what-i-am-doing --read-sector $beg $drive | head -5"; #drive-fix-sector sdb 4544000 *** fixing /dev/sdb 4544000 /dev/sdb: re-writing sector 4544000: succeeded *** fixing /dev/sdb 4544001 /dev/sdb: re-writing sector 4544001: succeeded *** fixing /dev/sdb 4544002 /dev/sdb: re-writing sector 4544002: succeeded *** fixing /dev/sdb 4544003 /dev/sdb: re-writing sector 4544003: succeeded *** fixing /dev/sdb 4544004 /dev/sdb: re-writing sector 4544004: succeeded *** fixing /dev/sdb 4544005 /dev/sdb: re-writing sector 4544005: succeeded *** fixing /dev/sdb 4544006 /dev/sdb: re-writing sector 4544006: succeeded *** fixing /dev/sdb 4544007 /dev/sdb: re-writing sector 4544007: succeeded *** fixing /dev/sdb 4544008 /dev/sdb: re-writing sector 4544008: succeeded subsequent sector test: /dev/sdb: reading sector 4544009: succeeded 5ac5 cc69 6c2d 0ed6 38e7 ae2b 0653 8d8e afed d6b6 b381 2dca 5c41 dd50 e5bd eb2e #smartctl -t select,4544000-max /dev/sdb [...] 1 761037360 3907029167 Self_test_in_progress [90% left] (761435592-761501127) [...] 1 761037360 3907029167 Self_test_in_progress [90% left] (770741544-770807079) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Selective offline Completed without error 00% 8476 - # 2 Selective offline Completed: read failure 90% 8472 761037360 # 3 Selective offline Completed: read failure 90% 8472 4547144 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 761037360 3907029167 Not_testing 2 0 0 Not_testing [...] 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 14 #smartctl -t offline /dev/sdb smartctl 6.5 2016-05-07 r4318 [i686-linux-4.7.9-100.fc23.i686+PAE] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART off-line routine immediately in off-line mode". Drive command "Execute SMART off-line routine immediately in off-line mode" successful. Testing has begun. Please wait 21960 seconds for test to complete. Test will complete after Tue Jan 24 06:21:15 2017 Use smartctl -X to abort test. ***NOTE: 6.1 hours ***NOTE: doesn't show the test running in the smartctl -a output #smartctl -a /dev/sdb [...] 194 Temperature_Celsius 0x0022 124 120 000 Old_age Always - 26 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 mdadm --add /dev/md4 /dev/sdb1 #cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md4 : active raid6 sdb1[9] sdi1[7] sdj1[6] sdh1[5] sdg1[4] sde1[1] sdd1[0] sdc1[3] sdf1[2] 13671854784 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/8] [UUUUUUUU_] [=====>...............] recovery = 27.1% (530036224/1953122112) finish=2328.8min speed=10183K/sec #cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md4 : active raid6 sdb1[9] sdi1[7] sdj1[6] sdh1[5] sdg1[4] sde1[1] sdd1[0] sdc1[3] sdf1[2] 13671854784 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU] ***NOTE: and no reboots or physical access required! - rust only? - TLER only? - why not automatic?