You are not logged in.

#1 2018-09-02 09:28:36

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Hard disk failure and SMART

[NOTE] This thread has been moved here from the ModZone because there's nothing private about it, and other members might be interested in the topic.
---

If I suddenly stop posting, you'll know why. Yesterday the box started running like frozen treacle, with continuous hard disk activity. Today it was virtually unable to boot, with various read/write and I/O errors, and filesystems being switched to read-only.

I managed to boot a live session off a USB stick, and was able to run a backup of sorts onto a plug-in drive. It took much longer than usual but the disk was at least still readable somehow.

Then I thought of booting into BL Hydrogen which still lives on a different (LVM) partition. Same disk of course, and again much activity at boot, but I was able to get a workable desktop, read my email, and log into BL. That's the system I'm running now. There were a couple of mails from the SMART daemon about more unreadable sectors than before, but sometimes those sectors get patched over. Right now the disk is quiet.

Tomorrow I'll try booting into Helium again, but won't be surprised if it's impossible, or even if Hydrogen is no longer usable. (At least it allowed me to back up some data.)

Anyway, I might have to go and look for a new computer, so don't be alarmed if I don't post for a while.

Last edited by johnraff (2018-09-07 00:37:47)


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#2 2018-09-02 10:34:36

twoion
ほやほや
Registered: 2015-08-10
Posts: 2,227

Re: Hard disk failure and SMART

Yeah, the disk can reallocate sectors from a spare sector pool as long as the number of defect sectores stays below a size of the spare spool.

Good job backing up your data, though just cloning your existing disk to a new one might be difficult depending on the state the disk is in…You might want to grab a SSD this time, esp. the popular Samsung models have gotten pretty cheap over the years.


Im grünen Wald, dort wo die Drossel singt…

Offline

#3 2018-09-02 18:10:28

hhh
That's easy!
Registered: 2015-09-17
Posts: 6,092
Website

Re: Hard disk failure and SMART

bzzzt BZZZZZZTTTT

Not good.

Online

#4 2018-09-03 02:01:43

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Re: Hard disk failure and SMART

twoion wrote:

Yeah, the disk can reallocate sectors from a spare sector pool as long as the number of defect sectores stays below a size of the spare spool.

That must be what happened, because it booted more-or-less OK today. Took the chance to redo yesterday's backup properly, but noticed that parts of it went very slowly, suggesting that sometimes the disk is still hitting bad zones. What seems to be happening is that the read/write instructions are being repeated endlessly until they eventually go through. Maybe those bad disk sectors are also being reallocated at that time?

Where is this spare sector pool? Can I check its size? My HD does have several hundred GB of unused space, with an LVM partition setup.

Good job backing up your data, though just cloning your existing disk to a new one might be difficult depending on the state the disk is in…

That I wouldn't bother with, and anyway have no disk big enough. Just backup my personal data along with all the config files. It makes the reinstall take longer of course.

You might want to grab a SSD this time, esp. the popular Samsung models have gotten pretty cheap over the years.

That might be nice, indeed. Very often the bottleneck to some operation is disk I/O, with the CPU sitting idly waiting.

But my current 4GB of RAM is a bit tight too (especially when running a VM), so I'm going to look for a "new" machine. The current one was second-hand from the start, and the graphics card is also quite old, so an all-round hardware update might be in order if I can find something for an affordable price.


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#5 2018-09-03 19:36:08

twoion
ほやほや
Registered: 2015-08-10
Posts: 2,227

Re: Hard disk failure and SMART

Where is this spare sector pool? Can I check its size? My HD does have several hundred GB of unused space, with an LVM partition setup.

It's usually part of the SMART response, albeit indirectly (usually, it strongly depends on the vendor): smartctl -a (try -x for fun) /dev/yourdisk

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   034    Pre-fail  Always       -       221964840
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       364
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0 <-------------------------- here are the reallocated sectors
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       4307638889
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2012 (2 202 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       364                                                                                                                                                    
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0                                                                                                                                                      
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0                                                                                                                                                      
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       5                                                                                                                                                      
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0                                                                                                                                                      
190 Airflow_Temperature_Cel 0x0022   057   046   045    Old_age   Always       -       43 (Min/Max 28/43)                                                                                                                                     
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0                                                                                                                                                      
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       7                                                                                                                                                      
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       459                                                                                                                                                    
194 Temperature_Celsius     0x0022   043   054   000    Old_age   Always       -       43 (0 16 0 0 0)                                                                                                                                        
196 Reallocated_Event_Count 0x000f   098   098   030    Pre-fail  Always       -       2029 (39169 0) <---------------------------  number reallocation events                                                                                                                              
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0  <------------ If this is non-zero, this many sectors cannot be reallocated to a spare area!                                                                                                                                                    
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0     <-------------- Count of absolutely uncorrectable read and write errors                                                                                                                                                 
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0                                                                                                                                                      
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0                                                                                                                                                      

It's normal to have some reallocation events because of manufacturing tolerances.

The above numbers are from my old Seagate laptop SSHD. The disk still works completely fine. Don't get irriated by the high numbers of raw read error rate etc, on Seagate, you need to do additional magic to extract meaningful data from the values -v hex48,1 for example.

It's a complex topic.


Im grünen Wald, dort wo die Drossel singt…

Offline

#6 2018-09-03 19:38:01

twoion
ほやほや
Registered: 2015-08-10
Posts: 2,227

Re: Hard disk failure and SMART

To add, (after backing up) you could initiate a SMART long or short self-test (smartctl -t short|long /dev/yourdisk), the result will be reported in smartctl -x output like so

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       931         -
# 2  Short offline       Completed without error       00%       744         -
# 3  Vendor (0x50)       Aborted by host               90%       580         -
# 4  Short offline       Completed without error       00%       580         -
# 5  Extended offline    Completed without error       00%         4         -
# 6  Conveyance offline  Completed without error       00%         3         -
# 7  Short offline       Completed without error       00%         2         -
# 8  Short offline       Completed without error       00%         2         -

(Unfortunately, I don't own one single disk that went bad completely smile

The output strongly depends on the smartctl version too, besides vendor and device firmware. For example, here's the output for my current SSD, as a reference as for what is possible, you may have useful device statistics further down too:

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.14.67-1-lts] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSC2KW512G8
Serial Number:    <CENSORED>
LU WWN Device Id: 5 5cd2e4 14ecb1559
Firmware Version: LHF002C
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  3 21:39:28 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: scsi error badly formed scsi parameters
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    601
 12 Power_Cycle_Count       -O--CK   100   100   000    -    503
170 Unknown_Attribute       PO--CK   100   100   010    -    0
171 Unknown_Attribute       -O--CK   100   100   000    -    0
172 Unknown_Attribute       -O--CK   100   100   000    -    0
173 Unknown_Attribute       PO--CK   100   100   005    -    12886147072
174 Unknown_Attribute       -O--CK   100   100   000    -    3
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        PO--CK   100   100   090    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   036   049   000    -    36 (Min/Max 16/49)
192 Power-Off_Retract_Count -O--CK   100   100   000    -    3
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
225 Unknown_SSD_Attribute   -O--CK   100   100   000    -    89719
226 Unknown_SSD_Attribute   -O--CK   100   100   000    -    0
227 Unknown_SSD_Attribute   -O--CK   100   100   000    -    0
228 Power-off_Retract_Count -O--CK   100   100   000    -    0
232 Available_Reservd_Space PO--CK   100   100   010    -    0
233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
236 Unknown_Attribute       -O--CK   100   100   000    -    0
241 Total_LBAs_Written      -O--CK   100   100   000    -    89719
242 Total_LBAs_Read         -O--CK   100   100   000    -    34874
249 Unknown_Attribute       -O--CK   100   100   000    -    1902
252 Unknown_Attribute       -O--CK   100   100   000    -    3
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01       GPL,SL  R/O      1  Summary SMART error log
0x02       GPL,SL  R/O      1  Comprehensive SMART error log
0x03       GPL,SL  R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06       GPL,SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09       GPL,SL  R/W      1  Selective self-test log
0x10       GPL,SL  R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xdf       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       436         -
# 2  Short offline       Completed without error       00%       295         -
# 3  Short offline       Completed without error       00%       229         -
# 4  Short offline       Completed without error       00%        58         -
# 5  Short offline       Completed without error       00%        29         -
# 6  Short offline       Completed without error       00%        21         -
# 7  Extended offline    Completed without error       00%         0         -
# 8  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
SCT Support Level:                   0
Device State:                        Active (0)
Current Temperature:                    44 Celsius
Power Cycle Min/Max Temperature:     29/44 Celsius
Lifetime    Min/Max Temperature:     22/58 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/100 Celsius
Min/Max Temperature Limit:            0/100 Celsius
Temperature History Size (Index):    128 (40)

Index    Estimated Time   Temperature Celsius
  41    2018-09-03 19:32    43  ************************
  42    2018-09-03 19:33    43  ************************
  43    2018-09-03 19:34    29  **********
  44    2018-09-03 19:35    29  **********
  45    2018-09-03 19:36    27  ********
  46    2018-09-03 19:37    29  **********
  47    2018-09-03 19:38    29  **********
  48    2018-09-03 19:39    28  *********
  49    2018-09-03 19:40    30  ***********
  50    2018-09-03 19:41    30  ***********
  51    2018-09-03 19:42    30  ***********
  52    2018-09-03 19:43    29  **********
  53    2018-09-03 19:44    30  ***********
  54    2018-09-03 19:45    31  ************
  55    2018-09-03 19:46    31  ************
  56    2018-09-03 19:47    31  ************
  57    2018-09-03 19:48    26  *******
  58    2018-09-03 19:49    23  ****
  59    2018-09-03 19:50    24  *****
  60    2018-09-03 19:51    24  *****
  61    2018-09-03 19:52    28  *********
  62    2018-09-03 19:53    30  ***********
  63    2018-09-03 19:54    30  ***********
  64    2018-09-03 19:55    30  ***********
  65    2018-09-03 19:56    31  ************
  66    2018-09-03 19:57    29  **********
  67    2018-09-03 19:58    30  ***********
  68    2018-09-03 19:59    35  ****************
  69    2018-09-03 20:00    35  ****************
  70    2018-09-03 20:01    38  *******************
  71    2018-09-03 20:02    40  *********************
  72    2018-09-03 20:03    40  *********************
  73    2018-09-03 20:04    41  **********************
  74    2018-09-03 20:05    42  ***********************
  75    2018-09-03 20:06    42  ***********************
  76    2018-09-03 20:07    33  **************
  77    2018-09-03 20:08    39  ********************
  78    2018-09-03 20:09    41  **********************
  79    2018-09-03 20:10    42  ***********************
  80    2018-09-03 20:11    43  ************************
  81    2018-09-03 20:12    44  *************************
  82    2018-09-03 20:13    44  *************************
  83    2018-09-03 20:14    43  ************************
  84    2018-09-03 20:15    44  *************************
  85    2018-09-03 20:16    44  *************************
  86    2018-09-03 20:17    44  *************************
  87    2018-09-03 20:18    43  ************************
  88    2018-09-03 20:19    43  ************************
  89    2018-09-03 20:20    43  ************************
  90    2018-09-03 20:21    44  *************************
  91    2018-09-03 20:22    43  ************************
  92    2018-09-03 20:23    43  ************************
  93    2018-09-03 20:24    42  ***********************
  94    2018-09-03 20:25    40  *********************
  95    2018-09-03 20:26    29  **********
  96    2018-09-03 20:27    34  ***************
  97    2018-09-03 20:28    36  *****************
  98    2018-09-03 20:29    40  *********************
  99    2018-09-03 20:30    40  *********************
 100    2018-09-03 20:31    41  **********************
 101    2018-09-03 20:32    42  ***********************
 102    2018-09-03 20:33    42  ***********************
 103    2018-09-03 20:34    42  ***********************
 104    2018-09-03 20:35    43  ************************
 105    2018-09-03 20:36    42  ***********************
 106    2018-09-03 20:37    43  ************************
 107    2018-09-03 20:38    43  ************************
 108    2018-09-03 20:39    42  ***********************
 109    2018-09-03 20:40    43  ************************
 110    2018-09-03 20:41    43  ************************
 111    2018-09-03 20:42    43  ************************
 112    2018-09-03 20:43    44  *************************
 113    2018-09-03 20:44    42  ***********************
 114    2018-09-03 20:45    42  ***********************
 115    2018-09-03 20:46    43  ************************
 116    2018-09-03 20:47    42  ***********************
 117    2018-09-03 20:48    30  ***********
 118    2018-09-03 20:49    28  *********
 119    2018-09-03 20:50    37  ******************
 120    2018-09-03 20:51    40  *********************
 121    2018-09-03 20:52    39  ********************
 122    2018-09-03 20:53    40  *********************
 ...    ..(  2 skipped).    ..  *********************
 125    2018-09-03 20:56    40  *********************
 126    2018-09-03 20:57    41  **********************
 127    2018-09-03 20:58    40  *********************
   0    2018-09-03 20:59    40  *********************
   1    2018-09-03 21:00    41  **********************
   2    2018-09-03 21:01    42  ***********************
   3    2018-09-03 21:02    42  ***********************
   4    2018-09-03 21:03    43  ************************
   5    2018-09-03 21:04    42  ***********************
 ...    ..(  2 skipped).    ..  ***********************
   8    2018-09-03 21:07    42  ***********************
   9    2018-09-03 21:08    43  ************************
  10    2018-09-03 21:09    42  ***********************
 ...    ..(  4 skipped).    ..  ***********************
  15    2018-09-03 21:14    42  ***********************
  16    2018-09-03 21:15    41  **********************
  17    2018-09-03 21:16    41  **********************
  18    2018-09-03 21:17    28  *********
  19    2018-09-03 21:18    34  ***************
  20    2018-09-03 21:19    37  ******************
  21    2018-09-03 21:20    37  ******************
  22    2018-09-03 21:21    38  *******************
  23    2018-09-03 21:22    40  *********************
  24    2018-09-03 21:23    40  *********************
  25    2018-09-03 21:24    31  ************
  26    2018-09-03 21:25    36  *****************
  27    2018-09-03 21:26    29  **********
  28    2018-09-03 21:27    36  *****************
  29    2018-09-03 21:28    37  ******************
  30    2018-09-03 21:29    38  *******************
  31    2018-09-03 21:30    40  *********************
  32    2018-09-03 21:31    39  ********************
  33    2018-09-03 21:32    39  ********************
  34    2018-09-03 21:33    41  **********************
  35    2018-09-03 21:34    42  ***********************
  36    2018-09-03 21:35    43  ************************
  37    2018-09-03 21:36    43  ************************
  38    2018-09-03 21:37    44  *************************
  39    2018-09-03 21:38    44  *************************
  40    2018-09-03 21:39    43  ************************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             503  ---  Lifetime Power-On Resets
0x01  0x010  4             601  ---  Power-on Hours
0x01  0x018  6      5879878807  ---  Logical Sectors Written
0x01  0x020  6        55985229  ---  Number of Write Commands
0x01  0x028  6      2285546460  ---  Logical Sectors Read
0x01  0x030  6        45397799  ---  Number of Read Commands
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               3  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              44  ---  Current Temperature
0x05  0x010  1               -  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              40  ---  Highest Temperature
0x05  0x028  1              33  ---  Lowest Temperature
0x05  0x030  1               -  ---  Highest Average Short Term Temperature
0x05  0x038  1               -  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              85  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            1688  ---  Number of Hardware Resets
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               0  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0005  2            0  R_ERR response for non-data FIS
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET

Im grünen Wald, dort wo die Drossel singt…

Offline

#7 2018-09-04 06:05:39

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Re: Hard disk failure and SMART

@twoion many thanks!

The output (after a short self-test) doesn't look too bad (snippets):

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K1000.C
Device Model:     Hitachi HDS721010CLA332
Serial Number:    <CENSORED>
LU WWN Device Id: 5 000cca 373d32104
Firmware Version: JP4OA3EA
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Sep  4 14:47:46 2018 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   086   086   005    -    396
196 Reallocated_Event_Count -O--CK   087   087   000    -    413
197 Current_Pending_Sector  -O---K   100   100   000    -    34
198 Offline_Uncorrectable   ---R--   100   100   000    -    0

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     11976         -
# 2  Short offline       Completed without error       00%     11976         -
# 3  Short offline       Completed without error       00%      9087         -
# 4  Short offline       Completed without error       00%      1782         -
# 5  Short offline       Completed without error       00%      1076         -

So it's saying that the disk is now OK!
For about the past year I was getting daily email messages with "3 Currently unreadable (pending) sectors".
Yesterday when the whole thing seemed to have crashed, it went up to 42, and now is reporting 34, which is way up on 3 of course, with 396 reallocated sectors.

I'm amazed, to be honest, of the ability of the drive to heal itself like this. I'm sure that wouldn't have happened a few years ago.

Anyway, I'll take it as a warning that catastrophic failure could come any time, and go and get a replacement box ASAP. (Not today, though, as a powerful typhoon is passing close by.)


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#8 2018-09-04 07:56:12

twoion
ほやほや
Registered: 2015-08-10
Posts: 2,227

Re: Hard disk failure and SMART

Yeah, with there still being pending sectors, replacing the drive soon sounds like a good idea.

Note that only a long self test delivers reliable results, so if you'd launch a long test, it's likely that the parameters of the drive still get worse.

Don't forgot to run a long self-tests on your new drive after purchasing, and perhaps run a file system benchmark on it like sysbench's a couple of times, then check the kernel log/syslog for I/O errors and the SMART log for suspicious items before committing the drive to active use, just to rule out it's a lemon.


Im grünen Wald, dort wo die Drossel singt…

Offline

#9 2018-09-04 19:40:58

hhh
That's easy!
Registered: 2015-09-17
Posts: 6,092
Website

Re: Hard disk failure and SMART

@twoion, what do you recommend to a layman for the disk format? I've always used ext3 or ext4. I've never tried btrfs.

Online

#10 2018-09-04 20:15:25

twoion
ほやほや
Registered: 2015-08-10
Posts: 2,227

Re: Hard disk failure and SMART

hhh wrote:

@Jens, what do you recommend to a layman for the disk format? I've always used ext3 or ext4. I've never tried btrfs.

For /home (=your important private data), I recommend ext4 with data=journal parameter. Who cares about speed when it's your important stuff. XFSv5 may also be acceptable, but boring is good.

For the rootfs, use ext4 on HDD and xfsv4 or xfsv5 on SSD.

For media collections (images, videos, PDF libraries, ebooks, …) I recommend xfsv4/v5 in any case.

I'd never use btrfs on my /home.

Use LVM for everything except the EFI partition. It also supports sparse volumes and RO and COW snapshots.

Use LVM on LUKS if you want an encrypted system.

Use btrfs only if you need to implement specific system solutions. For example, Docker and LXD have also btrfs backends which enable specific features. It is a solution; do not use a solution if you don't have the problem. Use btrfs only if your system is built around it. E.g. you can use `snapper` to make snapshots of brtfs volumes (but also LVM) to roll back bad upgrades later. GRUB2 can also boot from btrfs snapshots, making them extra useful. btrfs is cool, like ZFS, but you have to be careful about managing and deploying it.

Above all, K-I-S-S, on the server as on the desktop: Rollback via btrfs is nice and all,  but is only useful if you're rolling back transactions (=problems the user caused by making the computer do something). This is nice as a first guard, for example you can have your computer set up to take btrfs snapshots using a timer, and then when you rm -rf your important data one day, you can roll back to the last checkpoint and/or restore an offsite backup. Against data loss because of hardware failure, like always, just back your stuff up, all day, every day. To do everything right, even when using btrfs, you should still have a backup strategy.

(Not an expert.)


Im grünen Wald, dort wo die Drossel singt…

Offline

#11 2018-09-04 20:21:35

hhh
That's easy!
Registered: 2015-09-17
Posts: 6,092
Website

Re: Hard disk failure and SMART

Sounded pretty damn expert to me. smile

Online

#12 2018-09-04 20:26:11

hhh
That's easy!
Registered: 2015-09-17
Posts: 6,092
Website

Re: Hard disk failure and SMART

Why XFT for critical media? Arch Wiki says you can manually run a data-corruption tool, I'll guess that's the reason...

https://wiki.archlinux.org/index.php/XF … corruption

Last edited by hhh (2018-09-04 20:27:39)

Online

#13 2018-09-06 10:17:49

earlybird
ほやほや
Registered: 2015-12-16
Posts: 603
Website

Re: Hard disk failure and SMART

hhh wrote:

Why XFT for critical media? Arch Wiki says you can manually run a data-corruption tool, I'll guess that's the reason...

https://wiki.archlinux.org/index.php/XF … corruption

To clarify, XFS can be vastly faster on SSD than ext4 (for big databases on big servers, anyway, so YMMV).

XFS is also faster (in my experience, due to its allocation behaviour) for not-so-small files (let's say, files of 1M or bigger), which usually is media files (media, for me, are video/audio/PDF libraries/ebooks which a) I did not create myself and b) which I could easily download again should I lose it all, and c) even if I lost them all, it wouldn't matter. I still back them up though.

So, 'media' for me is 'absolutely not critical files'.

Anyway, when making new XFS file systems, there's no reason not to use XFSv5 instead of v4 (which is still the default), see https://wiki.archlinux.org/index.php/XFS#Integrity.

Offline

#14 2018-09-07 01:37:37

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Re: Hard disk failure and SMART

This disk is no more... it has ceased to be... it is an EX-disk.
(Bangs on counter.) "Wake up HDS721010CLA332!!"

No, it finally snuffed it. I'm posting from a live session, so the rest of the computer still works, but plenty of error messages come up during the boot. Thunar actually displays some of the partitions, but when I tried to open one it went into an endless spin, and the disk light just went, and stayed, on. (Now after yet another reboot the disk seems to be quiet.)

So now scouring the web for the best deal in 2nd hand boxes with 8GB RAM, an SSD drive + hard disk to bring total space to 1TB+, and an i5 or i7 CPU. It looks like something around 40,000 yen ($350~400), but I was only halfway through checking when the disk finally gave up.

Still have a question, though: smartctl is available in the live system cool, and is able to read the disk data.

snippet:

user@debian:~$ sudo smartctl -x /dev/sda
=== START OF INFORMATION SECTION ===
<same as above, except:>
Write cache is:   Disabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!  << THIS
Drive failure expected in less than 24 hours. SAVE ALL DATA.

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   016   016   016    NOW  4294932929
  5 Reallocated_Sector_Ct   PO--CK   001   001   005    NOW  1667
196 Reallocated_Event_Count -O--CK   008   008   000    -    1844
197 Current_Pending_Sector  -O---K   091   091   000    -    358

Is there anything that can be done from the live session that might make the disk bootable again, or at least mountable? (I don't know how much point there would be in doing that though, really...)

Take-home message: take seriously those smart daemon warnings! SMART can patch up bad sectors, but I think I was breaking more and more places by attempting to go on using the system as normal. Especially, booting up a virtual system - which calls on a lot of disk activity - broke the camel's back. So when the amber light comes on, backup everything and look for a new disk.


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#15 2018-09-07 03:16:18

hhh
That's easy!
Registered: 2015-09-17
Posts: 6,092
Website

Re: Hard disk failure and SMART

johnraff wrote:

This disk is no more... it has ceased to be... it is an EX-disk.
(Bangs on counter.) "Wake up HDS721010CLA332!!"

Bereft of life, it Rests In Piece. If it wasn't screwed to the laptop, it would be pushing up the daisies.

Online

#16 2018-09-07 04:33:28

ohnonot
...again
Registered: 2015-09-29
Posts: 3,190
Website

Re: Hard disk failure and SMART

i only diagonal-read this, but why do you need a whole new box?

i'm a sucker for recycling, even electronics.
recently i had a closer s.m.a.r.t. look at my hard drives (2 of the big ones, one laptop-sized) and was appalled at how little lifetime was left.
it also emerged that my main drive was the slowest of them all!

went to the store and got the cheapest SSD they had. 40€ for 120GB (WD green iirc).
dd'd my / to it. no further changes, except for adjusting UUIDs in fstab. ext4 as before.
the difference was (still is) amazing. boots in a few seconds. graphical desktop comes up immediately (that's what took the longest before, even though it's just openbox).
i should've done this much earlier.
nowadays SSDs are making computers fast, not CPUs.

we will see how long it lasts, being the cheapest and not samsung.
of course i'm making full backups to an external drive.

PS:
i still have the other hard drives, altogether it comes up to ~700GB right inside the box, plus another ~700GB on my server.
i have no idea how this could ever get filled up.
i delete movies after watching.

Last edited by ohnonot (2018-09-07 04:37:54)

Offline

#17 2018-09-08 04:06:39

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Re: Hard disk failure and SMART

^Yes I gave serious thought to just replacing the drive, but the rest of the hardware is getting old too. In particular it would need some more RAM - 4GB is barely enough after starting up a VM and opening some browser tabs - the GPU is classified "heritage" or something and one of the fans doesn't work so sometimes in summer overheating causes freezes. A 125GB SSD alone would not be enough, not because I keep movies around but: iso files (some irreplacable), git repos (debian-installer alone is several GB), and virtual machines and the like. A lot of that is downloadable on demand, but having the code locally makes grep, find &co. much faster.

True, there is was still some free space on my 1TB disk, but the 500GB /data partition was getting a bit tight. So I'd have to buy much more than 125GB of SSD, or add a hard drive to go with it. Along with the 4GB extra RAM and fixing the fan, a newer machine seemed a better deal, since the motherboard and everything else will be newer too.

I will however take your hint and look for a cheap SSD for the old machine. It would then be a perfectly usable computer - although not for my main workstation. Then to think of a sensible use to put it to...


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#18 2018-09-08 08:46:24

Hyacinth
Member
Registered: 2018-03-26
Posts: 14

Re: Hard disk failure and SMART

The monty python references are absolutely hilarious!

I looked online at how much it costs to bring new life to it and found https://www.amazon.co.jp/Samsung-2-5インチ … B0796B3GL6

8000 yen for 250 GB is so cheap! When I bought an SSD you could maybe get a used 32 GB one for that money. But it’s still a fair sum for something you are not likely to use anymore. Maybe you can find a second hand SSD? They don’t die ever, I think. My fiancé is using one from ancient times daily. I don’t think the company that made it exists today, even! That was before Samsung was in the market. Even the one I have in my desktop computer is an Intel one from Q2 2012 that the Intel Windows tool says is in peak condition still.

Glad you made a backup and that Bunsenlabs has such an excellent live session now, and good luck on looking for a new computer!

Offline

#19 2018-09-10 08:44:40

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Re: Hard disk failure and SMART

^Hey thanks!
The machine I picked should arrive today.
39,999yen, HP ProDesk 600, i5-4570 core, 120GB SSD+2TB HD, 8GB RAM, not to mention Windows 10. roll I suppose with 2TB there's room to keep that around, though I haven't touched Windows for years. Anyway, we'll see how it all works out...


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#20 2018-09-10 13:04:32

Jimbo_G
Member
From: France
Registered: 2017-05-12
Posts: 76

Re: Hard disk failure and SMART

^ If you haven't used Windows for a few years, Windows 10 might come as a bit of a shock... It would be interesting to see what you think of it though!

Offline

#21 2018-09-11 12:35:34

earlybird
ほやほや
Registered: 2015-12-16
Posts: 603
Website

Re: Hard disk failure and SMART

johnraff wrote:

^Hey thanks!
The machine I picked should arrive today.
39,999yen, HP ProDesk 600, i5-4570 core, 120GB SSD+2TB HD, 8GB RAM, not to mention Windows 10. roll I suppose with 2TB there's room to keep that around, though I haven't touched Windows for years. Anyway, we'll see how it all works out...

Now your computer's raw power is twice that of mine!

Offline

#22 2018-09-11 15:41:41

Hyacinth
Member
Registered: 2018-03-26
Posts: 14

Re: Hard disk failure and SMART

Hey that’s the CPU I almost have. Just a little faster! Does a great job for me. Enjoy it!

Offline

#23 2018-09-12 07:49:51

dbvolvox
Member
Registered: 2015-09-29
Posts: 49

Re: Hard disk failure and SMART

johnraff wrote:

^Hey thanks!
The machine I picked should arrive today.
39,999yen, HP ProDesk 600, i5-4570 core, 120GB SSD+2TB HD, 8GB RAM, not to mention Windows 10. roll I suppose with 2TB there's room to keep that around, though I haven't touched Windows for years. Anyway, we'll see how it all works out...

Good luck! I was going to do something similar but found I couldn't even turn the machine on without having to accept all the MS T&C so went straight to an install that wiped W10.

Offline

#24 2018-09-12 08:32:54

johnraff
nullglob
From: Nagoya, Japan
Registered: 2015-09-09
Posts: 4,669
Website

Re: Hard disk failure and SMART

Jimbo_G wrote:

^ If you haven't used Windows for a few years, Windows 10 might come as a bit of a shock... It would be interesting to see what you think of it though!

The last Windows I used was W98. Booted up XP a couple of times, that's it, so W10 was... sort of what I expected. The inscrutible error messages after a process has run for 5 min., mysteriously fixed the next time, sudden reboots without warning, some things don't change.

I was thinking of just wiping it right off both drives (SSD and HD) after making an installer just because, well I paid for it. Anyway, tried reinstalling - somewhat long convoluted process, with much googling and downloading - and managed to put Windows on the hard disk, leaving the SSD free, (without having to set up a Microsoft account). It boots up OK after all that, but it was a tiring day and installing BL to the SSD (with big data on the HD) will have to wait till tomorrow. I hope W10 doesn't do an update down the road and wipe all the hard drive. I might change my mind and delete it anyway.

BTW does anyone use LVM on drives as small as this 120GB SSD?


John
--------------------
( a boring Japan blog , Japan Links, idle twitterings  and GitStuff )
In case you forget, the rules.

Offline

#25 2018-09-12 09:22:33

twoion
ほやほや
Registered: 2015-08-10
Posts: 2,227

Re: Hard disk failure and SMART

johnraff wrote:

BTW does anyone use LVM on drives as small as this 120GB SSD?

LVM is a win on any disk since you can then forget about disk geometry when working with partitions. If there's even the slightest chance that you'd want to grow/move partitions, then use LVM.


Im grünen Wald, dort wo die Drossel singt…

Offline

Board footer

Powered by FluxBB