Yesterday I wrote about Linux security and the need for monitoring hard drives for failure symptoms. As if this was an omen, today the following message popped up on my screen:
At any given time, my PC runs between 6 to 10 hard drives of varying size and make. In recent years I’ve replaced some old and small 1TB and 2TB drives for larger 3TB and 4TB drives, sometimes replacing two drives for one. I’m also adding more SSD to improve performance, but my main data storage is still comprised of mechanical hard drives.
I am using smartmontools to monitor my hard drives S.M.A.R.T. data (Self-Monitoring, Analysis, and Reporting Technology). I also installed GSmartControl, a GUI frontend to smartmontools that allows me to run tests and inspect the drive(s) SMART attributes. For more information, see for example https://help.ubuntu.com/community/Smartmontools.
Disclaimer: The instructions and suggestions given below have worked for me – your millage may vary. I’m taking NO responsibility for any data or other loss resulting, directly or consequentially, from your use of the information provided herein.
To install smartmontools with GUI and notifier, open a terminal window and enter:
sudo apt-get install smartmontools smart-notifier gsmartcontrol
You can run smartmontools as a daemon and use the smart-notifier to get alerts (such as the screenshot above). In Linux Mint enter:
gksudo xed /etc/default/smartmontools
and uncomment the following line:
Save the file and exit. You need to reboot your computer to start the smartd daemon.
Reading Hard Drive Attributes
Once you installed smartmontools and GSmartControl, you can inspect your disk attributes. In Linux Mint go to Menu -> All applications -> System Tools and click “GSmartControl”.
The first thing to do is to turn on SMART. Select the first drive and click + More. Then select “Enable SMART”.
Optionally you can also select “Enable Auto Offline Data Collection”.
Repeat the above step for all drives. Note that some drives may not support “Enable Auto Offline Data Collection” – my SanDisk SSD doesn’t, but the Samsung SSD supports this feature. All modern drives support SMART – in the rare case that a drive does not support SMART, get a replacement.
After you enabled SMART for all hard drives and SSD, it’s time to take a look at the drive attributes.
Note: Different drives support different attributes. Also different manufacturers use different attributes, or their values may differ. For example, Seagate drives report high values in the Raw Read Error Rate and Seek Error Rate fields. This does not mean that there is a problem.
Select the drive you want to inspect and click Device -> View details from the menu.
The first tab shows the identity of the selected drive, including drive model, serial number, firmware version, capacity, etc.
Click on the second tab named Attributes.
Above is an example of a hard drive attributes page. Notice the red font color in the Attributes tab and the pink highlight of “Current Pending Sector Count”. This is a warning!
The following attributes refer to mechanical hard drives. SSD work differently, so you’ll need to look out for different attributes. Check out your SSD manufacturers SMART attributes and recommendations.
The attributes above can tell us a lot about the state of the drive. Of particular interest are the following hard drive attributes (see What SMART Stats Tell Us About Hard Drives, incl. comments):
- Raw Read Error Rate – may indicate that a drive is going to fail;
- Reallocated sector count – once you see a non-zero value, you should consider getting a new hard drive;
- Power-on Time – not a failure indicator, but you should start keeping an eye on a drive that has more than ~12,000* hours (500 days) for a desktop PC, and ~25,000* hours for a server;
- Current Pending Sector Count – shows the number of “unstable” sectors that need to be relocated. Watch out for a non-zero value – a single digit value may still be OK, but if the value increases to a double-digit number it’s best to replace the drive;
- Multi Zone Error Rate – watch for a non-zero value. The higher it is, the worse is the condition of the disk. Best is to replace the disk;
- Any other attribute or warning/notification that indicates a (future) failure.
*See note further down under “When to Replace a Disk”.
When using GSmartControl, you can get a detailed explanation on each attribute by placing the cursor over the attribute.
Based on the above attributes, I decided to run an extended self-test on this drive to see if there were more issues.
I perform extended self tests on all my hard drives twice a year, and short self-tests about every 2-3 months. The disk attributes and the results of the self-tests give me valuable information on the health of the disks.
To run a short self-test, click the Perform Tests tab.
The default is “Short Self-test”. Click “Execute” to start the test.
After the test has completed, click the Self-test Logs tab to see the results.
Because of the “Current Pending Sector Count” warning I got for this drive (see first screenshot above), I decided to do an Extended Self-test. You can select the different tests in the “Test type” pull-down menu. Pay attention that the extended self-test can take many hours, depending on the size and speed of the disk.
Below is the Self-test Logs tab with the results of the tests:
While the short (2 minutes) self-test completed without error/warning,the extended self-test produced an error and aborted after only 10% completion (see “LBA of the first error”). I may be able to retrieve all of the data, but there is a possibility that some data is corrupted or lost. This disk needs immediate replacement!
When to Replace a Disk
Well, this depends. Here a list of questions only you can answer:
- Do you regularly backup your data?
- How critical is downtime – can you afford some hours, a day, or a weekend without your computer/disk?
- Do you keep spare disk(s)?
- How fast and easy will it be to replace the disk and restore the data from backup?
- Even with regular backups, data will get lost when the disk fails. How critical is the data to you? Can you afford some (even small) data loss?
- Do you need your computer during business hours? Should repairs be done after business hours?
- Is high availability important to you?
If all your data is important to you, if you can’t or don’t want to have downtime during business hours, and/or if you need high availability, then it’s best to replace old disks or disks that start showing signs of wear as soon as possible.
For high availability, you should consider a RAID configuration (e.g. RAID5) or similar setup where your data is automatically mirrored or stored redundantly (cluster). Consult an expert on that! Software RAID is preferred over hardware RAID, because when a RAID controller fails, you might be SOOL (sh.. out of luck).
I personally use the disk attributes and replace a disk when it:
- is old – above ~13,000* hours uptime for a desktop drive, or ~26,000* for a server;
- shows signs of “fatigue” – any of the critical values are above 0 (zero);
- produces warnings or errors;
- doesn’t pass a self-test.
*Note: The Power-on Time disk replacement recommendations are highly individual and based solely on my personal experience! Some drive models last longer than others. Your computers’ work environment will most likely differ, and so may the operating hours before disk replacement. I power off my desktop PC when not in use. But my media PC/backup server runs 24/7. The latter also operates in a hot environment. Disk temperatures of 35-40C (95-104F) are the norm. Yet, the disks in my server typically outlast the disks in my desktop PC by a factor of 2-3. For another perspective on that, see How long do hard drives actually live for?
By replacing a disk at the onset of signs of old age allows me to copy the entire disk to a new replacement disk, usually without loosing data. I sometimes reuse old disks that are still OK for non-critical applications – in my kids’ gaming rigs, for example.
What Factors Cause Premature Failures
As with any product, some hard disks are better than others. And even the best disks on the planet have quality variations. Aside from these facts of life, there are factors that will influence the life expectancy of a disk:
- Acceleration – for example a computer falling to the floor. This is usually an issue with laptops and mobile computers. Use SSD!
- Heat – high heat has been reported as a reason for shortened disk life. I live in a hot country and heat is definitely an issue. Air-conditioning helps, so does good disk ventilation (use fans).
- Vibration – make sure there are no moving parts touching the computer. Use rubber feet to isolate the computer from the floor and the environment. Make sure the disks are installed properly using rubber gaskets or rubber rings to suspend the disks. Disks should never be fastened directly onto the metal frame of the chassis! Fans can produce vibration – choose good fans that produce little vibration and noise.
- Dust – often accumulates between the drives and prevents good airflow and cooling. Clean/vacuum.
- Hard drive specifications – different drives have different purposes, and different specs. If you operate a drive significantly outside its specs, it may die on you prematurely. For example, some drives are built for 24/7 operation, others are not. Those drives used for 24/7 may not be a good choice for desktop computers that get booted and shut down every day. Most drives have a limited annual throughput. When buying a drive, match it to your needs/application.
- Power failures – can lead to data loss, but in the worst case they may also damage mechanical drives.
smartmontools are just another set of tools to help identify drive issues. They do not replace backups and other means to ensure data redundancy.