Hard Drive Health and Usage

Recently I had to deal with a business critical Linux server, hosted by a popular hosting company, with a failing drive threatening all of our critical data. This company switched out the drive for a “new” drive and within a few weeks the same problem cropped right back up. As to be expected they blamed our software but it was clear that the kernel was logging low level device errors. After much research I landed upon smartctl and was able to prove that

  1. The replacement hard drive they gave us actually had 15,000+ runtime hours on it and was not new as they had claimed
  2. The drive itself was logging low level errors during read operations and was warning of impending failure
After confronting them with the data , they replaced the drive a second time and this time they gave us a new drive with no previous run time hours on it. Problem solved.

Here we cover this handy tool that is part of the open source package smartmontools . smartctl is a command line utility that allows you to query a hard drive’s Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) data. This tool runs on Windows, Linux, and OS X . Shortly we’ll cover how to install and use this tool but first a little bit about S.M.A.R.T.

Get S.M.A.R.T.

Modern hard drives (both solid state and standard spinning media) come equipped with onboard firmware, sensors, and flash memory that allow them to detect and record health and usage data on the drive itself. This data persists even if you move the drive to a new computer or install a new operating system on it. There are a plethora of datapoints that are collected including things like running and max temperature, power cycle counts, total life runtime hours, seek/read/write errors and many more. Specific data points can vary by manufacturer but all manufacturers capture lifetime run hours and max temp/current temp data.

These drives also keep a log of the last few recorded errors that the drive itself detects when it can not service a request or has to take contingency steps to deal with bad physical blocks. As drives age they inevitably will have a few errors crop up from time to time. Seeing logged errors is not necessarily an indication of a bad drive but errors that occur frequently and in clusters do indicate that trouble is brewing. Reading this data from your drive can help you determine if you just have had some bad luck with your filesystem or are about to experience an impending hardware failure.

To read this data your drive must be connected via SATA, SCSI, or IDE. USB and Firewire interfaces do not expose the needed SCSI commands that these tools need in order to query the drive. Drives hosted on these interfaces are still collecting data for you, but you will need to move them to a supported interface to read it.

Installing

Installing smartmontools (which include smartctl) on a Linux box is best done with your local package manager such as yum . If you can not install binaries with a package manager and you have the dev tools installed you can build it from source. Detailed instructions are can be found here .

Installing on Windows is best done with the binary installer in the latest release .

Installing on OS X can be done with MacPorts or by downloading the source and building. Note if you go the source route you will need the XCode developer tools installed (this causes GCC to be installed which you will need in order to build). You can get XCode for free in the Mac app store.

Usage

For our purposes here we will focus on reading the logged data from the drive as well scheduling in depth self tests. smartctl is a command line utility on all three OS types so open up your favorite terminal or command window and let’s begin

  1. On OS X and Linux smartctl should be in your path after it is installed. It usually lives in /usr/local/sbin but may vary based on your installation method. On Windows it is usually in C:\Program Files (x86)\smartmontools\bin (this is on 64 bit systems, on x86 systems it will be the same path without (x86) ).
  2. If smartctl is not in your path you can cd into its parent directory and run it using ./smartctl on Linux/OS X and .\smartctl.exe on Windows. If you know it has been installed but can’t find it, try using locate smartctl in Linux or OS X. On Windows Vista and up you can use the search feature in the start menu.
Reading the device log
        Open an administrative terminal. if you are on Windows use the “Run as administrator” option available by right clicking on the icon you use to start cmd.

 

      On OS X and Linux you can use

sudo 

       or

su – 

       If you do not have admin rights on OS X or Linux you may still be able to read the S.M.A.R.T. data but on Windows you must have administrative rights.
  1. Determine which drive you want to query. For example in Windows the C: drive is usually physical drive zero which can be referenced as /dev/sda . See the smartctl.8.html file in the \Program Files\smartmontools\doc folder for details on how to reference drives on Windows.Here is an example command line to read the log data from the C:\ drive on your Windows XP and later PC
    smartctl.exe –all /dev/sda
    For Linux and OS X you can enter mount at the command prompt and this will show you a list of drives and their mount points. Use the /dev drive path as the argument to smartctl. For example:smartctl –all /dev/disk0s2
  2. You may want to capture the log output directly to a text file so you can read it in your favorite editor. To do that on all operating systems follow the command with > filename.txt . For example:
    smartctl –all /dev/disk0s2 > main_drive_smart.txt
    Then you can read that file in your favorite editor and also be able to save the data off for comparison later
  3. Near the top of the output you will see something similar to:
  4. This is the vendor specific data block and should contain the Power On Hours entry so you can see how long your drive has been in use. Also look at the maximum temperature recorded. Drives die much more quickly when they overheat and a high recorded temperature is an indicator of potential degradation. Chances are if it has recorded that temp at least once, it has occurred several times in its lifetime.

Next we want to look at any error logs. The S.M.A.R.T. system usually keeps the last 5 logs but each log entry gets a sequential number so if you have error counts in the thousands, it’s time to start thinking about drive shopping. Here is an example output from my Windows development PC.

This drive is experiencing a bad block. This by itself is nothing necessarily to worry about, bad blocks usually get re-mapped and the block itself is then put in a table that the OS will look at when writing data so that it can be avoided. Bad block reports can also be generated by power outages during writes and events like that. However this drive has thousands of errors and I need to run a test on it to see if something more serious is happening.

Running Drive Tests

smartctl can schedule various built in self tests on the drive. Here we’ll cover the short and long test. The short test checks sensors, basic head movement and tracking, drive spindle RPM management, and other items that can be checked in a few minutes time. The long test will cover these as well as a closer inspection of the drive media itself. It can detect sector specific tracking issues, bad reads, and other issues that can only be detected by actually trying to read through each part of the drive. As to be expected, the long test can take a long time. It should be noted that both test types can be canceled by the host OS for various reasons outside your control. The long test especially should only be run during off peak hours as it ties up the drive and slows down read and write time. After choosing which test to run, do one of the following

  1. smartctl -t short /dev/sda
  2. smartctl -t long /dev/sda

Replace /dev/sda with your drive’s physical device name. If you schedule a long test it can take several hours before your results are ready.

You can monitor the test progress by issuing smartctl -a /dev/sda (use your physical device name) . At the bottom of the output will be a status update on the test with the remaining percentage to execute. Test results are kept in a log so you can safely queue a long test at night and check the results in the morning.

S.M.A.R.T. data can help you diagnose and decide when dealing with a potentially bad drive. No matter the health of your drives, always backup your data and periodically verify those backups. Power surges, fires, and plain old bad luck can ruin a drive in seconds and no amount of careful watching of S.M.A.R.T. logs can prevent that.

For further reading see the detailed manual page that is installed with smartmontools. On Linux and OS X this is simply

man smartctl

entered at the command line. For Windows you can search for smartctl in the Windows search bar in the start menu and in that list you will see a link to the html manual page that ships with Windows.

Every drive is guaranteed to fail eventually, it’s always just a matter of time. If you think you have a failing drive, check with your PC vendor or drive manufacturer to see if they offer low level diagnostic tools that you can run as a second check. For home users consider using a cloud backup such as Dropbox, Microsoft’s SkyDrive, Backblaze, or Crashplan among many others I’ve certainly missed.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Your email address will not be published. Required fields are marked *