Hard Disk Drive Failures

The system’s worst weakness?

Malo Le Goff
4 min readOct 27, 2022

In our quests for designing fault-tolerant, durable and consistent systems, we hear a lot about Hard Disk Drive (HDD) failures. Yet in the system design literature, there is little explanation of what a HDD failure really means. This article is meant to explain the ins and outs of HDD failures and some ways to mitigate them.

From Denny Müller on Unsplash

I. Definition

A HDD’s failure occurs when a properly configured computer cannot access the information stored on the drive. As we’ll see in the next part, there can be multiple causes for such issue. But before, we’ll take a look at the average lifespan of a disk and the 3 types of failures we can encounter.

For this, we’ll rely on a study done by BlackBlaze which had the wonderful idea of tracking the lifespan of 20 000 HDDs. Here is the failure rate over time :

As you can see on the plot above, there are three distinct failure phases (i.e. 3 types of failures):

  • At the beginning of the HDD life, for some disks, the manufacturing defects kick in.
  • Between 18 and 36 months, drive deaths are caused by random failures
  • Then, the HDDs start to wear out and the failure rate goes back up

We’ll see more concrete examples of such failures cases in part II

II. Causes

Whare can be the causes for a HDD failure ? Here is a non-exhaustive list :

  • Human errors : A human drops the HDD, delete accidentally some files, …
  • Hardware Failures : A head crash is an example of such hardware issue; A head crash occurs when a read–write head of a hard disk drive makes contact with its rotating platter (normally it’s just hovering over the surface of the platter), slashing its surface and permanently damaging its magnetic media. It is most often caused by a sudden severe motion of the disk, for example, the jolt caused by dropping a laptop to the ground while it is operating or physically shocking a computer.
  • The air filter : Another cause of failure is a faulty air filter. The air filters on today’s drives equalize the atmospheric pressure and moisture between the drive enclosure and its outside environment. If the filter fails to capture a dust particle, the particle can land on the platter, causing a head crash if the head happens to sweep over it.
  • Firmware Corruption : As you might already know, to start the OS and translate the OS operations into actual HDD operations a firmware is needed. If the firmware is corrupted (by a malware for instance), the hardware cannot operate properly.
  • Heat : As the temperature rises, disk platters expand and then contract with a temperature decrease. This can result in a distorted magnetic surface that develops micro cracks, a severe defect that compromises data.
  • Water Damage : Water can cause unwanted surges in the electrical current which can severely damage your device.

Now that we’ve seen a non-exhaustive list of the causes, we can see what can be done to prevent or mitigate them.

III. Mitigation

What are the known solutions to this HDD failure issue ? Well, you have several ones :

  • Data Scrubbing : A background task is run periodically to inspect the HDD for correctable errors. It corrects the errors using checksums or a copy of the original data. It is used by the RAID controller that checks every HDD in a RAID array for defective data blocks.

NB : As a reminder, a checksum is a block of data derived from another block of data and used to verify the data integrity of the original block of data. Hash functions for cryptographic purposes are an example of checksums functions. But some of the checksum functions can also correct the data like the Hamming code

  • Data Backup : Classic but still the most efficient way of mitigating a potential HDD failure
  • Active Hard Drive Protection : Used only in laptops to my knowledge. Usually the system consists of accelerometers that alert the system when excess acceleration or vibration is detected. The software then tells the hard disk to unload its heads to prevent them from coming in contact with the platter, thus potentially preventing head crash.
  • SMART (Self-Monitoring, Analysis and Reporting Technology) : Software used to predict the failure of the HDDs before it happens and notify the user about it. This way, he/she can change the HDD before the data gets corrupted. This software takes into account the increased heat output, increased noise level, problems with reading and writing data, or an increase in the number of damaged disk sectors. Note that it would work only on predictable failures due to the wear of the HDD

Conclusion

So after having defined a broad definition of HDD failure (the inaccessibility of data stored on the disk) we’ve seen the possible causes to these failures (heat, head crash, human errors, …) and ways of mitigating them.

Thanks for reading ! Please let me know in the comment if you have any question or comment !

References

[1] : https://en.wikipedia.org/wiki/Hard_disk_drive_failure

[2] : https://mydatarecoverylab.com/top-7-causes-of-hard-disk-failure/

[3] : https://www.extremetech.com/computing/170748-how-long-do-hard-drives-actually-live-for

--

--

Malo Le Goff

Student Engineer | Engineering school : IMT Atlantique | Software Engineering & Data Science & Cybersecurity