Hard disc test surprises Google

dankelley · on Feb 20, 2007

In case you're interested in reading the report, it's at http://216.239.37.132/papers/disk_failures.pdf at least as of 20070220. The report has quite a few graphs, but it has a surprisingly unstatistical flavour about it.

rtm · on Feb 20, 2007

I wonder what this means for practical system design. Do people currently build assumptions about hard drive failure patterns into their systems, in a way that they should change? I suppose independent failure (i.e. copying data to two drives is better than storing it on just one) is the main assumption behind e.g. RAID; I wonder whether Google has any new insight there.

jbert · on Feb 20, 2007

You should be able to improve over naive RAID by pairing a relatively-high-probability-of-failure drive with a low prob one. i.e. what you *shouldn't* do is the common practice of putting two new drives in a mirror, since they are both in the infant mortality part of the failure curve. What this data suggests is that you'll get a smaller chance of losing data (via simultaneous failure) if you pair a new drive with an older "proven" one (but not one so old that it is nearing end of life).

charliehotel · on March 24, 2007

infant mortality is practically non-existent for enterprise-class drives and rare for consumer-class drives.

there is something to gain by using drives from different manufacturers (or different lots from the same manufacturer) within an array.

Elfan · on Feb 20, 2007

Companies would be interested in saving on cooling costs if its not providing any significant benefit.

charliehotel · on March 24, 2007

yes, the classic RAID paper assumes that faults are independent. this is not the case.

some recent work extends the basic analysis to deal with correlated faults.

charliehotel · on March 24, 2007

the analysis in this paper is problematic.

the main problem is that the authors didn't look at the data by disk model and manufacturing lot. ideally you should remove drives with known problems from the population.

known problems? yes. there aren't any truly horrible drives out there, but there is he occasional bad bunch. a three point or more difference in afr between "good" drives and a bad bunch is typical.

disclosure: i reviewed this paper for the FAST program committee.