In case you're interested in reading the report, it's at http://216.239.37.132/papers/disk_failures.pdf at least as of 20070220.
The report has quite a few graphs, but it has a surprisingly unstatistical flavour about it.
I wonder what this means for practical system design.
Do people currently build assumptions about hard drive
failure patterns into their systems, in a way that
they should change? I suppose independent failure
(i.e. copying data to two drives is better than storing
it on just one) is the main assumption behind e.g. RAID;
I wonder whether Google has any new insight there.
You should be able to improve over naive RAID by pairing a relatively-high-probability-of-failure drive with a low prob one.
i.e. what you *shouldn't* do is the common practice of putting two new drives in a mirror, since they are both in the infant mortality part of the failure curve. What this data suggests is that you'll get a smaller chance of losing data (via simultaneous failure) if you pair a new drive with an older "proven" one (but not one so old that it is nearing end of life).
the main problem is that the authors didn't look at the data by disk model and manufacturing lot. ideally you should remove drives with known problems from the population.
known problems? yes. there aren't any truly horrible drives out there, but there is he occasional bad bunch. a three point or more difference in afr between "good" drives and a bad bunch is typical.
disclosure: i reviewed this paper for the FAST program committee.