My experience with running GPUs is that overclocking tends to go with undervolti...

bayindirh · on March 13, 2024

I'd not be so sure, actually. Because we have seen other processors on the systems, like RAID or Ethernet cards go "insane" after some years. No overheat, no physical stress, nothing. Normal if a bit too much (HPC) work.

Reboot the system, device just disappears, never to be never seen again. It generally starts after ~6 year mark.

Sometimes device starts to corrupt things silently, but not always. However they too disappear after some time.

Oh, sometimes GPUs do that, too.

magicalhippo · on March 13, 2024

If the device has a mean time to failure of say 5 years when running close to thermal limits, then running it at 20 degree C lower operating temperature turns that into 20 years as mentioned in the sibling comment[1].

Thus expected lifetime quickly becomes long enough that it's effectively not an issue for CPUs and GPUs if you provide sufficient cooling.

[1]: https://news.ycombinator.com/item?id=39697289

bayindirh · on March 15, 2024

> Thus expected lifetime quickly becomes long enough that it's effectively not an issue for CPUs and GPUs if you provide sufficient cooling.

Both yes and no.

I still have an old AMD Athlon XP system, which works at 2200MHz (200x11), which is completely out of spec for that generation of AMD systems (2200MHz parts had 166MHz bus), and it still performs as on day one since it's not overclocked and cooled well.

On the other hand, we change parts which fry because they feel like it even they are not even close to their thermal limits, because they're kept in well cooled data center.

Sometimes, things go bzzt even without extreme heat. It's really interesting. Something is working at full throttle with no problems, you update a couple of things, reboot, and the device is gone for good.

latchkey · on March 14, 2024

The point being that you don't know which component on the board failed. If you look at the GPU chip itself, it might be just fine and it was just a capacitor that blew.

bayindirh · on March 14, 2024

The GPUs we change arrive on a large board which hosts multiple GPUs with SXM interface. GPU itself arrives with its heat sink only, and we only change the GPU itself. Board is never replaced.

Same for the RAID card. The processor has a couple of failure modes (no cache or no card), both directly related to RAID processor itself. Again same for the Ethernet cards we fry. They lose their MAC addresses, all pointing to in silica problems.

latchkey · on March 15, 2024

Good point. I've also heard that there are fairly high rate of failures with high end nvidia stuff. I wonder why that is.