My experience with running GPUs is that overclocking tends to go with undervolting and it has zero impact on longevity of the chips themselves. Other components like power supplies, with consumable or hand made things, like hand soldered components, are what end up failing.
We had cards in the worst of the worst environments and they ran fine for years on end.
I'd not be so sure, actually. Because we have seen other processors on the systems, like RAID or Ethernet cards go "insane" after some years. No overheat, no physical stress, nothing. Normal if a bit too much (HPC) work.
Reboot the system, device just disappears, never to be never seen again. It generally starts after ~6 year mark.
Sometimes device starts to corrupt things silently, but not always. However they too disappear after some time.
If the device has a mean time to failure of say 5 years when running close to thermal limits, then running it at 20 degree C lower operating temperature turns that into 20 years as mentioned in the sibling comment[1].
Thus expected lifetime quickly becomes long enough that it's effectively not an issue for CPUs and GPUs if you provide sufficient cooling.
> Thus expected lifetime quickly becomes long enough that it's effectively not an issue for CPUs and GPUs if you provide sufficient cooling.
Both yes and no.
I still have an old AMD Athlon XP system, which works at 2200MHz (200x11), which is completely out of spec for that generation of AMD systems (2200MHz parts had 166MHz bus), and it still performs as on day one since it's not overclocked and cooled well.
On the other hand, we change parts which fry because they feel like it even they are not even close to their thermal limits, because they're kept in well cooled data center.
Sometimes, things go bzzt even without extreme heat. It's really interesting. Something is working at full throttle with no problems, you update a couple of things, reboot, and the device is gone for good.
The point being that you don't know which component on the board failed. If you look at the GPU chip itself, it might be just fine and it was just a capacitor that blew.
The GPUs we change arrive on a large board which hosts multiple GPUs with SXM interface. GPU itself arrives with its heat sink only, and we only change the GPU itself. Board is never replaced.
Same for the RAID card. The processor has a couple of failure modes (no cache or no card), both directly related to RAID processor itself. Again same for the Ethernet cards we fry. They lose their MAC addresses, all pointing to in silica problems.
We had cards in the worst of the worst environments and they ran fine for years on end.