Somewhat amazed, dang, that this topic is not discussed more widely here or else...

Somewhat amazed, dang, that this topic is not discussed more widely here or elsewhere. There is a lot of HPC and DS expertise out there which lacks understanding of ML system architecture (in the sense of the deployed machinery in toto).

Her follow up post [1] is also recommended for those who (like me, are experienced but not in ML) finally had things click because of the OP writeup:

Large Transformer Model Inference Optimization (2023)

https://lilianweng.github.io/posts/2023-01-10-inference-opti...

A very cool cite from that article is LLM.int8(): https://arxiv.org/abs/2208.07339