Somewhat amazed, dang, that this topic is not discussed more widely here or elsewhere. There is a lot of HPC and DS expertise out there which lacks understanding of ML system architecture (in the sense of the deployed machinery in toto).
Her follow up post [1] is also recommended for those who (like me, are experienced but not in ML) finally had things click because of the OP writeup:
Large Transformer Model Inference Optimization (2023)
Her follow up post [1] is also recommended for those who (like me, are experienced but not in ML) finally had things click because of the OP writeup:
Large Transformer Model Inference Optimization (2023)
https://lilianweng.github.io/posts/2023-01-10-inference-opti...
A very cool cite from that article is LLM.int8(): https://arxiv.org/abs/2208.07339