The cool thing about this paper is, actually it is a (relatively) cheap way to solve this particular problem (LLM on 2048k window), because they are using pre-trained models like LLaMA2 and Mistral and extending them to 2048k windows using a novel technique, rather than training a model from scratch with 2048k tokens which would be prohibitively expensive to mere mortals.