Am I missing something but can't something like distillation help here ?

imurray · on July 9, 2024

The paper says they tried that: https://arxiv.org/abs/2402.14905

Deep link to the relevant snippet in html version: https://ar5iv.labs.arxiv.org/html/2402.14905#S3.SS5

"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."