"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."