Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Am I missing something but can't something like distillation help here ?


The paper says they tried that: https://arxiv.org/abs/2402.14905

Deep link to the relevant snippet in html version: https://ar5iv.labs.arxiv.org/html/2402.14905#S3.SS5

"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: