学习OPD并复现。参考资料:https://github.com/david-xinyuwei/david-share/blob/master/DL-Algorithm-Insights/Multi-Expert-OPD-Distillation/README-CN.md,https://github.com/david-xinyuwei/david-share/tree/master/DL-Algorithm-Insights。
一些启发
- 作者讨论的“为什么是on-policy 而不是 sft?”
见https://github.com/david-xinyuwei/david-share/blob/master/DL-Algorithm-Insights/Multi-Expert-OPD-Distillation/README-CN.md “vs SFT(Supervised Fine-Tuning)—— Exposure Bias 问题”