North China University of Technology
Beijing Union University
Beijing Union University
North China University of Technology
2024 Volume E107.DIssue 7Pages 890-893
(compatible with EndNote, Reference Manager, ProCite, RefWorks)
BIB TEX(compatible with BibDesk, LaTeX)
TextHow to download citationCurrently, the most advanced knowledge distillation models use a metric learning approach based on probability distributions. However, the correlation between supervised probability distributions is typically geometric and implicit, causing inefficiency and an inability to capture structural feature representations among different tasks. To overcome this problem, we propose a knowledge distillation loss using the robust sliced Wasserstein distance with geometric median (GMSW) to estimate the differences between the teacher and student representations. Due to the intuitive geometric properties of GMSW, the student model can effectively learn to align its produced hidden states from the teacher model, thereby establishing a robust correlation among implicit features. In experiment, our method outperforms state-of-the-art models in both high-resource and low-resource settings.