Part of the book series:Lecture Notes in Computer Science ((LNISA,volume 12987))
Included in the following conference series:
490Accesses
Abstract
Pre-trained language models (e.g. BERT) have achieve remarkable performance in most natural language understanding tasks. However, it’s difficult to apply these models to online systems for their huge amount of parameters and long inference time. Knowledge Distillation is a popular model compression technique, which could achieve considerable model structure compression with limited performance degradation. However, there are currently no knowledge distillation methods specially designed for compressing Chinese pre-trained language model and no corresponding distilled model has been publicly released. In this paper, we propose LightBERT, which is a distilled Bert model specially for Chinese Language Processing. We perform pre-training distillation under the masking language model objective with whole word masking, which is a masking strategy adapted to Chinese language characteristics. Furthermore, we adopt a multi-step distillation strategy to compress the model progressively. Experiments on CLUE benchmark show LightBERT could reduce 62.5% size of a RoBERTa model while achieving 94.5% the performance of its teacher.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 6291
- Price includes VAT (Japan)
- Softcover Book
- JPY 7864
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. Stat1050, 21 (2016)
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprintarXiv:1906.08101 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Hinton, G., Dean, J., Vinyals, O.: Distilling the knowledge in a neural network, pp. 1–9, March 2014
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput.9(8), 1735–1780 (1997)
Jiao, X., et al.: Tinybert: Distilling BERT for natural language understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174 (2020)
Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4365–4374 (2019)
Kullback, S.: Information Theory and Statistics. Courier Corporation (1997)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Li, X., Yan, H., Qiu, X., Huang, X.J.: FLAT: chinese NER using flat-lattice transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6836–6842 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprintarXiv:1907.11692 (2019)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural. Inf. Process. Syst.32, 14014–14024 (2019)
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)
Romero, A., Ballas, N., Kahou, S., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. CoRR abs/1412.6550 (2015)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108 (2019)
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4323–4332 (2019)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. Association for Computational Linguistics (2019)
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., Chao, L.S.: Learning deep transformer models for machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822 (2019)
Wu, N., Green, B., Ben, X., O’Banion, S.: Deep transformer models for time series forecasting: the influenza prevalence case. arXiv preprintarXiv:2001.08317 (2020)
Xu, L., et al.: CLUE: a Chinese language understanding evaluation benchmark. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4762–4772 (2020)
Acknowledgement
This work was partially supported by the National Natural Science Foundation of China (61632011, 61876053, 62006062), the Shenzhen Foundational Research Funding (JCYJ20180507183527919), China Postdoctoral Science Foundation (2020M670912), Joint Lab of HITSZ and China Merchants Securities.
Author information
Authors and Affiliations
Joint Lab of HITSZ-CMS, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Yice Zhang, Yihui Li, Peng Xu & Ruifeng Xu
China Merchants Securities Co., Ltd., Shenzhen, China
Jianxin Li, Guozhong Shi & Feiran Hu
- Yice Zhang
You can also search for this author inPubMed Google Scholar
- Yihui Li
You can also search for this author inPubMed Google Scholar
- Peng Xu
You can also search for this author inPubMed Google Scholar
- Ruifeng Xu
You can also search for this author inPubMed Google Scholar
- Jianxin Li
You can also search for this author inPubMed Google Scholar
- Guozhong Shi
You can also search for this author inPubMed Google Scholar
- Feiran Hu
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toRuifeng Xu.
Editor information
Editors and Affiliations
Georgia State University, Atlanta, GA, USA
Yi Pan
University of Pittsburgh, Pittsburgh, PA, USA
Zhi-Hong Mao
University of Electronic Science and Technology of China, Chengdu, China
Lei Luo
China Gridcom Co., Ltd., Shenzhen, China
Jing Zeng
Kingdee International Software Group Co., Ltd., Shenzhen, China
Liang-Jie Zhang
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Y.et al. (2022). LightBERT: A Distilled Chinese BERT Model. In: Pan, Y., Mao, ZH., Luo, L., Zeng, J., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2021. AIMS 2021. Lecture Notes in Computer Science(), vol 12987. Springer, Cham. https://doi.org/10.1007/978-3-030-96033-9_5
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-030-96032-2
Online ISBN:978-3-030-96033-9
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative