Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNISA,volume 12987))

Included in the following conference series:

International Conference on AI and Mobile Services

490Accesses

Abstract

Pre-trained language models (e.g. BERT) have achieve remarkable performance in most natural language understanding tasks. However, it’s difficult to apply these models to online systems for their huge amount of parameters and long inference time. Knowledge Distillation is a popular model compression technique, which could achieve considerable model structure compression with limited performance degradation. However, there are currently no knowledge distillation methods specially designed for compressing Chinese pre-trained language model and no corresponding distilled model has been publicly released. In this paper, we propose LightBERT, which is a distilled Bert model specially for Chinese Language Processing. We perform pre-training distillation under the masking language model objective with whole word masking, which is a masking strategy adapted to Chinese language characteristics. Furthermore, we adopt a multi-step distillation strategy to compress the model progressively. Experiments on CLUE benchmark show LightBERT could reduce 62.5% size of a RoBERTa model while achieving 94.5% the performance of its teacher.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 6291; Price includes VAT (Japan)

Softcover Book: JPY 7864; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance

DAP-BERT: Differentiable Architecture Pruning of BERT

Deep-to-Bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

Notes

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. Stat1050, 21 (2016)
Google Scholar
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprintarXiv:1906.08101 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Hinton, G., Dean, J., Vinyals, O.: Distilling the knowledge in a neural network, pp. 1–9, March 2014
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput.9(8), 1735–1780 (1997)
Article Google Scholar
Jiao, X., et al.: Tinybert: Distilling BERT for natural language understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174 (2020)
Google Scholar
Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4365–4374 (2019)
Google Scholar
Kullback, S.: Information Theory and Statistics. Courier Corporation (1997)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Google Scholar
Li, X., Yan, H., Qiu, X., Huang, X.J.: FLAT: chinese NER using flat-lattice transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6836–6842 (2020)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprintarXiv:1907.11692 (2019)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural. Inf. Process. Syst.32, 14014–14024 (2019)
Google Scholar
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)
Google Scholar
Romero, A., Ballas, N., Kahou, S., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. CoRR abs/1412.6550 (2015)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108 (2019)
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4323–4332 (2019)
Google Scholar
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. Association for Computational Linguistics (2019)
Google Scholar
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., Chao, L.S.: Learning deep transformer models for machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822 (2019)
Google Scholar
Wu, N., Green, B., Ben, X., O’Banion, S.: Deep transformer models for time series forecasting: the influenza prevalence case. arXiv preprintarXiv:2001.08317 (2020)
Xu, L., et al.: CLUE: a Chinese language understanding evaluation benchmark. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4762–4772 (2020)
Google Scholar

Download references

Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (61632011, 61876053, 62006062), the Shenzhen Foundational Research Funding (JCYJ20180507183527919), China Postdoctoral Science Foundation (2020M670912), Joint Lab of HITSZ and China Merchants Securities.

Author information

Authors and Affiliations

Joint Lab of HITSZ-CMS, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Yice Zhang, Yihui Li, Peng Xu & Ruifeng Xu
China Merchants Securities Co., Ltd., Shenzhen, China
Jianxin Li, Guozhong Shi & Feiran Hu

Authors

Yice Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yihui Li
View author publications
You can also search for this author inPubMed Google Scholar
Peng Xu
View author publications
You can also search for this author inPubMed Google Scholar
Ruifeng Xu
View author publications
You can also search for this author inPubMed Google Scholar
Jianxin Li
View author publications
You can also search for this author inPubMed Google Scholar
Guozhong Shi
View author publications
You can also search for this author inPubMed Google Scholar
Feiran Hu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toRuifeng Xu.

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Yi Pan
University of Pittsburgh, Pittsburgh, PA, USA
Zhi-Hong Mao
University of Electronic Science and Technology of China, Chengdu, China
Lei Luo
China Gridcom Co., Ltd., Shenzhen, China
Jing Zeng
Kingdee International Software Group Co., Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y.et al. (2022). LightBERT: A Distilled Chinese BERT Model. In: Pan, Y., Mao, ZH., Luo, L., Zeng, J., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2021. AIMS 2021. Lecture Notes in Computer Science(), vol 12987. Springer, Cham. https://doi.org/10.1007/978-3-030-96033-9_5

Download citation

DOI:https://doi.org/10.1007/978-3-030-96033-9_5
Published:14 February 2022
Publisher Name:Springer, Cham
Print ISBN:978-3-030-96032-2
Online ISBN:978-3-030-96033-9
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

LightBERT: A Distilled Chinese BERT Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance

DAP-BERT: Differentiable Architecture Pruning of BERT

Deep-to-Bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now