Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

LightBERT: A Distilled Chinese BERT Model

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNISA,volume 12987))

Included in the following conference series:

  • 490Accesses

Abstract

Pre-trained language models (e.g. BERT) have achieve remarkable performance in most natural language understanding tasks. However, it’s difficult to apply these models to online systems for their huge amount of parameters and long inference time. Knowledge Distillation is a popular model compression technique, which could achieve considerable model structure compression with limited performance degradation. However, there are currently no knowledge distillation methods specially designed for compressing Chinese pre-trained language model and no corresponding distilled model has been publicly released. In this paper, we propose LightBERT, which is a distilled Bert model specially for Chinese Language Processing. We perform pre-training distillation under the masking language model objective with whole word masking, which is a masking strategy adapted to Chinese language characteristics. Furthermore, we adopt a multi-step distillation strategy to compress the model progressively. Experiments on CLUE benchmark show LightBERT could reduce 62.5% size of a RoBERTa model while achieving 94.5% the performance of its teacher.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 6291
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7864
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. Stat1050, 21 (2016)

    Google Scholar 

  2. Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprintarXiv:1906.08101 (2019)

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  4. Hinton, G., Dean, J., Vinyals, O.: Distilling the knowledge in a neural network, pp. 1–9, March 2014

    Google Scholar 

  5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput.9(8), 1735–1780 (1997)

    Article  Google Scholar 

  6. Jiao, X., et al.: Tinybert: Distilling BERT for natural language understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174 (2020)

    Google Scholar 

  7. Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4365–4374 (2019)

    Google Scholar 

  8. Kullback, S.: Information Theory and Statistics. Courier Corporation (1997)

    Google Scholar 

  9. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)

    Google Scholar 

  10. Li, X., Yan, H., Qiu, X., Huang, X.J.: FLAT: chinese NER using flat-lattice transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6836–6842 (2020)

    Google Scholar 

  11. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprintarXiv:1907.11692 (2019)

  12. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural. Inf. Process. Syst.32, 14014–14024 (2019)

    Google Scholar 

  13. Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)

    Google Scholar 

  14. Romero, A., Ballas, N., Kahou, S., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. CoRR abs/1412.6550 (2015)

    Google Scholar 

  15. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108 (2019)

  16. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4323–4332 (2019)

    Google Scholar 

  17. Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170 (2020)

    Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  19. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. Association for Computational Linguistics (2019)

    Google Scholar 

  20. Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., Chao, L.S.: Learning deep transformer models for machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822 (2019)

    Google Scholar 

  21. Wu, N., Green, B., Ben, X., O’Banion, S.: Deep transformer models for time series forecasting: the influenza prevalence case. arXiv preprintarXiv:2001.08317 (2020)

  22. Xu, L., et al.: CLUE: a Chinese language understanding evaluation benchmark. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4762–4772 (2020)

    Google Scholar 

Download references

Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (61632011, 61876053, 62006062), the Shenzhen Foundational Research Funding (JCYJ20180507183527919), China Postdoctoral Science Foundation (2020M670912), Joint Lab of HITSZ and China Merchants Securities.

Author information

Authors and Affiliations

  1. Joint Lab of HITSZ-CMS, Harbin Institute of Technology (Shenzhen), Shenzhen, China

    Yice Zhang, Yihui Li, Peng Xu & Ruifeng Xu

  2. China Merchants Securities Co., Ltd., Shenzhen, China

    Jianxin Li, Guozhong Shi & Feiran Hu

Authors
  1. Yice Zhang

    You can also search for this author inPubMed Google Scholar

  2. Yihui Li

    You can also search for this author inPubMed Google Scholar

  3. Peng Xu

    You can also search for this author inPubMed Google Scholar

  4. Ruifeng Xu

    You can also search for this author inPubMed Google Scholar

  5. Jianxin Li

    You can also search for this author inPubMed Google Scholar

  6. Guozhong Shi

    You can also search for this author inPubMed Google Scholar

  7. Feiran Hu

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toRuifeng Xu.

Editor information

Editors and Affiliations

  1. Georgia State University, Atlanta, GA, USA

    Yi Pan

  2. University of Pittsburgh, Pittsburgh, PA, USA

    Zhi-Hong Mao

  3. University of Electronic Science and Technology of China, Chengdu, China

    Lei Luo

  4. China Gridcom Co., Ltd., Shenzhen, China

    Jing Zeng

  5. Kingdee International Software Group Co., Ltd., Shenzhen, China

    Liang-Jie Zhang

Rights and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y.et al. (2022). LightBERT: A Distilled Chinese BERT Model. In: Pan, Y., Mao, ZH., Luo, L., Zeng, J., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2021. AIMS 2021. Lecture Notes in Computer Science(), vol 12987. Springer, Cham. https://doi.org/10.1007/978-3-030-96033-9_5

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 6291
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7864
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp