Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

A survey on multilingual large language models: corpora, alignment, and bias

You have full access to thisopen access article

Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Based on the foundation of Large Language Models (LLMs), Multilingual LLMs (MLLMs) have been developed to address the challenges faced in multilingual natural language processing, hoping to achieve knowledge transfer from high-resource languages to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolutions, key techniques, and multilingual capacities. Secondly, we explore the multilingual training corpora of MLLMs and the multilingual datasets oriented for downstream tasks that are crucial to enhance the cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-art studies of multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs, including its categories, evaluation metrics, and debiasing techniques. Finally, we discuss existing challenges and point out promising research directions of MLLMs.

Article PDF

Similar content being viewed by others

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

  1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

    Google Scholar 

  2. Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 4171–4186

    MATH  Google Scholar 

  3. Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 634

    Google Scholar 

  4. Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 483–498

    Google Scholar 

  5. Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S et al. BLOOM: A 176B-parameter open-access multilingual language model. 2022, arXiv preprint arXiv: 2211.05100

  6. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

  7. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8440–8451

    Chapter  Google Scholar 

  8. Cao S, Kitaev N, Klein D. Multilingual alignment of contextual word representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020

    MATH  Google Scholar 

  9. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119

    Google Scholar 

  10. Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543

    MATH  Google Scholar 

  11. Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, 610–623

    Chapter  Google Scholar 

  12. Talat Z, Névéol A, Biderman S, Clinciu M, Dey M, Longpre S, Luccioni S, Masoud M, Mitchell M, Radev D, Sharma S, Subramonian A, Tae J, Tan S, Tunuguntla D, Van Der Wal O. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In: Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. 2022, 26–41

    Chapter  Google Scholar 

  13. Hutchinson B, Prabhakaran V, Denton E, Webster K, Zhong Y, Denuyl S. Social biases in NLP models as barriers for persons with disabilities. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5491–5501

    Chapter MATH  Google Scholar 

  14. Nadeem M, Bethke A, Reddy S. StereoSet: measuring stereotypical bias in pretrained language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 5356–5371

    Google Scholar 

  15. Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, Allauzen A, Crabbé B, Besacier L, Schwab D. FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 2479–2490

    Google Scholar 

  16. De Vries W, Van Cranenburgh A, Bisazza A, Caselli T, Van Noord G, Nissim M. BERTje: A Dutch BERT model. 2019, arXiv preprint arXiv: 1912.09582

  17. Antoun W, Baly F, Hajj H. AraBERT: Transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 2020, 9–15

    MATH  Google Scholar 

  18. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog, 2018

    Google Scholar 

  19. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9

    Google Scholar 

  20. Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J. et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

    MATH  Google Scholar 

  21. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L et al. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011

    MATH  Google Scholar 

  22. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al. Gpt-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

  23. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140

    MathSciNet  Google Scholar 

  24. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871–7880

    Chapter  Google Scholar 

  25. Nguyen T Q, Chiang D. Transfer learning across low-resource, related languages for neural machine translation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 296–301

    MATH  Google Scholar 

  26. Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L. Multilingual denoising pre-training for neural machine translation. In: Proceedings of Transactions of the Association for Computational Linguistics. 2020, 726–742

    Google Scholar 

  27. Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4996–5001

    Chapter MATH  Google Scholar 

  28. Artetxe M, Ruder S, Yogatama D. On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 4623–4637

    Chapter MATH  Google Scholar 

  29. Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240

    Google Scholar 

  30. Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A. et al. LaMDA: language models for dialog applications. 2022, arXiv preprint arXiv: 2201.08239

  31. Zhang S, Roller S, Goyal N, Artetxe M, Chen M et al. OPT: open pre-trained transformer language models. 2022, arXiv preprint arXiv: 2205.01068

  32. Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 320–335

    MATH  Google Scholar 

  33. Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W, Xia X, Tam W L, Ma Z, Xue Y, Zhai J, Chen W, Liu Z, Zhang P, Dong Y, Tang J. GLM-130B: an open bilingual pre-trained model. In: Proceedings of the 11th International Conference on Learning Representations. 2023

    MATH  Google Scholar 

  34. Chiang W L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez J E et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See vicuna. lmsys. org websit, 2023

    Google Scholar 

  35. Anil R, Borgeaud S, Alayrac J B, Yu J, Soricut R. et al. Gemini: a family of highly capable multimodal models. 2023, arXiv preprint arXiv: 2312.11805

  36. Rust P, Pfeiffer J, Vulić I, Ruder S, Gurevych I. How good is your tokenizer? On the monolingual performance of multilingual language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 3118–3135

    MATH  Google Scholar 

  37. Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Yu D. MM-LLMs: recent advances in MultiModal large language models. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. 2024, 12401–12430

    Chapter MATH  Google Scholar 

  38. Rae J W, Borgeaud S, Cai T, Millican K, Hoffmann J. et al. Scaling language models: Methods, analysis & insights from training gopher. 2021, arXiv preprint arXiv: 2112.11446

  39. Chung H W, Hou L, Longpre S, Zoph B, Tay Y. et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024, 25(70): 1–53

    Google Scholar 

  40. OpenAI. Introducing chatGPT. Seeopenai.com/index/chatgpt/ website, 2022

  41. Driess D, Xia F, Sajjadi M S M, Lynch C, Chowdhery A. et al. PaLME: An embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 8469–8488

    Google Scholar 

  42. Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto T B. Stanford alpaca: An instruction-following llama model. See github. com/tatsulab/stanford_alpaca website, 2023

    Google Scholar 

  43. Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. 2023, arXiv preprint arXiv: 2303.10845

  44. Biderman S, Schoelkopf H, Anthony Q G, Bradley H, O’Brien K, Hallahan E, Khan M A, Purohit S, Prashanth U S, Raff E, Skowron A, Sutawika L, Van Der Wal O. Pythia: a suite for analyzing large language models across training and scaling. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 2397–2430

    Google Scholar 

  45. Anil R, Dai A M, Firat O, Johnson M, Lepikhin D. et al. PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403

  46. Touvron H, Martin L, Stone K, Albert P, Almahairi A. et al. Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

  47. Manyika J, Hsiao S. An overview of bard: an early experiment with generative AI. See ai.google/static/documents/google-about-bard.pdf Google Static Documents, 2023

    Google Scholar 

  48. Yang A, Xiao B, Wang B, Zhang B, Bian C. et al. Baichuan 2: Open large-scale language models. 2023, arXiv preprint arXiv: 2309.10305

  49. MICROSOFT. Phi-2: the surprising power of small language models. Seemicrosoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ website, 2023

  50. Zeng A, Xu B, Wang B, Zhang C, Yin D. et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools. 2024, arXiv preprint arXiv: 2406.12793

  51. Anthropic. The Claude 3 model family: Opus, sonnet, haiku. Seeanthropic.com/news/claude-3-family/ website, 2024

  52. Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A. et al. The llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

  53. Zhao W X, Zhou K, Li J, Tang T, Wang X. et al. A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223

  54. Doddapaneni S, Ramesh G, Kunchukuttan A, Kumar P, Khapra M M. A primer on pretrained multilingual language models. 2021, arXiv preprint arXiv: 2107.00676

  55. Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: a survey. Science China Technological Sciences, 2020, 63(10): 1872–1897

    Article MATH  Google Scholar 

  56. Shen T, Jin R, Huang Y, Liu C, Dong W, Guo Z, Wu X, Liu Y, Xiong D. Large language model alignment: A survey. 2023, arXiv preprint arXiv: 2309.15025

  57. Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V. et al. Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375

  58. Bai Y, Jones A, Ndousse K, Askell A, Chen A. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862

  59. Liu R, Zhang G, Feng X, Vosoughi S. Aligning generative language models with human values. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 241–252

    MATH  Google Scholar 

  60. Baheti A, Lu X, Brahman F, Le Bras R, Sap M, Riedl M O. Improving language models with advantage-based offline policy gradients. 2023, arXiv preprint arXiv: 2305.14718

  61. Go D, Korbak T, Kruszewski G, Rozen J, Ryu N, Dymetman M. Aligning language models with preferences through f-divergence minimization. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 463

    Google Scholar 

  62. Askell A, Bai Y, Chen A, Drain D, Ganguli D. et al. A general language assistant as a laboratory for alignment. 2021, arXiv preprint arXiv: 2112.00861

  63. Lambert N, Castricato L, Werra V L, Havrilla A. Illustrating reinforcement learning from human feedback (RLHF). See huggingface.co/blog/rlhf website, 2022

    Google Scholar 

  64. Stiennon N, Ouyang L, Wu J, Ziegler D M, Lowe R, Voss C, Radford A, Amodei D, Christiano P. Learning to summarize from human feedback. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 253

    Google Scholar 

  65. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347

  66. Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937

    Google Scholar 

  67. French R M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 1999, 3(4): 128–135

    Article MATH  Google Scholar 

  68. Hedderich M A, Lange L, Adel H, Strötgen J, Klakow D. A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 2545–2568

    MATH  Google Scholar 

  69. Alabi J O, Adelani D I, Mosbach M, Klakow D. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 4336–4349

    MATH  Google Scholar 

  70. Wongso W, Lucky H, Suhartono D. Pre-trained transformer-based language models for sundanese. Journal of Big Data, 2022, 9(1): 39

    Article MATH  Google Scholar 

  71. Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages-profiting from language families. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing. 2023, 1–10

    Google Scholar 

  72. Rönnqvist S, Kanerva J, Salakoski T, Ginter F. Is multilingual BERT fluent in language generation? In: Proceedings of the 1st NLPL Workshop on Deep Learning for Natural Language Processing. 2019, 29–36

    MATH  Google Scholar 

  73. Wang Z, Karthikeyan K, Mayhew S, Roth D. Extending multilingual BERT to low-resource languages. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. 2020, 2649–2656

    MATH  Google Scholar 

  74. Choenni R, Garrette D, Shutova E. How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 13244–13257

    Chapter MATH  Google Scholar 

  75. Wang Y, Yu Z, Wang J, Heng Q, Chen H, Ye W, Xie R, Xie X, Zhang S. Exploring vision-language models for imbalanced learning. International Journal of Computer Vision, 2024, 132(1): 224–237

    Article MATH  Google Scholar 

  76. Jiang Y, Qiu R, Zhang Y, Zhang P F. Balanced and explainable social media analysis for public health with large language models. In: Proceedings of the 34th Australasian Database Conference on Databases Theory and Applications. 2024, 73–86

    Chapter MATH  Google Scholar 

  77. Lin X V, Mihaylov T, Artetxe M, Wang T, Chen S et al. Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 9019–9052

    Chapter  Google Scholar 

  78. Tian L, Zhang X, Lau J H. Rumour detection via zero-shot cross-lingual transfer learning. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. 2021, 603–618

    MATH  Google Scholar 

  79. Shi F, Suzgun M, Freitag M, Wang X, Srivats S, Vosoughi S, Chung H W, Tay Y, Ruder S, Zhou D, Das D, Wei J. Language models are multilingual chain-of-thought reasoners. In: Proceedings of the 11th International Conference on Learning Representations. 2023

    Google Scholar 

  80. Ogunremi T, Jurafsky D, Manning C D. Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 1251–1266

    Google Scholar 

  81. Ogueji K, Zhu Y, Lin J. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021, 116–126

    Chapter MATH  Google Scholar 

  82. Pikuliak M, Šimko M, Bieliková M. Cross-lingual learning for text processing: a survey. Expert Systems with Applications, 2021, 165: 113765

    Article MATH  Google Scholar 

  83. Philippy F, Guo S, Haddadan S. Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: a review. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 5877–5891

    MATH  Google Scholar 

  84. Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3464

    Google Scholar 

  85. Luo Y, Kong Q, Xu N, Cao J, Hao B. et al. YAYI 2: multilingual open-source large language models. 2023, arXiv preprint arXiv: 2312.14862

  86. Sun H, Jin R, Xu S, Pan L, Supryadi, Cui M, Du J, Lei Y, Yang L, Shi L, Xiao J, Zhu S, Xiong D. FuxiTranyu: a multilingual large language model trained with balanced data. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1499–1522

    MATH  Google Scholar 

  87. Adelani D, Neubig G, Ruder S, Rijhwani S, Beukman M. et al. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4488–4508

    Chapter  Google Scholar 

  88. Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. MultiCoNER: a large-scale multilingual dataset for complex named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3798–3809

    Google Scholar 

  89. Öhman E, Pàmies M, Kajava K, Tiedemann J. XED: a multilingual dataset for sentiment analysis and emotion detection. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, 6542–6552

    Chapter  Google Scholar 

  90. Shode I, Adelani D I, Peng J, Feldman A. NollySenti: Leveraging transfer learning and machine translation for Nigerian movie sentiment classification. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 986–998

    Google Scholar 

  91. Muhammad S H, Adelani D I, Ruder S, Ahmad I S, Abdulmumin I, Bello B S, Choudhury M, Emezue C C, Abdullahi S S, Aremu A, Jorge A, Brazdil P. NaijaSenti: a Nigerian twitter sentiment corpus for multilingual sentiment analysis. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 590–602

    Google Scholar 

  92. Ogundepo O, Zhang X, Sun S, Duh K, Lin J. AfriCLIRMatrix: enabling cross-lingual information retrieval for African languages. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8721–8728

    Chapter MATH  Google Scholar 

  93. Sun S, Duh K. CLIRMatrix: a massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4160–4170

    MATH  Google Scholar 

  94. Ma C, Imani A, Ye H, Asgari E, Schütze H. Taxi1500: a multilingual dataset for text classification in 1500 languages. 2023, arXiv preprint arXiv: 2305.08487

  95. Keung P, Lu Y, Szarvas G, Smith N A. The multilingual Amazon reviews corpus. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4563–4568

    MATH  Google Scholar 

  96. Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H. Word translation without parallel data. In: Proceedings of the 6th International Conference on Learning Representations. 2018

    MATH  Google Scholar 

  97. Linguatools.org. Wikipedia monolingual corpora. See linguatools/tools/corpora/wikipedia-monolingual-corpora/website, 2018

  98. Palen-Michel C, Kim J, Lignos C. Multilingual open text release 1: Public domain news in 44 languages. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 2080–2089

    MATH  Google Scholar 

  99. Lison P, Tiedemann J. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016, 923–929

    MATH  Google Scholar 

  100. Zhu W, Liu H, Dong Q, Xu J, Huang S, Kong L, Chen J, Li L. Multilingual machine translation with large language models: empirical results and analysis. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 2765–2781

    MATH  Google Scholar 

  101. Goyal N, Du J, Ott M, Anantharaman G, Conneau A. Larger-scale transformers for multilingual masked language modeling. In: Proceedings of the 6th Workshop on Representation Learning for NLP. 2021, 29–33

    Google Scholar 

  102. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5: 135–146

    Article  Google Scholar 

  103. Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 789–798

    Google Scholar 

  104. Søgaard A, Ruder S, Vulić I. On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 778–788

    MATH  Google Scholar 

  105. Nakashole N. NORMA: Neighborhood sensitive maps for multilingual word embeddings. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 512–522

    Chapter MATH  Google Scholar 

  106. Wang H, Henderson J, Merlo P. Multi-adversarial learning for cross-lingual word embeddings. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 463–472

    MATH  Google Scholar 

  107. Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 2021, 304: 114135

    Article  Google Scholar 

  108. Schuster T, Ram O, Barzilay R, Globerson A. Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistic. 2019, 1599–1613

    Google Scholar 

  109. Gage P. A new algorithm for data compression. The C Users Journal, 1994, 12(2): 23–38

    MATH  Google Scholar 

  110. Schuster M, Nakajima K. Japanese and Korean voice search. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. 2012, 5149–5152

    MATH  Google Scholar 

  111. Vulić I, Ponti E M, Litschko R, Glavaš G, Korhonen A. Probing pretrained language models for lexical semantics. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 7222–7240

    MATH  Google Scholar 

  112. Zhang J, Ji B, Xiao N, Duan X, Zhang M, Shi Y, Luo W. Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Proceedings of the Findings of the Association for Computational Linguistics. 2021, 2943–2955

    MATH  Google Scholar 

  113. Hämmerl K, Libovický J, Fraser A. Combining static and contextualised multilingual embeddings. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2316–2329

    MATH  Google Scholar 

  114. Zheng J, Wang Y, Wang G, Xia J, Huang Y, Zhao G, Zhang Y, Li S. Using context-to-vector with graph retrofitting to improve word embeddings. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8154–8163

    MATH  Google Scholar 

  115. Li Y, Liu F, Collier N, Korhonen A, Vulić I. Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 4353–4374

    MATH  Google Scholar 

  116. Alvarez-Melis D, Jaakkola T. Gromov-wasserstein alignment of word embedding spaces. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1881–1890

    Chapter MATH  Google Scholar 

  117. Ren S, Liu S, Zhou M, Ma S. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 3476–3485

    Chapter MATH  Google Scholar 

  118. Mohiuddin T, Joty S. Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 3857–3867

    MATH  Google Scholar 

  119. Mohiuddin T, Bari M S, Joty S. LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 2712–2723

    MATH  Google Scholar 

  120. Glavaš G, Vulić I. Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7548–7555

    Chapter MATH  Google Scholar 

  121. Marchisio K, Verma N, Duh K, Koehn P. IsoVec: controlling the relative isomorphism of word embedding spaces. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 6019–6033

    Chapter MATH  Google Scholar 

  122. Singh J, McCann B, Socher R, Xiong C. BERT is not an interlingua and the bias of tokenization. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP. 2019, 47–55

    Chapter MATH  Google Scholar 

  123. Taitelbaum H, Chechik G, Goldberger J. Multilingual word translation using auxiliary languages. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 1330–1335

    Google Scholar 

  124. Karthikeyan K, Wang Z, Mayhew S, Roth D. Cross-lingual ability of multilingual BERT: an empirical study. In: Proceedings of the 8th International Conference on Learning Representations. 2020

    MATH  Google Scholar 

  125. Liu C L, Hsu T Y, Chuang Y S, Lee H Y. A study of cross-lingual ability and language-specific information in multilingual BERT. 2020, arXiv preprint arXiv: 2004.09205

  126. Ranjan R, Gupta S, Singh S N. A comprehensive survey of bias in LLMs: current landscape and future directions. 2024, arXiv preprint arXiv: 2409.16430

  127. Cao S, Cheng R, Wang Z. AGR: age group fairness reward for bias mitigation in LLMs. 2024, arXiv preprint arXiv: 2409.04340

  128. Ahn J, Oh A. Mitigating language-dependent ethnic bias in BERT. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 533–549

    Chapter MATH  Google Scholar 

  129. Meade N, Poole-Dayan E, Reddy S. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1878–1898

    MATH  Google Scholar 

  130. Zhao J, Mukherjee S, Hosseini S, Chang K W, Awadallah A H. Gender bias in multilingual embeddings and cross-lingual transfer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2896–2907

    Chapter MATH  Google Scholar 

  131. Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models. 2023, arXiv preprint arXiv: 2304.03738

  132. Wu S, Dredze M. Are all languages created equal in multilingual BERT? In: Proceedings of the 5th Workshop on Representation Learning for NLP. 2020, 120–130

    Chapter MATH  Google Scholar 

  133. Wang J, Liu Y, Wang X. Assessing multilingual fairness in pre-trained multimodal representations. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2681–2695

    MATH  Google Scholar 

  134. Kassner N, Dufter P, Schütze H. Multilingual LAMA: investigating knowledge in multilingual pretrained language models. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021, 3250–3258

    MATH  Google Scholar 

  135. Levy S, John N A, Liu L, Vyas Y, Ma J, Fujinuma Y, Ballesteros M, Castelli V, Roth D. Comparing biases and the impact of multilingual training across multiple languages. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10260–10280

    Chapter  Google Scholar 

  136. Piqueras L C, Søgaard A. Are pretrained multilingual models equally fair across languages? In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3597–3605

    MATH  Google Scholar 

  137. Touileb S, Øvrelid L, Velldal E. Occupational biases in Norwegian and multilingual language models. In: Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing. 2022, 200–211

    MATH  Google Scholar 

  138. Naous T, Ryan M J, Ritter A, Xu W. Having beer after prayer? Measuring cultural bias in large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 16366–16393

    Google Scholar 

  139. Abid A, Farooqi M, Zou J. Large language models associate Muslims with violence. Nature Machine Intelligence, 2021, 3(6): 461–463

    Article  Google Scholar 

  140. Cao Y T, Pruksachatkun Y, Chang K W, Gupta R, Kumar V, Dhamala J, Galstyan A. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 561–570

    Google Scholar 

  141. Leiter C, Lertvittayakumjorn P, Fomicheva M, Zhao W, Gao Y, Eger S. Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 2024, 25(75): 1–49

    Google Scholar 

  142. Sun T, He J, Qiu X, Huang X. BERTScore is unfair: on social bias in language model-based metrics for text generation. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3726–3739

    Chapter MATH  Google Scholar 

  143. Zhang T, Kishore V, Wu F, Weinberger K Q, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations. 2020

    MATH  Google Scholar 

  144. Sellam T, Das D, Parikh A. BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7881–7892

    Chapter MATH  Google Scholar 

  145. Yuan W, Neubig G, Liu P. BARTSCORE: evaluating generated text as text generation. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 2088

    MATH  Google Scholar 

  146. Koo R, Lee M, Raheja V, Park J I, Kim Z M, Kang D. Benchmarking cognitive biases in large language models as evaluators. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 517–545

    MATH  Google Scholar 

  147. Delobelle P, Tokpo E, Calders T, Berendt B. Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics. 2022, 1693–1706

    Google Scholar 

  148. Caliskan A, Bryson J J, Narayanan A. Semantics derived automatically from language corpora Contain human-like biases. Science, 2017, 356(6334): 183–186

    Article  Google Scholar 

  149. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353–355

    Chapter  Google Scholar 

  150. May C, Wang A, Bordia S, Bowman S R, Rudinger R. On measuring social biases in sentence encoders. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 622–628

    Google Scholar 

  151. Guo W, Caliskan A. Detecting emergent intersectional biases: contextualized word embeddings contain a distribution of human-like biases. In: Proceedings of 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021, 122–133

    Chapter MATH  Google Scholar 

  152. Bansal S, Garimella V, Suhane A, Mukherjee A. Debiasing multilingual word embeddings: a case study of three Indian languages. In: Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 2021, 27–34

    Google Scholar 

  153. Rudinger R, Naradowsky J, Leonard B, Van Durme B. Gender bias in coreference resolution. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 8–14

    Google Scholar 

  154. Zhao J, Wang T, Yatskar M, Ordonez V, Chang K W. Gender bias in coreference resolution: Evaluation and debiasing methods. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 15–20

    MATH  Google Scholar 

  155. Kiritchenko S, Mohammad S. Examining gender and race bias in two hundred sentiment analysis systems. In: Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018, 43–53

    Chapter MATH  Google Scholar 

  156. Nangia N, Vania C, Bhalerao R, Bowman S R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 1953–1967

    Google Scholar 

  157. Stanovsky G, Smith N A, Zettlemoyer L. Evaluating gender bias in machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1679–1684

    Chapter MATH  Google Scholar 

  158. De-Arteaga M, Romanov A, Wallach H, Chayes J, Borgs C, Chouldechova A, Geyik S, Kenthapadi K, Kalai A T. Bias in bios: a case study of semantic representation bias in a high-stakes setting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2019, 120–128

    Chapter  Google Scholar 

  159. Karkkainen K, Joo J. FairFace: face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 1547–1557

    MATH  Google Scholar 

  160. Lauscher A, Glavaš G. Are we consistently biased? Multidimensional analysis of biases in distributional word vectors. In: Proceedings of the 8th Joint Conference on Lexical and Computational Semantics. 2019, 85–91

    MATH  Google Scholar 

  161. Névéol A, Dupont Y, Bezançon J, Fort K. French CrowS-pairs: extending a challenge dataset for measuring social bias in masked language models to a language other than English. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8521–8531

    MATH  Google Scholar 

  162. Wan Y, Pu G, Sun J, Garimella A, Chang K W, Peng N. “Kelly is a warm person, joseph is a role model”: Gender biases in LLM-generated reference letters. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 3730–3748

    MATH  Google Scholar 

  163. Liang P P, Li I M, Zheng E, Lim Y C, Salakhutdinov R, Morency L P. Towards debiasing sentence representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5502–5515

    Chapter MATH  Google Scholar 

  164. Ravfogel S, Elazar Y, Gonen H, Twiton M, Goldberg Y. Null it out: Guarding protected attributes by iterative nullspace projection. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7237–7256

    Chapter MATH  Google Scholar 

  165. Yang Z, Yang Y, Cer D, Darve E. A simple and effective method to eliminate the self language bias in multilingual representations. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 5825–5832

    Chapter MATH  Google Scholar 

  166. Webster K, Wang X, Tenney I, Beutel A, Pitler E, Pavlick E, Chen J, Chi E, Petrov S. Measuring and reducing gendered correlations in pre-trained models. 2020, arXiv preprint arXiv: 2010.06032

  167. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958

    MathSciNet MATH  Google Scholar 

  168. Zhou F, Mao Y, Yu L, Yang Y, Zhong T. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 4227–4241

    MATH  Google Scholar 

  169. Ranaldi L, Ruzzetti E S, Venditti D, Onorati D, Zanzotto F M. A trip towards fairness: Bias and de-biasing in large language models. In: Proceedings of the 13th Joint Conference on Lexical and Computational Semantics. 2024, 372–384

    Google Scholar 

  170. Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022

    MATH  Google Scholar 

  171. Wang A, Russakovsky O. Overwriting pretrained bias with finetuning data. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2023, 3934–3945

    MATH  Google Scholar 

  172. Guo Y, Yang Y, Abbasi A. Auto-debias: debiasing masked language models with automated biased prompts. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1012–1023

    MATH  Google Scholar 

  173. Mattern J, Jin Z, Sachan M, Mihalcea R, Schölkopf B. Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. 2022, arXiv preprint arXiv: 2212.10678

  174. Dhingra H, Jayashanker P, Moghe S, Strubell E. Queer people are people first: Deconstructing sexual identity stereotypes in large language models. 2023, arXiv preprint arXiv: 2307, 0010, 1: 2023

  175. Schick T, Udupa S, Schütze H. Self-diagnosis and self-debiasing: a proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics, 2021, 9: 1408–1424

    Article  Google Scholar 

  176. Conneau A, Rinott R, Lample G, Williams A, Bowman S, Schwenk H, Stoyanov V. XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2475–2485

    Chapter  Google Scholar 

  177. Nguyen T, Van Nguyen C, Lai V D, Man H, Ngo N T, Dernoncourt F, Rossi R A, Nguyen T H. CulturaX: a cleaned, enormous, and multilingual dataset for large language models in 167 languages. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 4226–4237

    MATH  Google Scholar 

  178. Laurençon H, Saulnier L, Wang T, Akiki C, Del Moral A V. et al. The BigScience roots corpus: a 1.6TB composite multilingual dataset. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2306

    Google Scholar 

  179. Kreutzer J, Caswell I, Wang L, Wahab A, Van Esch D. et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 2022, 10: 50–72

    Article  Google Scholar 

  180. Sen I, Assenmacher D, Samory M, Augenstein I, Aalst W, Wagner C. People make better edits: measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10480–10504

    Chapter MATH  Google Scholar 

  181. Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K W. Gender bias in contextualized word embeddings. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 629–634

    Google Scholar 

  182. Yang L, Li J, Cunningham P, Zhang Y, Smyth B, Dong R. Exploring the efficacy of automatically generated counterfactuals for sentiment analysis. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 306–316

    Google Scholar 

  183. Sen I, Samory M, Flöck F, Wagner C, Augenstein I. How does counterfactually augmented data impact models for social computing constructs? In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 325–344

    Chapter MATH  Google Scholar 

  184. Goldfarb-Tarrant S, Lopez A, Blanco R, Marcheggiani D. Bias beyond English: counterfactual tests for bias in sentiment analysis in four languages. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 4458–4468

    MATH  Google Scholar 

  185. Sen I, Samory M, Wagner C, Augenstein I. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistic. 2022, 4716–4726

    MATH  Google Scholar 

  186. Joshi N, He H. An investigation of the (in)effectiveness of counterfactually augmented data. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 3668–3681

    MATH  Google Scholar 

  187. Zhang Q, Duan Q, Yuan B, Shi Y, Liu J. Exploring accuracy-fairness trade-off in large language models. 2024, arXiv preprint arXiv: 2411.14500

  188. Lin Z, Guan S, Zhang W, Zhang H, Li Y, Zhang H. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 2024, 57(9): 243

    Article MATH  Google Scholar 

  189. Yang N, Kang T, Choi S J, Lee H, Jung K. Mitigating biases for instruction-following language models via bias neurons elimination. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 9061–9073

    MATH  Google Scholar 

  190. Yadav H, Sitaram S. A survey of multilingual models for automatic speech recognition. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 5071–5079

    MATH  Google Scholar 

  191. Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M. XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 410

    MATH  Google Scholar 

  192. Dufter P, Schütze H. Identifying elements essential for BERT’s multilinguality. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4423–4437

    MATH  Google Scholar 

  193. Nzeyimana A, Niyongabo Rubungo A. KinyaBERT: a morphology-aware Kinyarwanda language model. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 5347–5363

    MATH  Google Scholar 

  194. Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. 2023, arXiv preprint arXiv: 2307.06435

  195. Pan X, Zhang B, May J, Nothman J, Knight K, Ji H. Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 1946–1958

    MATH  Google Scholar 

  196. Liu F, Bugliarello E, Ponti E M, Reddy S, Collier N, Elliott D. Visually grounded reasoning across languages and cultures. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 10467–10485

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Social Science Foundation of China (No. 24CYY107)

Author information

Authors and Affiliations

  1. School of Information Science and Technology, Beijing Foreign Studies University, Beijing, 100089, China

    Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Kexin Xu, Yuqi Ye & Hanwen Gu

Authors
  1. Yuemei Xu

    You can also search for this author inPubMed Google Scholar

  2. Ling Hu

    You can also search for this author inPubMed Google Scholar

  3. Jiayi Zhao

    You can also search for this author inPubMed Google Scholar

  4. Zihan Qiu

    You can also search for this author inPubMed Google Scholar

  5. Kexin Xu

    You can also search for this author inPubMed Google Scholar

  6. Yuqi Ye

    You can also search for this author inPubMed Google Scholar

  7. Hanwen Gu

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toYuemei Xu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Yuemei XU is an associate professor in the School of Information Science and Technology, Beijing Foreign Studies University, China. She received her PhD degree from Chinese Academy of Sciences, China in 2014 and the BE degree from Beijing University of Posts and Telecommunications, China in 2009. Her main research interests include multilingual natural language processing and artificial intelligence.

Ling HU received the bachelor’s degree from Beijing University of Posts and Telecommunications, China in 2021. She is currently pursing the master degree with the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Jiayi ZHAO is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Zihan QIU is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Kexin XU received the bachelor’s degree from Southwestern University of Finance and Economics, China in 2024. She is currently pursing the master degree with the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Yuqi YE is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include Multilingual Natural Language Processing and Artificial Intelligence.

Hanwen GU received the BE degree from the School of Information Science at Beijing Language and Culture University, China in 2023. Currently, he is pursuing a master’s degree in the School of Information Science and Technology at Beijing Foreign Studies University, China. His primary research interests encompass natural language processing and artificial intelligence.

Electronic supplementary material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Hu, L., Zhao, J.et al. A survey on multilingual large language models: corpora, alignment, and bias.Front. Comput. Sci.19, 1911362 (2025). https://doi.org/10.1007/s11704-024-40579-4

Download citation

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp