Movatterモバイル変換

185Accesses
1Altmetric
Explore all metrics

Abstract

Based on the foundation of Large Language Models (LLMs), Multilingual LLMs (MLLMs) have been developed to address the challenges faced in multilingual natural language processing, hoping to achieve knowledge transfer from high-resource languages to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolutions, key techniques, and multilingual capacities. Secondly, we explore the multilingual training corpora of MLLMs and the multilingual datasets oriented for downstream tasks that are crucial to enhance the cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-art studies of multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs, including its categories, evaluation metrics, and debiasing techniques. Finally, we discuss existing challenges and point out promising research directions of MLLMs.

Article PDF

Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling

European Language Grid: Introduction

History, development, and principles of large language models: an introductory survey

Article14 October 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010
Google Scholar
Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 4171–4186
MATH Google Scholar
Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 634
Google Scholar
Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 483–498
Google Scholar
Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S et al. BLOOM: A 176B-parameter open-access multilingual language model. 2022, arXiv preprint arXiv: 2211.05100
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8440–8451
Chapter Google Scholar
Cao S, Kitaev N, Klein D. Multilingual alignment of contextual word representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020
MATH Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
Google Scholar
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543
MATH Google Scholar
Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, 610–623
Chapter Google Scholar
Talat Z, Névéol A, Biderman S, Clinciu M, Dey M, Longpre S, Luccioni S, Masoud M, Mitchell M, Radev D, Sharma S, Subramonian A, Tae J, Tan S, Tunuguntla D, Van Der Wal O. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In: Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. 2022, 26–41
Chapter Google Scholar
Hutchinson B, Prabhakaran V, Denton E, Webster K, Zhong Y, Denuyl S. Social biases in NLP models as barriers for persons with disabilities. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5491–5501
Chapter MATH Google Scholar
Nadeem M, Bethke A, Reddy S. StereoSet: measuring stereotypical bias in pretrained language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 5356–5371
Google Scholar
Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, Allauzen A, Crabbé B, Besacier L, Schwab D. FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 2479–2490
Google Scholar
De Vries W, Van Cranenburgh A, Bisazza A, Caselli T, Van Noord G, Nissim M. BERTje: A Dutch BERT model. 2019, arXiv preprint arXiv: 1912.09582
Antoun W, Baly F, Hajj H. AraBERT: Transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 2020, 9–15
MATH Google Scholar
Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog, 2018
Google Scholar
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9
Google Scholar
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J. et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159
MATH Google Scholar
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L et al. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011
MATH Google Scholar
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al. Gpt-4 technical report. 2023, arXiv preprint arXiv: 2303.08774
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140
MathSciNet Google Scholar
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871–7880
Chapter Google Scholar
Nguyen T Q, Chiang D. Transfer learning across low-resource, related languages for neural machine translation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 296–301
MATH Google Scholar
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L. Multilingual denoising pre-training for neural machine translation. In: Proceedings of Transactions of the Association for Computational Linguistics. 2020, 726–742
Google Scholar
Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4996–5001
Chapter MATH Google Scholar
Artetxe M, Ruder S, Yogatama D. On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 4623–4637
Chapter MATH Google Scholar
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240
Google Scholar
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A. et al. LaMDA: language models for dialog applications. 2022, arXiv preprint arXiv: 2201.08239
Zhang S, Roller S, Goyal N, Artetxe M, Chen M et al. OPT: open pre-trained transformer language models. 2022, arXiv preprint arXiv: 2205.01068
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 320–335
MATH Google Scholar
Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W, Xia X, Tam W L, Ma Z, Xue Y, Zhai J, Chen W, Liu Z, Zhang P, Dong Y, Tang J. GLM-130B: an open bilingual pre-trained model. In: Proceedings of the 11th International Conference on Learning Representations. 2023
MATH Google Scholar
Chiang W L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez J E et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See vicuna. lmsys. org websit, 2023
Google Scholar
Anil R, Borgeaud S, Alayrac J B, Yu J, Soricut R. et al. Gemini: a family of highly capable multimodal models. 2023, arXiv preprint arXiv: 2312.11805
Rust P, Pfeiffer J, Vulić I, Ruder S, Gurevych I. How good is your tokenizer? On the monolingual performance of multilingual language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 3118–3135
MATH Google Scholar
Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Yu D. MM-LLMs: recent advances in MultiModal large language models. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. 2024, 12401–12430
Chapter MATH Google Scholar
Rae J W, Borgeaud S, Cai T, Millican K, Hoffmann J. et al. Scaling language models: Methods, analysis & insights from training gopher. 2021, arXiv preprint arXiv: 2112.11446
Chung H W, Hou L, Longpre S, Zoph B, Tay Y. et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024, 25(70): 1–53
Google Scholar
OpenAI. Introducing chatGPT. Seeopenai.com/index/chatgpt/ website, 2022
Driess D, Xia F, Sajjadi M S M, Lynch C, Chowdhery A. et al. PaLME: An embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 8469–8488
Google Scholar
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto T B. Stanford alpaca: An instruction-following llama model. See github. com/tatsulab/stanford_alpaca website, 2023
Google Scholar
Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. 2023, arXiv preprint arXiv: 2303.10845
Biderman S, Schoelkopf H, Anthony Q G, Bradley H, O’Brien K, Hallahan E, Khan M A, Purohit S, Prashanth U S, Raff E, Skowron A, Sutawika L, Van Der Wal O. Pythia: a suite for analyzing large language models across training and scaling. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 2397–2430
Google Scholar
Anil R, Dai A M, Firat O, Johnson M, Lepikhin D. et al. PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403
Touvron H, Martin L, Stone K, Albert P, Almahairi A. et al. Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288
Manyika J, Hsiao S. An overview of bard: an early experiment with generative AI. See ai.google/static/documents/google-about-bard.pdf Google Static Documents, 2023
Google Scholar
Yang A, Xiao B, Wang B, Zhang B, Bian C. et al. Baichuan 2: Open large-scale language models. 2023, arXiv preprint arXiv: 2309.10305
MICROSOFT. Phi-2: the surprising power of small language models. Seemicrosoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ website, 2023
Zeng A, Xu B, Wang B, Zhang C, Yin D. et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools. 2024, arXiv preprint arXiv: 2406.12793
Anthropic. The Claude 3 model family: Opus, sonnet, haiku. Seeanthropic.com/news/claude-3-family/ website, 2024
Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A. et al. The llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783
Zhao W X, Zhou K, Li J, Tang T, Wang X. et al. A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223
Doddapaneni S, Ramesh G, Kunchukuttan A, Kumar P, Khapra M M. A primer on pretrained multilingual language models. 2021, arXiv preprint arXiv: 2107.00676
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: a survey. Science China Technological Sciences, 2020, 63(10): 1872–1897
Article MATH Google Scholar
Shen T, Jin R, Huang Y, Liu C, Dong W, Guo Z, Wu X, Liu Y, Xiong D. Large language model alignment: A survey. 2023, arXiv preprint arXiv: 2309.15025
Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V. et al. Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375
Bai Y, Jones A, Ndousse K, Askell A, Chen A. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862
Liu R, Zhang G, Feng X, Vosoughi S. Aligning generative language models with human values. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 241–252
MATH Google Scholar
Baheti A, Lu X, Brahman F, Le Bras R, Sap M, Riedl M O. Improving language models with advantage-based offline policy gradients. 2023, arXiv preprint arXiv: 2305.14718
Go D, Korbak T, Kruszewski G, Rozen J, Ryu N, Dymetman M. Aligning language models with preferences through f-divergence minimization. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 463
Google Scholar
Askell A, Bai Y, Chen A, Drain D, Ganguli D. et al. A general language assistant as a laboratory for alignment. 2021, arXiv preprint arXiv: 2112.00861
Lambert N, Castricato L, Werra V L, Havrilla A. Illustrating reinforcement learning from human feedback (RLHF). See huggingface.co/blog/rlhf website, 2022
Google Scholar
Stiennon N, Ouyang L, Wu J, Ziegler D M, Lowe R, Voss C, Radford A, Amodei D, Christiano P. Learning to summarize from human feedback. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 253
Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
Google Scholar
French R M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 1999, 3(4): 128–135
Article MATH Google Scholar
Hedderich M A, Lange L, Adel H, Strötgen J, Klakow D. A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 2545–2568
MATH Google Scholar
Alabi J O, Adelani D I, Mosbach M, Klakow D. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 4336–4349
MATH Google Scholar
Wongso W, Lucky H, Suhartono D. Pre-trained transformer-based language models for sundanese. Journal of Big Data, 2022, 9(1): 39
Article MATH Google Scholar
Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages-profiting from language families. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing. 2023, 1–10
Google Scholar
Rönnqvist S, Kanerva J, Salakoski T, Ginter F. Is multilingual BERT fluent in language generation? In: Proceedings of the 1st NLPL Workshop on Deep Learning for Natural Language Processing. 2019, 29–36
MATH Google Scholar
Wang Z, Karthikeyan K, Mayhew S, Roth D. Extending multilingual BERT to low-resource languages. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. 2020, 2649–2656
MATH Google Scholar
Choenni R, Garrette D, Shutova E. How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 13244–13257
Chapter MATH Google Scholar
Wang Y, Yu Z, Wang J, Heng Q, Chen H, Ye W, Xie R, Xie X, Zhang S. Exploring vision-language models for imbalanced learning. International Journal of Computer Vision, 2024, 132(1): 224–237
Article MATH Google Scholar
Jiang Y, Qiu R, Zhang Y, Zhang P F. Balanced and explainable social media analysis for public health with large language models. In: Proceedings of the 34th Australasian Database Conference on Databases Theory and Applications. 2024, 73–86
Chapter MATH Google Scholar
Lin X V, Mihaylov T, Artetxe M, Wang T, Chen S et al. Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 9019–9052
Chapter Google Scholar
Tian L, Zhang X, Lau J H. Rumour detection via zero-shot cross-lingual transfer learning. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. 2021, 603–618
MATH Google Scholar
Shi F, Suzgun M, Freitag M, Wang X, Srivats S, Vosoughi S, Chung H W, Tay Y, Ruder S, Zhou D, Das D, Wei J. Language models are multilingual chain-of-thought reasoners. In: Proceedings of the 11th International Conference on Learning Representations. 2023
Google Scholar
Ogunremi T, Jurafsky D, Manning C D. Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 1251–1266
Google Scholar
Ogueji K, Zhu Y, Lin J. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021, 116–126
Chapter MATH Google Scholar
Pikuliak M, Šimko M, Bieliková M. Cross-lingual learning for text processing: a survey. Expert Systems with Applications, 2021, 165: 113765
Article MATH Google Scholar
Philippy F, Guo S, Haddadan S. Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: a review. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 5877–5891
MATH Google Scholar
Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3464
Google Scholar
Luo Y, Kong Q, Xu N, Cao J, Hao B. et al. YAYI 2: multilingual open-source large language models. 2023, arXiv preprint arXiv: 2312.14862
Sun H, Jin R, Xu S, Pan L, Supryadi, Cui M, Du J, Lei Y, Yang L, Shi L, Xiao J, Zhu S, Xiong D. FuxiTranyu: a multilingual large language model trained with balanced data. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1499–1522
MATH Google Scholar
Adelani D, Neubig G, Ruder S, Rijhwani S, Beukman M. et al. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4488–4508
Chapter Google Scholar
Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. MultiCoNER: a large-scale multilingual dataset for complex named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3798–3809
Google Scholar
Öhman E, Pàmies M, Kajava K, Tiedemann J. XED: a multilingual dataset for sentiment analysis and emotion detection. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, 6542–6552
Chapter Google Scholar
Shode I, Adelani D I, Peng J, Feldman A. NollySenti: Leveraging transfer learning and machine translation for Nigerian movie sentiment classification. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 986–998
Google Scholar
Muhammad S H, Adelani D I, Ruder S, Ahmad I S, Abdulmumin I, Bello B S, Choudhury M, Emezue C C, Abdullahi S S, Aremu A, Jorge A, Brazdil P. NaijaSenti: a Nigerian twitter sentiment corpus for multilingual sentiment analysis. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 590–602
Google Scholar
Ogundepo O, Zhang X, Sun S, Duh K, Lin J. AfriCLIRMatrix: enabling cross-lingual information retrieval for African languages. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8721–8728
Chapter MATH Google Scholar
Sun S, Duh K. CLIRMatrix: a massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4160–4170
MATH Google Scholar
Ma C, Imani A, Ye H, Asgari E, Schütze H. Taxi1500: a multilingual dataset for text classification in 1500 languages. 2023, arXiv preprint arXiv: 2305.08487
Keung P, Lu Y, Szarvas G, Smith N A. The multilingual Amazon reviews corpus. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4563–4568
MATH Google Scholar
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H. Word translation without parallel data. In: Proceedings of the 6th International Conference on Learning Representations. 2018
MATH Google Scholar
Linguatools.org. Wikipedia monolingual corpora. See linguatools/tools/corpora/wikipedia-monolingual-corpora/website, 2018
Palen-Michel C, Kim J, Lignos C. Multilingual open text release 1: Public domain news in 44 languages. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 2080–2089
MATH Google Scholar
Lison P, Tiedemann J. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016, 923–929
MATH Google Scholar
Zhu W, Liu H, Dong Q, Xu J, Huang S, Kong L, Chen J, Li L. Multilingual machine translation with large language models: empirical results and analysis. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 2765–2781
MATH Google Scholar
Goyal N, Du J, Ott M, Anantharaman G, Conneau A. Larger-scale transformers for multilingual masked language modeling. In: Proceedings of the 6th Workshop on Representation Learning for NLP. 2021, 29–33
Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5: 135–146
Article Google Scholar
Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 789–798
Google Scholar
Søgaard A, Ruder S, Vulić I. On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 778–788
MATH Google Scholar
Nakashole N. NORMA: Neighborhood sensitive maps for multilingual word embeddings. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 512–522
Chapter MATH Google Scholar
Wang H, Henderson J, Merlo P. Multi-adversarial learning for cross-lingual word embeddings. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 463–472
MATH Google Scholar
Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 2021, 304: 114135
Article Google Scholar
Schuster T, Ram O, Barzilay R, Globerson A. Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistic. 2019, 1599–1613
Google Scholar
Gage P. A new algorithm for data compression. The C Users Journal, 1994, 12(2): 23–38
MATH Google Scholar
Schuster M, Nakajima K. Japanese and Korean voice search. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. 2012, 5149–5152
MATH Google Scholar
Vulić I, Ponti E M, Litschko R, Glavaš G, Korhonen A. Probing pretrained language models for lexical semantics. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 7222–7240
MATH Google Scholar
Zhang J, Ji B, Xiao N, Duan X, Zhang M, Shi Y, Luo W. Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Proceedings of the Findings of the Association for Computational Linguistics. 2021, 2943–2955
MATH Google Scholar
Hämmerl K, Libovický J, Fraser A. Combining static and contextualised multilingual embeddings. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2316–2329
MATH Google Scholar
Zheng J, Wang Y, Wang G, Xia J, Huang Y, Zhao G, Zhang Y, Li S. Using context-to-vector with graph retrofitting to improve word embeddings. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8154–8163
MATH Google Scholar
Li Y, Liu F, Collier N, Korhonen A, Vulić I. Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 4353–4374
MATH Google Scholar
Alvarez-Melis D, Jaakkola T. Gromov-wasserstein alignment of word embedding spaces. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1881–1890
Chapter MATH Google Scholar
Ren S, Liu S, Zhou M, Ma S. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 3476–3485
Chapter MATH Google Scholar
Mohiuddin T, Joty S. Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 3857–3867
MATH Google Scholar
Mohiuddin T, Bari M S, Joty S. LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 2712–2723
MATH Google Scholar
Glavaš G, Vulić I. Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7548–7555
Chapter MATH Google Scholar
Marchisio K, Verma N, Duh K, Koehn P. IsoVec: controlling the relative isomorphism of word embedding spaces. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 6019–6033
Chapter MATH Google Scholar
Singh J, McCann B, Socher R, Xiong C. BERT is not an interlingua and the bias of tokenization. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP. 2019, 47–55
Chapter MATH Google Scholar
Taitelbaum H, Chechik G, Goldberger J. Multilingual word translation using auxiliary languages. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 1330–1335
Google Scholar
Karthikeyan K, Wang Z, Mayhew S, Roth D. Cross-lingual ability of multilingual BERT: an empirical study. In: Proceedings of the 8th International Conference on Learning Representations. 2020
MATH Google Scholar
Liu C L, Hsu T Y, Chuang Y S, Lee H Y. A study of cross-lingual ability and language-specific information in multilingual BERT. 2020, arXiv preprint arXiv: 2004.09205
Ranjan R, Gupta S, Singh S N. A comprehensive survey of bias in LLMs: current landscape and future directions. 2024, arXiv preprint arXiv: 2409.16430
Cao S, Cheng R, Wang Z. AGR: age group fairness reward for bias mitigation in LLMs. 2024, arXiv preprint arXiv: 2409.04340
Ahn J, Oh A. Mitigating language-dependent ethnic bias in BERT. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 533–549
Chapter MATH Google Scholar
Meade N, Poole-Dayan E, Reddy S. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1878–1898
MATH Google Scholar
Zhao J, Mukherjee S, Hosseini S, Chang K W, Awadallah A H. Gender bias in multilingual embeddings and cross-lingual transfer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2896–2907
Chapter MATH Google Scholar
Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models. 2023, arXiv preprint arXiv: 2304.03738
Wu S, Dredze M. Are all languages created equal in multilingual BERT? In: Proceedings of the 5th Workshop on Representation Learning for NLP. 2020, 120–130
Chapter MATH Google Scholar
Wang J, Liu Y, Wang X. Assessing multilingual fairness in pre-trained multimodal representations. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2681–2695
MATH Google Scholar
Kassner N, Dufter P, Schütze H. Multilingual LAMA: investigating knowledge in multilingual pretrained language models. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021, 3250–3258
MATH Google Scholar
Levy S, John N A, Liu L, Vyas Y, Ma J, Fujinuma Y, Ballesteros M, Castelli V, Roth D. Comparing biases and the impact of multilingual training across multiple languages. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10260–10280
Chapter Google Scholar
Piqueras L C, Søgaard A. Are pretrained multilingual models equally fair across languages? In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3597–3605
MATH Google Scholar
Touileb S, Øvrelid L, Velldal E. Occupational biases in Norwegian and multilingual language models. In: Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing. 2022, 200–211
MATH Google Scholar
Naous T, Ryan M J, Ritter A, Xu W. Having beer after prayer? Measuring cultural bias in large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 16366–16393
Google Scholar
Abid A, Farooqi M, Zou J. Large language models associate Muslims with violence. Nature Machine Intelligence, 2021, 3(6): 461–463
Article Google Scholar
Cao Y T, Pruksachatkun Y, Chang K W, Gupta R, Kumar V, Dhamala J, Galstyan A. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 561–570
Google Scholar
Leiter C, Lertvittayakumjorn P, Fomicheva M, Zhao W, Gao Y, Eger S. Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 2024, 25(75): 1–49
Google Scholar
Sun T, He J, Qiu X, Huang X. BERTScore is unfair: on social bias in language model-based metrics for text generation. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3726–3739
Chapter MATH Google Scholar
Zhang T, Kishore V, Wu F, Weinberger K Q, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations. 2020
MATH Google Scholar
Sellam T, Das D, Parikh A. BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7881–7892
Chapter MATH Google Scholar
Yuan W, Neubig G, Liu P. BARTSCORE: evaluating generated text as text generation. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 2088
MATH Google Scholar
Koo R, Lee M, Raheja V, Park J I, Kim Z M, Kang D. Benchmarking cognitive biases in large language models as evaluators. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 517–545
MATH Google Scholar
Delobelle P, Tokpo E, Calders T, Berendt B. Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics. 2022, 1693–1706
Google Scholar
Caliskan A, Bryson J J, Narayanan A. Semantics derived automatically from language corpora Contain human-like biases. Science, 2017, 356(6334): 183–186
Article Google Scholar
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353–355
Chapter Google Scholar
May C, Wang A, Bordia S, Bowman S R, Rudinger R. On measuring social biases in sentence encoders. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 622–628
Google Scholar
Guo W, Caliskan A. Detecting emergent intersectional biases: contextualized word embeddings contain a distribution of human-like biases. In: Proceedings of 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021, 122–133
Chapter MATH Google Scholar
Bansal S, Garimella V, Suhane A, Mukherjee A. Debiasing multilingual word embeddings: a case study of three Indian languages. In: Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 2021, 27–34
Google Scholar
Rudinger R, Naradowsky J, Leonard B, Van Durme B. Gender bias in coreference resolution. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 8–14
Google Scholar
Zhao J, Wang T, Yatskar M, Ordonez V, Chang K W. Gender bias in coreference resolution: Evaluation and debiasing methods. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 15–20
MATH Google Scholar
Kiritchenko S, Mohammad S. Examining gender and race bias in two hundred sentiment analysis systems. In: Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018, 43–53
Chapter MATH Google Scholar
Nangia N, Vania C, Bhalerao R, Bowman S R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 1953–1967
Google Scholar
Stanovsky G, Smith N A, Zettlemoyer L. Evaluating gender bias in machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1679–1684
Chapter MATH Google Scholar
De-Arteaga M, Romanov A, Wallach H, Chayes J, Borgs C, Chouldechova A, Geyik S, Kenthapadi K, Kalai A T. Bias in bios: a case study of semantic representation bias in a high-stakes setting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2019, 120–128
Chapter Google Scholar
Karkkainen K, Joo J. FairFace: face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 1547–1557
MATH Google Scholar
Lauscher A, Glavaš G. Are we consistently biased? Multidimensional analysis of biases in distributional word vectors. In: Proceedings of the 8th Joint Conference on Lexical and Computational Semantics. 2019, 85–91
MATH Google Scholar
Névéol A, Dupont Y, Bezançon J, Fort K. French CrowS-pairs: extending a challenge dataset for measuring social bias in masked language models to a language other than English. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8521–8531
MATH Google Scholar
Wan Y, Pu G, Sun J, Garimella A, Chang K W, Peng N. “Kelly is a warm person, joseph is a role model”: Gender biases in LLM-generated reference letters. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 3730–3748
MATH Google Scholar
Liang P P, Li I M, Zheng E, Lim Y C, Salakhutdinov R, Morency L P. Towards debiasing sentence representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5502–5515
Chapter MATH Google Scholar
Ravfogel S, Elazar Y, Gonen H, Twiton M, Goldberg Y. Null it out: Guarding protected attributes by iterative nullspace projection. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7237–7256
Chapter MATH Google Scholar
Yang Z, Yang Y, Cer D, Darve E. A simple and effective method to eliminate the self language bias in multilingual representations. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 5825–5832
Chapter MATH Google Scholar
Webster K, Wang X, Tenney I, Beutel A, Pitler E, Pavlick E, Chen J, Chi E, Petrov S. Measuring and reducing gendered correlations in pre-trained models. 2020, arXiv preprint arXiv: 2010.06032
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958
MathSciNet MATH Google Scholar
Zhou F, Mao Y, Yu L, Yang Y, Zhong T. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 4227–4241
MATH Google Scholar
Ranaldi L, Ruzzetti E S, Venditti D, Onorati D, Zanzotto F M. A trip towards fairness: Bias and de-biasing in large language models. In: Proceedings of the 13th Joint Conference on Lexical and Computational Semantics. 2024, 372–384
Google Scholar
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022
MATH Google Scholar
Wang A, Russakovsky O. Overwriting pretrained bias with finetuning data. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2023, 3934–3945
MATH Google Scholar
Guo Y, Yang Y, Abbasi A. Auto-debias: debiasing masked language models with automated biased prompts. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1012–1023
MATH Google Scholar
Mattern J, Jin Z, Sachan M, Mihalcea R, Schölkopf B. Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. 2022, arXiv preprint arXiv: 2212.10678
Dhingra H, Jayashanker P, Moghe S, Strubell E. Queer people are people first: Deconstructing sexual identity stereotypes in large language models. 2023, arXiv preprint arXiv: 2307, 0010, 1: 2023
Schick T, Udupa S, Schütze H. Self-diagnosis and self-debiasing: a proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics, 2021, 9: 1408–1424
Article Google Scholar
Conneau A, Rinott R, Lample G, Williams A, Bowman S, Schwenk H, Stoyanov V. XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2475–2485
Chapter Google Scholar
Nguyen T, Van Nguyen C, Lai V D, Man H, Ngo N T, Dernoncourt F, Rossi R A, Nguyen T H. CulturaX: a cleaned, enormous, and multilingual dataset for large language models in 167 languages. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 4226–4237
MATH Google Scholar
Laurençon H, Saulnier L, Wang T, Akiki C, Del Moral A V. et al. The BigScience roots corpus: a 1.6TB composite multilingual dataset. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2306
Google Scholar
Kreutzer J, Caswell I, Wang L, Wahab A, Van Esch D. et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 2022, 10: 50–72
Article Google Scholar
Sen I, Assenmacher D, Samory M, Augenstein I, Aalst W, Wagner C. People make better edits: measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10480–10504
Chapter MATH Google Scholar
Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K W. Gender bias in contextualized word embeddings. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 629–634
Google Scholar
Yang L, Li J, Cunningham P, Zhang Y, Smyth B, Dong R. Exploring the efficacy of automatically generated counterfactuals for sentiment analysis. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 306–316
Google Scholar
Sen I, Samory M, Flöck F, Wagner C, Augenstein I. How does counterfactually augmented data impact models for social computing constructs? In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 325–344
Chapter MATH Google Scholar
Goldfarb-Tarrant S, Lopez A, Blanco R, Marcheggiani D. Bias beyond English: counterfactual tests for bias in sentiment analysis in four languages. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 4458–4468
MATH Google Scholar
Sen I, Samory M, Wagner C, Augenstein I. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistic. 2022, 4716–4726
MATH Google Scholar
Joshi N, He H. An investigation of the (in)effectiveness of counterfactually augmented data. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 3668–3681
MATH Google Scholar
Zhang Q, Duan Q, Yuan B, Shi Y, Liu J. Exploring accuracy-fairness trade-off in large language models. 2024, arXiv preprint arXiv: 2411.14500
Lin Z, Guan S, Zhang W, Zhang H, Li Y, Zhang H. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 2024, 57(9): 243
Article MATH Google Scholar
Yang N, Kang T, Choi S J, Lee H, Jung K. Mitigating biases for instruction-following language models via bias neurons elimination. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 9061–9073
MATH Google Scholar
Yadav H, Sitaram S. A survey of multilingual models for automatic speech recognition. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 5071–5079
MATH Google Scholar
Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M. XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 410
MATH Google Scholar
Dufter P, Schütze H. Identifying elements essential for BERT’s multilinguality. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4423–4437
MATH Google Scholar
Nzeyimana A, Niyongabo Rubungo A. KinyaBERT: a morphology-aware Kinyarwanda language model. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 5347–5363
MATH Google Scholar
Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. 2023, arXiv preprint arXiv: 2307.06435
Pan X, Zhang B, May J, Nothman J, Knight K, Ji H. Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 1946–1958
MATH Google Scholar
Liu F, Bugliarello E, Ponti E M, Reddy S, Collier N, Elliott D. Visually grounded reasoning across languages and cultures. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 10467–10485
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by the National Social Science Foundation of China (No. 24CYY107)

Author information

Authors and Affiliations

School of Information Science and Technology, Beijing Foreign Studies University, Beijing, 100089, China
Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Kexin Xu, Yuqi Ye & Hanwen Gu

Authors

Yuemei Xu
View author publications
You can also search for this author inPubMed Google Scholar
Ling Hu
View author publications
You can also search for this author inPubMed Google Scholar
Jiayi Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Zihan Qiu
View author publications
You can also search for this author inPubMed Google Scholar
Kexin Xu
View author publications
You can also search for this author inPubMed Google Scholar
Yuqi Ye
View author publications
You can also search for this author inPubMed Google Scholar
Hanwen Gu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toYuemei Xu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Yuemei XU is an associate professor in the School of Information Science and Technology, Beijing Foreign Studies University, China. She received her PhD degree from Chinese Academy of Sciences, China in 2014 and the BE degree from Beijing University of Posts and Telecommunications, China in 2009. Her main research interests include multilingual natural language processing and artificial intelligence.

Ling HU received the bachelor’s degree from Beijing University of Posts and Telecommunications, China in 2021. She is currently pursing the master degree with the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Jiayi ZHAO is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Zihan QIU is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Kexin XU received the bachelor’s degree from Southwestern University of Finance and Economics, China in 2024. She is currently pursing the master degree with the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.

Yuqi YE is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include Multilingual Natural Language Processing and Artificial Intelligence.

Hanwen GU received the BE degree from the School of Information Science at Beijing Language and Culture University, China in 2023. Currently, he is pursuing a master’s degree in the School of Information Science and Technology at Beijing Foreign Studies University, China. His primary research interests encompass natural language processing and artificial intelligence.

Electronic supplementary material

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Hu, L., Zhao, J.et al. A survey on multilingual large language models: corpora, alignment, and bias.Front. Comput. Sci.19, 1911362 (2025). https://doi.org/10.1007/s11704-024-40579-4

Download citation

Received:07 June 2024
Accepted:09 December 2024
Published:03 April 2025
DOI:https://doi.org/10.1007/s11704-024-40579-4