185Accesses
1Altmetric
Abstract
Based on the foundation of Large Language Models (LLMs), Multilingual LLMs (MLLMs) have been developed to address the challenges faced in multilingual natural language processing, hoping to achieve knowledge transfer from high-resource languages to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolutions, key techniques, and multilingual capacities. Secondly, we explore the multilingual training corpora of MLLMs and the multilingual datasets oriented for downstream tasks that are crucial to enhance the cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-art studies of multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs, including its categories, evaluation metrics, and debiasing techniques. Finally, we discuss existing challenges and point out promising research directions of MLLMs.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010
Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 4171–4186
Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 634
Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 483–498
Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S et al. BLOOM: A 176B-parameter open-access multilingual language model. 2022, arXiv preprint arXiv: 2211.05100
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8440–8451
Cao S, Kitaev N, Klein D. Multilingual alignment of contextual word representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543
Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, 610–623
Talat Z, Névéol A, Biderman S, Clinciu M, Dey M, Longpre S, Luccioni S, Masoud M, Mitchell M, Radev D, Sharma S, Subramonian A, Tae J, Tan S, Tunuguntla D, Van Der Wal O. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In: Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. 2022, 26–41
Hutchinson B, Prabhakaran V, Denton E, Webster K, Zhong Y, Denuyl S. Social biases in NLP models as barriers for persons with disabilities. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5491–5501
Nadeem M, Bethke A, Reddy S. StereoSet: measuring stereotypical bias in pretrained language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 5356–5371
Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, Allauzen A, Crabbé B, Besacier L, Schwab D. FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 2479–2490
De Vries W, Van Cranenburgh A, Bisazza A, Caselli T, Van Noord G, Nissim M. BERTje: A Dutch BERT model. 2019, arXiv preprint arXiv: 1912.09582
Antoun W, Baly F, Hajj H. AraBERT: Transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 2020, 9–15
Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog, 2018
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J. et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L et al. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al. Gpt-4 technical report. 2023, arXiv preprint arXiv: 2303.08774
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871–7880
Nguyen T Q, Chiang D. Transfer learning across low-resource, related languages for neural machine translation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 296–301
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L. Multilingual denoising pre-training for neural machine translation. In: Proceedings of Transactions of the Association for Computational Linguistics. 2020, 726–742
Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4996–5001
Artetxe M, Ruder S, Yogatama D. On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 4623–4637
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A. et al. LaMDA: language models for dialog applications. 2022, arXiv preprint arXiv: 2201.08239
Zhang S, Roller S, Goyal N, Artetxe M, Chen M et al. OPT: open pre-trained transformer language models. 2022, arXiv preprint arXiv: 2205.01068
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 320–335
Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W, Xia X, Tam W L, Ma Z, Xue Y, Zhai J, Chen W, Liu Z, Zhang P, Dong Y, Tang J. GLM-130B: an open bilingual pre-trained model. In: Proceedings of the 11th International Conference on Learning Representations. 2023
Chiang W L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez J E et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See vicuna. lmsys. org websit, 2023
Anil R, Borgeaud S, Alayrac J B, Yu J, Soricut R. et al. Gemini: a family of highly capable multimodal models. 2023, arXiv preprint arXiv: 2312.11805
Rust P, Pfeiffer J, Vulić I, Ruder S, Gurevych I. How good is your tokenizer? On the monolingual performance of multilingual language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 3118–3135
Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Yu D. MM-LLMs: recent advances in MultiModal large language models. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. 2024, 12401–12430
Rae J W, Borgeaud S, Cai T, Millican K, Hoffmann J. et al. Scaling language models: Methods, analysis & insights from training gopher. 2021, arXiv preprint arXiv: 2112.11446
Chung H W, Hou L, Longpre S, Zoph B, Tay Y. et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024, 25(70): 1–53
OpenAI. Introducing chatGPT. Seeopenai.com/index/chatgpt/ website, 2022
Driess D, Xia F, Sajjadi M S M, Lynch C, Chowdhery A. et al. PaLME: An embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 8469–8488
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto T B. Stanford alpaca: An instruction-following llama model. See github. com/tatsulab/stanford_alpaca website, 2023
Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. 2023, arXiv preprint arXiv: 2303.10845
Biderman S, Schoelkopf H, Anthony Q G, Bradley H, O’Brien K, Hallahan E, Khan M A, Purohit S, Prashanth U S, Raff E, Skowron A, Sutawika L, Van Der Wal O. Pythia: a suite for analyzing large language models across training and scaling. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 2397–2430
Anil R, Dai A M, Firat O, Johnson M, Lepikhin D. et al. PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403
Touvron H, Martin L, Stone K, Albert P, Almahairi A. et al. Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288
Manyika J, Hsiao S. An overview of bard: an early experiment with generative AI. See ai.google/static/documents/google-about-bard.pdf Google Static Documents, 2023
Yang A, Xiao B, Wang B, Zhang B, Bian C. et al. Baichuan 2: Open large-scale language models. 2023, arXiv preprint arXiv: 2309.10305
MICROSOFT. Phi-2: the surprising power of small language models. Seemicrosoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ website, 2023
Zeng A, Xu B, Wang B, Zhang C, Yin D. et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools. 2024, arXiv preprint arXiv: 2406.12793
Anthropic. The Claude 3 model family: Opus, sonnet, haiku. Seeanthropic.com/news/claude-3-family/ website, 2024
Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A. et al. The llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783
Zhao W X, Zhou K, Li J, Tang T, Wang X. et al. A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223
Doddapaneni S, Ramesh G, Kunchukuttan A, Kumar P, Khapra M M. A primer on pretrained multilingual language models. 2021, arXiv preprint arXiv: 2107.00676
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: a survey. Science China Technological Sciences, 2020, 63(10): 1872–1897
Shen T, Jin R, Huang Y, Liu C, Dong W, Guo Z, Wu X, Liu Y, Xiong D. Large language model alignment: A survey. 2023, arXiv preprint arXiv: 2309.15025
Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V. et al. Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375
Bai Y, Jones A, Ndousse K, Askell A, Chen A. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862
Liu R, Zhang G, Feng X, Vosoughi S. Aligning generative language models with human values. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 241–252
Baheti A, Lu X, Brahman F, Le Bras R, Sap M, Riedl M O. Improving language models with advantage-based offline policy gradients. 2023, arXiv preprint arXiv: 2305.14718
Go D, Korbak T, Kruszewski G, Rozen J, Ryu N, Dymetman M. Aligning language models with preferences through f-divergence minimization. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 463
Askell A, Bai Y, Chen A, Drain D, Ganguli D. et al. A general language assistant as a laboratory for alignment. 2021, arXiv preprint arXiv: 2112.00861
Lambert N, Castricato L, Werra V L, Havrilla A. Illustrating reinforcement learning from human feedback (RLHF). See huggingface.co/blog/rlhf website, 2022
Stiennon N, Ouyang L, Wu J, Ziegler D M, Lowe R, Voss C, Radford A, Amodei D, Christiano P. Learning to summarize from human feedback. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 253
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
French R M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 1999, 3(4): 128–135
Hedderich M A, Lange L, Adel H, Strötgen J, Klakow D. A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 2545–2568
Alabi J O, Adelani D I, Mosbach M, Klakow D. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 4336–4349
Wongso W, Lucky H, Suhartono D. Pre-trained transformer-based language models for sundanese. Journal of Big Data, 2022, 9(1): 39
Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages-profiting from language families. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing. 2023, 1–10
Rönnqvist S, Kanerva J, Salakoski T, Ginter F. Is multilingual BERT fluent in language generation? In: Proceedings of the 1st NLPL Workshop on Deep Learning for Natural Language Processing. 2019, 29–36
Wang Z, Karthikeyan K, Mayhew S, Roth D. Extending multilingual BERT to low-resource languages. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. 2020, 2649–2656
Choenni R, Garrette D, Shutova E. How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 13244–13257
Wang Y, Yu Z, Wang J, Heng Q, Chen H, Ye W, Xie R, Xie X, Zhang S. Exploring vision-language models for imbalanced learning. International Journal of Computer Vision, 2024, 132(1): 224–237
Jiang Y, Qiu R, Zhang Y, Zhang P F. Balanced and explainable social media analysis for public health with large language models. In: Proceedings of the 34th Australasian Database Conference on Databases Theory and Applications. 2024, 73–86
Lin X V, Mihaylov T, Artetxe M, Wang T, Chen S et al. Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 9019–9052
Tian L, Zhang X, Lau J H. Rumour detection via zero-shot cross-lingual transfer learning. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. 2021, 603–618
Shi F, Suzgun M, Freitag M, Wang X, Srivats S, Vosoughi S, Chung H W, Tay Y, Ruder S, Zhou D, Das D, Wei J. Language models are multilingual chain-of-thought reasoners. In: Proceedings of the 11th International Conference on Learning Representations. 2023
Ogunremi T, Jurafsky D, Manning C D. Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 1251–1266
Ogueji K, Zhu Y, Lin J. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021, 116–126
Pikuliak M, Šimko M, Bieliková M. Cross-lingual learning for text processing: a survey. Expert Systems with Applications, 2021, 165: 113765
Philippy F, Guo S, Haddadan S. Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: a review. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 5877–5891
Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3464
Luo Y, Kong Q, Xu N, Cao J, Hao B. et al. YAYI 2: multilingual open-source large language models. 2023, arXiv preprint arXiv: 2312.14862
Sun H, Jin R, Xu S, Pan L, Supryadi, Cui M, Du J, Lei Y, Yang L, Shi L, Xiao J, Zhu S, Xiong D. FuxiTranyu: a multilingual large language model trained with balanced data. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1499–1522
Adelani D, Neubig G, Ruder S, Rijhwani S, Beukman M. et al. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4488–4508
Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. MultiCoNER: a large-scale multilingual dataset for complex named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3798–3809
Öhman E, Pàmies M, Kajava K, Tiedemann J. XED: a multilingual dataset for sentiment analysis and emotion detection. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, 6542–6552
Shode I, Adelani D I, Peng J, Feldman A. NollySenti: Leveraging transfer learning and machine translation for Nigerian movie sentiment classification. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 986–998
Muhammad S H, Adelani D I, Ruder S, Ahmad I S, Abdulmumin I, Bello B S, Choudhury M, Emezue C C, Abdullahi S S, Aremu A, Jorge A, Brazdil P. NaijaSenti: a Nigerian twitter sentiment corpus for multilingual sentiment analysis. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 590–602
Ogundepo O, Zhang X, Sun S, Duh K, Lin J. AfriCLIRMatrix: enabling cross-lingual information retrieval for African languages. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8721–8728
Sun S, Duh K. CLIRMatrix: a massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4160–4170
Ma C, Imani A, Ye H, Asgari E, Schütze H. Taxi1500: a multilingual dataset for text classification in 1500 languages. 2023, arXiv preprint arXiv: 2305.08487
Keung P, Lu Y, Szarvas G, Smith N A. The multilingual Amazon reviews corpus. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4563–4568
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H. Word translation without parallel data. In: Proceedings of the 6th International Conference on Learning Representations. 2018
Linguatools.org. Wikipedia monolingual corpora. See linguatools/tools/corpora/wikipedia-monolingual-corpora/website, 2018
Palen-Michel C, Kim J, Lignos C. Multilingual open text release 1: Public domain news in 44 languages. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 2080–2089
Lison P, Tiedemann J. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016, 923–929
Zhu W, Liu H, Dong Q, Xu J, Huang S, Kong L, Chen J, Li L. Multilingual machine translation with large language models: empirical results and analysis. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 2765–2781
Goyal N, Du J, Ott M, Anantharaman G, Conneau A. Larger-scale transformers for multilingual masked language modeling. In: Proceedings of the 6th Workshop on Representation Learning for NLP. 2021, 29–33
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5: 135–146
Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 789–798
Søgaard A, Ruder S, Vulić I. On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 778–788
Nakashole N. NORMA: Neighborhood sensitive maps for multilingual word embeddings. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 512–522
Wang H, Henderson J, Merlo P. Multi-adversarial learning for cross-lingual word embeddings. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 463–472
Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 2021, 304: 114135
Schuster T, Ram O, Barzilay R, Globerson A. Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistic. 2019, 1599–1613
Gage P. A new algorithm for data compression. The C Users Journal, 1994, 12(2): 23–38
Schuster M, Nakajima K. Japanese and Korean voice search. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. 2012, 5149–5152
Vulić I, Ponti E M, Litschko R, Glavaš G, Korhonen A. Probing pretrained language models for lexical semantics. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 7222–7240
Zhang J, Ji B, Xiao N, Duan X, Zhang M, Shi Y, Luo W. Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Proceedings of the Findings of the Association for Computational Linguistics. 2021, 2943–2955
Hämmerl K, Libovický J, Fraser A. Combining static and contextualised multilingual embeddings. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2316–2329
Zheng J, Wang Y, Wang G, Xia J, Huang Y, Zhao G, Zhang Y, Li S. Using context-to-vector with graph retrofitting to improve word embeddings. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8154–8163
Li Y, Liu F, Collier N, Korhonen A, Vulić I. Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 4353–4374
Alvarez-Melis D, Jaakkola T. Gromov-wasserstein alignment of word embedding spaces. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1881–1890
Ren S, Liu S, Zhou M, Ma S. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 3476–3485
Mohiuddin T, Joty S. Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 3857–3867
Mohiuddin T, Bari M S, Joty S. LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 2712–2723
Glavaš G, Vulić I. Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7548–7555
Marchisio K, Verma N, Duh K, Koehn P. IsoVec: controlling the relative isomorphism of word embedding spaces. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 6019–6033
Singh J, McCann B, Socher R, Xiong C. BERT is not an interlingua and the bias of tokenization. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP. 2019, 47–55
Taitelbaum H, Chechik G, Goldberger J. Multilingual word translation using auxiliary languages. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 1330–1335
Karthikeyan K, Wang Z, Mayhew S, Roth D. Cross-lingual ability of multilingual BERT: an empirical study. In: Proceedings of the 8th International Conference on Learning Representations. 2020
Liu C L, Hsu T Y, Chuang Y S, Lee H Y. A study of cross-lingual ability and language-specific information in multilingual BERT. 2020, arXiv preprint arXiv: 2004.09205
Ranjan R, Gupta S, Singh S N. A comprehensive survey of bias in LLMs: current landscape and future directions. 2024, arXiv preprint arXiv: 2409.16430
Cao S, Cheng R, Wang Z. AGR: age group fairness reward for bias mitigation in LLMs. 2024, arXiv preprint arXiv: 2409.04340
Ahn J, Oh A. Mitigating language-dependent ethnic bias in BERT. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 533–549
Meade N, Poole-Dayan E, Reddy S. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1878–1898
Zhao J, Mukherjee S, Hosseini S, Chang K W, Awadallah A H. Gender bias in multilingual embeddings and cross-lingual transfer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2896–2907
Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models. 2023, arXiv preprint arXiv: 2304.03738
Wu S, Dredze M. Are all languages created equal in multilingual BERT? In: Proceedings of the 5th Workshop on Representation Learning for NLP. 2020, 120–130
Wang J, Liu Y, Wang X. Assessing multilingual fairness in pre-trained multimodal representations. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2681–2695
Kassner N, Dufter P, Schütze H. Multilingual LAMA: investigating knowledge in multilingual pretrained language models. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021, 3250–3258
Levy S, John N A, Liu L, Vyas Y, Ma J, Fujinuma Y, Ballesteros M, Castelli V, Roth D. Comparing biases and the impact of multilingual training across multiple languages. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10260–10280
Piqueras L C, Søgaard A. Are pretrained multilingual models equally fair across languages? In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3597–3605
Touileb S, Øvrelid L, Velldal E. Occupational biases in Norwegian and multilingual language models. In: Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing. 2022, 200–211
Naous T, Ryan M J, Ritter A, Xu W. Having beer after prayer? Measuring cultural bias in large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 16366–16393
Abid A, Farooqi M, Zou J. Large language models associate Muslims with violence. Nature Machine Intelligence, 2021, 3(6): 461–463
Cao Y T, Pruksachatkun Y, Chang K W, Gupta R, Kumar V, Dhamala J, Galstyan A. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 561–570
Leiter C, Lertvittayakumjorn P, Fomicheva M, Zhao W, Gao Y, Eger S. Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 2024, 25(75): 1–49
Sun T, He J, Qiu X, Huang X. BERTScore is unfair: on social bias in language model-based metrics for text generation. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3726–3739
Zhang T, Kishore V, Wu F, Weinberger K Q, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations. 2020
Sellam T, Das D, Parikh A. BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7881–7892
Yuan W, Neubig G, Liu P. BARTSCORE: evaluating generated text as text generation. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 2088
Koo R, Lee M, Raheja V, Park J I, Kim Z M, Kang D. Benchmarking cognitive biases in large language models as evaluators. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 517–545
Delobelle P, Tokpo E, Calders T, Berendt B. Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics. 2022, 1693–1706
Caliskan A, Bryson J J, Narayanan A. Semantics derived automatically from language corpora Contain human-like biases. Science, 2017, 356(6334): 183–186
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353–355
May C, Wang A, Bordia S, Bowman S R, Rudinger R. On measuring social biases in sentence encoders. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 622–628
Guo W, Caliskan A. Detecting emergent intersectional biases: contextualized word embeddings contain a distribution of human-like biases. In: Proceedings of 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021, 122–133
Bansal S, Garimella V, Suhane A, Mukherjee A. Debiasing multilingual word embeddings: a case study of three Indian languages. In: Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 2021, 27–34
Rudinger R, Naradowsky J, Leonard B, Van Durme B. Gender bias in coreference resolution. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 8–14
Zhao J, Wang T, Yatskar M, Ordonez V, Chang K W. Gender bias in coreference resolution: Evaluation and debiasing methods. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 15–20
Kiritchenko S, Mohammad S. Examining gender and race bias in two hundred sentiment analysis systems. In: Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018, 43–53
Nangia N, Vania C, Bhalerao R, Bowman S R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 1953–1967
Stanovsky G, Smith N A, Zettlemoyer L. Evaluating gender bias in machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1679–1684
De-Arteaga M, Romanov A, Wallach H, Chayes J, Borgs C, Chouldechova A, Geyik S, Kenthapadi K, Kalai A T. Bias in bios: a case study of semantic representation bias in a high-stakes setting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2019, 120–128
Karkkainen K, Joo J. FairFace: face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 1547–1557
Lauscher A, Glavaš G. Are we consistently biased? Multidimensional analysis of biases in distributional word vectors. In: Proceedings of the 8th Joint Conference on Lexical and Computational Semantics. 2019, 85–91
Névéol A, Dupont Y, Bezançon J, Fort K. French CrowS-pairs: extending a challenge dataset for measuring social bias in masked language models to a language other than English. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8521–8531
Wan Y, Pu G, Sun J, Garimella A, Chang K W, Peng N. “Kelly is a warm person, joseph is a role model”: Gender biases in LLM-generated reference letters. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 3730–3748
Liang P P, Li I M, Zheng E, Lim Y C, Salakhutdinov R, Morency L P. Towards debiasing sentence representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5502–5515
Ravfogel S, Elazar Y, Gonen H, Twiton M, Goldberg Y. Null it out: Guarding protected attributes by iterative nullspace projection. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7237–7256
Yang Z, Yang Y, Cer D, Darve E. A simple and effective method to eliminate the self language bias in multilingual representations. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 5825–5832
Webster K, Wang X, Tenney I, Beutel A, Pitler E, Pavlick E, Chen J, Chi E, Petrov S. Measuring and reducing gendered correlations in pre-trained models. 2020, arXiv preprint arXiv: 2010.06032
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958
Zhou F, Mao Y, Yu L, Yang Y, Zhong T. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 4227–4241
Ranaldi L, Ruzzetti E S, Venditti D, Onorati D, Zanzotto F M. A trip towards fairness: Bias and de-biasing in large language models. In: Proceedings of the 13th Joint Conference on Lexical and Computational Semantics. 2024, 372–384
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022
Wang A, Russakovsky O. Overwriting pretrained bias with finetuning data. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2023, 3934–3945
Guo Y, Yang Y, Abbasi A. Auto-debias: debiasing masked language models with automated biased prompts. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1012–1023
Mattern J, Jin Z, Sachan M, Mihalcea R, Schölkopf B. Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. 2022, arXiv preprint arXiv: 2212.10678
Dhingra H, Jayashanker P, Moghe S, Strubell E. Queer people are people first: Deconstructing sexual identity stereotypes in large language models. 2023, arXiv preprint arXiv: 2307, 0010, 1: 2023
Schick T, Udupa S, Schütze H. Self-diagnosis and self-debiasing: a proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics, 2021, 9: 1408–1424
Conneau A, Rinott R, Lample G, Williams A, Bowman S, Schwenk H, Stoyanov V. XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2475–2485
Nguyen T, Van Nguyen C, Lai V D, Man H, Ngo N T, Dernoncourt F, Rossi R A, Nguyen T H. CulturaX: a cleaned, enormous, and multilingual dataset for large language models in 167 languages. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 4226–4237
Laurençon H, Saulnier L, Wang T, Akiki C, Del Moral A V. et al. The BigScience roots corpus: a 1.6TB composite multilingual dataset. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2306
Kreutzer J, Caswell I, Wang L, Wahab A, Van Esch D. et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 2022, 10: 50–72
Sen I, Assenmacher D, Samory M, Augenstein I, Aalst W, Wagner C. People make better edits: measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10480–10504
Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K W. Gender bias in contextualized word embeddings. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 629–634
Yang L, Li J, Cunningham P, Zhang Y, Smyth B, Dong R. Exploring the efficacy of automatically generated counterfactuals for sentiment analysis. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 306–316
Sen I, Samory M, Flöck F, Wagner C, Augenstein I. How does counterfactually augmented data impact models for social computing constructs? In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 325–344
Goldfarb-Tarrant S, Lopez A, Blanco R, Marcheggiani D. Bias beyond English: counterfactual tests for bias in sentiment analysis in four languages. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 4458–4468
Sen I, Samory M, Wagner C, Augenstein I. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistic. 2022, 4716–4726
Joshi N, He H. An investigation of the (in)effectiveness of counterfactually augmented data. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 3668–3681
Zhang Q, Duan Q, Yuan B, Shi Y, Liu J. Exploring accuracy-fairness trade-off in large language models. 2024, arXiv preprint arXiv: 2411.14500
Lin Z, Guan S, Zhang W, Zhang H, Li Y, Zhang H. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 2024, 57(9): 243
Yang N, Kang T, Choi S J, Lee H, Jung K. Mitigating biases for instruction-following language models via bias neurons elimination. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 9061–9073
Yadav H, Sitaram S. A survey of multilingual models for automatic speech recognition. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 5071–5079
Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M. XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 410
Dufter P, Schütze H. Identifying elements essential for BERT’s multilinguality. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4423–4437
Nzeyimana A, Niyongabo Rubungo A. KinyaBERT: a morphology-aware Kinyarwanda language model. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 5347–5363
Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. 2023, arXiv preprint arXiv: 2307.06435
Pan X, Zhang B, May J, Nothman J, Knight K, Ji H. Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 1946–1958
Liu F, Bugliarello E, Ponti E M, Reddy S, Collier N, Elliott D. Visually grounded reasoning across languages and cultures. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 10467–10485
Acknowledgements
This work was supported by the National Social Science Foundation of China (No. 24CYY107)
Author information
Authors and Affiliations
School of Information Science and Technology, Beijing Foreign Studies University, Beijing, 100089, China
Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Kexin Xu, Yuqi Ye & Hanwen Gu
- Yuemei Xu
You can also search for this author inPubMed Google Scholar
- Ling Hu
You can also search for this author inPubMed Google Scholar
- Jiayi Zhao
You can also search for this author inPubMed Google Scholar
- Zihan Qiu
You can also search for this author inPubMed Google Scholar
- Kexin Xu
You can also search for this author inPubMed Google Scholar
- Yuqi Ye
You can also search for this author inPubMed Google Scholar
- Hanwen Gu
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toYuemei Xu.
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Additional information
Yuemei XU is an associate professor in the School of Information Science and Technology, Beijing Foreign Studies University, China. She received her PhD degree from Chinese Academy of Sciences, China in 2014 and the BE degree from Beijing University of Posts and Telecommunications, China in 2009. Her main research interests include multilingual natural language processing and artificial intelligence.
Ling HU received the bachelor’s degree from Beijing University of Posts and Telecommunications, China in 2021. She is currently pursing the master degree with the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.
Jiayi ZHAO is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.
Zihan QIU is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.
Kexin XU received the bachelor’s degree from Southwestern University of Finance and Economics, China in 2024. She is currently pursing the master degree with the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include multilingual natural language processing and artificial intelligence.
Yuqi YE is majoring in computer science and technology at the School of Information Science and Technology, Beijing Foreign Studies University, China. Her main research interests include Multilingual Natural Language Processing and Artificial Intelligence.
Hanwen GU received the BE degree from the School of Information Science at Beijing Language and Culture University, China in 2023. Currently, he is pursuing a master’s degree in the School of Information Science and Technology at Beijing Foreign Studies University, China. His primary research interests encompass natural language processing and artificial intelligence.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xu, Y., Hu, L., Zhao, J.et al. A survey on multilingual large language models: corpora, alignment, and bias.Front. Comput. Sci.19, 1911362 (2025). https://doi.org/10.1007/s11704-024-40579-4
Received:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative