- Notifications
You must be signed in to change notification settings - Fork8
A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.
License
NotificationsYou must be signed in to change notification settings
zabir-nabil/awesome-multilingual-large-language-models
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation

A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.
Dataset | Year | Languages | GitHub | Download |
---|---|---|---|---|
OMGEval : An Open Multilingual Generative Evaluation Benchmark for Large Language Models | 2024 | Chinese (zh) (🇨🇳), Russian (ru) (🇷🇺), French (fr) (🇫🇷), Spanish (es) (🇪🇸), Arabic (ar) (🇸🇦) | Github | Data |
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property | 2024 | Chinese (zh) (🇨🇳), English (en) (🇬🇧), German (de) (🇩🇪), Japanese (ja) (🇯🇵), French (fr) (🇫🇷), Korean (ko) (🇰🇷), Russian (ru) (🇷🇺), Spanish (es) (🇪🇸), Portuguese (pt) (🇵🇹), Catalan (ca) (🇦🇩) | Github | Data |
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models | 2024 | English (en) (🇬🇧), Chinese (zh) (🇨🇳), Japanese (ja) (🇯🇵), French (fr) (🇫🇷), German (de) (🇩🇪) | Github | Data |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | 2023 | English (🇺🇸), Chinese (🇨🇳), Italian (🇮🇹), Portuguese (🇧🇷), Vietnamese (🇻🇳), Thai (🇹🇭), Swahili (🇰🇪), Afrikaans (🇿🇦), Javanese (🇮🇩) | Github | Data |
Language models are multilingual chain-of-thought reasoners | 2023 | Bengali (🇧🇩), Chinese (🇨🇳), French (🇫🇷), German (🇩🇪), Japanese (🇯🇵), Russian (🇷🇺), Spanish (🇪🇸), Swahili (🇰🇪), Telugu (🇮🇳), Thai (🇹🇭) | Github | Data |
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages | 2023 | English [🇬🇧], Russian [🇷🇺], Spanish [🇪🇸], German [🇩🇪], French [🇫🇷], Chinese [🇨🇳], Italian [🇮🇹], Portuguese [🇵🇹], Polish [🇵🇱], Japanese [🇯🇵], Vietnamese [🇻🇳], Dutch [🇳🇱], Arabic [🇸🇦], Turkish [🇹🇷], Czech [🇨🇿], Persian [🇮🇷], Hungarian [🇭🇺], Greek [🇬🇷], Romanian [🇷🇴], Swedish [🇸🇪], Ukrainian [🇺🇦], Finnish [🇫🇮], Korean [🇰🇷], Danish [🇩🇰], Bulgarian [🇧🇬], Norwegian [🇳🇴], Hindi [🇮🇳], Slovak [🇸🇰], Thai [🇹🇭], Lithuanian [🇱🇹], Catalan [🇪🇸], Indonesian [🇮🇩], Bangla [🇧🇩], Estonian [🇪🇪], Slovenian [🇸🇮], Latvian [🇱🇻], Hebrew [🇮🇱], Serbian [🇷🇸], Tamil [🇮🇳], Albanian [🇦🇱], Azerbaijani [🇦🇿] | 🤗 | Data |
Language models are multilingual chain-of-thought reasoners | 2023 | Bengali (🇧🇩), Chinese (🇨🇳), French (🇫🇷), German (🇩🇪), Japanese (🇯🇵), Russian (🇷🇺), Spanish (🇪🇸), Swahili (🇰🇪), Telugu (🇮🇳), Thai (🇹🇭) | Github | Data |
Wiki-40B: Multilingual Language Model Dataset | 2020 | English (🇺🇸), German (🇩🇪), French (🇫🇷), Russian (🇷🇺), Spanish (🇪🇸), Italian (🇮🇹), Japanese (🇯🇵), Chinese Simplified (🇨🇳), Chinese Traditional (🇹🇼), Polish (🇵🇱), Ukrainian (🇺🇦), Dutch (🇳🇱), Swedish (🇸🇪), Portuguese (🇵🇹), Serbian (🇷🇸), Hungarian (🇭🇺), Catalan (🇪🇸), Czech (🇨🇿), Finnish (🇫🇮), Arabic (🇸🇦), Korean (🇰🇷), Persian (🇮🇷), Norwegian (🇳🇴), Vietnamese (🇻🇳), Hebrew (🇮🇱), Indonesian (🇮🇩), Romanian (🇷🇴), Turkish (🇹🇷), Bulgarian (🇧🇬), Estonian (🇪🇪), Malay (🇲🇾), Danish (🇩🇰), Slovak (🇸🇰), Croatian (🇭🇷), Greek (🇬🇷), Lithuanian (🇱🇹), Slovenian (🇸🇮), Thai (🇹🇭), Hindi (🇮🇳), Latvian (🇱🇻), Filipino (🇵🇭) | 👁️ | Data |
Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning | 2021 | English (🇺🇸), German (🇩🇪), French (🇫🇷), Russian (🇷🇺), Spanish (🇪🇸), Hindi (🇮🇳), Vietnamese (🇻🇳), Bulgarian (🇧🇬), Chinese (🇨🇳), Dutch (🇳🇱), Italian (🇮🇹), Japanese (🇯🇵), Polish (🇵🇱), Portuguese (🇵🇹), Arabic (🇸🇦), Swahili (🇹🇿), Urdu (🇵🇰) | GitHub️ | Data |
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset | 2022 | Akan (🇬🇭), Arabic (🇸🇦), Assamese (🇮🇳), Bambara (🇲🇱), Basque (🇪🇸), Bengali (🇧🇩), Catalan (🇪🇸), Chichewa (🇲🇼), chiShona (🇿🇼), Chitumbuka (🇲🇼), English (🇬🇧), Fon (🇧🇯), French (🇫🇷), Gujarati (🇮🇳), Hindi (🇮🇳), Igbo (🇳🇬), Indonesian (🇮🇩), isiXhosa (🇿🇦), isiZulu (🇿🇦), Kannada (🇮🇳), Kikuyu (🇰🇪), Kinyarwanda (🇷🇼), Kirundi (🇧🇮), Lingala (🇨🇩), Luganda (🇺🇬), Malayalam (🇮🇳), Marathi (🇮🇳), Nepali (🇳🇵), Northern Sotho (🇿🇦), Odia (🇮🇳), Portuguese (🇵🇹), Punjabi (🇮🇳), Sesotho (🇱🇸), Setswana (🇧🇼), Simplified Chinese (🇨🇳), Spanish (🇪🇸), Swahili (🇰🇪), Tamil (🇮🇳), Telugu (🇮🇳), Traditional Chinese (🇹🇼), Twi (🇬🇭), Urdu (🇵🇰), Vietnamese (🇻🇳), Wolof (🇸🇳), Xitsonga (🇿🇦), Yoruba (🇳🇬), Programming Languages (💻) | GitHub️ | Data |
GEOMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models | 2022 | English (🇺🇸), Chinese (🇨🇳), Hindi (🇮🇳), Persian (🇮🇷), Swahili (🇰🇪) | GitHub️ | 🔍 |
Title | Year | Languages | Code | Demo |
---|---|---|---|---|
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model | 2024 | Afrikaans [🇿🇦], Amharic [🇪🇹], Arabic [🇸🇦], Azerbaijani [🇦🇿], Belarusian [🇧🇾], Bengali [🇧🇩], Bulgarian [🇧🇬], Catalan [🇪🇸], Cebuano [🇵🇭], Czech [🇨🇿], Welsh [🏴], Danish [🇩🇰], German [🇩🇪], Greek [🇬🇷], English [🇬🇧], Esperanto [🇪🇸], Estonian [🇪🇪], Basque [🇪🇸], Finnish [🇫🇮], Tagalog [🇵🇭], French [🇫🇷], Western Frisian [🇳🇱], Scottish Gaelic [🏴], Irish [🇮🇪], Galician [🇪🇸], Gujarati [🇮🇳], Haitian Creole [🇭🇹], Hausa [🇳🇪], Hebrew [🇮🇱], Hindi [🇮🇳], Hungarian [🇭🇺], Armenian [🇦🇲], Igbo [🇳🇬], Indonesian [🇮🇩], Icelandic [🇮🇸], Italian [🇮🇹], Javanese [🇮🇩], Japanese [🇯🇵], Kannada [🇮🇳], Georgian [🇬🇪], Kazakh [🇰🇿], Khmer [🇰🇭], Kyrgyz [🇰🇬], Korean [🇰🇷], Kurdish [🇹🇷], Lao [🇱🇦], Latvian [🇱🇻], Latin [🇻🇦], Lithuanian [🇱🇹], Luxembourgish [🇱🇺], Malayalam [🇮🇳], Marathi [🇮🇳], Macedonian [🇲🇰], Malagasy [🇲🇬], Maltese [🇲🇹], Mongolian [🇲🇳], Maori [🇳🇿], Malay [🇲🇾], Burmese [🇲🇲], Nepali [🇳🇵], Dutch [🇳🇱], Norwegian [🇳🇴], Northern Sotho [🇿🇦], Chichewa [🇲🇼], Oriya [🇮🇳], Punjabi [🇮🇳], Persian [🇮🇷], Polish [🇵🇱], Portuguese [🇵🇹], Pashto [🇦🇫], Romanian [🇷🇴], Russian [🇷🇺], Sinhala [🇱🇰], Slovak [🇸🇰], Slovenian [🇸🇮], Samoan [🇼🇸], Shona [🇿🇼], Sindhi [🇵🇰], Somali [🇸🇴], Southern Sotho [🇱🇸], Spanish [🇪🇸], Albanian [🇦🇱], Serbian [🇷🇸], Sundanese [🇮🇩], Swahili [🇰🇪], Swedish [🇸🇪], Tamil [🇮🇳], Telugu [🇮🇳], Tajik [🇹🇯], Thai [🇹🇭], Turkish [🇹🇷], Twi [🇬🇭], Ukrainian [🇺🇦], Urdu [🇵🇰], Uzbek [🇺🇿], Vietnamese [🇻🇳], Xhosa [🇿🇦], Yiddish [🇮🇱], Yoruba [🇳🇬], Chinese [🇨🇳], Zulu [🇿🇦] | Source | 🤗 |
LANGBRIDGE: Multilingual Reasoning Without Multilingual Supervision | 2024 | Arabic (ar) (🇸🇦), Bengali (bn) (🇧🇩), Chinese (zh) (🇨🇳), Danish (da) (🇩🇰), Dutch (nl) (🇳🇱), English (en) (🇬🇧), French (fr) (🇫🇷), German (de) (🇩🇪), Hindi (hi) (🇮🇳), Japanese (ja) (🇯🇵), Korean (ko) (🇰🇷), Marathi (mr) (🇮🇳), Punjabi (pa) (🇮🇳), Russian (ru) (🇷🇺), Spanish (es) (🇪🇸), Swahili (sw) (🇰🇪), Telugu (te) (🇮🇳), Turkish (tr) (🇹🇷), Urdu (ur) (🇵🇰) | Github | 🤗 |
Orion-14B: Open-source Multilingual Large Language Models | 2024 | English [🇬🇧], Chinese [🇨🇳], Japanese [🇯🇵], Korean [🇰🇷], Spanish [🇪🇸], French [🇫🇷], German [🇩🇪], Arabic [🇸🇦] | Github | 🤗 |
Baichuan 2: Open Large-scale Language Models | 2023 | Arabic (ar) (🇸🇦), Chinese (zh) (🇨🇳), English (en) (🇬🇧), French (fr) (🇫🇷), Russian (ru) (🇷🇺), Spanish (es) (🇪🇸), German (de) (🇩🇪), Japanese (ja) (🇯🇵) | Github | 🤗 |
Larger-Scale Transformers for Multilingual Masked Language Modeling | 2021 | Afrikaans (🇿🇦), Albanian (🇦🇱), Amharic (🇪🇹), Arabic (🇸🇦), Armenian (🇦🇲), Assamese (🇮🇳), Azerbaijani (🇦🇿), Basque (🇪🇸), Belarusian (🇧🇾), Bengali (🇧🇩), Bengali Romanize (🇧🇩), Bosnian (🇧🇦), Breton (🏴), Bulgarian (🇧🇬), Burmese (🇲🇲), Burmese zawgyi font (🇲🇲), Catalan (🇪🇸), Chinese (Simplified) (🇨🇳), Chinese (Traditional) (🇹🇼), Croatian (🇭🇷), Czech (🇨🇿), Danish (🇩🇰), Dutch (🇳🇱), English (🇬🇧), Esperanto (🏴), Estonian (🇪🇪), Filipino (🇵🇭), Finnish (🇫🇮), French (🇫🇷), Galician (🇪🇸), Georgian (🇬🇪), German (🇩🇪), Greek (🇬🇷), Gujarati (🇮🇳), Hausa (🇳🇬), Hebrew (🇮🇱), Hindi (🇮🇳), Hindi Romanize (🇮🇳), Hungarian (🇭🇺), Icelandic (🇮🇸), Indonesian (🇮🇩), Irish (🇮🇪), Italian (🇮🇹), Japanese (🇯🇵), Javanese (🇮🇩), Kannada (🇮🇳), Kazakh (🇰🇿), Khmer (🇰🇭), Korean (🇰🇷), Kurdish (Kurmanji) (🇹🇷), Kyrgyz (🇰🇬), Lao (🇱🇦), Latin (🏛️), Latvian (🇱🇻), Lithuanian (🇱🇹), Macedonian (🇲🇰), Malagasy (🇲🇬), Malay (🇲🇾), Malayalam (🇮🇳), Marathi (🇮🇳), Mongolian (🇲🇳), Nepali (🇳🇵), Norwegian (🇳🇴), Oriya (🇮🇳), Oromo (🇪🇹), Pashto (🇦🇫), Persian (🇮🇷), Polish (🇵🇱), Portuguese (🇵🇹), Punjabi (🇮🇳), Romanian (🇷🇴), Russian (🇷🇺), Sanskrit (🇮🇳), Scottish Gaelic (🏴), Serbian (🇷🇸), Sindhi (🇵🇰), Sinhala (🇱🇰), Slovak (🇸🇰), Slovenian (🇸🇮), Somali (🇸🇴), Spanish (🇪🇸), Sundanese (🇮🇩), Swahili (🇰🇪), Swedish (🇸🇪), Tamil (🇮🇳), Tamil Romanize (🇮🇳), Telugu (🇮🇳), Telugu Romanize (🇮🇳), Thai (🇹🇭), Turkish (🇹🇷), Ukrainian (🇺🇦), Urdu (🇵🇰), Urdu Romanize (🇵🇰), Uyghur (🇨🇳), Uzbek (🇺🇿), Vietnamese (🇻🇳), Welsh (🏴), Western Frisian (🇳🇱), Xhosa (🇿🇦), Yiddish (🇮🇱) | Github | 🔍 |
InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities | 2023 | English (🇺🇸), Chinese (🇨🇳) | Github | 🔍 |
PolyLM: An Open Source Polyglot Large Language Model | 2023 | English (EN) [🇬🇧], Chinese (ZH) [🇨🇳], Russian (RU) [🇷🇺], Spanish (ES) [🇪🇸], German (DE) [🇩🇪], French (FR) [🇫🇷], Italian (IT) [🇮🇹], Portuguese (PT) [🇵🇹], Japanese (JA) [🇯🇵], Vietnamese (VI) [🇻🇳], Indonesian (ID) [🇮🇩], Polish (PL) [🇵🇱], Dutch (NL) [🇳🇱], Arabic (AR) [🇦🇪], Turkish (TR) [🇹🇷], Thai (TH) [🇹🇭], Hebrew (HE) [🇮🇱], Korean (KO) [🇰🇷] | Model | 🔍 |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | 2023 | Akan (🇬🇭), Arabic (🇸🇦), Assamese (🇮🇳), Bambara (🇲🇱), Basque (🇪🇸), Bengali (🇧🇩), Catalan (🇪🇸), Chichewa (🇲🇼), chiShona (🇿🇼), Chitumbuka (🇲🇼), English (🇬🇧), Fon (🇧🇯), French (🇫🇷), Gujarati (🇮🇳), Hindi (🇮🇳), Igbo (🇳🇬), Indonesian (🇮🇩), isiXhosa (🇿🇦), isiZulu (🇿🇦), Kannada (🇮🇳), Kikuyu (🇰🇪), Kinyarwanda (🇷🇼), Kirundi (🇧🇮), Lingala (🇨🇩), Luganda (🇺🇬), Malayalam (🇮🇳), Marathi (🇮🇳), Nepali (🇳🇵), Northern Sotho (🇿🇦), Odia (🇮🇳), Portuguese (🇵🇹), Punjabi (🇮🇳), Sesotho (🇱🇸), Setswana (🇧🇼), Simplified Chinese (🇨🇳), Spanish (🇪🇸), Swahili (🇰🇪), Tamil (🇮🇳), Telugu (🇮🇳), Traditional Chinese (🇹🇼), Twi (🇬🇭), Urdu (🇵🇰), Vietnamese (🇻🇳), Wolof (🇸🇳), Xitsonga (🇿🇦), Yoruba (🇳🇬), Programming Languages (💻) | Github | 🤗 |
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages | 2023 | hbs_Latn (🇭🇷), mal_Mlym (🇮🇳), aze_Latn (🇦🇿), guj_Gujr (🇮🇳), ben_Beng (🇮🇳), kan_Knda (🇮🇳), tel_Telu (🇮🇳), mlt_Latn (🇲🇹), fra_Latn (🇫🇷), spa_Latn (🇪🇸), eng_Latn (🇬🇧), fil_Latn (🇵🇭), nob_Latn (🇳🇴), rus_Cyrl (🇷🇺), deu_Latn (🇩🇪), tur_Latn (🇹🇷), pan_Guru (🇮🇳), mar_Deva (🇮🇳), por_Latn (🇵🇹), nld_Latn (🇳🇱), ara_Arab (🇸🇦), zho_Hani (🇨🇳), ita_Latn (🇮🇹), ind_Latn (🇮🇩), ell_Grek (🇬🇷), bul_Cyrl (🇧🇬), swe_Latn (🇸🇪), ces_Latn (🇨🇿), isl_Latn (🇮🇸), pol_Latn (🇵🇱), ron_Latn (🇷🇴), dan_Latn (🇩🇰), hun_Latn (🇭🇺), tgk_Cyrl (🇹🇯), srp_Latn (🇷🇸), fas_Arab (🇮🇷), ceb_Latn (🇵🇭), heb_Hebr (🇮🇱), hrv_Latn (🇭🇷), glg_Latn (🇪🇸), fin_Latn (🇫🇮), slv_Latn (🇸🇮), vie_Latn (🇻🇳), mkd_Cyrl (🇲🇰), slk_Latn (🇸🇰), nor_Latn (🇳🇴), est_Latn (🇪🇪), ltz_Latn (🇱🇺), eus_Latn (🇪🇸), lit_Latn (🇱🇹), kaz_Cyrl (🇰🇿), lav_Latn (🇱🇻), bos_Latn (🇧🇦), epo_Latn (🇺🇸), cat_Latn (🇪🇸), tha_Thai (🇹🇭), ukr_Cyrl (🇺🇦), tgl_Latn (🇵🇭), sin_Sinh (🇱🇰), gle_Latn (🇮🇪), hin_Deva (🇮🇳), kor_Hang (🇰🇷), ory_Orya (🇮🇳), urd_Arab (🇵🇰), swa_Latn (🇰🇪), sqi_Latn (🇦🇱), bel_Cyrl (🇧🇾), afr_Latn (🇿🇦), nno_Latn (🇳🇴), tat_Cyrl (🇷🇺), asm_Beng (🇮🇳), hil_Latn (🇵🇭), nso_Latn (🇿🇦), ibo_Latn (🇳🇬), kin_Latn (🇷🇼), tpi_Latn (🇵🇬), twi_Latn (🇬🇭), kir_Cyrl (🇰🇬), nep_Deva (🇳🇵), azj_Latn (🇦🇿), bcl_Latn (🇵🇭), xho_Latn (🇿🇦), cym_Latn (🏴), gaa_Latn (🇬🇭), ton_Latn (🇹🇴), tah_Latn (🇵🇫), lat_Latn (🇻🇦), srn_Latn (🇸🇷), ewe_Latn (🇬🇭), bem_Latn (🇿🇲), orm_Latn (🇪🇹), haw_Latn (🇺🇸), hmo_Latn (🇵🇬), kat_Geor (🇬🇪), pag_Latn (🇵🇭), loz_Latn (🇿🇲), fry_Latn (🇳🇱), mya_Mymr (🇲🇲), nds_Latn (🇩🇪), run_Latn (🇧🇮), pnb_Arab (🇵🇰), rar_Latn (🇨🇰), fij_Latn (🇫🇯), wls_Latn (🇼🇸), ckb_Arab (🇮🇶), ven_Latn (🇿🇦), zsm_Latn (🇲🇾), chv_Cyrl (🇷🇺), lua_Latn (🇨🇩), que_Latn (🇵🇪), sag_Latn (🇨🇫), guw_Latn (🇬🇼), bre_Latn (🇫🇷), toi_Latn (🇨🇫), pus_Arab (🇦🇫), che_Cyrl (🇷🇺), pis_Latn (🇸🇧), kon_Latn (🇨🇩), oss_Cyrl (🇷🇺), hyw_Armn (🇦🇲), iso_Latn (🇻🇺), nan_Latn (🇹🇼), lub_Latn (🇨🇩), lim_Latn (🇳🇱), tuk_Latn (🇹🇲), tir_Ethi (🇪🇹), tgk_Latn (🇹🇯), yua_Latn (🇲🇽), min_Latn (🇮🇩), lue_Latn (🇨🇩), khm_Khmr (🇰🇭), tum_Latn (🇲🇼), tll_Latn (🇳🇦), ekk_Latn (🇪🇪), lug_Latn (🇺🇬), niu_Latn (🇳🇺), tzo_Latn (🇲🇽), mah_Latn (🇲🇭), tvl_Latn (🇹🇻), jav_Latn (🇮🇩), hau_Latn (🇳🇬), som_Latn (🇸🇴), uzb_Latn (🇺🇿), sot_Latn (🇿🇦), uzb_Cyrl (🇺🇿), cos_Latn (🇫🇷), als_Latn (🇦🇱), amh_Ethi (🇪🇹), sun_Latn (🇮🇩), war_Latn (🇵🇭), div_Thaa (🇲🇻), yor_Latn (🇳🇬), fao_Latn (🇫🇴), uzn_Cyrl (🇺🇿), smo_Latn (🇼🇸), bak_Cyrl (🇷🇺), ilo_Latn (🇵🇭), tso_Latn (🇿🇦), mri_Latn (🇳🇿), hmn_Latn (🇺🇸), nau_Latn (🇳🇷), asm_Beng (🇮🇳), hil_Latn (🇵🇭), nso_Latn (🇿🇦), ibo_Latn (🇳🇬), kin_Latn (🇷🇼), tpi_Latn (🇵🇬), twi_Latn (🇬🇭), kir_Cyrl (🇰🇬), pap_Latn (🇳🇱), aze_Latn (🇦🇿), qvi_Latn (🇵🇪), cak_Latn (🇬🇹), kbp_Latn (🇧🇫), kri_Latn (🇸🇱), mau_Latn (🇲🇽), scn_Latn (🇮🇹), tyv_Cyrl (🇷🇺), ina_Latn (🇧🇪), btx_Latn (🇮🇩), nch_Latn (🇲🇽), ncj_Latn (🇲🇽), pau_Latn (🇵🇼), toj_Latn (🇲🇽), pcm_Latn (🇳🇬), dyu_Latn (🇧🇫), kss_Latn (🇳🇬), afb_Arab (🇸🇦), urh_Latn (🇳🇬), quc_Latn (🇬🇹), new_Deva (🇳🇵), yao_Latn (🇲🇼), ngl_Latn (🇲🇿), nyu_Latn (🇲🇿), kab_Latn (🇩🇿), tuk_Cyrl (🇹🇲), xmf_Geor (🇬🇪), ndc_Latn (🇲🇿), san_Deva (🇮🇳), nba_Latn (🇳🇬), bpy_Beng (🇮🇳), ncx_Latn (🇲🇽), qug_Latn (🇵🇪), rmn_Latn (🇮🇳), cjk_Latn (🇬🇹), arb_Arab (🇸🇦), kea_Latn (🇨🇻), mck_Latn (🇨🇩), arn_Latn (🇨🇱), pdt_Latn (🇩🇪), her_Latn (🇳🇦), tlh_Latn (🇺🇸), suz_Deva (🇮🇳), kat_Geor (🇬🇪), kmr_Cyrl (🇷🇺), gcr_Latn (🇬🇵), jbo_Latn (🇺🇸), tbz_Latn (🇵🇼), bam_Latn (🇲🇱), prk_Latn (🇸🇮), jam_Latn (🇯🇲), twx_Latn (🇹🇼), sme_Latn (🇫🇮), gom_Latn (🇮🇳), bum_Latn (🇨🇲), mgr_Latn (🇲🇼), ahk_Latn (🇵🇰), kur_Arab (🇮🇶), bas_Latn (🇨🇲), bin_Latn (🇳🇬), tsz_Latn (🇲🇽), sid_Latn (🇪🇹), diq_Latn (🇹🇷), srd_Latn (🇮🇹), tcf_Latn (🇲🇽), bzj_Latn (🇮🇳), udm_Cyrl (🇷🇺), cce_Latn (🇨🇲), meu_Latn (🇨🇩), chw_Latn (🇨🇲), cbk_Latn (🇵🇭), ibg_Latn (🇮🇩), bhw_Latn (🇮🇩), ngu_Latn (🇲🇽), nyy_Latn (🇹🇿), szl_Latn (🇵🇱), ish_Latn (🇹🇿), naq_Latn (🇳🇦), toh_Latn (🇳🇿), ttj_Latn (🇰🇪), nse_Latn (🇳🇬), ami_Latn (🇹🇼), alz_Latn (🇸🇩), apc_Arab (🇸🇾), vls_Latn (🇳🇱), mhr_Cyrl (🇷🇺), djk_Latn (🇩🇪), prs_Arab (🇦🇫), san_Latn (🇮🇳), som_Arab (🇸🇴), uig_Latn (🇨🇳), hau_Arab (🇳🇬) | Github | 🔍 |
Few-shot Learning with Multilingual Generative Language Models | 2022 | English (🇺🇸), Russian (🇷🇺), Chinese (🇨🇳), German (🇩🇪), Spanish (🇪🇸), French (🇫🇷), Japanese (🇯🇵), Italian (🇮🇹), Portuguese (🇵🇹), Greek (🇬🇷), Romanian (🇷🇴), Ukrainian (🇺🇦), Hungarian (🇭🇺), Korean (🇰🇷), Polish (🇵🇱), Norwegian (🇳🇴), Dutch (🇳🇱), Finnish (🇫🇮), Danish (🇩🇰), Indonesian (🇮🇩), Croatian (🇭🇷), Turkish (🇹🇷), Arabic (🇸🇦), Vietnamese (🇻🇳), Thai (🇹🇭), Bulgarian (🇧🇬), Persian (🇮🇷), Swedish (🇸🇪), Malay (🇲🇾), Hebrew (🇮🇱), Czech (🇨🇿), Slovak (🇸🇰), Catalan (🇪🇸), Lithuanian (🇱🇹), Slovene (🇸🇮), Hindi (🇮🇳), Estonian (🇪🇪), Latvian (🇱🇻), Tagalog (🇵🇭), Albanian (🇦🇱), Serbian (🇷🇸), Azerbaijani (🇦🇿), Bengali (🇧🇩), Tamil (🇮🇳), Urdu (🇵🇰), Kazakh (🇰🇿), Armenian (🇦🇲), Georgian (🇬🇪), Icelandic (🇮🇸), Belarusian (🇧🇾), Bosnian (🇧🇦), Malayalam (🇮🇳), Macedonian (🇲🇰), Swahili (🇹🇿), Afrikaans (🇿🇦), Telugu (🇮🇳), Arabic Romanized (🇸🇦), Mongolian (🇲🇳), Latin (🇮🇹), Nepali (🇳🇵), Sinhalese (🇱🇰), Marathi (🇮🇳), Kannada (🇮🇳), Somali (🇸🇴), Welsh (🏴), Javanese (🇮🇩), Pashto (🇦🇫), Uzbek (🇺🇿), Gujarati (🇮🇳), Khmer (🇰🇭), Urdu Romanized (🇵🇰), Amharic (🇪🇹), Bengali Romanized (🇧🇩), Punjabi (🇮🇳), Galician (🇪🇸), Hausa (🇳🇬), Sanskrit (🇮🇳), Basque (🇪🇸), Burmese (🇲🇲), Sundanese (🇮🇩), Oriya (🇮🇳), Haitian (🇭🇹), Lao (🇱🇦), Kyrgyz (🇰🇬), Breton (🇫🇷), Irish (🇮🇪), Yoruba (🇳🇬), Esperanto (🌐), Tamil Romanized (🇮🇳), Zulu (🇿🇦), Tigrinya (🇪🇷), Telugu Romanized (🇮🇳), Kurdish (🇹🇷), Oromo (🇪🇹), Xhosa (🇿🇦), Scottish Gaelic (🇬🇧), Igbo (🇳🇬), Assamese (🇮🇳), Ganda (🇺🇬), Wolof (🇸🇳), Western Frisian (🇳🇱), Tswana (🇧🇼), Fula (🇸🇳), Guaraní (🇵🇾), Sindhi (🇵🇰), Lingala (🇨🇩), Bambara (🇲🇱), Inuktitut (🇨🇦), Kongo (🇨🇩), Quechua (🇵🇪), Swati (🇸🇿), Unassigned (🌐) | Github | 🔍 |
Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions | 2024 | English (🇺🇸), Chinese (🇨🇳), Telugu (🇮🇳), Hindi (🇮🇳), Arabic (🇸🇦), Swahili (🇹🇿), Bengali (🇧🇩) | 🔍 | 🔍 |
Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning | 2022 | Afrikaans (🇿🇦), Amharic (🇪🇹), Hausa (🇳🇬), Igbo (🇳🇬), Malagasy (🇲🇬), Chichewa (🇲🇼), Oromo (🇪🇹), Naija (🇳🇬), Kinyarwanda (🇷🇼), Kirundi (🇧🇮), Shona (🇿🇼), Somali (🇸🇴), Sesotho (🇱🇸), Swahili (🇹🇿), isiXhosa (🇿🇦), Yoruba (🇳🇬), isiZulu (🇿🇦), English (🇬🇧), French (🇫🇷), Arabic (🇸🇦), Lingala (🇨🇩), Luganda (🇺🇬), Luo (🇰🇪), Wolof (🇸🇳) | GitHub | 🤗 |
MuRIL: Multilingual Representations for Indian Languages | 2021 | Assamese (🇮🇳), Bengali (🇧🇩), Gujarati (🇮🇳), Hindi (🇮🇳), Kannada (🇮🇳), Kashmiri (🇮🇳), Malayalam (🇮🇳), Marathi (🇮🇳), Nepali (🇳🇵), Oriya (🇮🇳), Punjabi (🇮🇳), Sanskrit (🇮🇳), Sindhi (🇵🇰), Tamil (🇮🇳), Telugu (🇮🇳), Urdu (🇮🇳), English (🇬🇧) | 🔍 | 🔍 |
From English to Foreign Languages: Transferring Pretrained Language Models | 2020 | French (🇫🇷), Russian (🇷🇺), Arabic (🇦🇪), Chinese (🇨🇳), Hindi (🇮🇳), Vietnamese (🇻🇳) | 🔍 | 🔍 |
- PaLI-X: On Scaling up a Multilingual Vision and Language Model (2023)
- PALI: A Jointly-Scaled Multilingual Language-Image Model (2023)
- Learning to Scale Multilingual Representations for Vision-Language Tasks (2020)
- A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias (2024)
- Towards Building Multilingual Language Model for Medicine (2024)
- What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models (2024)
- All Languages Matter: On the Multilingual Safety of Large Language Models (2024)
- Multilingual Jailbreak Challenges in Large Language Models (2024)
- EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation (2024)
- Chat2VIS: Fine-Tuning Data Visualisations using Multilingual Natural Language Text and Pre-Trained Large Language (2024)
- How Linguistically Fair Are Multilingual Pre-Trained Language Models? (2021)
- IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages (2020)
- Are Multilingual Models the Best Choice for Moderately Under-Resourced Languages? A Comprehensive Assessment for Catalan (2021)
- You Reap What You Sow: On the Challenges of Bias Evaluation Under Multilingual Settings (2022)
- How to Adapt Your Pretrained Multilingual Model to 1600 Languages (2021)
- MEGA: Multilingual Evaluation of Generative AI /GitHub (2023)
- XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models (2023)
About
A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published