This repository was archived by the owner on Aug 6, 2025. It is now read-only.

facebookresearch/LASERPublic archive

NotificationsYou must be signed in to change notification settings
Fork464
Star3.7k

Language-Agnostic SEntence Representations

License

View license

3.7k stars 464 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github/workflows		.github/workflows
data		data
docker		docker
laser_encoders		laser_encoders
nllb		nllb
source		source
tasks		tasks
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
install_external_tools.sh		install_external_tools.sh
install_models.sh		install_models.sh
pyproject.toml		pyproject.toml
remove_external_tools.sh		remove_external_tools.sh

Repository files navigation

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

2023/11/30 ReleasedP-xSIM, a dual approach extension to multilingual similarity search (xSIM)
2023/11/16 Releasedlaser_encoders, a pip-installable package supporting LASER-2 and LASER-3 models
2023/06/26xSIM++ evaluation pipeline and datareleased
2022/07/06 Updated LASER models with support for over 200 languages arenow available
2022/07/06 Multilingual similarity search (xSIM) evaluation pipelinereleased
2022/05/03Librivox S2S is available: Speech-to-Speech translations automatically mined in Librivox [9]
2019/11/08CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
2019/07/31 Gilles Bodard and Jérémy Rapin provided aDocker environment to use LASER
2019/07/11WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
2019/03/18 switch to BSD license
2019/02/13 The code to perform bitext mining isnow available

CURRENT VERSION:

We now provide updated LASER models which support over 200 languages. Please seehere for more details including how to download the models and perform inference.

According to our experience, the sentence encoder also supports code-switching, i.e.the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalize to otherlanguages which have not been seen during training, but which are ina language family which is covered by other languages.

A detailed description of how the multilingual sentence embeddings are trained canbe foundhere, together with an experimental evaluation.

The core sentence embedding package:`laser_encoders`

We provide a packagelaser_encoders with minimal dependencies.It supports LASER-2 (a single encoder for the languages listedbelow)and LASER-3 (147 language-specific encoders describedhere).

The package can be installed simply withpip install laser_encoders and used as below:

fromlaser_encodersimportLaserEncoderPipelineencoder=LaserEncoderPipeline(lang="eng_Latn")embeddings=encoder.encode_sentences(["Hi!","This is a sentence encoder."])print(embeddings.shape)# (2, 1024)

The laser_encodersreadme file provides more examples of its installation and usage.

The full LASER kit

Apart from thelaser_encoders, we provide support for LASER-1 (the original multilingual encoder)and for various LASER applications listed below.

Dependencies

Python >= 3.7
PyTorch 1.0
NumPy, tested with 1.15.4
Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
Faiss, for fast similarity search and bitext mining
transliterate 1.10.2 (pip install transliterate)
jieba 0.39, Chinese segmenter (pip install jieba)
mecab 0.996, Japanese segmenter
tokenization from the Moses encoder (installed automatically)
FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
Fairseq, sequence modeling toolkit (pip install fairseq==0.12.1)
tabulate, pretty-print tabular data (pip install tabulate)
pandas, data analysis toolkit (pip install pandas)
Sentencepiece, subword tokenization (installed automatically)

Installation

install thelaser_encoders package by e.g.pip install -e . for installing it in the editable mode
set the environment variable 'LASER' to the root of the installation, e.g.export LASER="${HOME}/projects/laser"
download encoders from Amazon s3 by e.g.bash ./nllb/download_models.sh
download third party software bybash ./install_external_tools.sh
download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddingswith code to reproduce our results (in the directory "tasks").

Cross-lingual document classification using theMLDoc corpus [2,6]
WikiMatrixMining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
Bitext mining using theBUCC corpus [3,5]
Cross-lingual NLIusing theXNLI corpus [4,5,6]
Multilingual similarity search [1,6]
Sentence embedding of text filesexample how to calculate sentence embeddings for arbitrary text files in any of the supported language.

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in theLICENSE file in the root directory of this source tree.

Supported languages

The original LASER model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali,Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer,Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English,Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi,Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle,Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon,Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan,Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian,Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur,Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian,Swiss German or Western Frisian.

LASER3

Updated LASER models referred to asLASER3 supplement the above list with support for 147 languages. The full list of supported languages can be seenhere.

References

[1] Holger Schwenk and Matthijs Douze,Learning Joint Multilingual Sentence Representations with Neural Machine Translation,ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li,A Corpus for Multilingual Document Classification in Eight Languages,LREC, pages 3548-3551, 2018.

[3] Holger Schwenk,Filtering and Mining Parallel Data in a Joint Multilingual SpaceACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov,XNLI: Cross-lingual Sentence Understanding through Inference,EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk,Margin-based Parallel Corpus Mining with Multilingual Sentence EmbeddingsarXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk,Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and BeyondarXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman,WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from WikipediaarXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand JoulinCCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

[9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk,Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,, NeurIPS 2021, pages 15748-15761.

[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk,Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

About

Language-Agnostic SEntence Representations

Resources

Readme

License

View license

Code of conduct

Contributing

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LASER Language-Agnostic SEntence Representations

The core sentence embedding package:`laser_encoders`

The full LASER kit

Dependencies

Installation

Applications

License

Supported languages

LASER3

References

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors22

Languages

Movatterモバイル変換

License

facebookresearch/LASER

Folders and files

Latest commit

History

Repository files navigation

LASER Language-Agnostic SEntence Representations

The core sentence embedding package:laser_encoders

The full LASER kit

Dependencies

Installation

Applications

License

Supported languages

LASER3

References

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors22

Languages

The core sentence embedding package:`laser_encoders`

Packages