Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on May 19, 2022. It is now read-only.

A curated list of pretrained sentence and word embedding models

License

NotificationsYou must be signed in to change notification settings

Separius/awesome-sentence-embedding

Repository files navigation

Build StatusGitHub - LICENSE

A curated list of pretrained sentence and word embedding models

Table of Contents

About This Repo

  • well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
  • this repo will also be incomplete, but I'll try my best to find and include all the papers with pretrained models
  • this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
  • if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
  • enjoy!

General Framework

  • Almost all the sentence embeddings work like this:
  • Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
  • Then they define some sort of pooling (it can be as simple as last pooling).
  • Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
  • So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!

Word Embeddings

  • Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
datepapercitation counttraining codepretrained models
-WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic ModelsN/A-RusVectōrēs
2013/01Efficient Estimation of Word Representations in Vector Space999+CWord2Vec
2014/12Word Representations via Gaussian Embedding221Cython-
2014/??A Probabilistic Model for Learning Multi-Prototype Word Embeddings127DMTK-
2014/??Dependency-Based Word Embeddings719C++word2vecf
2014/??GloVe: Global Vectors for Word Representation999+CGloVe
2015/06Sparse Overcomplete Word Vector Representations129C++-
2015/06From Paraphrase Database to Compositional Paraphrase Model and Back3TheanoPARAGRAM
2015/06Non-distributional Word Vector Representations68PythonWordFeat
2015/??Joint Learning of Character and Word Embeddings195C-
2015/??SensEmbed: Learning Sense Embeddings for Word and Relational Similarity249-SensEmbed
2015/??Topical Word Embeddings292Cython
2016/02Swivel: Improving Embeddings by Noticing What's Missing61TF-
2016/03Counter-fitting Word Vectors to Linguistic Constraints232Pythoncounter-fitting(broken)
2016/05Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec91Chainer-
2016/06Siamese CBOW: Optimizing Word Embeddings for Sentence Representations166TheanoSiamese CBOW
2016/06Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations58Golexvec
2016/07Enriching Word Vectors with Subword Information999+C++fastText
2016/08Morphological Priors for Probabilistic Neural Word Embeddings34Theano-
2016/11A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks359C++charNgram2vec
2016/12ConceptNet 5.5: An Open Multilingual Graph of General Knowledge604PythonNumberbatch
2016/??Learning Word Meta-Embeddings58-Meta-Emb(broken)
2017/02Offline bilingual word vectors, orthogonal transformations and the inverted softmax336Python-
2017/04Multimodal Word Distributions57TFword2gm
2017/05Poincaré Embeddings for Learning Hierarchical Representations413Pytorch-
2017/06Context encoders as a simple but powerful extension of word2vec13Python-
2017/06Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints99TFAttract-Repel
2017/08Learning Chinese Word Representations From Glyphs Of Characters44C-
2017/08Making Sense of Word Embeddings92Pythonsensegram
2017/09Hash Embeddings for Efficient Word Representations25Keras-
2017/10BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages91GensimBPEmb
2017/11SPINE: SParse Interpretable Neural Embeddings48PytorchSPINE
2017/??AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP161GensimAraVec
2017/??Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics25C-
2017/??Dict2vec : Learning Word Embeddings using Lexical Dictionaries49C++Dict2vec
2017/??Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components63C-
2018/04Representation Tradeoffs for Hyperbolic Embeddings120Pytorchh-MDS
2018/04Dynamic Meta-Embeddings for Improved Sentence Representations60PytorchDME/CDME
2018/05Analogical Reasoning on Chinese Morphological and Semantic Relations128-ChineseWordVectors
2018/06Probabilistic FastText for Multi-Sense Word Embeddings39C++Probabilistic FastText
2018/09Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks3TFSynGCN
2018/09FRAGE: Frequency-Agnostic Word Representation64Pytorch-
2018/12Wikipedia2Vec: An Optimized Tool for LearningEmbeddings of Words and Entities from Wikipedia17CythonWikipedia2Vec
2018/??Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings106-ChineseEmbedding
2018/??cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information45C++-
2019/02VCWE: Visual Character-Enhanced Word Embeddings5PytorchVCWE
2019/05Learning Cross-lingual Embeddings from Twitter via Distant Supervision2Text-
2019/08An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning5TF-
2019/08ViCo: Word Embeddings from Visual Co-occurrences7PytorchViCo
2019/11Spherical Text Embedding25C-
2019/??Unsupervised word embeddings capture latent knowledge from materials science literature150Gensim-

OOV Handling

Contextualized Word Embeddings

  • Note: all the unofficial models can load the official pretrained models
datepapercitation countcodepretrained models
-Language Models are Unsupervised Multitask LearnersN/ATF
Pytorch, TF2.0
Keras
GPT-2(117M,124M,345M,355M,774M,1558M)
2017/08Learned in Translation: Contextualized Word Vectors524Pytorch
Keras
CoVe
2018/01Universal Language Model Fine-tuning for Text Classification167PytorchULMFit(English,Zoo)
2018/02Deep contextualized word representations999+Pytorch
TF
ELMO(AllenNLP,TF-Hub)
2018/04Efficient Contextualized Representation:Language Model Pruning for Sequence Labeling26PytorchLD-Net
2018/07Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation120PytorchELMo
2018/08Direct Output Connection for a High-Rank Language Model24PytorchDOC
2018/10BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding999+TF
Keras
Pytorch, TF2.0
MXNet
PaddlePaddle
TF
Keras
BERT(BERT,ERNIE,KoBERT)
2018/??Contextual String Embeddings for Sequence Labeling486PytorchFlair
2018/??Improving Language Understanding by Generative Pre-Training999+TF
Keras
Pytorch, TF2.0
GPT
2019/01Multi-Task Deep Neural Networks for Natural Language Understanding364PytorchMT-DNN
2019/01BioBERT: pre-trained biomedical language representation model for biomedical text mining634TFBioBERT
2019/01Cross-lingual Language Model Pretraining639Pytorch
Pytorch, TF2.0
XLM
2019/01Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context754TF
Pytorch
Pytorch, TF2.0
Transformer-XL
2019/02Efficient Contextual Representation Learning Without Softmax Layer2Pytorch-
2019/03SciBERT: Pretrained Contextualized Embeddings for Scientific Text124Pytorch, TFSciBERT
2019/04Publicly Available Clinical BERT Embeddings229TextclinicalBERT
2019/04ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission84PytorchClinicalBERT
2019/05ERNIE: Enhanced Language Representation with Informative Entities210PytorchERNIE
2019/05Unified Language Model Pre-training for Natural Language Understanding and Generation278PytorchUniLMv1(unilm1-large-cased,unilm1-base-cased)
2019/05HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization81-
2019/06Pre-Training with Whole Word Masking for Chinese BERT98Pytorch, TFBERT-wwm
2019/06XLNet: Generalized Autoregressive Pretraining for Language Understanding999+TF
Pytorch, TF2.0
XLNet
2019/07ERNIE 2.0: A Continual Pre-training Framework for Language Understanding107PaddlePaddleERNIE 2.0
2019/07SpanBERT: Improving Pre-training by Representing and Predicting Spans282PytorchSpanBERT
2019/07RoBERTa: A Robustly Optimized BERT Pretraining Approach999+Pytorch
Pytorch, TF2.0
RoBERTa
2019/09Subword ELMo1Pytorch-
2019/09Knowledge Enhanced Contextual Word Representations115-
2019/09TinyBERT: Distilling BERT for Natural Language Understanding129-
2019/09Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism136PytorchMegatron-LM(BERT-345M,GPT-2-345M)
2019/09MultiFiT: Efficient Multi-lingual Language Model Fine-tuning29Pytorch-
2019/09Extreme Language Model Compression with Optimal Subwords and Shared Projections32-
2019/09MULE: Multimodal Universal Language Embedding5-
2019/09Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks51-
2019/09K-BERT: Enabling Language Representation with Knowledge Graph59-
2019/09UNITER: Learning UNiversal Image-TExt Representations60-
2019/09ALBERT: A Lite BERT for Self-supervised Learning of Language Representations803TF-
2019/10BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension349PytorchBART(bart.base,bart.large,bart.large.mnli,bart.large.cnn,bart.large.xsum)
2019/10DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter481Pytorch, TF2.0DistilBERT
2019/10Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer696TFT5
2019/11CamemBERT: a Tasty French Language Model102-CamemBERT
2019/11ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations15Pytorch-
2019/11Unsupervised Cross-lingual Representation Learning at Scale319PytorchXLM-R (XLM-RoBERTa)(xlmr.large,xlmr.base)
2020/01ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training35PytorchProphetNet(ProphetNet-large-16GB,ProphetNet-large-160GB)
2020/02CodeBERT: A Pre-Trained Model for Programming and Natural Languages25PytorchCodeBERT
2020/02UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training33Pytorch-
2020/03ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators203TFELECTRA(ELECTRA-Small,ELECTRA-Base,ELECTRA-Large)
2020/04MPNet: Masked and Permuted Pre-training for Language Understanding5PytorchMPNet
2020/05ParsBERT: Transformer-based Model for Persian Language Understanding1PytorchParsBERT
2020/05Language Models are Few-Shot Learners382--
2020/07InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training12Pytorch-

Pooling Methods

Encoders

datepapercitation countcodemodel_name
-Incremental Domain Adaptation for Neural Machine Translation in Low-Resource SettingsN/APythonAraSIF
2014/05Distributed Representations of Sentences and Documents999+Pytorch
Python
Doc2Vec
2014/11Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models849Theano
Pytorch
VSE
2015/06Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books795Theano
TF
Pytorch, Torch
SkipThought
2015/11Order-Embeddings of Images and Language354Theanoorder-embedding
2015/11Towards Universal Paraphrastic Sentence Embeddings411TheanoParagramPhrase
2015/??From Word Embeddings to Document Distances999+C, PythonWord Mover's Distance
2016/02Learning Distributed Representations of Sentences from Unlabelled Data363PythonFastSent
2016/07Charagram: Embedding Words and Sentences via Character n-grams144TheanoCharagram
2016/11Learning Generic Sentence Representations Using Convolutional Neural Networks76TheanoConvSent
2017/03Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features319C++Sent2Vec
2017/04Learning to Generate Reviews and Discovering Sentiment293TF
Pytorch
Pytorch
Sentiment Neuron
2017/05Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings60TheanoGRAN
2017/05Supervised Learning of Universal Sentence Representations from Natural Language Inference Data999+PytorchInferSent
2017/07VSE++: Improving Visual-Semantic Embeddings with Hard Negatives132PytorchVSE++
2017/08Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm357Keras
Pytorch
DeepMoji
2017/09StarSpace: Embed All The Things!129C++StarSpace
2017/10DisSent: Learning Sentence Representations from Explicit Discourse Relations47PytorchDisSent
2017/11Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations128Theanopara-nmt
2017/11Dual-Path Convolutional Image-Text Embedding with Instance Loss44MatlabImage-Text-Embedding
2018/03An efficient framework for learning sentence representations183TFQuick-Thought
2018/03Universal Sentence Encoder564TF-HubUSE
2018/04End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions14TheanoDEISTE
2018/04Learning general purpose distributed sentence representations via large scale multi-task learning198PytorchGenSen
2018/06Embedding Text in Hyperbolic Spaces50TFHyperText
2018/07Representation Learning with Contrastive Predictive Coding736KerasCPC
2018/08Context Mover’s Distance & Barycenters: Optimal transport of contexts for building representations8PythonCMD
2018/09Learning Universal Sentence Representations with Mean-Max Attention Autoencoder14TFMean-MaxAAE
2018/10Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model35TF-HubUSE-xling
2018/10Improving Sentence Representations with Consensus Maximisation4-Multi-view
2018/10BioSentVec: creating sentence embeddings for biomedical texts70PythonBioSentVec
2018/11Word Mover's Embedding: From Word2Vec to Document Embedding47C, PythonWordMoversEmbeddings
2018/11A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks76PytorchHMTL
2018/12Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond238PytorchLASER
2018/??Convolutional Neural Network for Universal Sentence Embeddings6TheanoCSE
2019/01No Training Required: Exploring Random Encoders for Sentence Classification54Pytorchrandsent
2019/02CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model4PytorchCMOW
2019/07GLOSS: Generative Latent Optimization of Sentence Representations1-GLOSS
2019/07Multilingual Universal Sentence Encoder52TF-HubMultilingualUSE
2019/08Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks261PytorchSentence-BERT
2020/02SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models11PytorchSBERT-WK
2020/06DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations4PytorchDeCLUTR
2020/07Language-agnostic BERT Sentence Embedding5TF-HubLaBSE
2020/11On the Sentence Embeddings from Pre-trained Language Models0TFBERT-flow

Evaluation

Misc

Vector Mapping

Articles


[8]ページ先頭

©2009-2025 Movatter.jp