- Notifications
You must be signed in to change notification settings - Fork7
Distributed Representations of Sentences and Documents
License
NotificationsYou must be signed in to change notification settings
bnosac/doc2vec
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains an R package allowing to buildParagraph Vector
models also known asdoc2vec
models. You can train the distributed memory ('PV-DM') and the distributed bag of words ('PV-DBOW') models.Next to that, it also allows to build atop2vec
model allowing to cluster documents based on these embeddings.
- doc2vec is based on the paperDistributed Representations of Sentences and DocumentsMikolov et al. while top2vec is based on the paperDistributed Representations of TopicsAngelov
- The doc2vec part is an Rcpp wrapper aroundhttps://github.com/hiyijian/doc2vec
- The package allows one
- to train paragraph embeddings (also known as document embeddings) on character data or data in a text file
- use the embeddings to find similar documents, paragraphs, sentences or words
- cluster document embeddings using top2vec
- Note. For getting word vectors in R: look at packagehttps://github.com/bnosac/word2vec, detailshere, for Starspace embeddings: look at packagehttps://github.com/bnosac/ruimtehol, detailshere
- For regular users, install the package from your local CRAN mirror
install.packages("doc2vec")
- For installing the development version of this package:
remotes::install_github("bnosac/doc2vec")
Look to the documentation of the functions
help(package="doc2vec")
- Take some data and standardise it a bit.
- Make sure it has columns doc_id and text
- Make sure that each text has less than 1000 words (a word is considered separated by a single space)
- Make sure that each text does not contain newline symbols
library(doc2vec)library(tokenizers.bpe)library(udpipe)data(belgium_parliament,package="tokenizers.bpe")x<- subset(belgium_parliament,language%in%"dutch")x<-data.frame(doc_id= sprintf("doc_%s",1:nrow(x)),text=x$text,stringsAsFactors=FALSE)x$text<- tolower(x$text)x$text<- gsub("[^[:alpha:]]","",x$text)x$text<- gsub("[[:space:]]+","",x$text)x$text<- trimws(x$text)x$nwords<- txt_count(x$text,pattern="")x<- subset(x,nwords<1000& nchar(text)>0)
- Build the model
## Low-dimensional model using DM, low number of iterations, for speed and display purposesmodel<- paragraph2vec(x=x,type="PV-DM",dim=5,iter=3,min_count=5,lr=0.05,threads=1)str(model)
## List of 3## $ model :<externalptr> ## $ data :List of 4## ..$ file : chr "C:\\Users\\Jan\\AppData\\Local\\Temp\\Rtmpk9Npjg\\textspace_1c446bffa0e.txt"## ..$ n : num 170469## ..$ n_vocabulary: num 3867## ..$ n_docs : num 1000## $ control:List of 9## ..$ min_count: int 5## ..$ dim : int 5## ..$ window : int 5## ..$ iter : int 3## ..$ lr : num 0.05## ..$ skipgram : logi FALSE## ..$ hs : int 0## ..$ negative : int 5## ..$ sample : num 0.001## - attr(*, "class")= chr "paragraph2vec_trained"
## More realistic modelmodel<- paragraph2vec(x=x,type="PV-DBOW",dim=100,iter=20,min_count=5,lr=0.05,threads=4)
- Get the embedding of the documents or words and get the vocabulary
embedding<- as.matrix(model,which="words")embedding<- as.matrix(model,which="docs")vocab<- summary(model,which="docs")vocab<- summary(model,which="words")
- Get the embedding of specific documents / words or sentences.
sentences<-list(sent1= c("geld","diabetes"),sent2= c("frankrijk","koning","proximus"))embedding<- predict(model,newdata=sentences,type="embedding")embedding<- predict(model,newdata= c("geld","koning"),type="embedding",which="words")embedding<- predict(model,newdata= c("doc_1","doc_10","doc_3"),type="embedding",which="docs")ncol(embedding)
## [1] 100
embedding[,1:4]
## [,1] [,2] [,3] [,4]## doc_1 0.05721277 -0.10298843 0.1089350 -0.03075439## doc_10 0.09553983 0.05211980 -0.0513489 -0.11847925## doc_3 0.08008177 -0.03324692 0.1563442 0.06585038
- Get similar documents or words when providing sentences, documents or words
nn<- predict(model,newdata= c("proximus","koning"),type="nearest",which="word2word",top_n=5)nn
## [[1]]## term1 term2 similarity rank## 1 proximus telefoontoestellen 0.5357178 1## 2 proximus belfius 0.5169221 2## 3 proximus ceo 0.4839031 3## 4 proximus klanten 0.4819543 4## 5 proximus taal 0.4590944 5## ## [[2]]## term1 term2 similarity rank## 1 koning ministerie 0.5615162 1## 2 koning verplaatsingen 0.5484987 2## 3 koning familie 0.4911003 3## 4 koning grondwet 0.4871097 4## 5 koning gedragen 0.4694150 5
nn<- predict(model,newdata= c("proximus","koning"),type="nearest",which="word2doc",top_n=5)nn
## [[1]]## term1 term2 similarity rank## 1 proximus doc_105 0.6684639 1## 2 proximus doc_863 0.5917463 2## 3 proximus doc_186 0.5233522 3## 4 proximus doc_620 0.4919243 4## 5 proximus doc_862 0.4619178 5## ## [[2]]## term1 term2 similarity rank## 1 koning doc_44 0.6686417 1## 2 koning doc_45 0.5616031 2## 3 koning doc_583 0.5379452 3## 4 koning doc_943 0.4855201 4## 5 koning doc_797 0.4573555 5
nn<- predict(model,newdata= c("doc_198","doc_285"),type="nearest",which="doc2doc",top_n=5)nn
## [[1]]## term1 term2 similarity rank## 1 doc_198 doc_343 0.5522854 1## 2 doc_198 doc_899 0.4902798 2## 3 doc_198 doc_983 0.4847047 3## 4 doc_198 doc_642 0.4829021 4## 5 doc_198 doc_336 0.4674844 5## ## [[2]]## term1 term2 similarity rank## 1 doc_285 doc_319 0.5318567 1## 2 doc_285 doc_286 0.5100293 2## 3 doc_285 doc_113 0.5056069 3## 4 doc_285 doc_526 0.4840761 4## 5 doc_285 doc_488 0.4805686 5
sentences<-list(sent1= c("geld","frankrijk"),sent2= c("proximus","onderhandelen"))nn<- predict(model,newdata=sentences,type="nearest",which="sent2doc",top_n=5)nn
## $sent1## term1 term2 similarity rank## 1 sent1 doc_742 0.4830917 1## 2 sent1 doc_151 0.4340138 2## 3 sent1 doc_825 0.4263285 3## 4 sent1 doc_740 0.4059283 4## 5 sent1 doc_776 0.4024554 5## ## $sent2## term1 term2 similarity rank## 1 sent2 doc_105 0.5497447 1## 2 sent2 doc_863 0.5061581 2## 3 sent2 doc_862 0.4973840 3## 4 sent2 doc_620 0.4793786 4## 5 sent2 doc_186 0.4755909 5
sentences<- strsplit(setNames(x$text,x$doc_id),split="")nn<- predict(model,newdata=sentences,type="nearest",which="sent2doc",top_n=5)
Top2vec clusters document semantically and finds most semantically relevant terms for each topic
library(doc2vec)library(word2vec)library(uwot)library(dbscan)data(be_parliament_2020,package="doc2vec")x<-data.frame(doc_id=be_parliament_2020$doc_id,text=be_parliament_2020$text_nl,stringsAsFactors=FALSE)x$text<- txt_clean_word2vec(x$text)x<- subset(x, txt_count_words(text)<1000)d2v<- paragraph2vec(x,type="PV-DBOW",dim=50,lr=0.05,iter=10,window=15,hs=TRUE,negative=0,sample=0.00001,min_count=5,threads=1)model<- top2vec(d2v,control.dbscan=list(minPts=50),control.umap=list(n_neighbors=15L,n_components=3),umap=tumap,trace=TRUE)info<- summary(model,top_n=7)info$topwords
The package has some hard limits namely
- Each document should contain less than 1000 words
- Each word has a maximum length of 100 letters
Need support in text mining?Contact BNOSAC:http://www.bnosac.be
About
Distributed Representations of Sentences and Documents
Topics
Resources
License
Stars
Watchers
Forks
Packages0
No packages published