Movatterモバイル変換


[0]ホーム

URL:


Word vectors - doc2vec - textclustering

Lampros Mouselimis

2023-12-04

This vignette discuss the new functionality, which is added in thetextTinyR package (version 1.1.0). I’ll explain some of the functions byusing the data and pre-processing steps ofthisblog-post.


The following code chunks assume that thenltk-corpus isalready downloaded and thereticulate package is installed,


NLTK= reticulate::import("nltk.corpus")text_reuters= NLTK$reutersnltk= reticulate::import("nltk")# if the 'reuters' data is not already available then it can be downloaded from within Rnltk$download('reuters')


documents= text_reuters$fileids()str(documents)# List of categoriescategories= text_reuters$categories()str(categories)# Documents in a categorycategory_docs= text_reuters$fileids("acq")str(category_docs)one_doc= text_reuters$raw("test/14843")one_doc


The collection originally consisted of 21,578 documents but a subsetand split is traditionally used. The most common split isMod-Apte which only considers categories that have at least onedocument in the training set and the test set. TheMod-Aptesplit has 90 categories with a training set of 7769 documents and a testset of 3019 documents.


documents= text_reuters$fileids()# document ids for train - testtrain_docs_id= documents[as.vector(sapply(documents,function(i)substr(i,1,5)=="train"))]test_docs_id= documents[as.vector(sapply(documents,function(i)substr(i,1,4)=="test"))]train_docs=lapply(1:length(train_docs_id),function(x) text_reuters$raw(train_docs_id[x]))test_docs=lapply(1:length(test_docs_id),function(x) text_reuters$raw(test_docs_id[x]))str(train_docs)str(test_docs)# train - test labels  [ some categories might have more than one label (overlapping) ]train_labels=as.vector(sapply(train_docs_id,function(x) text_reuters$categories(x)))test_labels=as.vector(sapply(test_docs_id,function(x) text_reuters$categories(x)))


textTinyR - fastTextR - doc2vec - kmeans - cluster_medoids


First, I’ll perform the following pre-processing steps :


concat=c(unlist(train_docs),unlist(test_docs))length(concat)clust_vec= textTinyR::tokenize_transform_vec_docs(object = concat,as_token = T,to_lower = T,remove_punctuation_vector = F,remove_numbers = F,trim_token = T,split_string = T,split_separator ="\r\n\t.,;:()?!//",remove_stopwords = T,language ="english",min_num_char =3,max_num_char =100,stemmer ="porter2_stemmer",threads =4,verbose = T)unq=unique(unlist(clust_vec$token,recursive = F))length(unq)# I'll build also the term matrix as I'll need the global-term-weightsutl= textTinyR::sparse_term_matrix$new(vector_data = concat,file_data =NULL,document_term_matrix =TRUE)tm= utl$Term_Matrix(sort_terms =FALSE,to_lower = T,remove_punctuation_vector = F,remove_numbers = F,trim_token = T,split_string = T,stemmer ="porter2_stemmer",split_separator ="\r\n\t.,;:()?!//",remove_stopwords = T,language ="english",min_num_char =3,max_num_char =100,print_every_rows =100000,normalize =NULL,tf_idf = F,threads =6,verbose = T)gl_term_w= utl$global_term_weights()str(gl_term_w)


For simplicity, I’ll use theReuters data as input to thefastTextR::skipgram_cbow function. The data has to be firstpre-processed and then saved to a file,


 save_dat= textTinyR::tokenize_transform_vec_docs(object = concat,as_token = T,to_lower = T,remove_punctuation_vector = F,remove_numbers = F,trim_token = T,split_string = T,split_separator ="\r\n\t.,;:()?!//",remove_stopwords = T,language ="english",min_num_char =3,max_num_char =100,stemmer ="porter2_stemmer",path_2folder ="/path_to_your_folder/",threads =1,# whenever I save data to file set the number threads to 1verbose = T)


UPDATE 11-04-2019: There is anupdated version of thefastText R package which includes all the features of the portedfasttextlibrary. Therefore the oldfastTextR repositoryis archived. See also the correspondingblog-post.


Then, I’ll load the previously saved data and I’ll usefastTextR to build theword-vectors,


PATH_INPUT="/path_to_your_folder/output_token_single_file.txt"PATH_OUT="/path_to_your_folder/rt_fst_model"vecs= fastTextR::skipgram_cbow(input_path = PATH_INPUT,output_path = PATH_OUT,method ="skipgram",lr =0.075,lrUpdateRate =100,dim =300,ws =5,epoch =5,minCount =1,neg =5,wordNgrams =2,loss ="ns",bucket =2e+06,minn =0,maxn =0,thread =6,t =1e-04,verbose =2)


Before using one of the three methods, it would be better to reducethe initial dimensions of the word-vectors (rows of the matrix). So,I’ll keep the word-vectors for which the terms appear in theReuters data set -clust_vec$token ( although it’s notapplicable in this case, if the resulted word-vectors were based onexternal data - say the Wikipedia data - then their dimensions would beway larger and many of the terms would be redundant for theReuters data set increasing that way the computation timeconsiderably when invoking one of the doc2vec methods),


init= textTinyR::Doc2Vec$new(token_list = clust_vec$token,word_vector_FILE ="path_to_your_folder/rt_fst_model.vec",print_every_rows =5000,verbose =TRUE,copy_data =FALSE)# use of external pointerpre-processing of input data starts ...File is successfully openedtotal.number.lines.processed.input:25000creation of index starts ...intersection of tokens and wordvec character strings starts ...modification of indices starts ...final processing of data starts ...File is successfully openedtotal.number.lines.processed.output:25000


In case thatcopy_data = TRUE then the pre-processed datacan be observed before invoking one of the ‘doc2vec’ methods,

# res_wv = init$pre_processed_wv()## str(res_wv)


Then, I can use one of the three methods (sum_sqrt,min_max_norm,idf) to receive the transformed vectors.These methods are based on the followingblog-posts , seeespecially:


doc2_sum= init$doc2vec_methods(method ="sum_sqrt",threads =6)doc2_norm= init$doc2vec_methods(method ="min_max_norm",threads =6)doc2_idf= init$doc2vec_methods(method ="idf",global_term_weights = gl_term_w,threads =6)rows_cols=1:5doc2_sum[rows_cols, rows_cols]doc2_norm[rows_cols, rows_cols]doc2_idf[rows_cols, rows_cols]>dim(doc2_sum)[1]10788300>dim(doc2_norm)[1]10788300>dim(doc2_idf)[1]10788300


For illustration, I’ll use the resulted word-vectors of thesum_sqrt method. The approach described can be used as analternative toLatent semantic indexing (LSI) ortopic-modeling in order to discover categories in text data(documents).


First, someone can seach for the optimal number of clusters using theOptimal_Clusters_KMeans function of theClusterRpackage,


scal_dat= ClusterR::center_scale(doc2_sum)# center and scale the dataopt_cl= ClusterR::Optimal_Clusters_KMeans(scal_dat,max_clusters =15,criterion ="distortion_fK",fK_threshold =0.85,num_init =3,max_iters =50,initializer ="kmeans++",tol =1e-04,plot_clusters =TRUE,verbose = T,tol_optimal_init =0.3,seed =1)


Based on the output of theOptimal_Clusters_KMeans function,I’ll pick 5 as the optimal number of clusters in order to performk-means clustering,


num_clust=5km= ClusterR::KMeans_rcpp(scal_dat,clusters = num_clust,num_init =3,max_iters =50,initializer ="kmeans++",fuzzy = T,verbose = F,CENTROIDS =NULL,tol =1e-04,tol_optimal_init =0.3,seed =2)table(km$clusters)123457132439239326072636


As a follow up, someone can also performcluster-medoidsclustering using thepearson-correlation metric, whichresembles thecosine distance ( the latter is frequently usedfor text clustering ),


kmed= ClusterR::Cluster_Medoids(scal_dat,clusters = num_clust,distance_metric ="pearson_correlation",minkowski_p =1,threads =6,swap_phase =TRUE,fuzzy =FALSE,verbose = F,seed =1)table(kmed$clusters)123452396229326808752544


Finally, the word-frequencies of the documents can be obtained usingthecluster_frequency function, which groups the tokens (words)of the documents based on which cluster each document appears,


freq_clust= textTinyR::cluster_frequency(tokenized_list_text = clust_vec$token,cluster_vector = km$clusters,verbose = T)Time difference of0.1762383 secs


> freq_clust$`3`         WORDS COUNTS1:      mln87012:00067413:      cts62604:      net59495:     loss4628---6417:    vira>16418:    gain>16419:     pwj>16420: drummond16421: parisian1$`1`         WORDS COUNTS1:      cts13032:   record6963:    april6694:&lt6525: dividend554---1833:     hvt>11834:    bang>11835:   replac11836:    stbk>11837:     bic>1$`4`         WORDS COUNTS1:     mln61372:     pct50843:    dlrs40244:    year33975: billion3390---10968:   heijn110969:"behind      110970:    myo>      110971:  "favor110972: wonder>1$`5`                  WORDS COUNTS1:&lt42442:            share37483:             dlrs32744:          compani31845:              mln2659---13059:        often-fat113060: computerknowledg113061:       fibrinolyt113062:           hercul113063:           ceroni1$`2`             WORDS COUNTS1:       trade30772:        bank25783:      market25354:         pct24165:        rate2308---13702:"mfn      113703:         uk>      113704:    honolulu      113705:        arap      113706: infinitesim      1


freq_clust_kmed= textTinyR::cluster_frequency(tokenized_list_text = clust_vec$token,cluster_vector = kmed$clusters,verbose = T)Time difference of0.1685851 secs


This is one of the ways that the transformed word-vectors can be usedand is solely based on tokens (words) and word frequencies. However amore advanced approach would be to cluster documents based on wordn-grams and take advantage ofgraphs as explainedherein order to plot the nodes, edges and text.



References:


[8]ページ先頭

©2009-2025 Movatter.jp