- Notifications
You must be signed in to change notification settings - Fork0
A library for topic modeling based on the algorithm: Generative Text Compression with Agglomerative Clustering Summarization (GTCACS)
License
NotificationsYou must be signed in to change notification settings
andrealenzi11/gen-text-compr-aggl-clust-sum
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A library for topic modeling based on the algorithm: Generative Text Compression with Agglomerative Clustering Summarization (GTCACS)
Use the package managerpip to install gtcacs.
pip3 install gtcacs
Tested Python version:
python3.8
Tested dependencies:
numpy==1.19.5scikit-learn==0.24.1scipy==1.6.1tensorflow==2.4.1tqdm==4.58.0
fromsklearn.datasetsimportfetch_20newsgroupsfromgtcacs.topic_modelingimportGTCACS# load datasetcorpus,labels=fetch_20newsgroups(subset='all',return_X_y=True,download_if_missing=False)# set stop wordseng_stopwords= {'i','me','my','myself','we','our','ours','ourselves','you',"you're","you've","you'll","you'd",'your','yours','yourself','yourselves','he','him','his','himself','she',"she's",'her','hers','herself','it',"it's",'its','itself','they','them','their','theirs','themselves','what','which','who','whom','this','that',"that'll",'these','those','am','is','are','was','were','be','been','being','have','has','had','having','do','does','did','doing','a','an','the','and','but','if','or','because','as','until','while','of','at','by','for','with','about','against','between','into','through','during','before','after','above','below','to','from','up','down','in','out','on','off','over','under','again','further','then','once','here','there','when','where','why','how','all','any','both','each','few','more','most','other','some','such','no','nor','not','only','own','same','so','than','too','very','s','t','can','will','just','don',"don't",'should',"should've",'now','d','ll','m','o','re','ve','y','ain','aren',"aren't",'couldn',"couldn't",'didn',"didn't",'doesn',"doesn't",'hadn',"hadn't",'hasn',"hasn't",'haven',"haven't",'isn',"isn't",'ma','mightn',"mightn't",'mustn',"mustn't",'needn',"needn't",'shan',"shan't",'shouldn',"shouldn't",'wasn',"wasn't",'weren',"weren't",'won',"won't",'wouldn',"wouldn't"}# instantiate the GTCACS objectgtcacs_obj=GTCACS(num_topics=20,# number of topicsmax_num_words=50,# maximum number of terms to considermax_df=0.95,# maximum document frequencymin_df=15,# minimum document frequencystopwords=eng_stopwords,# stopwords setngram_range=(1,2),# range for ngrammax_features=None,# maximum number of terms to consider (max vocabulary size)lowercase=True,# flag for convert to lowercasenum_epoches=5,# number of epochsbatch_size=128,# number of documents in a batchgen_learning_rate=0.005,# learning rate for optimize the generative partdiscr_learning_rate=0.005,# learning rate for optimize the discriminative partrandom_seed_size=100,# dimension of generator input layergenerator_hidden_dim=512,# dimension of generator hidden layerdocument_dim=None,# dimension of generator output layer and discriminator's input/output layerlatent_space_dim=64,# dimension of discriminator latent spacediscriminator_hidden_dim=256# dimension of discriminator hidden layer)# compuation on corpus (dimensional reduction, clustering, summarization)gtcacs_obj.extract_topics(corpus=corpus)# get the extracted clusters of wordstopics=gtcacs_obj.get_topics_words()fori,topicinenumerate(topics):print(">>> TOPIC",i+1,topic)# get the topics distribution scores for each documentcorpus_transf=gtcacs_obj.get_topics_distribution_scores()print(corpus_transf)
About
A library for topic modeling based on the algorithm: Generative Text Compression with Agglomerative Clustering Summarization (GTCACS)
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
No packages published
Uh oh!
There was an error while loading.Please reload this page.