abhilash1910/ClusterTransformerPublic

NotificationsYou must be signed in to change notification settings
Fork15
Star43

Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.

License

View license

43 stars 15 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ClusterTransformer		ClusterTransformer
dist		dist
ClusterTransformer_test.py		ClusterTransformer_test.py
LICENSE.TXT		LICENSE.TXT
MANIFEST		MANIFEST
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

ClusterTransformer

A Topic Clustering Library made with Transformer Embeddings 🤖

This is a topic clustering library built with transformer embeddings and analysing cosine similarity between them. The topics are clustered either by kmeans or agglomeratively depending on the use case, and the embeddings are attained after propagating through any of the Transformers present inHuggingFace.The library can be foundhere.

Dependencies

Pytorch

Transformers

Usability

Installation is carried out using the pip command as follows:

pipinstallClusterTransformer==0.1

For using inside the Jupyter Notebook or Python IDE:

importClusterTransformer.ClusterTransformerasct

The 'ClusterTransformer_test.py' file contains an example of using the Library in this context.

Usability Overview

The steps to operate this library is as follows:

Initialise the class: ClusterTransformer()Provide the input list of sentences: In this case, the quora similar questions dataframe has been taken for experimental purposes.Declare hyperparameters:

batch_size: Batch size for running model inference
max_seq_length: Maximum sequence length for transformer to enable truncation
convert_to_numpy: If enabled will return the embeddings in numpy ,else will keep in torch.Tensor
normalize_embeddings:If set to True will enable normalization of embeddings.
neighborhood_min_size:This is used for neighborhood_detection method and determines the minimum number of entries in each cluster
cutoff_threshold:This is used for neighborhood_detection method and determines the cutoff cosine similarity score to cluster the embeddings.
kmeans_max_iter: Hyperparameter for kmeans_detection method signifying nnumber of iterations for convergence.
kmeans_random_state:Hyperparameter for kmeans_detection method signifying random initial state.
kmeans_no_cluster:Hyperparameter for kmeans_detection method signifying number of cluster.
model_name:Transformer model name ,any transformer from Huggingface pretrained library

Call the methods:

ClusterTransfomer.model_inference: For creating the embeddings by running inference through any Transformer library (BERT,Albert,Roberta,Distilbert etc.)Returns a torch.Tensor containing the embeddings.
ClusterTransformer.neighborhood_detection: For agglomerative clustering from the embeddings created from the model_inference method.Returns a dictionary.
ClusterTransformer.kmeans_detection:For Kmeans clustering from the embeddings created from the model_inference method.Returns a dictionary.
ClusterTransformer.convert_to_df: Converts the dictionary from the neighborhood_detection/kmeans_detection methods in a dataframe
ClusterTransformer.plot_cluster:Used for simple plotting of the clusters for each text topic.

Code Sample

The code steps provided in the tab below, represent all the steps required to be done for creating the clusters. The 'compute_topics' method has the following steps:

Instantiate the object of the ClusterTransformer
Specify the transformer name from pretrained transformers
Specify the hyperparameters
Get the embeddings from 'model_inference' method
For agglomerative neighborhood detection use 'neighborhood_detection' method
For kmeans detection, use the 'kmeans_detection' method
For converting the dictionary to a dataframe use the 'convert_to_df' method
For optional plotting of the clusters w.r.t corpus samples, use the 'plot_cluster' method

%%timeimportClusterTransformer.ClusterTransformerascluster_transformerdefcompute_topics(transformer_name):#Instantiate the objectct=cluster_transformer.ClusterTransformer()#Transformer model for inferencemodel_name=transformer_name#Hyperparameters#Hyperparameters for model inferencebatch_size=500max_seq_length=64convert_to_numpy=Falsenormalize_embeddings=False#Hyperparameters for Agglomerative clusteringneighborhood_min_size=3cutoff_threshold=0.95#Hyperparameters for K means clusteringkmeans_max_iter=100kmeans_random_state=42kmeans_no_clusters=8#Sub input data listsub_merged_sent=merged_set[:200]#Transformer (Longformer) embeddingsembeddings=ct.model_inference(sub_merged_sent,batch_size,model_name,max_seq_length,normalize_embeddings,convert_to_numpy)#Hierarchical agglomerative detectionoutput_dict=ct.neighborhood_detection(sub_merged_sent,embeddings,cutoff_threshold,neighborhood_min_size)#Kmeans detectionoutput_kmeans_dict=ct.kmeans_detection(sub_merged_sent,embeddings,kmeans_no_clusters,kmeans_max_iter,kmeans_random_state)#Agglomerative clusteringneighborhood_detection_df=ct.convert_to_df(output_dict)#KMeans clusteringkmeans_df=ct.convert_to_df(output_kmeans_dict)returnneighborhood_detection_df,kmeans_df

Calling the driver code:

%%timeimportmatplotlib.pyplotaspltn_df,k_df=compute_topics('bert-large-uncased')kg_df=k_df.groupby('Cluster').agg({'Text':'count'}).reset_index()ng_df=n_df.groupby('Cluster').agg({'Text':'count'}).reset_index()#Plottingfig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))rng=np.random.RandomState(0)s=1000*rng.rand(len(kg_df['Text']))s1=1000*rng.rand(len(ng_df['Text']))ax1.scatter(kg_df['Cluster'],kg_df['Text'],s=s,c=kg_df['Cluster'],alpha=0.3)ax1.set_title('Kmeans clustering')ax1.set_xlabel('No of clusters')ax1.set_ylabel('No of topics')ax2.scatter(ng_df['Cluster'],ng_df['Text'],s=s1,c=ng_df['Cluster'],alpha=0.3)ax2.set_title('Agglomerative clustering')ax2.set_xlabel('No of clusters')ax2.set_ylabel('No of topics')plt.show()

Samples

Colab-Demo

Kaggle Notebook

Quantum Stat Repository

Images

Cluster Images ( Created With Facebook BART)

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

About

Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ClusterTransformer

A Topic Clustering Library made with Transformer Embeddings 🤖

Dependencies

Usability

Usability Overview

Code Sample

Samples

Images

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

abhilash1910/ClusterTransformer

Folders and files

Latest commit

History

Repository files navigation

ClusterTransformer

A Topic Clustering Library made with Transformer Embeddings 🤖

Dependencies

Usability

Usability Overview

Code Sample

Samples

Images

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages