- Notifications
You must be signed in to change notification settings - Fork0
bsameera/Metis_NLP_Unsupervised
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The goal of this project is to do topic modeling using K-Means and NMF ( Non Negative Matrix Factorization ). Bumble is a dating application. Profiles of potential matches are displayed to users, who can "swipe left" to reject a candidate or "swipe right" to indicate interest. In heterosexual matches, only female users can make the first contact with matched male users, while in same-sex matches either person can send a message first. The app is a product of Bumble Inc.Users can sign up using their phone number or Facebook profile, and have options of searching for romantic matches or, in "BFF mode", friends. Bumble Bizz facilitates business communications. Bumble was founded by Whitney Wolfe Herd shortly after she left Tinder, a dating app she says she co-founded, due to growing tensions with other company executives. Wolfe Herd has described Bumble as a "feminist dating app". As of January 2021, with a monthly user base of 42 million, Bumble is the second-most popular dating app in the U.S. after Tinder. According to a June 2016 survey, 46.2% of its users are female. According to Forbes, by 2017 the company was valued at more than $1 billion, and the company reports having over 55 million users in 150 countries as of 2019. [Source: Wikipedia]
The data for this project was obtained from Kaggle -https://www.kaggle.com/datasets/shivkumarganesh/bumble-dating-app-google-play-store-review/code. The reviews data was transformed into a matrix which represents the weights of each word. This matrix is used to train a model with 5 clusters using K-Means algorithm and NMF. The number of clusters was decided using the popular elbow method. The model was also trained using NMF ( Non Negative Matrix Factorization ). The topics were decided depending on the most used words in each cluster. Also, the percentage of each cluster was calculated. Sentiment score was calculated using VaderSentiment, which gives the best and worst review.The models were also trained using Gaussian NaïveBayes and Multinomial NaïveBayes.
The data consists of 110031 entries and 10 columns. The features mostly were reviews, ratings (1 to 5), date and time, etc.The feature of interest was “content” which had reviews by various users in English and non-english languages (Script in English and non-english both). After separating the English and non-english reviews, the reviews in English were 89472. The column “content” was cleaned for null values, numbers and punctuation and lemmatized using wordnetlemmatizer. Also, custom stop words were added to the stopwords set.
•K-Means – 5 clusters•Non Negative Matrix Factorization (NMF) – 5 topics•Naïve Bayes
•Five different topics were discovered –oBad Reviews For Paid SubscriptionsoProfile MatchoGood Reviews About The AppoGood Reviews About People On The AppoEasy To Use•Naïve BayesoGaussian NB Score – 0.503oMultinomial NB Score – 0.855oTarget is sentiment (Positive or Negative)
Pandas – Clean, Explore and Feature Engineering
Scikit-Learn – Build different Classification models and perform cross validation, variable selection and regularization
Matplotlib/ Seaborn – Visualizing data exploration, modeling and results
Python 3.8.5 – to run all of the above
nltk - Natural language toolkit, to work with human language data.
About
NLP
Resources
Uh oh!
There was an error while loading.Please reload this page.