- Notifications
You must be signed in to change notification settings - Fork0
Indian Reddit channel r/India Flair Classification
devil-cyber/Reddit-Flair-Detection
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The directory is aFlask web application set-up for hosting onPivotal servers. The description of files and folders can be found below:
- app.py --The file used to start the Flask server.
- requirements.txt --Containing all Python dependencies of the project.
- Procfile -- Needed to setup Pivotal.
- templates --Folder containing HTML/CSS files.
- Models --Folder containing the saved model.
- Open the
Terminal
. - Clone the repository by entering
https://github.com/devil-cyber/Reddit-Flair-Detection
- Ensure that
Python3
andpip
is installed on the system. - Create a
virtualenv
by executing the following command:virtualenv -p python3 env
. - Activate the
env
virtual environment by executing the follwing command:source env/bin/activate
. - Enter the cloned repository directory and execute
pip install -r requirements.txt
. - Enter
python
shell andimport nltk
. Executenltk.download('stopwords')
and exit the shell. - Now, execute the following command:
python manage.py runserver
and it will point to thelocalhost
with the port. - Hit the
IP Address
on a web browser and use the application.
The following dependencies can be found in requirements.txt:
Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using[2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.
The approach taken for the task is as follows:
- Collect 1800 India subreddit data for each of the 15 flairs using
praw
module[1]. - The data includestitle, comments, body, url, author, score, id, time-created andnumber of comments.
- Forcomments, only top level comments are considered in dataset and no sub-comments are present.
- Thetitle, comments andbody are cleaned by removing bad symbols and stopwords using
nltk
. - Five types of features are considered for the the given task:
a) Titleb) Commentsc) Urlsd) Bodye) Combining Title, Comments, Body and Urls as one feature.
- The dataset is split into70% train and30% test data using
train-test-split
ofscikit-learn
. - The dataset is then converted into a
Vector
andTF-IDF
form. - Then, the following ML algorithms (using
scikit-learn
libraries) are applied on the dataset:
a) Naive-Bayesb) Linear Support Vector Machinec) Logistic Regressiond) Random Forest
- Training and Testing on the dataset showed theLinear Support Vector Machine showed the best testing accuracy of77.97% when trained on the combination ofTitle + Comments + Body + Url feature.
- The best model is saved and is used for prediction of the flair from the URL of the post.
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.6792452830 |
Linear SVM | 0.8113207547 |
Logistic Regression | 0.8231132075 |
Random Forest | 0.8042452830 |
MLP | 0.8042452830 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5636792452 |
Linear SVM | 0.8278301886 |
Logistic Regression | 0.8066037735 |
Random Forest | 0.8207547169 |
MLP | 0.7971698113 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5754716981 |
Linear SVM | 0.7523584905 |
Logistic Regression | 0.7523584905 |
Random Forest | 0.6886792452 |
MLP | 0.7523584905 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.4622641509 |
Linear SVM | 0.4056603773 |
Logistic Regression | 0.4716981132 |
Random Forest | 0.4646226415 |
MLP | 0.4599056603 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5589622641 |
Linear SVM | 0.8325471698 |
Logistic Regression | 0.8254716981 |
Random Forest | 0.8089622641 |
MLP | 0.8372641509 |
The features independently showed a test accuracy near to82% with theURL
feature giving the worst accuracies during the training.