Movatterモバイル変換

devil-cyber/Reddit-Flair-DetectionPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

Indian Reddit channel r/India Flair Classification

reditflair.herokuapp.com/

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DataPreprocessing		DataPreprocessing
static/css		static/css
templates		templates
.gitignore		.gitignore
Logisticstitle.pickle		Logisticstitle.pickle
Procfile		Procfile
README.md		README.md
app.py		app.py
manifest.yml		manifest.yml
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Repository files navigation

Directory Structure

The directory is aFlask web application set-up for hosting onPivotal servers. The description of files and folders can be found below:

app.py --The file used to start the Flask server.
requirements.txt --Containing all Python dependencies of the project.
Procfile -- Needed to setup Pivotal.
templates --Folder containing HTML/CSS files.
Models --Folder containing the saved model.

Project Execution

Open theTerminal.
Clone the repository by enteringhttps://github.com/devil-cyber/Reddit-Flair-Detection
Ensure thatPython3 andpip is installed on the system.
Create avirtualenv by executing the following command:virtualenv -p python3 env.
Activate theenv virtual environment by executing the follwing command:source env/bin/activate.
Enter the cloned repository directory and executepip install -r requirements.txt.
Enterpython shell andimport nltk. Executenltk.download('stopwords') and exit the shell.
Now, execute the following command:python manage.py runserver and it will point to thelocalhost with the port.
Hit theIP Address on a web browser and use the application.

Dependencies

The following dependencies can be found in requirements.txt:

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using[2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.

The approach taken for the task is as follows:

Collect 1800 India subreddit data for each of the 15 flairs usingpraw module[1].
The data includestitle, comments, body, url, author, score, id, time-created andnumber of comments.
Forcomments, only top level comments are considered in dataset and no sub-comments are present.
Thetitle, comments andbody are cleaned by removing bad symbols and stopwords usingnltk.
Five types of features are considered for the the given task:

a) Titleb) Commentsc) Urlsd) Bodye) Combining Title, Comments, Body and Urls as one feature.

The dataset is split into70% train and30% test data usingtrain-test-split ofscikit-learn.
The dataset is then converted into aVector andTF-IDF form.
Then, the following ML algorithms (usingscikit-learn libraries) are applied on the dataset:

a) Naive-Bayesb) Linear Support Vector Machinec) Logistic Regressiond) Random Forest

Training and Testing on the dataset showed theLinear Support Vector Machine showed the best testing accuracy of77.97% when trained on the combination ofTitle + Comments + Body + Url feature.
The best model is saved and is used for prediction of the flair from the URL of the post.

Results

Title as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.6792452830
Linear SVM	0.8113207547
Logistic Regression	0.8231132075
Random Forest	0.8042452830
MLP	0.8042452830

Body as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.5636792452
Linear SVM	0.8278301886
Logistic Regression	0.8066037735
Random Forest	0.8207547169
MLP	0.7971698113

URL as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.5754716981
Linear SVM	0.7523584905
Logistic Regression	0.7523584905
Random Forest	0.6886792452
MLP	0.7523584905

Comments as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.4622641509
Linear SVM	0.4056603773
Logistic Regression	0.4716981132
Random Forest	0.4646226415
MLP	0.4599056603

Title + Comments + URL + Body as Feature

Machine Learning Algorithm	Test Accuracy
Naive Bayes	0.5589622641
Linear SVM	0.8325471698
Logistic Regression	0.8254716981
Random Forest	0.8089622641
MLP	0.8372641509

Intuition behind Combined Feature

The features independently showed a test accuracy near to82% with theURL feature giving the worst accuracies during the training.

About

Indian Reddit channel r/India Flair Classification

reditflair.herokuapp.com/

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Folders and files

Latest commit

History

Repository files navigation

Directory Structure

Project Execution

Dependencies

Approach

Results

Title as Feature

Body as Feature

URL as Feature

Comments as Feature

Title + Comments + URL + Body as Feature

Intuition behind Combined Feature

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

devil-cyber/Reddit-Flair-Detection

Folders and files

Latest commit

History

Repository files navigation

Directory Structure

Project Execution

Dependencies

Approach

Results

Title as Feature

Body as Feature

URL as Feature

Comments as Feature

Title + Comments + URL + Body as Feature

Intuition behind Combined Feature

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages