Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Indian Reddit channel r/India Flair Classification

NotificationsYou must be signed in to change notification settings

devil-cyber/Reddit-Flair-Detection

Repository files navigation

The directory is aFlask web application set-up for hosting onPivotal servers. The description of files and folders can be found below:

  1. app.py --The file used to start the Flask server.
  2. requirements.txt --Containing all Python dependencies of the project.
  3. Procfile -- Needed to setup Pivotal.
  4. templates --Folder containing HTML/CSS files.
  5. Models --Folder containing the saved model.

Project Execution

  1. Open theTerminal.
  2. Clone the repository by enteringhttps://github.com/devil-cyber/Reddit-Flair-Detection
  3. Ensure thatPython3 andpip is installed on the system.
  4. Create avirtualenv by executing the following command:virtualenv -p python3 env.
  5. Activate theenv virtual environment by executing the follwing command:source env/bin/activate.
  6. Enter the cloned repository directory and executepip install -r requirements.txt.
  7. Enterpython shell andimport nltk. Executenltk.download('stopwords') and exit the shell.
  8. Now, execute the following command:python manage.py runserver and it will point to thelocalhost with the port.
  9. Hit theIP Address on a web browser and use the application.

Dependencies

The following dependencies can be found in requirements.txt:

  1. praw
  2. scikit-learn
  3. nltk
  4. Flask
  5. bs4
  6. pandas
  7. numpy

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using[2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.

The approach taken for the task is as follows:

  1. Collect 1800 India subreddit data for each of the 15 flairs usingpraw module[1].
  2. The data includestitle, comments, body, url, author, score, id, time-created andnumber of comments.
  3. Forcomments, only top level comments are considered in dataset and no sub-comments are present.
  4. Thetitle, comments andbody are cleaned by removing bad symbols and stopwords usingnltk.
  5. Five types of features are considered for the the given task:
a) Titleb) Commentsc) Urlsd) Bodye) Combining Title, Comments, Body and Urls as one feature.
  1. The dataset is split into70% train and30% test data usingtrain-test-split ofscikit-learn.
  2. The dataset is then converted into aVector andTF-IDF form.
  3. Then, the following ML algorithms (usingscikit-learn libraries) are applied on the dataset:
a) Naive-Bayesb) Linear Support Vector Machinec) Logistic Regressiond) Random Forest
  1. Training and Testing on the dataset showed theLinear Support Vector Machine showed the best testing accuracy of77.97% when trained on the combination ofTitle + Comments + Body + Url feature.
  2. The best model is saved and is used for prediction of the flair from the URL of the post.

Results

Title as Feature

Machine Learning AlgorithmTest Accuracy
Naive Bayes0.6792452830
Linear SVM0.8113207547
Logistic Regression0.8231132075
Random Forest0.8042452830
MLP0.8042452830

Body as Feature

Machine Learning AlgorithmTest Accuracy
Naive Bayes0.5636792452
Linear SVM0.8278301886
Logistic Regression0.8066037735
Random Forest0.8207547169
MLP0.7971698113

URL as Feature

Machine Learning AlgorithmTest Accuracy
Naive Bayes0.5754716981
Linear SVM0.7523584905
Logistic Regression0.7523584905
Random Forest0.6886792452
MLP0.7523584905

Comments as Feature

Machine Learning AlgorithmTest Accuracy
Naive Bayes0.4622641509
Linear SVM0.4056603773
Logistic Regression0.4716981132
Random Forest0.4646226415
MLP0.4599056603

Title + Comments + URL + Body as Feature

Machine Learning AlgorithmTest Accuracy
Naive Bayes0.5589622641
Linear SVM0.8325471698
Logistic Regression0.8254716981
Random Forest0.8089622641
MLP0.8372641509

Intuition behind Combined Feature

The features independently showed a test accuracy near to82% with theURL feature giving the worst accuracies during the training.


[8]ページ先頭

©2009-2025 Movatter.jp