- Notifications
You must be signed in to change notification settings - Fork0
VETURISRIRAM/YELP_REVIEWS_SENTIMENT_ANALYSIS_FASTTEXT_AUTOTUNE
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project aims to classify theKaggle Yelp reviews in three classes.
- Positive (If the stars are above 3).
- Neutral (If the stars are equal to 3).
- Negative (If the stars are below 3).
FastText as a library for efficient learning of word representations and sentence classification. It is written in C++ and supports multiprocessing during training. FastText allows you to train supervised and unsupervised representations of words and sentences. These representations (embeddings) can be used for numerous applications from data compression, as features into additional models, for candidate selection, or as initializers for transfer learning.
Thedata used in this project from the initial Kaggle dataset to the intermediate FastText files created could be downloaded fromhere.
In this repository, I have kept the./data/
directory empty. You can place the downloaded folder (extracted) in the./data/
and follow the below instructions.
Setup the project. I used the latest FastText from theGitHub.
I wanted to explore theAutoTune
feature of FastText which enables the automatic Hyperparameter tuning. UsingAutoTune
feature, the model is trained with the best possible hyperparameters. According to my understanding, it is somewhat similar to theSklearn's GridSearchCV module.
The below script reads the data, creates the labels, does some minor text preprocessing usingmultiprocessing
.
python preprocess_data.py
After the preprocessing is done, train-val-test files are to be created for the FastText model.
The format required by FastText is like__label__positive The restaurant was great.
Notice the__label__
. It's how FastText understands thatpositive
is the label for the dataThe restaurant was great.
They could either be separated by a space or a tab.
The file extension does not matter. It could be any of the TXT/TSV/CSV or other extensions which can hold textual data.
The below command creates input files for FastText as described above.
python create_files.py
Now that we have the input files, we can go ahead and start training our classifier.
Model Training and Testing are fairly simple in FastText.
I am using theAutoTune
functionality to tune the hyperparameters of my model. It can be set by passing the validation file to theautotuneValidationFile
argument when you initiate the training.
The bin model is trained with the best hyperparameters and saved in the./models/
directory.
python modelling.py
I chose recall as my evaluation metric in order to fit more reviews in the correct buckets or classes.
Evaluation Results on the Test Set.
Recall@10.863
You can also get results /predictions for your test review by running the below file.
python test_model.py
I gave some test inputs to the model and got the predictions as follows.
print(model.predict("the food was really great"))print(model.predict("the restaurant was horrible"))print(model.predict("the salon was okay. Not bad!"))
Output/Predictions:
(('__label__positive',), array([0.99909163]))(('__label__negative',), array([1.00000417]))(('__label__neutral',), array([0.99479502]))
Thanks to the authors of these articles!
- FastText: Under the Hood (Medium Article).
- Python for NLP: Working with Facebook FastText Library (StackAbuse Article).
- FastText Official Documentation
About
Sentiment Analysis of Kaggle Yelp Reviews using FastText.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.