Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Design a Decision-Tree Classifer and a Logistical Regression Classifier and compare their performance in Sentiment Analysis on Twitter Tweet data

License

NotificationsYou must be signed in to change notification settings

tadiusfrank2001/Sentiment_Analyst_Machine_Learning_Project

Repository files navigation

Overview

The goal of this project was to perform sentiment analysis on a Twitter dataset using two different machine learning models: Decision Tree and Logistic Regression.The focus was on evaluating which model performed sentiment classification better in terms of accuracy.

Sentiment Classification

Sentiment classification is a type of text classification problem, where instead of classifying based on the topic of the text, the focus is on the sentiment or opinion lexicon that indicates whether an opinion is positive, negative or neutral.To reduce the complexity and improve classification accuracy, I filtered out neutral lexicons, or “stopwords,” using Python’snltk library, as they are less informative.This approach allows us to treat sentiment classification as a binary classification problem, ignoring the neutral class and classifying text as either positive or negative.

After vectorizing the text and counting the frequency of sentiment lexicons, I fitted a Logistic Regression model on the preprocessed data with a 70-30 train-test split.I then evaluated whether removing neutral lexicons impacts the results and implemented a Decision Tree classifier to compare performance.

Challenges and Adjustments

Initially, the project code was structured to handle non-NLP-based datasets, which only involved numerical data.To address this, I augmented the code to properly handle the text data by reading the CSV file, removing stop words, and filtering out characters such as emojis and "@" mentions. Additionally, the dataset was contextually labeled based on a specific scenario (e.g., a video game Twitter thread), where logically negative actions might be labeled as positive.

The dataset consisted of over 1.6 million entries, which posed challenges in terms of runtime and computational resources. To manage this, we trained and tested our models on physical university lab machines with greater GPU and CPU capacity.

Technologies Used

Python: Programming language used.

scikit-learn: For implementing the Decision Tree and Logistic Regression models.

Pandas: For data manipulation and analysis.

NumPy: For numerical computations.

NLTK: For natural language processing tasks.

Project Structure

experiment.py: Cleans and preprocesses the Twitter dataset, including removing stop words and extra punctuation.

classify.py: Contains the Logistic Regression and Decision Tree classifiers used to train and evaluate the models.

data/: Directory containing the Twitter tweet data, split into multiple 25MB files for easier management.

About

Design a Decision-Tree Classifer and a Logistical Regression Classifier and compare their performance in Sentiment Analysis on Twitter Tweet data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp