DPigeon/NLP-Language-ClassifierPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

A Naive Bayes classification for NLP to determine the most likely language of a tweet.

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
input		input
.gitignore		.gitignore
README.md		README.md
classifier.py		classifier.py
corpus_testing.py		corpus_testing.py
corpus_training.py		corpus_training.py
input_parser.py		input_parser.py
language.py		language.py
n_gram.py		n_gram.py
naive_bayes.py		naive_bayes.py
output_parser.py		output_parser.py
tweet.py		tweet.py

Repository files navigation

NLP Language Classifier

https://github.com/DPigeon/NLP-Language-Classifier

A Naive Bayes classification for NLP to determine the most likely language of a tweet

First, install Miniconda with Python 3.7 at

https://docs.conda.io/en/latest/miniconda.html

You also need NumPy to run the project.

Install NumPy with

conda install numpy

Run

To run the program, you must create an output folder in the root of the project. Then, you must edit the input.txt file in input folder.The input file text is made as follow:

vocabulary size_of_ngram smoothing_value training_file testing_file

Where the vocabulary is either

0Fold the corpus to lowercase and use only the 26 letters of the alphabet [a-z]

1Distinguish up and low cases and use only the 26 letters of the alphabet [a-z, A-Z]

2Distinguish up and low cases and use all characters accepted by the built-in isalpha() method

Where the size of ngram is either

1character unigrams

2character bigrams

3character trigrams

Smoothing value is a smoothing between [0, 1].

Output Files

The trace file will give an output as follows:

tweet_id  most_likely_class  score_most_likely_class  correct_class  correct_wrong_label

Where the correct and wrong label.

The evaluation file will give an output as follows:

accuracyeu_precision  ca_precision  gl_precision  es_precision  en_precision  pt_precisioneu_recal  ca_recall  gl_recall  es_recall  en_recall  pt_recalleu_f1_measure  ca_f1_measure  gl_f1_measure  es_f1_measure  en_f1_measure  pt_f1_measuremacro_f1  weighted_average_f1

About

A Naive Bayes classification for NLP to determine the most likely language of a tweet.

Releases

No releases published

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NLP Language Classifier

Run

Output Files

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

DPigeon/NLP-Language-Classifier

Folders and files

Latest commit

History

Repository files navigation

NLP Language Classifier

Run

Output Files

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages