Newsgroup Text classification with Machine Learning

#python #machinelearning

Text can be automaticallyclassified. As anything with Machine Learning, it needs data. So what data are we going to us?

The Data

Lets say our source of data is the fetch_20newsgroups data set.
This data set contains the text of nearly 20,000 newsgroup posts partitioned across 20 different newsgroups.

The dataset is quite old, but that doesn't matter.You can find the original homepage here:20 news groups dataset

The data set included by default in the Python Machine Learning module sklearn.

To simplify, we'll only take 2 news groups "rec.motorcycles" and "rec.sport.hockey".

#!/usr/bin/python3news = fetch_20newsgroups(subset="all", categories=['rec.sport.hockey', 'rec.motorcycles'])

Test the Algorithm

Before using the classifier, you want to know how well it works. That is done by splitting the data set intotrain and test set.

#!/usr/bin/python3x_train, x_test, y_train, y_test = train_test_split(news.data,news.target)

The data we're dealing with is text. It needs to be vectors. Then use the TfidfVectorizer. So we have two vectors: x_train and x_test.

#!/usr/bin/python3transfer = TfidfVectorizer()x_train = transfer.fit_transform(x_train)x_test = transfer.transform(x_test)

No need to change y_train and y_test, as those are output labels (class 0 or class 1)

Create an algorithm object and train it with the data.

#!/usr/bin/python3estimator = MultinomialNB()estimator.fit(x_train,y_train)

Then you can make predictions and see how well it classifies on the test data

#!/usr/bin/python3y_predict = estimator.predict(x_test)print("y_predict:\n", y_predict)score = estimator.score(x_test, y_test)print("score：\n", score)

Run the program and you'll see the accuracy:

score： 0.9939879759519038

Make your own predictions

You can make predictions with new texts:

Enter some text: i like to drive motor cycle on the highwayy_predict:[0]Enter some text: i like to play hockey gamey_predict:[1]

To do so add these lines:

#!/usr/bin/python3sentence = input("Enter some text: ")sentence_x = transfer.transform([sentence])y_predict = estimator.predict(sentence_x)print("y_predict:\n", y_predict)

The program

The program below does it all

#!/usr/bin/python3from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitdef nb_news():    news = fetch_20newsgroups(subset="all", categories=['rec.sport.hockey', 'rec.motorcycles'])    x_train, x_test, y_train, y_test = train_test_split(news.data,news.target)    transfer = TfidfVectorizer()    x_train = transfer.fit_transform(x_train)    x_test = transfer.transform(x_test)    estimator = MultinomialNB()    estimator.fit(x_train,y_train)    y_predict = estimator.predict(x_test)    print("y_predict:\n", y_predict)    score = estimator.score(x_test, y_test)    print("score：\n", score)    return Nonenb_news()