odubno/gauss-naive-bayesPublic

NotificationsYou must be signed in to change notification settings
Fork21
Star45

Gauss Naive Bayes in Python From Scratch.

45 stars 21 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
img		img
.gitignore		.gitignore
README.md		README.md
gauss_nb.py		gauss_nb.py
nb_tutorial.py		nb_tutorial.py
requirements.txt		requirements.txt
test_gauss_nb.py		test_gauss_nb.py

Repository files navigation

Gauss Naive Bayes From Scratch

Building a Naive Bayes classifier using Python with drawings.

We will translate each part of the Gauss Naive Bayes into Python code and explain the logic behind its methods.

TheComplete Code could be found at the bottom of this page or innb_tutorial.py

The Overview will just be that, the overview, and a soft introduction to Naive Bayes. Stay with us! Preparing Data is where the excitement begins.

Overview

Using Naive Bayes and the Gaussian Distribution (Normal Distribution) to build a classifierthat will predict flower species based off of petal and sepal features.

We will be working with theiris data set,a collection of 4 dimensional features that define 3 different types of flower species.

Iris Data Set:

The Iris data set is a classic and is widely used when explaining classification models.The data set has 4 independent variables and 1 dependent variable that have 3 different classes with 150 instances.

The first 4 columns are the independent variables (features).
The 5th column is the dependent variable (class).

sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
class:
- Iris Setosa,
- Iris Versicolour
- Iris Virginica

Random 5 Row Sample

sepal length	sepal width	petal length	petal width	class
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
7.0	3.2	4.7	1.4	Iris-versicolor
6.3	2.8	5.1	1.5	Iris-virginica
6.4	3.2	4.5	1.5	Iris-versicolor

Bayes Theorem:

Naive Bayes, more technically referred to as the Posterior Probability, updates the prior belief of an event given new information.The result is the probability of the class occuring given the new data.

The classification model could handle binary and multiple classifications.

When predicting a class, the model calculates the posterior probability for all classes andselects the largest posterior probability as the predicted class. This value is referred to as the Maximum A Posterior (MAP).

drawn by Oleh Dubno

Posterior Probability:

This is the updated belief given the new data, and the objective probability of each class, derived from the Naive Bayes technique.

Class Prior Probability:

This is the Prior Belief; the probability of the class before updating the belief.

Likelihood:

Likelihood is calculated by taking the product of allNormal Probability Density Functions (assume independence, ergo the "Naivete").The Normal PDF is calculated using the Gaussian Distribution. Hence, the name Gauss Naive Bayes.
- We will use the Normal PDF to calculate the Normal Probability value for each feature given the class.
- Likelihood is the product of all Normal PDFs.
There's an important distinction to keep in mind between Likelihood and Probability.
- Normal Probability is calculated for each feature given the class and is always between 0 and 1.
- Likelihood is the product of all Normal Probability values.
- The number of features is infinite and limited to our imagination.
- Since there will always be features that could be added, the product of all Normal Probabilities is not the probability but the Likelihood.

Predictor Prior Probability:

Predictor Prior Probability is another way of saying Marginal Probability.
It is the probability given the new data under all possible features for each class.
It isn't necessary for the Naive Bayes Classifier to calculate this,because we're only looking for the prediction and not the exact probability.
The results do not change at all, however we do calculate it here to show that this is the case.

Normal PDF Formula:

SeeNormal Distribution (Wikipedia) definition.

The Normal Distribution will help determine the Normal Probability for each new feature given the class.The product of all Normal Probabilities will result in the likelihood of theclass occurring given the new features.In other words, the Normal Distribution will calculate the Normal Probability value for each new feature.The product of all Normal Probabilities will be the likelihood.

Prepare Data

Building the Naive Bayes Classifier.

Here, we'll create the structure and the methods to read and prepare data for modeling.

Prerequisites

Every function is created from scratch.However, instead of having to download the data, we're using an API call to get the data.

$ pip install requests

In some sections you'll see "Click to expand". Click it to view the Python code.

Skeleton

Import the necessary libraries and create the GaussNB class. This will be the foundation for the rest of the code.

Click to expand GaussNB Skeleton.

# -*- coding: utf-8 -*-fromcollectionsimportdefaultdictfrommathimportpifrommathimporteimportrequestsimportrandomimportcsvimportreclassGaussNB:def__init__(self):passdefmain():print"Here we will handle class methods."if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Here we will handle class methods.

Load CSV

Read in the raw data and convert each string into an integer.

Click to expand load_csv().

classGaussNB:def__init__(self):passdefload_csv(self,data,header=False):"""        :param data: raw comma seperated file        :param header: remove header if it exists        :return:        Load and convert each string of data into a float        """lines=csv.reader(data.splitlines())dataset=list(lines)ifheader:# remove headerdataset=dataset[1:]foriinrange(len(dataset))dataset[i]= [float(x)ifre.search('\d',x)elsexforxindataset[i]]returndatasetdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)printdata[:3]# first 3 rowsif__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

[[4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], [4.6, 3.1, 1.5, 0.2, 'Iris-setosa']]

Split Data

Split the data into atrain_set and atest_set.

The weight will determine how much of the data will be in thetrain_set.

Click to expand split_data().

classGaussNB:    .    .    .defsplit_data(self,data,weight):"""        :param data:        :param weight: indicates the percentage of rows that'll be used for training        :return:        Randomly selects rows for training according to the weight and uses the rest of the rows for testing.        """train_size=int(len(data)*weight)train_set= []foriinrange(train_size):index=random.randrange(len(data))train_set.append(data[index])data.pop(index)return [train_set,data]defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testing

Group Data

Group the data according to class by mapping each class to individual instances.

Take this table

sepal length	sepal width	petal length	petal width	class
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
7.0	3.2	4.7	1.4	Iris-versicolor
6.3	2.8	5.1	1.5	Iris-virginica
6.4	3.2	4.5	1.5	Iris-versicolor

and turn it into this map

{'Iris-virginica': [        [6.3,2.8,5.1,1.5],    ],'Iris-setosa': [        [5.1,3.5,1.4,0.2],        [4.9,3.0,1.4,0.2],    ],'Iris-versicolor': [        [7.0,3.2,4.7,1.4],        [6.4,3.2,4.5,1.5],    ]}

Click to expand group_by_class().

classGaussNB:    .    .    .defgroup_by_class(self,data,target):"""        :param data: Training set. Lists of events (rows) in a list        :param target: Index for the target column. Usually the last index in the list        :return:        Mapping each target to a list of it's features        """target_map=defaultdict(list)forindexinrange(len(data)):features=data[index]ifnotfeatures:continuex=features[target]target_map[x].append(features[:-1])# designating the last column as the class columnreturndict(target_map)defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)print"Grouped into %s classes: %s"% (len(group.keys()),group.keys())if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Grouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']

Summarize Data

Prepare the data for modeling. Calculate the descriptive statistics that will later be used in the model.

Mean

Calculate the mean for[5.9, 3.0, 5.1, 1.8].

Click to expand mean().

classGaussNB:    .    .     .defmean(self,numbers):"""        :param numbers: list of numbers        :return:        """result=sum(numbers)/float(len(numbers))returnresultdefmain():nb=GaussNB()print"Mean: %s"%nb.mean([5.9,3.0,5.1,1.8])if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Mean: 3.95

Standard Deviation

Calculate the standard deviation for[5.9, 3.0, 5.1, 1.8].

Click to expand stdev().

classGaussNB:    .     .     .defstdev(self,numbers):"""        :param numbers: list of numbers        :return:        Calculate the standard deviation for a list of numbers.        """avg=self.mean(numbers)squared_diff_list= []fornuminnumbers:squared_diff= (num-avg)**2squared_diff_list.append(squared_diff)squared_diff_sum=sum(squared_diff_list)sample_n=float(len(numbers)-1)var=squared_diff_sum/sample_nreturnvar**.5defmain():nb=GaussNB()print"Standard Deviation: %s"%nb.stdev([5.9,3.0,5.1,1.8])if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Standard Deviation: 1.88414436814

Summary

Return the (mean, standard deviation) combination for each feature of thetrain_set.The mean and the standard deviation will be used when calculating the Normal Probabiltiy values for each feature of thetest_set.

Click to expand summarize().

classGaussNB:    .     .     .defsummarize(self,test_set):"""        :param test_set: lists of features        :return:        Use zip to line up each feature into a single column across multiple lists.        yield the mean and the stdev for each feature.        """forfeatureinzip(*test_set):yield {'stdev':self.stdev(feature),'mean':self.mean(feature)            }defmain():nb=GaussNB()data= [        [5.9,3.0,5.1,1.8],         [5.1,3.5,1.4,0.2]    ]print"Feature Summary: %s"% [iforiinnb.summarize(data)]if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Feature Summary: [    {'mean': 5.5, 'stdev': 0.5656854249492386}, # sepal length     {'mean': 3.25, 'stdev': 0.3535533905932738}, # sepal width    {'mean': 3.25, 'stdev': 2.6162950903902256}, # petal length    {'mean': 1.0, 'stdev': 1.1313708498984762} # petal width]

Build Model

Building the class methods for calculatingBayes Theorem:

Features and Class

Bayes Tree Diagram

Using Iris-setosa as an example

Prior Probability

Prior Probability is what we know about each class before considering the new data.

It's the probability of each class occurring.

Click to expand prior_prob().

classGaussNB:    .     .     .defprior_prob(self,group,target,data):"""        :return:        The probability of each target class        """total=float(len(data))result=len(group[target])/totalreturnresultdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())fortarget_classin ['Iris-virginica','Iris-setosa','Iris-versicolor']:prior_prob=nb.prior_prob(group,target_class,data)print'P(%s): %s'% (target_class,prior_prob)if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']P(Iris-virginica): 0.38P(Iris-setosa): 0.3P(Iris-versicolor): 0.32

Train

This is where we learn from the train set, by calculating the mean and the standard deviation.

Using the grouped classes, calculate the (mean, standard deviation) combination for each feature of each class.

The calculations will later use the (mean, standard deviation) of each feature to calculate class likelihoods.

Click to expand train().

classGaussNB:    .     .     .deftrain(self,train_list,target):"""        :param data:        :param target: target class        :return:        For each target:            1. yield prior_prob: the probability of each class. P(class) eg P(Iris-virginica)            2. yield summary: list of {'mean': 0.0, 'stdev': 0.0}        """group=self.group_by_class(train_list,target)self.summaries= {}fortarget,featuresingroup.iteritems():self.summaries[target]= {'prior_prob':self.prior_prob(group,target,train_list),'summary': [iforiinself.summarize(features)],            }returnself.summariesdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())printnb.train(train_list,-1)if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']{'Iris-setosa': {'prior_prob': 0.3,  'summary': [{'mean': 4.980000000000001, 'stdev': 0.34680810554104063}, # sepal length    {'mean': 3.406666666666667, 'stdev': 0.3016430104397023}, # sepal width   {'mean': 1.496666666666667, 'stdev': 0.20254132542705236}, # petal length   {'mean': 0.24333333333333343, 'stdev': 0.12228664272317624}]}, # petal width 'Iris-versicolor': {'prior_prob': 0.31,  'summary': [{'mean': 5.96774193548387, 'stdev': 0.4430102307127106},   {'mean': 2.7903225806451615, 'stdev': 0.28560443356698495},   {'mean': 4.303225806451613, 'stdev': 0.41990782398659987},   {'mean': 1.3451612903225807, 'stdev': 0.17289439874755796}]}, 'Iris-virginica': {'prior_prob': 0.39,  'summary': [{'mean': 6.679487179487178, 'stdev': 0.585877428882027},   {'mean': 3.002564102564103, 'stdev': 0.34602036712733625},   {'mean': 5.643589743589742, 'stdev': 0.5215336048086158},   {'mean': 2.0487179487179477, 'stdev': 0.2927831916298213}]}}

Likelihood

Likelihood is calculated by taking the product of all Normal Probabilities.

For each feature given the class we calculate the Normal Probability using theNormal Distribution.

Click to expand normal_pdf().

frommathimporte,piclassGaussNB:    .     .     .defnormal_pdf(self,x,mean,stdev):"""        :param x: a variable        :param mean: µ - the expected value or average from M samples        :param stdev: σ - standard deviation        :return: Gaussian (Normal) Density function.        N(x; µ, σ) = (1 / 2πσ) * (e ^ (x–µ)^2/-2σ^2        """variance=stdev**2exp_squared_diff= (x-mean)**2exp_power=-exp_squared_diff/ (2*variance)exponent=e**exp_powerdenominator= ((2*pi)**.5)*stdevnormal_prob=exponent/denominatorreturnnormal_probdefmain():nb=GaussNB()normal_prob=nb.normal_pdf(5,4.98,0.35)printnormal_probif__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

1.13797564994

Joint Probability

Joint Probability is calculated by taking the product of the Prior Probability and the Likelihood.

For each class:

Calculate the Prior Probability.
Use the Normal Distribution to calculate the Normal Probability of each feature. e.g.N(x; µ, σ).
Take the product of the Prior Probability and the Likelihood.
Return one Joint Probability value for each class given the new data.

Click to expand joint_probabilities().

classGaussNB:    .     .     .defjoint_probabilities(self,test_row):"""        :param test_row: single list of features to test; new data        :return:        Use the normal_pdf(self, x, mean, stdev) to calculate the Normal Probability for each feature        Take the product of all Normal Probabilities and the Prior Probability.        """joint_probs= {}fortarget,featuresinself.summaries.iteritems():total_features=len(features['summary'])likelihood=1forindexinrange(total_features):feature=test_row[index]mean=features['summary'][index]['mean']stdev=features['summary'][index]['stdev']normal_prob=self.normal_pdf(feature,mean,stdev)likelihood*=normal_probprior_prob=features['prior_prob']joint_probs[target]=prior_prob*likelihoodreturnjoint_probsdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)printnb.joint_probabilities([5.0,4.98,0.35,4.0])if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']{    'Iris-virginica': 7.880001356130214e-38,     'Iris-setosa': 9.616469451152855e-230,     'Iris-versicolor': 6.125801208117717e-68}

Marginal Probability

Calculate the total sum of all joint probabilities.

The Marginal Probability is calculated using the sum of all joint probabilities.The Marginal value, a single value, will be the same across all classes.We could think of the Marginal Probability as the total joint probability of all classes occurring given the new data.

Reminder, we're looking to predict the class by choosing the Maximum A Posterior (MAP).The prediction doesn't care about the exact posterior probability of each class anddividing by the same value is more memory intensive and does not improve the accuracy of predicting the correct class.

For the purposes of sticking to the trueBayes Theorem, we're using it here.

Click to expand marginal_pdf().

classGaussNB:    .     .     .defmarginal_pdf(self,joint_probabilities):"""        :param joint_probabilities: list of joint probabilities for each feature        :return:        Marginal Probability Density Function (Predictor Prior Probability)        Joint Probability = prior * likelihood        Marginal Probability is the sum of all joint probabilities for all classes.        marginal_pdf =          [P(setosa) * P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)]        + [P(versicolour) * P(sepal length | versicolour) * P(sepal width | versicolour) * P(petal length | versicolour) * P(petal width | versicolour)]        + [P(virginica) * P(sepal length | verginica) * P(sepal width | verginica) * P(petal length | verginica) * P(petal width | verginica)]        """marginal_prob=sum(joint_probabilities.values())returnmarginal_probdefmain():nb=GaussNB()joint_probs= {'Iris-setosa':1.2904413965468937,'Iris-versicolor':5.414630046086964e-14,'Iris-virginica':7.087518912297627e-30    }marginal_prob=nb.marginal_pdf(joint_probs)print'Marginal Probability: %s'%marginal_probif__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Marginal Probability: 1.29044139655

Posterior Probability

The Posterior Probability is the probability of a class occuring and is calculated for each class given the new data.

This where all of the preceding class methods tie together to calculate the Gauss Naive Bayes formula with the goal of selecting MAP.

Click to expand posterior_probabilities().

classGaussNB:    .    .    .defposterior_probabilities(self,test_row):"""        :param test_row: single list of features to test; new data        :return:        For each feature (x) in the test_row:            1. Calculate Predictor Prior Probability using the Normal PDF N(x; µ, σ). eg = P(feature | class)            2. Calculate Likelihood by getting the product of the prior and the Normal PDFs            3. Multiply Likelihood by the prior to calculate the Joint Probability.        E.g.        prior_prob: P(setosa)        likelihood: P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)        joint_prob: prior_prob * likelihood        marginal_prob: predictor prior probability        posterior_prob = joint_prob/ marginal_prob        returning a dictionary mapping of class to it's posterior probability        """posterior_probs= {}joint_probabilities=self.joint_probabilities(test_row)marginal_prob=self.marginal_pdf(joint_probabilities)fortarget,joint_probinjoint_probabilities.iteritems():posterior_probs[target]=joint_prob/marginal_probreturnposterior_probsdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)posterior_probs=nb.posterior_probabilities([6.3,2.8,5.1,1.5])print"Posterior Probabilityies: %s"%posterior_probsif__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']Posterior Probabilityies: {    'Iris-virginica': 0.32379024365947745,    'Iris-setosa': 2.5693999408505845e-158,    'Iris-versicolor': 0.6762097563405226}

Test Model

Testing the model and predicting a class given the new data.

Get Maximum A Posterior

Theget_map() method will call theposterior_probabilities() method on a singletest_row eg ([6.3, 2.8, 5.1, 1.5]).

For eachtest_row we will calculate 3 Posterior Probabilities; one for each class. The goal is to select MAP, the Maximum A Posterior probability.

Theget_map() method will simply choose the Maximum A Posterior probability and return the associated class for the giventest_row.

Click to expand get_map().

classGaussNB:    .    .     .defget_map(self,test_row):"""        :param test_row: single list of features to test        :return:        Return the target class with the largest/best posterior probability        """posterior_probs=self.posterior_probabilities(test_row)map_prob=max(posterior_probs,key=posterior_probs.get)returnmap_probdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)prediction=nb.get_map([6.3,2.8,5.1,1.5])# 'Iris-virginica'print'According to the test row the best prediction is: %s'%predictionif__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']According to the test row the best prediction is: Iris-versicolor

Predict

This method will return a prediction for each test_row.

Example input, list of lists:

[    [5.1, 3.5, 1.4, 0.2],    [4.9, 3.0, 1.4, 0.2],]

For testing this method, we'll use the data from thesample data above.

Click to expand predict().

classGaussNB:    .    .     .defpredict(self,test_set):"""        :param test_set: list of features to test on        :return:        Predict the likeliest target for each row of the test_set.        Return a list of predicted targets.        """map_probs= []forrowintest_set:map_prob=self.get_map(row)map_probs.append(map_prob)returnmap_probsdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)test= {'Iris-virginica': [            [6.3,2.8,5.1,1.5],        ],'Iris-setosa': [            [5.1,3.5,1.4,0.2],            [4.9,3.0,1.4,0.2],        ],'Iris-versicolor': [            [7.0,3.2,4.7,1.4],            [6.4,3.2,4.5,1.5],        ]    }fortarget,featuresintest.iteritems():predicted=nb.predict(features)print'predicted target: %s | true target: %s'% (predicted,target)if__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']predicted target: ['Iris-versicolor'] | true target: Iris-virginicapredicted target: ['Iris-setosa', 'Iris-setosa'] | true target: Iris-setosa # both test rows were predicted to be setosapredicted target: ['Iris-versicolor', 'Iris-versicolor'] | true target: Iris-versicolor # both test rows were predicted to be versicolor

Accuracy

Accuracy will test the performance of the model by taking the total number of correct predictions anddivide them by the total number of predictions.This is critical in understanding the veracity of the model.

Click to expand accuracy().

classGaussNB:    .    .     .defaccuracy(self,test_set,predicted):"""        :param test_set: list of test_data        :param predicted: list of predicted classes        :return:        Calculate the the average performance of the classifier.        """correct=0actual= [item[-1]foritemintest_set]forx,yinzip(actual,predicted):ifx==y:correct+=1returncorrect/float(len(test_set))defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)predicted=nb.predict(test_list)accuracy=nb.accuracy(test_list,predicted)print'Accuracy: %.3f'%accuracyif__name__=='__main__':main()

Execute in terminal:

$ python nb_tutorial.py

Output:

Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']Accuracy: 0.960

Recap

The Naive Bayes Classification model makes some strong assumptions. All of the features are assumed to be independent when calculating the likelihood; hence "Naive".Likelihood is calculated using the Gaussian Distribution (Normal Distribution) and all of the features are assumed to be normally distribtuted; hence "Gauss".

Overlooking Gauss NB'sstrong assumptions, the classifier is very fast and accurate.Gauss NB does not require a lot of data to be accurate and is highly scalable.

You could find theComplete Code below.

The initial build of Gauss Naive Bayes classifier could run on the four classic data sets:

You could find that code ingauss_nb.py

Authors

Oleh Dubno -github.odubno
- Code and images.
Danny Argov -github.datargov
- Wording and logic of text.

See the list ofcontributors who participated in this project.

Acknowledgments

Tip of the hat to the authors that made this tutorial possible.

Author	URL
Dr. Jason Brownlee	How To Implement Naive Bayes From Scratch in Python
Chris Albon	Naive Bayes Classifier From Scratch
Sunil Ray	6 Easy Steps to Learn Naive Bayes Algorithm
Rahul Saxena	How The Naive Bayes Classifier Works In Machine Learning
Data Source	UCI Machine Learning
C. Randy Gallistel	Bayes for Beginners: Probability and Likelihood

Inspiration:

Project for Columbia University Probability and Statistics course - Prof. Banu Baydil

With the massive popularity of Bayes' Theorem as well as the default use of Gaussian/Normal distributions for common data sets, we were keen to better understand firstly, if the assumption to use the Normal distrubition on differing data sets was a fit or not, and secondly, how to take the Normal distribution and Bayes' Theorem and apply it in use via repeatable code. Initially, we were intrigued by the idea of pulling public data from sites such as OKCupid or facebook to produce interesting "predictions" of human behavior, but it dawned on us that diving in to the fundamentals of knowing which statistical distribution to leverage on differing data sets, along with understanding the workings of Bayes' Theorem in practice, that we decided to focus on buildings the rungs of the ladder first, figuratively.

Complete Code

Code is also available innb_tutorial.py.

Click to expand nb_tutorial.py

# -*- coding: utf-8 -*-fromcollectionsimportdefaultdictfrommathimportpifrommathimporteimportrequestsimportrandomimportcsvimportreclassGaussNB:def__init__(self):passdefload_csv(self,data,header=False):"""        :param data: raw comma seperated file        :param header: remove header if it exists        :return:        Load and convert each string of data into a float        """lines=csv.reader(data.splitlines())dataset=list(lines)ifheader:# remove headerdataset=dataset[1:]foriinrange(len(dataset)):dataset[i]= [float(x)ifre.search('\d',x)elsexforxindataset[i]]returndatasetdefsplit_data(self,data,weight):"""        :param data:        :param weight: indicates the percentage of rows that'll be used for training        :return:        Randomly selects rows for training according to the weight and uses the rest of the rows for testing.        """train_size=int(len(data)*weight)train_set= []foriinrange(train_size):index=random.randrange(len(data))train_set.append(data[index])data.pop(index)return [train_set,data]defgroup_by_class(self,data,target):"""        :param data: Training set. Lists of events (rows) in a list        :param target: Index for the target column. Usually the last index in the list        :return:        Mapping each target to a list of it's features        """target_map=defaultdict(list)forindexinrange(len(data)):features=data[index]ifnotfeatures:continuex=features[target]target_map[x].append(features[:-1])# designating the last column as the class columnreturndict(target_map)defmean(self,numbers):"""        :param numbers: list of numbers        :return:        """result=sum(numbers)/float(len(numbers))returnresultdefstdev(self,numbers):"""        :param numbers: list of numbers        :return:        Calculate the standard deviation for a list of numbers.        """avg=self.mean(numbers)squared_diff_list= []fornuminnumbers:squared_diff= (num-avg)**2squared_diff_list.append(squared_diff)squared_diff_sum=sum(squared_diff_list)sample_n=float(len(numbers)-1)var=squared_diff_sum/sample_nreturnvar**.5defsummarize(self,test_set):"""        :param test_set: lists of features        :return:        Use zip to line up each feature into a single column across multiple lists.        yield the mean and the stdev for each feature.        """forfeatureinzip(*test_set):yield {'stdev':self.stdev(feature),'mean':self.mean(feature)            }defprior_prob(self,group,target,data):"""        :return:        The probability of each target class        """total=float(len(data))result=len(group[target])/totalreturnresultdeftrain(self,train_list,target):"""        :param data:        :param target: target class        :return:        For each target:            1. yield prior_prob: the probability of each class. P(class) eg P(Iris-virginica)            2. yield summary: list of {'mean': 0.0, 'stdev': 0.0}        """group=self.group_by_class(train_list,target)self.summaries= {}fortarget,featuresingroup.iteritems():self.summaries[target]= {'prior_prob':self.prior_prob(group,target,train_list),'summary': [iforiinself.summarize(features)],            }returnself.summariesdefnormal_pdf(self,x,mean,stdev):"""        :param x: a variable        :param mean: µ - the expected value or average from M samples        :param stdev: σ - standard deviation        :return: Gaussian (Normal) Density function.        N(x; µ, σ) = (1 / 2πσ) * (e ^ (x–µ)^2/-2σ^2        """variance=stdev**2exp_squared_diff= (x-mean)**2exp_power=-exp_squared_diff/ (2*variance)exponent=e**exp_powerdenominator= ((2*pi)**.5)*stdevnormal_prob=exponent/denominatorreturnnormal_probdefmarginal_pdf(self,joint_probabilities):"""        :param joint_probabilities: list of joint probabilities for each feature        :return:        Marginal Probability Density Function (Predictor Prior Probability)        Joint Probability = prior * likelihood        Marginal Probability is the sum of all joint probabilities for all classes.        marginal_pdf =          [P(setosa) * P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)]        + [P(versicolour) * P(sepal length | versicolour) * P(sepal width | versicolour) * P(petal length | versicolour) * P(petal width | versicolour)]        + [P(virginica) * P(sepal length | verginica) * P(sepal width | verginica) * P(petal length | verginica) * P(petal width | verginica)]        """marginal_prob=sum(joint_probabilities.values())returnmarginal_probdefjoint_probabilities(self,test_row):"""        :param test_row: single list of features to test; new data        :return:        Use the normal_pdf(self, x, mean, stdev) to calculate the Normal Probability for each feature        Take the product of all Normal Probabilities and the Prior Probability.        """joint_probs= {}fortarget,featuresinself.summaries.iteritems():total_features=len(features['summary'])likelihood=1forindexinrange(total_features):feature=test_row[index]mean=features['summary'][index]['mean']stdev=features['summary'][index]['stdev']normal_prob=self.normal_pdf(feature,mean,stdev)likelihood*=normal_probprior_prob=features['prior_prob']joint_probs[target]=prior_prob*likelihoodreturnjoint_probsdefposterior_probabilities(self,test_row):"""        :param test_row: single list of features to test; new data        :return:        For each feature (x) in the test_row:            1. Calculate Predictor Prior Probability using the Normal PDF N(x; µ, σ). eg = P(feature | class)            2. Calculate Likelihood by getting the product of the prior and the Normal PDFs            3. Multiply Likelihood by the prior to calculate the Joint Probability.        E.g.        prior_prob: P(setosa)        likelihood: P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)        joint_prob: prior_prob * likelihood        marginal_prob: predictor prior probability        posterior_prob = joint_prob/ marginal_prob        returning a dictionary mapping of class to it's posterior probability        """posterior_probs= {}joint_probabilities=self.joint_probabilities(test_row)marginal_prob=self.marginal_pdf(joint_probabilities)fortarget,joint_probinjoint_probabilities.iteritems():posterior_probs[target]=joint_prob/marginal_probreturnposterior_probsdefget_map(self,test_row):"""        :param test_row: single list of features to test; new data        :return:        Return the target class with the largest/best posterior probability        """posterior_probs=self.posterior_probabilities(test_row)map_prob=max(posterior_probs,key=posterior_probs.get)returnmap_probdefpredict(self,test_set):"""        :param test_set: list of features to test on        :return:        Predict the likeliest target for each row of the test_set.        Return a list of predicted targets.        """map_probs= []forrowintest_set:map_prob=self.get_map(row)map_probs.append(map_prob)returnmap_probsdefaccuracy(self,test_set,predicted):"""        :param test_set: list of test_data        :param predicted: list of predicted classes        :return:        Calculate the the average performance of the classifier.        """correct=0actual= [item[-1]foritemintest_set]forx,yinzip(actual,predicted):ifx==y:correct+=1returncorrect/float(len(test_set))defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)predicted=nb.predict(test_list)accuracy=nb.accuracy(test_list,predicted)print'Accuracy: %.3f'%accuracyif__name__=='__main__':main()

About

Gauss Naive Bayes in Python From Scratch.

Languages

Python100.0%

Movatterモバイル変換

odubno/gauss-naive-bayes

Folders and files

Latest commit

History

Repository files navigation

Gauss Naive Bayes From Scratch

Table of Contents

Overview

Iris Data Set:

Random 5 Row Sample

Bayes Theorem:

Normal PDF Formula:

Prepare Data

Prerequisites

Skeleton

Execute in terminal:

Output:

Load CSV

Execute in terminal:

Output:

Split Data

Execute in terminal:

Output:

Group Data

Execute in terminal:

Output:

Summarize Data

Mean

Execute in terminal:

Output:

Standard Deviation

Execute in terminal:

Output:

Summary

Execute in terminal:

Output:

Build Model

Prior Probability

Execute in terminal:

Output:

Train

Execute in terminal:

Output:

Likelihood

Execute in terminal:

Output:

Joint Probability

Execute in terminal:

Output:

Marginal Probability

Execute in terminal:

Output:

Posterior Probability

Execute in terminal:

Output:

Test Model

Get Maximum A Posterior

Execute in terminal:

Output:

Predict

Execute in terminal:

Output:

Accuracy

Execute in terminal:

Output:

Recap

Authors

Acknowledgments

Inspiration:

Complete Code

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Packages