- Notifications
You must be signed in to change notification settings - Fork21
odubno/gauss-naive-bayes
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Building a Naive Bayes classifier using Python with drawings.
We will translate each part of the Gauss Naive Bayes into Python code and explain the logic behind its methods.
TheComplete Code could be found at the bottom of this page or innb_tutorial.py
The Overview will just be that, the overview, and a soft introduction to Naive Bayes. Stay with us! Preparing Data is where the excitement begins.
Using Naive Bayes and the Gaussian Distribution (Normal Distribution) to build a classifierthat will predict flower species based off of petal and sepal features.
We will be working with theiris data set,a collection of 4 dimensional features that define 3 different types of flower species.
The Iris data set is a classic and is widely used when explaining classification models.The data set has 4 independent variables and 1 dependent variable that have 3 different classes with 150 instances.
The first 4 columns are the independent variables (features).
The 5th column is the dependent variable (class).
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- class:
- Iris Setosa,
- Iris Versicolour
- Iris Virginica
sepal length | sepal width | petal length | petal width | class |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
6.3 | 2.8 | 5.1 | 1.5 | Iris-virginica |
6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
Naive Bayes, more technically referred to as the Posterior Probability, updates the prior belief of an event given new information.The result is the probability of the class occuring given the new data.
The classification model could handle binary and multiple classifications.
When predicting a class, the model calculates the posterior probability for all classes andselects the largest posterior probability as the predicted class. This value is referred to as the Maximum A Posterior (MAP).
Posterior Probability:
- This is the updated belief given the new data, and the objective probability of each class, derived from the Naive Bayes technique.
Class Prior Probability:
- This is the Prior Belief; the probability of the class before updating the belief.
Likelihood:
- Likelihood is calculated by taking the product of allNormal Probability Density Functions (assume independence, ergo the "Naivete").The Normal PDF is calculated using the Gaussian Distribution. Hence, the name Gauss Naive Bayes.
- We will use the Normal PDF to calculate the Normal Probability value for each feature given the class.
- Likelihood is the product of all Normal PDFs.
- There's an important distinction to keep in mind between Likelihood and Probability.
- Normal Probability is calculated for each feature given the class and is always between 0 and 1.
- Likelihood is the product of all Normal Probability values.
- The number of features is infinite and limited to our imagination.
- Since there will always be features that could be added, the product of all Normal Probabilities is not the probability but the Likelihood.
Predictor Prior Probability:
- Predictor Prior Probability is another way of saying Marginal Probability.
- It is the probability given the new data under all possible features for each class.
- It isn't necessary for the Naive Bayes Classifier to calculate this,because we're only looking for the prediction and not the exact probability.
- The results do not change at all, however we do calculate it here to show that this is the case.
SeeNormal Distribution (Wikipedia) definition.
The Normal Distribution will help determine the Normal Probability for each new feature given the class.The product of all Normal Probabilities will result in the likelihood of theclass occurring given the new features.In other words, the Normal Distribution will calculate the Normal Probability value for each new feature.The product of all Normal Probabilities will be the likelihood.
Building the Naive Bayes Classifier.
Here, we'll create the structure and the methods to read and prepare data for modeling.
Every function is created from scratch.However, instead of having to download the data, we're using an API call to get the data.
$ pip install requests
In some sections you'll see "Click to expand". Click it to view the Python code.
Import the necessary libraries and create the GaussNB class. This will be the foundation for the rest of the code.
Click to expand GaussNB Skeleton.
# -*- coding: utf-8 -*-fromcollectionsimportdefaultdictfrommathimportpifrommathimporteimportrequestsimportrandomimportcsvimportreclassGaussNB:def__init__(self):passdefmain():print"Here we will handle class methods."if__name__=='__main__':main()
$ python nb_tutorial.py
Here we will handle class methods.
Read in the raw data and convert each string into an integer.
Click to expand load_csv().
classGaussNB:def__init__(self):passdefload_csv(self,data,header=False):""" :param data: raw comma seperated file :param header: remove header if it exists :return: Load and convert each string of data into a float """lines=csv.reader(data.splitlines())dataset=list(lines)ifheader:# remove headerdataset=dataset[1:]foriinrange(len(dataset))dataset[i]= [float(x)ifre.search('\d',x)elsexforxindataset[i]]returndatasetdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)printdata[:3]# first 3 rowsif__name__=='__main__':main()
$ python nb_tutorial.py
[[4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], [4.6, 3.1, 1.5, 0.2, 'Iris-setosa']]
Split the data into atrain_set
and atest_set
.
The weight will determine how much of the data will be in thetrain_set
.
Click to expand split_data().
classGaussNB: . . .defsplit_data(self,data,weight):""" :param data: :param weight: indicates the percentage of rows that'll be used for training :return: Randomly selects rows for training according to the weight and uses the rest of the rows for testing. """train_size=int(len(data)*weight)train_set= []foriinrange(train_size):index=random.randrange(len(data))train_set.append(data[index])data.pop(index)return [train_set,data]defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))if__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testing
Group the data according to class by mapping each class to individual instances.
Take this table
sepal length | sepal width | petal length | petal width | class |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
6.3 | 2.8 | 5.1 | 1.5 | Iris-virginica |
6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
and turn it into this map
{'Iris-virginica': [ [6.3,2.8,5.1,1.5], ],'Iris-setosa': [ [5.1,3.5,1.4,0.2], [4.9,3.0,1.4,0.2], ],'Iris-versicolor': [ [7.0,3.2,4.7,1.4], [6.4,3.2,4.5,1.5], ]}
Click to expand group_by_class().
classGaussNB: . . .defgroup_by_class(self,data,target):""" :param data: Training set. Lists of events (rows) in a list :param target: Index for the target column. Usually the last index in the list :return: Mapping each target to a list of it's features """target_map=defaultdict(list)forindexinrange(len(data)):features=data[index]ifnotfeatures:continuex=features[target]target_map[x].append(features[:-1])# designating the last column as the class columnreturndict(target_map)defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)print"Grouped into %s classes: %s"% (len(group.keys()),group.keys())if__name__=='__main__':main()
$ python nb_tutorial.py
Grouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']
Prepare the data for modeling. Calculate the descriptive statistics that will later be used in the model.
Calculate the mean for[5.9, 3.0, 5.1, 1.8]
.
Click to expand mean().
classGaussNB: . . .defmean(self,numbers):""" :param numbers: list of numbers :return: """result=sum(numbers)/float(len(numbers))returnresultdefmain():nb=GaussNB()print"Mean: %s"%nb.mean([5.9,3.0,5.1,1.8])if__name__=='__main__':main()
$ python nb_tutorial.py
Mean: 3.95
Calculate the standard deviation for[5.9, 3.0, 5.1, 1.8]
.
Click to expand stdev().
classGaussNB: . . .defstdev(self,numbers):""" :param numbers: list of numbers :return: Calculate the standard deviation for a list of numbers. """avg=self.mean(numbers)squared_diff_list= []fornuminnumbers:squared_diff= (num-avg)**2squared_diff_list.append(squared_diff)squared_diff_sum=sum(squared_diff_list)sample_n=float(len(numbers)-1)var=squared_diff_sum/sample_nreturnvar**.5defmain():nb=GaussNB()print"Standard Deviation: %s"%nb.stdev([5.9,3.0,5.1,1.8])if__name__=='__main__':main()
$ python nb_tutorial.py
Standard Deviation: 1.88414436814
Return the (mean, standard deviation) combination for each feature of thetrain_set
.The mean and the standard deviation will be used when calculating the Normal Probabiltiy values for each feature of thetest_set
.
Click to expand summarize().
classGaussNB: . . .defsummarize(self,test_set):""" :param test_set: lists of features :return: Use zip to line up each feature into a single column across multiple lists. yield the mean and the stdev for each feature. """forfeatureinzip(*test_set):yield {'stdev':self.stdev(feature),'mean':self.mean(feature) }defmain():nb=GaussNB()data= [ [5.9,3.0,5.1,1.8], [5.1,3.5,1.4,0.2] ]print"Feature Summary: %s"% [iforiinnb.summarize(data)]if__name__=='__main__':main()
$ python nb_tutorial.py
Feature Summary: [ {'mean': 5.5, 'stdev': 0.5656854249492386}, # sepal length {'mean': 3.25, 'stdev': 0.3535533905932738}, # sepal width {'mean': 3.25, 'stdev': 2.6162950903902256}, # petal length {'mean': 1.0, 'stdev': 1.1313708498984762} # petal width]
Building the class methods for calculatingBayes Theorem:
Features and Class
Bayes Tree Diagram
Using Iris-setosa as an example
Prior Probability is what we know about each class before considering the new data.
It's the probability of each class occurring.
Click to expand prior_prob().
classGaussNB: . . .defprior_prob(self,group,target,data):""" :return: The probability of each target class """total=float(len(data))result=len(group[target])/totalreturnresultdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())fortarget_classin ['Iris-virginica','Iris-setosa','Iris-versicolor']:prior_prob=nb.prior_prob(group,target_class,data)print'P(%s): %s'% (target_class,prior_prob)if__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']P(Iris-virginica): 0.38P(Iris-setosa): 0.3P(Iris-versicolor): 0.32
This is where we learn from the train set, by calculating the mean and the standard deviation.
Using the grouped classes, calculate the (mean, standard deviation) combination for each feature of each class.
The calculations will later use the (mean, standard deviation) of each feature to calculate class likelihoods.
Click to expand train().
classGaussNB: . . .deftrain(self,train_list,target):""" :param data: :param target: target class :return: For each target: 1. yield prior_prob: the probability of each class. P(class) eg P(Iris-virginica) 2. yield summary: list of {'mean': 0.0, 'stdev': 0.0} """group=self.group_by_class(train_list,target)self.summaries= {}fortarget,featuresingroup.iteritems():self.summaries[target]= {'prior_prob':self.prior_prob(group,target,train_list),'summary': [iforiinself.summarize(features)], }returnself.summariesdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())printnb.train(train_list,-1)if__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']{'Iris-setosa': {'prior_prob': 0.3, 'summary': [{'mean': 4.980000000000001, 'stdev': 0.34680810554104063}, # sepal length {'mean': 3.406666666666667, 'stdev': 0.3016430104397023}, # sepal width {'mean': 1.496666666666667, 'stdev': 0.20254132542705236}, # petal length {'mean': 0.24333333333333343, 'stdev': 0.12228664272317624}]}, # petal width 'Iris-versicolor': {'prior_prob': 0.31, 'summary': [{'mean': 5.96774193548387, 'stdev': 0.4430102307127106}, {'mean': 2.7903225806451615, 'stdev': 0.28560443356698495}, {'mean': 4.303225806451613, 'stdev': 0.41990782398659987}, {'mean': 1.3451612903225807, 'stdev': 0.17289439874755796}]}, 'Iris-virginica': {'prior_prob': 0.39, 'summary': [{'mean': 6.679487179487178, 'stdev': 0.585877428882027}, {'mean': 3.002564102564103, 'stdev': 0.34602036712733625}, {'mean': 5.643589743589742, 'stdev': 0.5215336048086158}, {'mean': 2.0487179487179477, 'stdev': 0.2927831916298213}]}}
Likelihood is calculated by taking the product of all Normal Probabilities.
For each feature given the class we calculate the Normal Probability using theNormal Distribution.
Click to expand normal_pdf().
frommathimporte,piclassGaussNB: . . .defnormal_pdf(self,x,mean,stdev):""" :param x: a variable :param mean: µ - the expected value or average from M samples :param stdev: σ - standard deviation :return: Gaussian (Normal) Density function. N(x; µ, σ) = (1 / 2πσ) * (e ^ (x–µ)^2/-2σ^2 """variance=stdev**2exp_squared_diff= (x-mean)**2exp_power=-exp_squared_diff/ (2*variance)exponent=e**exp_powerdenominator= ((2*pi)**.5)*stdevnormal_prob=exponent/denominatorreturnnormal_probdefmain():nb=GaussNB()normal_prob=nb.normal_pdf(5,4.98,0.35)printnormal_probif__name__=='__main__':main()
$ python nb_tutorial.py
1.13797564994
Joint Probability is calculated by taking the product of the Prior Probability and the Likelihood.
For each class:
- Calculate the Prior Probability.
- Use the Normal Distribution to calculate the Normal Probability of each feature. e.g.N(x; µ, σ).
- Take the product of the Prior Probability and the Likelihood.
- Return one Joint Probability value for each class given the new data.
Click to expand joint_probabilities().
classGaussNB: . . .defjoint_probabilities(self,test_row):""" :param test_row: single list of features to test; new data :return: Use the normal_pdf(self, x, mean, stdev) to calculate the Normal Probability for each feature Take the product of all Normal Probabilities and the Prior Probability. """joint_probs= {}fortarget,featuresinself.summaries.iteritems():total_features=len(features['summary'])likelihood=1forindexinrange(total_features):feature=test_row[index]mean=features['summary'][index]['mean']stdev=features['summary'][index]['stdev']normal_prob=self.normal_pdf(feature,mean,stdev)likelihood*=normal_probprior_prob=features['prior_prob']joint_probs[target]=prior_prob*likelihoodreturnjoint_probsdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)printnb.joint_probabilities([5.0,4.98,0.35,4.0])if__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']{ 'Iris-virginica': 7.880001356130214e-38, 'Iris-setosa': 9.616469451152855e-230, 'Iris-versicolor': 6.125801208117717e-68}
Calculate the total sum of all joint probabilities.
The Marginal Probability is calculated using the sum of all joint probabilities.The Marginal value, a single value, will be the same across all classes.We could think of the Marginal Probability as the total joint probability of all classes occurring given the new data.
Reminder, we're looking to predict the class by choosing the Maximum A Posterior (MAP).The prediction doesn't care about the exact posterior probability of each class anddividing by the same value is more memory intensive and does not improve the accuracy of predicting the correct class.
For the purposes of sticking to the trueBayes Theorem, we're using it here.
Click to expand marginal_pdf().
classGaussNB: . . .defmarginal_pdf(self,joint_probabilities):""" :param joint_probabilities: list of joint probabilities for each feature :return: Marginal Probability Density Function (Predictor Prior Probability) Joint Probability = prior * likelihood Marginal Probability is the sum of all joint probabilities for all classes. marginal_pdf = [P(setosa) * P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)] + [P(versicolour) * P(sepal length | versicolour) * P(sepal width | versicolour) * P(petal length | versicolour) * P(petal width | versicolour)] + [P(virginica) * P(sepal length | verginica) * P(sepal width | verginica) * P(petal length | verginica) * P(petal width | verginica)] """marginal_prob=sum(joint_probabilities.values())returnmarginal_probdefmain():nb=GaussNB()joint_probs= {'Iris-setosa':1.2904413965468937,'Iris-versicolor':5.414630046086964e-14,'Iris-virginica':7.087518912297627e-30 }marginal_prob=nb.marginal_pdf(joint_probs)print'Marginal Probability: %s'%marginal_probif__name__=='__main__':main()
$ python nb_tutorial.py
Marginal Probability: 1.29044139655
The Posterior Probability is the probability of a class occuring and is calculated for each class given the new data.
This where all of the preceding class methods tie together to calculate the Gauss Naive Bayes formula with the goal of selecting MAP.
Click to expand posterior_probabilities().
classGaussNB: . . .defposterior_probabilities(self,test_row):""" :param test_row: single list of features to test; new data :return: For each feature (x) in the test_row: 1. Calculate Predictor Prior Probability using the Normal PDF N(x; µ, σ). eg = P(feature | class) 2. Calculate Likelihood by getting the product of the prior and the Normal PDFs 3. Multiply Likelihood by the prior to calculate the Joint Probability. E.g. prior_prob: P(setosa) likelihood: P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa) joint_prob: prior_prob * likelihood marginal_prob: predictor prior probability posterior_prob = joint_prob/ marginal_prob returning a dictionary mapping of class to it's posterior probability """posterior_probs= {}joint_probabilities=self.joint_probabilities(test_row)marginal_prob=self.marginal_pdf(joint_probabilities)fortarget,joint_probinjoint_probabilities.iteritems():posterior_probs[target]=joint_prob/marginal_probreturnposterior_probsdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)posterior_probs=nb.posterior_probabilities([6.3,2.8,5.1,1.5])print"Posterior Probabilityies: %s"%posterior_probsif__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']Posterior Probabilityies: { 'Iris-virginica': 0.32379024365947745, 'Iris-setosa': 2.5693999408505845e-158, 'Iris-versicolor': 0.6762097563405226}
Testing the model and predicting a class given the new data.
Theget_map()
method will call theposterior_probabilities()
method on a singletest_row
eg ([6.3, 2.8, 5.1, 1.5]
).
For eachtest_row
we will calculate 3 Posterior Probabilities; one for each class. The goal is to select MAP, the Maximum A Posterior probability.
Theget_map()
method will simply choose the Maximum A Posterior probability and return the associated class for the giventest_row
.
Click to expand get_map().
classGaussNB: . . .defget_map(self,test_row):""" :param test_row: single list of features to test :return: Return the target class with the largest/best posterior probability """posterior_probs=self.posterior_probabilities(test_row)map_prob=max(posterior_probs,key=posterior_probs.get)returnmap_probdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)prediction=nb.get_map([6.3,2.8,5.1,1.5])# 'Iris-virginica'print'According to the test row the best prediction is: %s'%predictionif__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']According to the test row the best prediction is: Iris-versicolor
This method will return a prediction for each test_row.
Example input, list of lists:
[ [5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2],]
For testing this method, we'll use the data from thesample data above.
Click to expand predict().
classGaussNB: . . .defpredict(self,test_set):""" :param test_set: list of features to test on :return: Predict the likeliest target for each row of the test_set. Return a list of predicted targets. """map_probs= []forrowintest_set:map_prob=self.get_map(row)map_probs.append(map_prob)returnmap_probsdefmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)test= {'Iris-virginica': [ [6.3,2.8,5.1,1.5], ],'Iris-setosa': [ [5.1,3.5,1.4,0.2], [4.9,3.0,1.4,0.2], ],'Iris-versicolor': [ [7.0,3.2,4.7,1.4], [6.4,3.2,4.5,1.5], ] }fortarget,featuresintest.iteritems():predicted=nb.predict(features)print'predicted target: %s | true target: %s'% (predicted,target)if__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']predicted target: ['Iris-versicolor'] | true target: Iris-virginicapredicted target: ['Iris-setosa', 'Iris-setosa'] | true target: Iris-setosa # both test rows were predicted to be setosapredicted target: ['Iris-versicolor', 'Iris-versicolor'] | true target: Iris-versicolor # both test rows were predicted to be versicolor
Accuracy will test the performance of the model by taking the total number of correct predictions anddivide them by the total number of predictions.This is critical in understanding the veracity of the model.
Click to expand accuracy().
classGaussNB: . . .defaccuracy(self,test_set,predicted):""" :param test_set: list of test_data :param predicted: list of predicted classes :return: Calculate the the average performance of the classifier. """correct=0actual= [item[-1]foritemintest_set]forx,yinzip(actual,predicted):ifx==y:correct+=1returncorrect/float(len(test_set))defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)predicted=nb.predict(test_list)accuracy=nb.accuracy(test_list,predicted)print'Accuracy: %.3f'%accuracyif__name__=='__main__':main()
$ python nb_tutorial.py
Using 100 rows for training and 50 rows for testingGrouped into 3 classes: ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']Accuracy: 0.960
The Naive Bayes Classification model makes some strong assumptions. All of the features are assumed to be independent when calculating the likelihood; hence "Naive".Likelihood is calculated using the Gaussian Distribution (Normal Distribution) and all of the features are assumed to be normally distribtuted; hence "Gauss".
Overlooking Gauss NB'sstrong assumptions, the classifier is very fast and accurate.Gauss NB does not require a lot of data to be accurate and is highly scalable.
You could find theComplete Code below.
The initial build of Gauss Naive Bayes classifier could run on the four classic data sets:
You could find that code ingauss_nb.py
- Oleh Dubno -github.odubno
- Code and images.
- Danny Argov -github.datargov
- Wording and logic of text.
See the list ofcontributors who participated in this project.
Tip of the hat to the authors that made this tutorial possible.
Author | URL |
---|---|
Dr. Jason Brownlee | How To Implement Naive Bayes From Scratch in Python |
Chris Albon | Naive Bayes Classifier From Scratch |
Sunil Ray | 6 Easy Steps to Learn Naive Bayes Algorithm |
Rahul Saxena | How The Naive Bayes Classifier Works In Machine Learning |
Data Source | UCI Machine Learning |
C. Randy Gallistel | Bayes for Beginners: Probability and Likelihood |
Project for Columbia University Probability and Statistics course - Prof. Banu Baydil
With the massive popularity of Bayes' Theorem as well as the default use of Gaussian/Normal distributions for common data sets, we were keen to better understand firstly, if the assumption to use the Normal distrubition on differing data sets was a fit or not, and secondly, how to take the Normal distribution and Bayes' Theorem and apply it in use via repeatable code. Initially, we were intrigued by the idea of pulling public data from sites such as OKCupid or facebook to produce interesting "predictions" of human behavior, but it dawned on us that diving in to the fundamentals of knowing which statistical distribution to leverage on differing data sets, along with understanding the workings of Bayes' Theorem in practice, that we decided to focus on buildings the rungs of the ladder first, figuratively.
Code is also available innb_tutorial.py.
Click to expand nb_tutorial.py
# -*- coding: utf-8 -*-fromcollectionsimportdefaultdictfrommathimportpifrommathimporteimportrequestsimportrandomimportcsvimportreclassGaussNB:def__init__(self):passdefload_csv(self,data,header=False):""" :param data: raw comma seperated file :param header: remove header if it exists :return: Load and convert each string of data into a float """lines=csv.reader(data.splitlines())dataset=list(lines)ifheader:# remove headerdataset=dataset[1:]foriinrange(len(dataset)):dataset[i]= [float(x)ifre.search('\d',x)elsexforxindataset[i]]returndatasetdefsplit_data(self,data,weight):""" :param data: :param weight: indicates the percentage of rows that'll be used for training :return: Randomly selects rows for training according to the weight and uses the rest of the rows for testing. """train_size=int(len(data)*weight)train_set= []foriinrange(train_size):index=random.randrange(len(data))train_set.append(data[index])data.pop(index)return [train_set,data]defgroup_by_class(self,data,target):""" :param data: Training set. Lists of events (rows) in a list :param target: Index for the target column. Usually the last index in the list :return: Mapping each target to a list of it's features """target_map=defaultdict(list)forindexinrange(len(data)):features=data[index]ifnotfeatures:continuex=features[target]target_map[x].append(features[:-1])# designating the last column as the class columnreturndict(target_map)defmean(self,numbers):""" :param numbers: list of numbers :return: """result=sum(numbers)/float(len(numbers))returnresultdefstdev(self,numbers):""" :param numbers: list of numbers :return: Calculate the standard deviation for a list of numbers. """avg=self.mean(numbers)squared_diff_list= []fornuminnumbers:squared_diff= (num-avg)**2squared_diff_list.append(squared_diff)squared_diff_sum=sum(squared_diff_list)sample_n=float(len(numbers)-1)var=squared_diff_sum/sample_nreturnvar**.5defsummarize(self,test_set):""" :param test_set: lists of features :return: Use zip to line up each feature into a single column across multiple lists. yield the mean and the stdev for each feature. """forfeatureinzip(*test_set):yield {'stdev':self.stdev(feature),'mean':self.mean(feature) }defprior_prob(self,group,target,data):""" :return: The probability of each target class """total=float(len(data))result=len(group[target])/totalreturnresultdeftrain(self,train_list,target):""" :param data: :param target: target class :return: For each target: 1. yield prior_prob: the probability of each class. P(class) eg P(Iris-virginica) 2. yield summary: list of {'mean': 0.0, 'stdev': 0.0} """group=self.group_by_class(train_list,target)self.summaries= {}fortarget,featuresingroup.iteritems():self.summaries[target]= {'prior_prob':self.prior_prob(group,target,train_list),'summary': [iforiinself.summarize(features)], }returnself.summariesdefnormal_pdf(self,x,mean,stdev):""" :param x: a variable :param mean: µ - the expected value or average from M samples :param stdev: σ - standard deviation :return: Gaussian (Normal) Density function. N(x; µ, σ) = (1 / 2πσ) * (e ^ (x–µ)^2/-2σ^2 """variance=stdev**2exp_squared_diff= (x-mean)**2exp_power=-exp_squared_diff/ (2*variance)exponent=e**exp_powerdenominator= ((2*pi)**.5)*stdevnormal_prob=exponent/denominatorreturnnormal_probdefmarginal_pdf(self,joint_probabilities):""" :param joint_probabilities: list of joint probabilities for each feature :return: Marginal Probability Density Function (Predictor Prior Probability) Joint Probability = prior * likelihood Marginal Probability is the sum of all joint probabilities for all classes. marginal_pdf = [P(setosa) * P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)] + [P(versicolour) * P(sepal length | versicolour) * P(sepal width | versicolour) * P(petal length | versicolour) * P(petal width | versicolour)] + [P(virginica) * P(sepal length | verginica) * P(sepal width | verginica) * P(petal length | verginica) * P(petal width | verginica)] """marginal_prob=sum(joint_probabilities.values())returnmarginal_probdefjoint_probabilities(self,test_row):""" :param test_row: single list of features to test; new data :return: Use the normal_pdf(self, x, mean, stdev) to calculate the Normal Probability for each feature Take the product of all Normal Probabilities and the Prior Probability. """joint_probs= {}fortarget,featuresinself.summaries.iteritems():total_features=len(features['summary'])likelihood=1forindexinrange(total_features):feature=test_row[index]mean=features['summary'][index]['mean']stdev=features['summary'][index]['stdev']normal_prob=self.normal_pdf(feature,mean,stdev)likelihood*=normal_probprior_prob=features['prior_prob']joint_probs[target]=prior_prob*likelihoodreturnjoint_probsdefposterior_probabilities(self,test_row):""" :param test_row: single list of features to test; new data :return: For each feature (x) in the test_row: 1. Calculate Predictor Prior Probability using the Normal PDF N(x; µ, σ). eg = P(feature | class) 2. Calculate Likelihood by getting the product of the prior and the Normal PDFs 3. Multiply Likelihood by the prior to calculate the Joint Probability. E.g. prior_prob: P(setosa) likelihood: P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa) joint_prob: prior_prob * likelihood marginal_prob: predictor prior probability posterior_prob = joint_prob/ marginal_prob returning a dictionary mapping of class to it's posterior probability """posterior_probs= {}joint_probabilities=self.joint_probabilities(test_row)marginal_prob=self.marginal_pdf(joint_probabilities)fortarget,joint_probinjoint_probabilities.iteritems():posterior_probs[target]=joint_prob/marginal_probreturnposterior_probsdefget_map(self,test_row):""" :param test_row: single list of features to test; new data :return: Return the target class with the largest/best posterior probability """posterior_probs=self.posterior_probabilities(test_row)map_prob=max(posterior_probs,key=posterior_probs.get)returnmap_probdefpredict(self,test_set):""" :param test_set: list of features to test on :return: Predict the likeliest target for each row of the test_set. Return a list of predicted targets. """map_probs= []forrowintest_set:map_prob=self.get_map(row)map_probs.append(map_prob)returnmap_probsdefaccuracy(self,test_set,predicted):""" :param test_set: list of test_data :param predicted: list of predicted classes :return: Calculate the the average performance of the classifier. """correct=0actual= [item[-1]foritemintest_set]forx,yinzip(actual,predicted):ifx==y:correct+=1returncorrect/float(len(test_set))defmain():nb=GaussNB()url='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'data=requests.get(url).contentdata=nb.load_csv(data,header=True)train_list,test_list=nb.split_data(data,weight=.67)print"Using %s rows for training and %s rows for testing"% (len(train_list),len(test_list))group=nb.group_by_class(data,-1)# designating the last column as the class columnprint"Grouped into %s classes: %s"% (len(group.keys()),group.keys())nb.train(train_list,-1)predicted=nb.predict(test_list)accuracy=nb.accuracy(test_list,predicted)print'Accuracy: %.3f'%accuracyif__name__=='__main__':main()
About
Gauss Naive Bayes in Python From Scratch.