KNN algorithm implemented in Python

Question 1

This is the first time I tried to write some code in Python.I think it gives proper answers but probably some "vectorization" is needed

import numpy as npimport mathimport operatordata = np.genfromtxt("KNNdata.csv", delimiter = ',', skip_header = 1)data = data[:,2:]np.random.shuffle(data)X = data[:, range(5)]Y = data[:, 5]def distance(instance1, instance2):    dist = 0.0    for i in range(len(instance1)):        dist += pow((instance1[i] - instance2[i]), 2)    return math.sqrt(dist)# Calculating distances between all data, return sorted  k-elements list (whole element and output)def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = []    for i in range(len(trainingSetX)):        dist = distance(testInstance, trainingSetX[i])        distances.append((trainingSetX[i], dist, trainingSetY[i]))    distances.sort(key=operator.itemgetter(1))    neighbour = []    for elem in range(k):        neighbour.append((distances[elem][0], distances[elem][2]))    return neighbour#return answerdef getResponse(neighbors):    classVotes = {}    for x in range(len(neighbors)):        response = int(neighbors[x][-1])        if response in classVotes:            classVotes[response] += 1        else:            classVotes[response] = 1    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse = True)    return sortedVotes[0][0]#return accuracy, your predicitons and actual valuesdef getAccuracy(testSetY, predictions):    correct = 0    for x in range(len(predictions)):        if testSetY[x] == predictions[x]:            correct += 1    return (correct / (len(predictions))) * 100.0def start():    trainingSetX = X[:2000]    trainingSetY = Y[:2000]    testSetX = X[2000:]    testSetY = Y[2000:]    # generate predictions    predictions = []    k = 4    for x in range(len(testSetX)):        neighbors = getNeighbors(trainingSetX, trainingSetY, testSetX[x], k)        result = getResponse(neighbors)        predictions.append(result)    accuracy = getAccuracy(testSetY, predictions)    print('Accuracy: ' + str(accuracy))start()

Question 2

First a style nitpick: Python has an official style-guide,PEP8, which recommends usinglower_case_with_underscores for variable and function names instead ofcamelCase.

Second, the comments you have above your functions should becomedocstrings.These appear for example when callinghelp(your_function) in an interactive session. Just Have a string as the first line below the function header like so:

def f(a, b):    """Returns the sum of `a` and `b`"""    return a + b

It is recommended to always use triple double-quotes (i.e.""").

Now I am going to focus on the distance calculation.

First, you can greatly simplify yourgetNeighbors function using list comprehensions:

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = sorted((distance(testInstance, x), x, y)                       for x, y in zip(trainingSetX, trainingSetY)    return [(d[1], d[2]) for d in distances[:k]]

Here I used the fact that tuples already sort naturally, by first comparing the first index then (if they are equal) the second and so on. So I put the distance as the first index of the tuple and you don't need the key function any longer.sorted can take a generator expression and sort it directly. We can also iterate over multiple iterables at the same time usingzip.

Since your variables are all numpy arrays, you could also do it more vectorized. For this I would first re-define the distance function to use numpy functions:

def distance(x, y):    return np.sqrt(((x - y)**2).sum())

And then put the distances into a numpy array as well. Only returning the second and third column becomes then easier with array slicing.

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = np.array([(distance(testInstance, x), x, y)                          for x, y in zip(trainingSetX, trainingSetY])    distances.sort()    return distances[:k, 1:]

This can probably be modified even further by trying to make thedistance call vectorized as well.

Your functionclassVotes can be simplified using thecollections.Counter class which implements exactly what you do her:

def getResponse(neighbors):    classVotes = Counter(int(neighbor[-1]) for neighbor in neighbors)    return max(classVotes.iteritems(), key=itemgetter(1))[0]

Your functiongetAccuracy can be slightly simplified using a generator expression andsum:

def getAccuracy(testSetY, predictions):    correct = sum(y == p for y, p in zip(testSetY, predictions))    return correct * 100.0 / len(predictions)

And lastly, in yourstart function, you can directly iterate over the elements oftestSetX, make it a generator expression and use the fact thatprint can take multiple arguments:

def start():    trainingSetX = X[:2000]    trainingSetY = Y[:2000]    testSetX = X[2000:]    testSetY = Y[2000:]    # generate predictions    k = 4    predictions = (getResponse(getNeighbors(trainingSetX, trainingSetY, x, k)]                   for x in testSetX)    accuracy = getAccuracy(testSetY, predictions)    print('Accuracy:', accuracy)

Graipher 41.7k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2017-02-07 12:58:00Z

First a style nitpick: Python has an official style-guide,PEP8, which recommends usinglower_case_with_underscores for variable and function names instead ofcamelCase.

Second, the comments you have above your functions should becomedocstrings.These appear for example when callinghelp(your_function) in an interactive session. Just Have a string as the first line below the function header like so:

def f(a, b):    """Returns the sum of `a` and `b`"""    return a + b

It is recommended to always use triple double-quotes (i.e.""").

Now I am going to focus on the distance calculation.

First, you can greatly simplify yourgetNeighbors function using list comprehensions:

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = sorted((distance(testInstance, x), x, y)                       for x, y in zip(trainingSetX, trainingSetY)    return [(d[1], d[2]) for d in distances[:k]]

Here I used the fact that tuples already sort naturally, by first comparing the first index then (if they are equal) the second and so on. So I put the distance as the first index of the tuple and you don't need the key function any longer.sorted can take a generator expression and sort it directly. We can also iterate over multiple iterables at the same time usingzip.

Since your variables are all numpy arrays, you could also do it more vectorized. For this I would first re-define the distance function to use numpy functions:

def distance(x, y):    return np.sqrt(((x - y)**2).sum())

And then put the distances into a numpy array as well. Only returning the second and third column becomes then easier with array slicing.

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = np.array([(distance(testInstance, x), x, y)                          for x, y in zip(trainingSetX, trainingSetY])    distances.sort()    return distances[:k, 1:]

This can probably be modified even further by trying to make thedistance call vectorized as well.

Your functionclassVotes can be simplified using thecollections.Counter class which implements exactly what you do her:

def getResponse(neighbors):    classVotes = Counter(int(neighbor[-1]) for neighbor in neighbors)    return max(classVotes.iteritems(), key=itemgetter(1))[0]

Your functiongetAccuracy can be slightly simplified using a generator expression andsum:

def getAccuracy(testSetY, predictions):    correct = sum(y == p for y, p in zip(testSetY, predictions))    return correct * 100.0 / len(predictions)

And lastly, in yourstart function, you can directly iterate over the elements oftestSetX, make it a generator expression and use the fact thatprint can take multiple arguments:

def start():    trainingSetX = X[:2000]    trainingSetY = Y[:2000]    testSetX = X[2000:]    testSetY = Y[2000:]    # generate predictions    k = 4    predictions = (getResponse(getNeighbors(trainingSetX, trainingSetY, x, k)]                   for x in testSetX)    accuracy = getAccuracy(testSetY, predictions)    print('Accuracy:', accuracy)

Movatterモバイル変換

Stack Exchange Network

KNN algorithm implemented in Python

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS