4
\$\begingroup\$

This is the first time I tried to write some code in Python.I think it gives proper answers but probably some "vectorization" is needed

import numpy as npimport mathimport operatordata = np.genfromtxt("KNNdata.csv", delimiter = ',', skip_header = 1)data = data[:,2:]np.random.shuffle(data)X = data[:, range(5)]Y = data[:, 5]def distance(instance1, instance2):    dist = 0.0    for i in range(len(instance1)):        dist += pow((instance1[i] - instance2[i]), 2)    return math.sqrt(dist)# Calculating distances between all data, return sorted  k-elements list (whole element and output)def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = []    for i in range(len(trainingSetX)):        dist = distance(testInstance, trainingSetX[i])        distances.append((trainingSetX[i], dist, trainingSetY[i]))    distances.sort(key=operator.itemgetter(1))    neighbour = []    for elem in range(k):        neighbour.append((distances[elem][0], distances[elem][2]))    return neighbour#return answerdef getResponse(neighbors):    classVotes = {}    for x in range(len(neighbors)):        response = int(neighbors[x][-1])        if response in classVotes:            classVotes[response] += 1        else:            classVotes[response] = 1    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse = True)    return sortedVotes[0][0]#return accuracy, your predicitons and actual valuesdef getAccuracy(testSetY, predictions):    correct = 0    for x in range(len(predictions)):        if testSetY[x] == predictions[x]:            correct += 1    return (correct / (len(predictions))) * 100.0def start():    trainingSetX = X[:2000]    trainingSetY = Y[:2000]    testSetX = X[2000:]    testSetY = Y[2000:]    # generate predictions    predictions = []    k = 4    for x in range(len(testSetX)):        neighbors = getNeighbors(trainingSetX, trainingSetY, testSetX[x], k)        result = getResponse(neighbors)        predictions.append(result)    accuracy = getAccuracy(testSetY, predictions)    print('Accuracy: ' + str(accuracy))start()
200_success's user avatar
200_success
146k22 gold badges191 silver badges481 bronze badges
askedFeb 6, 2017 at 17:37
Newbie's user avatar
\$\endgroup\$

1 Answer1

4
\$\begingroup\$

First a style nitpick: Python has an official style-guide,PEP8, which recommends usinglower_case_with_underscores for variable and function names instead ofcamelCase.

Second, the comments you have above your functions should becomedocstrings.These appear for example when callinghelp(your_function) in an interactive session. Just Have a string as the first line below the function header like so:

def f(a, b):    """Returns the sum of `a` and `b`"""    return a + b

It is recommended to always use triple double-quotes (i.e.""").


Now I am going to focus on the distance calculation.

First, you can greatly simplify yourgetNeighbors function using list comprehensions:

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = sorted((distance(testInstance, x), x, y)                       for x, y in zip(trainingSetX, trainingSetY)    return [(d[1], d[2]) for d in distances[:k]]

Here I used the fact that tuples already sort naturally, by first comparing the first index then (if they are equal) the second and so on. So I put the distance as the first index of the tuple and you don't need the key function any longer.sorted can take a generator expression and sort it directly. We can also iterate over multiple iterables at the same time usingzip.

Since your variables are all numpy arrays, you could also do it more vectorized. For this I would first re-define the distance function to use numpy functions:

def distance(x, y):    return np.sqrt(((x - y)**2).sum())

And then put the distances into a numpy array as well. Only returning the second and third column becomes then easier with array slicing.

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):    distances = np.array([(distance(testInstance, x), x, y)                          for x, y in zip(trainingSetX, trainingSetY])    distances.sort()    return distances[:k, 1:]

This can probably be modified even further by trying to make thedistance call vectorized as well.


Your functionclassVotes can be simplified using thecollections.Counter class which implements exactly what you do her:

def getResponse(neighbors):    classVotes = Counter(int(neighbor[-1]) for neighbor in neighbors)    return max(classVotes.iteritems(), key=itemgetter(1))[0]

Your functiongetAccuracy can be slightly simplified using a generator expression andsum:

def getAccuracy(testSetY, predictions):    correct = sum(y == p for y, p in zip(testSetY, predictions))    return correct * 100.0 / len(predictions)

And lastly, in yourstart function, you can directly iterate over the elements oftestSetX, make it a generator expression and use the fact thatprint can take multiple arguments:

def start():    trainingSetX = X[:2000]    trainingSetY = Y[:2000]    testSetX = X[2000:]    testSetY = Y[2000:]    # generate predictions    k = 4    predictions = (getResponse(getNeighbors(trainingSetX, trainingSetY, x, k)]                   for x in testSetX)    accuracy = getAccuracy(testSetY, predictions)    print('Accuracy:', accuracy)
answeredFeb 7, 2017 at 11:42
Graipher's user avatar
\$\endgroup\$

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.