Here is my personal implementation of the clusteringk-means algorithm.
from scipy.spatial import distanceimport numpy as npimport random# (x,y) coordinates of a pointX = 0Y = 1def get_first(k, points): return points[0:k]def cost(cetroids, clusters): cost = 0 for i in range(len(centroids)): centroid = centroids[i] cluster = clusters[i] for point in cluster: cost += (distance.euclidean(centroid, point))**2 return costdef compute_centroids(clusters): centroids = [] for cluster in clusters: centroids.append(np.mean(cluster, axis=0)) return centroidsdef kmeans(k, centroids, points, method, iter): clusters = [[] for i in range(k)] for i in range(len(points)): point = points[i] belongs_to_cluster = closest_centroid(point, centroids) clusters[belongs_to_cluster].append(point) new_centroids = compute_centroids(clusters) if not equals(centroids, new_centroids): print "Iteration " + str(iter) + ". Cost [k=" + str(k) + ", " + method + "] = " + str(cost(new_centroids, clusters)) clusters = kmeans(k, new_centroids, points, method, iter+1) return clustersdef closest_centroid(point, centroids): min_distance = float('inf') belongs_to_cluster = None for j in range(len(centroids)): centroid = centroids[j] dist = distance.euclidean(point, centroid) if dist < min_distance: min_distance = dist belongs_to_cluster = j return belongs_to_clusterdef contains(point1, points): for point2 in points: if point1[X] == point2[X] and point1[Y] == point2[Y]: return True return Falsedef equals(points1, points2): if len(points1) != len(points2): return False for i in range (len(points1)): point1 = points1[i] point2 = points2[i] if point1[X] != point2[X] or point1[Y] != point2[Y]: return False return Trueif __name__ == "__main__": data = [[-19.0748, -8.536 ], [ 22.0108, -10.9737 ], [ 12.6597, 19.2601 ], [ 11.26884087, 19.90132146 ], [ 15.44640731, 21.13121676 ], [-20.03865146, -8.820872829], [-19.65417726, -8.211477352], [-15.97295894, -9.648002534], [-18.74359696, -5.383551586], [-19.453215, -8.146120006], [-16.43074088, -7.524968005], [-19.75512437, -8.533215751], [-19.56237082, -8.798668569], [-19.47135573, -8.057217004], [-18.60946986, -4.475888949], [-21.59368337, -10.38712463], [-15.39158057, -3.8336522 ], [-40.0, 40.0 ]] k = 4 # k-means picking the first k points as centroids centroids = get_first(k, data) clusters = kmeans(k, centroids, data, "first", 1)I understood and followed the theory of the algorithm, but as you can see, when running the code, the cost on each iteration of the algorithm increases. I am new to python and I think the problem relies in some misunderstanding from my side regarding list manipulation. Any ideas?
- 2\$\begingroup\$Welcome to Code Review! I hope you get some good answers.\$\endgroup\$Phrancis– Phrancis2016-05-13 20:38:58 +00:00CommentedMay 13, 2016 at 20:38
- \$\begingroup\$Also, I am open to suggestions regarding the quality of my code.\$\endgroup\$Daniyal Shahrokhian– Daniyal Shahrokhian2016-05-13 21:47:32 +00:00CommentedMay 13, 2016 at 21:47
1 Answer1
N.b. I'm pretty sure you can find optimised replacements of many ofthese functions in either the NumPy or SciPy library.
Edit: I don't immediately see a reason for a slowdown except theaddition of new clusters. A bigger sample set that shows the behaviourwould be nice, together with some timings and probably running aprofiler on it.
In general you should likely use a NumPy array or matrix instead oflists of lists as they are optimised for storing numeric data(!). Thatwould likely also eliminate the need for some of these functions orconsiderably reduce their length. Also store all data in the samecontainers - that way you don't have to create new functions likeequals andcontains yourself to compare between differentrepresentations.
- The
X/Ydefinitions at the start are more of a WTF for me. I'dget rid of that by writing code that's either independent on thenumber of dimensions, or just use0/1- IMO it's not magic enoughto warrant separate constants. get_firstisn't neccessary - the pattern is obvious enough to notrequire encapsulation in a function. The zero is also optional.- Typo in
costsignature, should becentroids. - The pattern
for i in range(len(...)):appears a couple of times andshould be replaced with proper iteration over some helper generator.E.g. incostthe parallel iteration over bothcentroidsandclustersshould be done byzip(itertools.izipin Python 2.7 ifused in the same way). It can also be simplified by using anotherSciPy function,cdist. - If you still need the index, use
enumerate. - You can also often get away with using the squared euclidean if theexact value isn't relevant, just the relation between differentdistances.
- The pattern to initialise a list of empty lists should probably use
_instead ofito make it obvious that the variable serves tofurther purpose. printshould be used as a function to make it more uniform. Also,use one of the various formatting options (%orformat) to get ridof the additionalstrcalls.compute_centroidscan be simplified with a list comprehension.
Looks like this now:
from scipy.spatial import distanceimport numpy as npimport randomfrom itertools import izipdef cost(centroids, clusters): return sum(distance.cdist([centroid], cluster, 'sqeuclidean').sum() for centroid, cluster in izip(centroids, clusters))def compute_centroids(clusters): return [np.mean(cluster, axis=0) for cluster in clusters]def kmeans(k, centroids, points, method): clusters = [[] for _ in range(k)] for point in points: clusters[closest_centroid(point, centroids)].append(point) new_centroids = compute_centroids(clusters) if not equals(centroids, new_centroids): print("cost [k={}, {}] = {}".format(k, method, cost(new_centroids, clusters))) clusters = kmeans(k, new_centroids, points, method) return clustersdef closest_centroid(point, centroids): min_distance = float('inf') belongs_to_cluster = None for j, centroid in enumerate(centroids): dist = distance.sqeuclidean(point, centroid) if dist < min_distance: min_distance = dist belongs_to_cluster = j return belongs_to_clusterdef contains(point1, points): for point2 in points: if point1[0] == point2[0] and point1[1] == point2[1]: # if all(x == y for x, y in izip(points1, points2)): return True return Falsedef equals(points1, points2): if len(points1) != len(points2): return False for point1, point2 in izip(points1, points2): if point1[0] != point2[0] or point1[1] != point2[1]: # if any(x != y for x, y in izip(points1, points2)): return False return Trueif __name__ == "__main__": data = [[-19.0748, -8.536 ], [ 22.0108, -10.9737 ], [ 12.6597, 19.2601 ], [ 11.26884087, 19.90132146 ], [ 15.44640731, 21.13121676 ], [-20.03865146, -8.820872829], [-19.65417726, -8.211477352], [-15.97295894, -9.648002534], [-18.74359696, -5.383551586], [-19.453215, -8.146120006], [-16.43074088, -7.524968005], [-19.75512437, -8.533215751], [-19.56237082, -8.798668569], [-19.47135573, -8.057217004], [-18.60946986, -4.475888949], [-21.59368337, -10.38712463], [-15.39158057, -3.8336522 ], [-40.0, 40.0 ]] k = 4 # k-means picking the first k points as centroids centroids = data[:k] clusters = kmeans(k, centroids, data, "first")- 1\$\begingroup\$This is... just prodigious. Sharp, neat and complete answer. I will revise the whole thing tomorrow morning. Thanks a lot @ferada.\$\endgroup\$Daniyal Shahrokhian– Daniyal Shahrokhian2016-05-14 03:21:34 +00:00CommentedMay 14, 2016 at 3:21
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.

