k-means clustering algorithm implementation

Question 1

Here is my personal implementation of the clusteringk-means algorithm.

from scipy.spatial import distanceimport numpy as npimport random# (x,y) coordinates of a pointX = 0Y = 1def get_first(k, points):    return points[0:k]def cost(cetroids, clusters):    cost = 0    for i in range(len(centroids)):        centroid = centroids[i]        cluster = clusters[i]        for point in cluster:            cost += (distance.euclidean(centroid, point))**2    return costdef compute_centroids(clusters):    centroids = []    for cluster in clusters:        centroids.append(np.mean(cluster, axis=0))    return centroidsdef kmeans(k, centroids, points, method, iter):    clusters = [[] for i in range(k)]    for i in range(len(points)):        point = points[i]        belongs_to_cluster = closest_centroid(point, centroids)        clusters[belongs_to_cluster].append(point)    new_centroids = compute_centroids(clusters)    if not equals(centroids, new_centroids):        print "Iteration " + str(iter) + ". Cost [k=" + str(k) + ", " + method + "] = " + str(cost(new_centroids, clusters))        clusters = kmeans(k, new_centroids, points, method, iter+1)    return clustersdef closest_centroid(point, centroids):    min_distance = float('inf')    belongs_to_cluster = None    for j in range(len(centroids)):        centroid = centroids[j]        dist = distance.euclidean(point, centroid)        if dist < min_distance:            min_distance = dist            belongs_to_cluster = j    return belongs_to_clusterdef contains(point1, points):    for point2 in points:        if point1[X] == point2[X] and point1[Y] == point2[Y]:            return True    return Falsedef equals(points1, points2):    if len(points1) != len(points2):        return False    for i in range (len(points1)):        point1 = points1[i]        point2 = points2[i]        if point1[X] != point2[X] or point1[Y] != point2[Y]:            return False    return Trueif __name__ == "__main__":    data = [[-19.0748,     -8.536       ],            [ 22.0108,      -10.9737    ],            [ 12.6597,      19.2601     ],            [ 11.26884087,  19.90132146 ],            [ 15.44640731,  21.13121676 ],            [-20.03865146,  -8.820872829],            [-19.65417726,  -8.211477352],            [-15.97295894,  -9.648002534],            [-18.74359696,  -5.383551586],            [-19.453215,    -8.146120006],            [-16.43074088,  -7.524968005],            [-19.75512437,  -8.533215751],            [-19.56237082,  -8.798668569],            [-19.47135573,  -8.057217004],            [-18.60946986,  -4.475888949],            [-21.59368337,  -10.38712463],            [-15.39158057,  -3.8336522  ],            [-40.0,          40.0       ]]    k = 4    # k-means picking the first k points as centroids    centroids = get_first(k, data)    clusters = kmeans(k, centroids, data, "first", 1)

I understood and followed the theory of the algorithm, but as you can see, when running the code, the cost on each iteration of the algorithm increases. I am new to python and I think the problem relies in some misunderstanding from my side regarding list manipulation. Any ideas?

Question 2

Welcome to Code Review! I hope you get some good answers.

Question 3

Also, I am open to suggestions regarding the quality of my code.

Question 4

N.b. I'm pretty sure you can find optimised replacements of many ofthese functions in either the NumPy or SciPy library.

Edit: I don't immediately see a reason for a slowdown except theaddition of new clusters. A bigger sample set that shows the behaviourwould be nice, together with some timings and probably running aprofiler on it.

In general you should likely use a NumPy array or matrix instead oflists of lists as they are optimised for storing numeric data(!). Thatwould likely also eliminate the need for some of these functions orconsiderably reduce their length. Also store all data in the samecontainers - that way you don't have to create new functions likeequals andcontains yourself to compare between differentrepresentations.

TheX/Y definitions at the start are more of a WTF for me. I'dget rid of that by writing code that's either independent on thenumber of dimensions, or just use0/1 - IMO it's not magic enoughto warrant separate constants.
get_first isn't neccessary - the pattern is obvious enough to notrequire encapsulation in a function. The zero is also optional.
Typo incost signature, should becentroids.
The patternfor i in range(len(...)): appears a couple of times andshould be replaced with proper iteration over some helper generator.E.g. incost the parallel iteration over bothcentroids andclusters should be done byzip (itertools.izip in Python 2.7 ifused in the same way). It can also be simplified by using anotherSciPy function,cdist.
If you still need the index, useenumerate.
You can also often get away with using the squared euclidean if theexact value isn't relevant, just the relation between differentdistances.
The pattern to initialise a list of empty lists should probably use_ instead ofi to make it obvious that the variable serves tofurther purpose.
print should be used as a function to make it more uniform. Also,use one of the various formatting options (% orformat) to get ridof the additionalstr calls.
compute_centroids can be simplified with a list comprehension.

Looks like this now:

from scipy.spatial import distanceimport numpy as npimport randomfrom itertools import izipdef cost(centroids, clusters):    return sum(distance.cdist([centroid], cluster, 'sqeuclidean').sum()            for centroid, cluster in izip(centroids, clusters))def compute_centroids(clusters):    return [np.mean(cluster, axis=0) for cluster in clusters]def kmeans(k, centroids, points, method):    clusters = [[] for _ in range(k)]    for point in points:        clusters[closest_centroid(point, centroids)].append(point)    new_centroids = compute_centroids(clusters)    if not equals(centroids, new_centroids):        print("cost [k={}, {}] = {}".format(k, method, cost(new_centroids, clusters)))        clusters = kmeans(k, new_centroids, points, method)    return clustersdef closest_centroid(point, centroids):    min_distance = float('inf')    belongs_to_cluster = None    for j, centroid in enumerate(centroids):        dist = distance.sqeuclidean(point, centroid)        if dist < min_distance:            min_distance = dist            belongs_to_cluster = j    return belongs_to_clusterdef contains(point1, points):    for point2 in points:        if point1[0] == point2[0] and point1[1] == point2[1]:        # if all(x == y for x, y in izip(points1, points2)):            return True    return Falsedef equals(points1, points2):    if len(points1) != len(points2):        return False    for point1, point2 in izip(points1, points2):        if point1[0] != point2[0] or point1[1] != point2[1]:        # if any(x != y for x, y in izip(points1, points2)):            return False    return Trueif __name__ == "__main__":    data = [[-19.0748,     -8.536       ],            [ 22.0108,      -10.9737    ],            [ 12.6597,      19.2601     ],            [ 11.26884087,  19.90132146 ],            [ 15.44640731,  21.13121676 ],            [-20.03865146,  -8.820872829],            [-19.65417726,  -8.211477352],            [-15.97295894,  -9.648002534],            [-18.74359696,  -5.383551586],            [-19.453215,   -8.146120006],            [-16.43074088,  -7.524968005],            [-19.75512437,  -8.533215751],            [-19.56237082,  -8.798668569],            [-19.47135573,  -8.057217004],            [-18.60946986,  -4.475888949],            [-21.59368337,  -10.38712463],            [-15.39158057,  -3.8336522  ],            [-40.0,          40.0       ]]    k = 4    # k-means picking the first k points as centroids    centroids = data[:k]    clusters = kmeans(k, centroids, data, "first")

Question 5

This is... just prodigious. Sharp, neat and complete answer. I will revise the whole thing tomorrow morning. Thanks a lot @ferada.

ferada 11.4k26 silver badges66 bronze badges · Accepted Answer · 2016-05-13 22:27:57Z

N.b. I'm pretty sure you can find optimised replacements of many ofthese functions in either the NumPy or SciPy library.

Edit: I don't immediately see a reason for a slowdown except theaddition of new clusters. A bigger sample set that shows the behaviourwould be nice, together with some timings and probably running aprofiler on it.

In general you should likely use a NumPy array or matrix instead oflists of lists as they are optimised for storing numeric data(!). Thatwould likely also eliminate the need for some of these functions orconsiderably reduce their length. Also store all data in the samecontainers - that way you don't have to create new functions likeequals andcontains yourself to compare between differentrepresentations.

TheX/Y definitions at the start are more of a WTF for me. I'dget rid of that by writing code that's either independent on thenumber of dimensions, or just use0/1 - IMO it's not magic enoughto warrant separate constants.
get_first isn't neccessary - the pattern is obvious enough to notrequire encapsulation in a function. The zero is also optional.
Typo incost signature, should becentroids.
The patternfor i in range(len(...)): appears a couple of times andshould be replaced with proper iteration over some helper generator.E.g. incost the parallel iteration over bothcentroids andclusters should be done byzip (itertools.izip in Python 2.7 ifused in the same way). It can also be simplified by using anotherSciPy function,cdist.
If you still need the index, useenumerate.
You can also often get away with using the squared euclidean if theexact value isn't relevant, just the relation between differentdistances.
The pattern to initialise a list of empty lists should probably use_ instead ofi to make it obvious that the variable serves tofurther purpose.
print should be used as a function to make it more uniform. Also,use one of the various formatting options (% orformat) to get ridof the additionalstr calls.
compute_centroids can be simplified with a list comprehension.

Looks like this now:

from scipy.spatial import distanceimport numpy as npimport randomfrom itertools import izipdef cost(centroids, clusters):    return sum(distance.cdist([centroid], cluster, 'sqeuclidean').sum()            for centroid, cluster in izip(centroids, clusters))def compute_centroids(clusters):    return [np.mean(cluster, axis=0) for cluster in clusters]def kmeans(k, centroids, points, method):    clusters = [[] for _ in range(k)]    for point in points:        clusters[closest_centroid(point, centroids)].append(point)    new_centroids = compute_centroids(clusters)    if not equals(centroids, new_centroids):        print("cost [k={}, {}] = {}".format(k, method, cost(new_centroids, clusters)))        clusters = kmeans(k, new_centroids, points, method)    return clustersdef closest_centroid(point, centroids):    min_distance = float('inf')    belongs_to_cluster = None    for j, centroid in enumerate(centroids):        dist = distance.sqeuclidean(point, centroid)        if dist < min_distance:            min_distance = dist            belongs_to_cluster = j    return belongs_to_clusterdef contains(point1, points):    for point2 in points:        if point1[0] == point2[0] and point1[1] == point2[1]:        # if all(x == y for x, y in izip(points1, points2)):            return True    return Falsedef equals(points1, points2):    if len(points1) != len(points2):        return False    for point1, point2 in izip(points1, points2):        if point1[0] != point2[0] or point1[1] != point2[1]:        # if any(x != y for x, y in izip(points1, points2)):            return False    return Trueif __name__ == "__main__":    data = [[-19.0748,     -8.536       ],            [ 22.0108,      -10.9737    ],            [ 12.6597,      19.2601     ],            [ 11.26884087,  19.90132146 ],            [ 15.44640731,  21.13121676 ],            [-20.03865146,  -8.820872829],            [-19.65417726,  -8.211477352],            [-15.97295894,  -9.648002534],            [-18.74359696,  -5.383551586],            [-19.453215,   -8.146120006],            [-16.43074088,  -7.524968005],            [-19.75512437,  -8.533215751],            [-19.56237082,  -8.798668569],            [-19.47135573,  -8.057217004],            [-18.60946986,  -4.475888949],            [-21.59368337,  -10.38712463],            [-15.39158057,  -3.8336522  ],            [-40.0,          40.0       ]]    k = 4    # k-means picking the first k points as centroids    centroids = data[:k]    clusters = kmeans(k, centroids, data, "first")

This is... just prodigious. Sharp, neat and complete answer. I will revise the whole thing tomorrow morning. Thanks a lot @ferada.

Movatterモバイル変換

Stack Exchange Network

k-means clustering algorithm implementation

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS