5
\$\begingroup\$

Here is my implementation of the k-means algorithm in python. I would love to get any feedback on how it could be improved or any logical errors that you may see. I've left off a lot of the boilerplate code like the command line argument parsing, error handling for data read in from CSV file, etc. and just added the meat of the algorithm.

import osimport numpy as np# kmeans clustering algorithm# data = set of data points# k = number of clusters# c = initial list of centroids (if provided)#def kmeans(data, k, c):    centroids = []    centroids = randomize_centroids(data, centroids, k)      old_centroids = [[] for i in range(k)]     iterations = 0    while not (has_converged(centroids, old_centroids, iterations)):        iterations += 1        clusters = [[] for i in range(k)]        # assign data points to clusters        clusters = euclidean_dist(data, centroids, clusters)        # recalculate centroids        index = 0        for cluster in clusters:            old_centroids[index] = centroids[index]            centroids[index] = np.mean(cluster, axis=0).tolist()            index += 1    print("The total number of data instances is: " + str(len(data)))    print("The total number of iterations necessary is: " + str(iterations))    print("The means of each cluster are: " + str(centroids))    print("The clusters are as follows:")    for cluster in clusters:        print("Cluster with a size of " + str(len(cluster)) + " starts here:")        print(np.array(cluster).tolist())        print("Cluster ends here.")    return# Calculates euclidean distance between# a data point and all the available cluster# centroids.      def euclidean_dist(data, centroids, clusters):    for instance in data:          # Find which centroid is the closest        # to the given data point.        mu_index = min([(i[0], np.linalg.norm(instance-centroids[i[0]])) \                            for i in enumerate(centroids)], key=lambda t:t[1])[0]        try:            clusters[mu_index].append(instance)        except KeyError:            clusters[mu_index] = [instance]    # If any cluster is empty then assign one point    # from data set randomly so as to not have empty    # clusters and 0 means.            for cluster in clusters:        if not cluster:            cluster.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist())    return clusters# randomize initial centroidsdef randomize_centroids(data, centroids, k):    for cluster in range(0, k):        centroids.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist())    return centroids# check if clusters have converged    def has_converged(centroids, old_centroids, iterations):    MAX_ITERATIONS = 1000    if iterations > MAX_ITERATIONS:        return True    return old_centroids == centroids
askedFeb 9, 2015 at 14:34
alacy's user avatar
\$\endgroup\$
3
  • \$\begingroup\$Seethis question and its answer.\$\endgroup\$CommentedFeb 12, 2015 at 16:21
  • \$\begingroup\$It would be nice if you could accept the answer. Moreover, where isc used in your code?\$\endgroup\$CommentedJan 31, 2016 at 13:43
  • \$\begingroup\$what is the data type fordata ? Numpy array or list of lists/tuple of tuples?\$\endgroup\$CommentedApr 28, 2016 at 17:40

2 Answers2

4
\$\begingroup\$

Avoid comments that are identical to the code:

# check if clusters have converged  <-- Remove thisdef has_converged(centroids, old_centroids, iterations):

MAX_ITERATIONS = 1000

Constants should be put at the top of the file to be easier to tweak.


Separate calculations from IO, your functionkmeans should return the values and then another function (maybepretty_format_k_means) should create a human readable message.


# k = number of clusters# c = initial list of centroids (if provided)

Multi-character variable names are allowed, rename your variables (and function arguments) to more meaningful names, and then you can delete the comments.

answeredFeb 9, 2015 at 16:53
Caridorc's user avatar
\$\endgroup\$
1
  • 1
    \$\begingroup\$I will agree with Caridorc, but renamingk ink-means will not be such a good idea, Ithink, +1 though!\$\endgroup\$CommentedJan 29, 2016 at 23:05
2
\$\begingroup\$

To expand Caridorc's answer, I would change that:

# c = initial list of centroids (if provided)def kmeans(data, k, c):

to:

# c = initial list of centroids (if provided)def kmeans(data, k, c=None):

sincec may not be provided. So I used theNone keyword.

However, note that in your code, you do not usec!

answeredJan 31, 2016 at 13:51
gsamaras's user avatar
\$\endgroup\$

You mustlog in to answer this question.

Protected question. To answer this question, you need to have at least 10 reputation on this site (not counting theassociation bonus). The reputation requirement helps protect this question from spam and non-answer activity.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.