K-means clustering algorithm in python

Question 1

Here is my implementation of the k-means algorithm in python. I would love to get any feedback on how it could be improved or any logical errors that you may see. I've left off a lot of the boilerplate code like the command line argument parsing, error handling for data read in from CSV file, etc. and just added the meat of the algorithm.

import osimport numpy as np# kmeans clustering algorithm# data = set of data points# k = number of clusters# c = initial list of centroids (if provided)#def kmeans(data, k, c):    centroids = []    centroids = randomize_centroids(data, centroids, k)      old_centroids = [[] for i in range(k)]     iterations = 0    while not (has_converged(centroids, old_centroids, iterations)):        iterations += 1        clusters = [[] for i in range(k)]        # assign data points to clusters        clusters = euclidean_dist(data, centroids, clusters)        # recalculate centroids        index = 0        for cluster in clusters:            old_centroids[index] = centroids[index]            centroids[index] = np.mean(cluster, axis=0).tolist()            index += 1    print("The total number of data instances is: " + str(len(data)))    print("The total number of iterations necessary is: " + str(iterations))    print("The means of each cluster are: " + str(centroids))    print("The clusters are as follows:")    for cluster in clusters:        print("Cluster with a size of " + str(len(cluster)) + " starts here:")        print(np.array(cluster).tolist())        print("Cluster ends here.")    return# Calculates euclidean distance between# a data point and all the available cluster# centroids.      def euclidean_dist(data, centroids, clusters):    for instance in data:          # Find which centroid is the closest        # to the given data point.        mu_index = min([(i[0], np.linalg.norm(instance-centroids[i[0]])) \                            for i in enumerate(centroids)], key=lambda t:t[1])[0]        try:            clusters[mu_index].append(instance)        except KeyError:            clusters[mu_index] = [instance]    # If any cluster is empty then assign one point    # from data set randomly so as to not have empty    # clusters and 0 means.            for cluster in clusters:        if not cluster:            cluster.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist())    return clusters# randomize initial centroidsdef randomize_centroids(data, centroids, k):    for cluster in range(0, k):        centroids.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist())    return centroids# check if clusters have converged    def has_converged(centroids, old_centroids, iterations):    MAX_ITERATIONS = 1000    if iterations > MAX_ITERATIONS:        return True    return old_centroids == centroids

Question 2

Seethis question and its answer.

Question 3

It would be nice if you could accept the answer. Moreover, where isc used in your code?

Question 4

what is the data type fordata ? Numpy array or list of lists/tuple of tuples?

Question 5

Avoid comments that are identical to the code:

# check if clusters have converged  <-- Remove thisdef has_converged(centroids, old_centroids, iterations):

MAX_ITERATIONS = 1000

Constants should be put at the top of the file to be easier to tweak.

Separate calculations from IO, your functionkmeans should return the values and then another function (maybepretty_format_k_means) should create a human readable message.

# k = number of clusters# c = initial list of centroids (if provided)

Multi-character variable names are allowed, rename your variables (and function arguments) to more meaningful names, and then you can delete the comments.

Question 6

I will agree with Caridorc, but renamingk ink-means will not be such a good idea, Ithink, +1 though!

Question 7

To expand Caridorc's answer, I would change that:

# c = initial list of centroids (if provided)def kmeans(data, k, c):

to:

# c = initial list of centroids (if provided)def kmeans(data, k, c=None):

sincec may not be provided. So I used theNone keyword.

However, note that in your code, you do not usec!

Caridorc 28.2k7 gold badges55 silver badges138 bronze badges · Accepted Answer · 2015-02-09 16:53:21Z

Avoid comments that are identical to the code:

# check if clusters have converged  <-- Remove thisdef has_converged(centroids, old_centroids, iterations):

MAX_ITERATIONS = 1000

Constants should be put at the top of the file to be easier to tweak.

Separate calculations from IO, your functionkmeans should return the values and then another function (maybepretty_format_k_means) should create a human readable message.

# k = number of clusters# c = initial list of centroids (if provided)

Multi-character variable names are allowed, rename your variables (and function arguments) to more meaningful names, and then you can delete the comments.

I will agree with Caridorc, but renamingk ink-means will not be such a good idea, Ithink, +1 though!

Movatterモバイル変換

Stack Exchange Network

K-means clustering algorithm in python

2 Answers2

You mustlog in to answer this question.

Linked

Related

Hot Network Questions

Subscribe to RSS