Here is my implementation of the k-means algorithm in python. I would love to get any feedback on how it could be improved or any logical errors that you may see. I've left off a lot of the boilerplate code like the command line argument parsing, error handling for data read in from CSV file, etc. and just added the meat of the algorithm.
import osimport numpy as np# kmeans clustering algorithm# data = set of data points# k = number of clusters# c = initial list of centroids (if provided)#def kmeans(data, k, c): centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) # recalculate centroids index = 0 for cluster in clusters: old_centroids[index] = centroids[index] centroids[index] = np.mean(cluster, axis=0).tolist() index += 1 print("The total number of data instances is: " + str(len(data))) print("The total number of iterations necessary is: " + str(iterations)) print("The means of each cluster are: " + str(centroids)) print("The clusters are as follows:") for cluster in clusters: print("Cluster with a size of " + str(len(cluster)) + " starts here:") print(np.array(cluster).tolist()) print("Cluster ends here.") return# Calculates euclidean distance between# a data point and all the available cluster# centroids. def euclidean_dist(data, centroids, clusters): for instance in data: # Find which centroid is the closest # to the given data point. mu_index = min([(i[0], np.linalg.norm(instance-centroids[i[0]])) \ for i in enumerate(centroids)], key=lambda t:t[1])[0] try: clusters[mu_index].append(instance) except KeyError: clusters[mu_index] = [instance] # If any cluster is empty then assign one point # from data set randomly so as to not have empty # clusters and 0 means. for cluster in clusters: if not cluster: cluster.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist()) return clusters# randomize initial centroidsdef randomize_centroids(data, centroids, k): for cluster in range(0, k): centroids.append(data[np.random.randint(0, len(data), size=1)].flatten().tolist()) return centroids# check if clusters have converged def has_converged(centroids, old_centroids, iterations): MAX_ITERATIONS = 1000 if iterations > MAX_ITERATIONS: return True return old_centroids == centroids- \$\begingroup\$Seethis question and its answer.\$\endgroup\$Gareth Rees– Gareth Rees2015-02-12 16:21:28 +00:00CommentedFeb 12, 2015 at 16:21
- \$\begingroup\$It would be nice if you could accept the answer. Moreover, where is
cused in your code?\$\endgroup\$gsamaras– gsamaras2016-01-31 13:43:05 +00:00CommentedJan 31, 2016 at 13:43 - \$\begingroup\$what is the data type for
data? Numpy array or list of lists/tuple of tuples?\$\endgroup\$webmaker– webmaker2016-04-28 17:40:14 +00:00CommentedApr 28, 2016 at 17:40
2 Answers2
Avoid comments that are identical to the code:
# check if clusters have converged <-- Remove thisdef has_converged(centroids, old_centroids, iterations):MAX_ITERATIONS = 1000Constants should be put at the top of the file to be easier to tweak.
Separate calculations from IO, your functionkmeans should return the values and then another function (maybepretty_format_k_means) should create a human readable message.
# k = number of clusters# c = initial list of centroids (if provided)Multi-character variable names are allowed, rename your variables (and function arguments) to more meaningful names, and then you can delete the comments.
- 1\$\begingroup\$I will agree with Caridorc, but renaming
kink-means will not be such a good idea, Ithink, +1 though!\$\endgroup\$gsamaras– gsamaras2016-01-29 23:05:17 +00:00CommentedJan 29, 2016 at 23:05
To expand Caridorc's answer, I would change that:
# c = initial list of centroids (if provided)def kmeans(data, k, c):to:
# c = initial list of centroids (if provided)def kmeans(data, k, c=None):sincec may not be provided. So I used theNone keyword.
However, note that in your code, you do not usec!
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.
