Clustering using k-medoids

Question 1

This is the program function code for clustering using k-medoids

def kMedoids(D, k, tmax=100):    # determine dimensions of distance matrix D    m, n = D.shape    # randomly initialize an array of k medoid indices    M = np.sort(np.random.choice(n, k)    # create a copy of the array of medoid indices    Mnew = np.copy(M)    # initialize a dictionary to represent clusters    C = {}    for t in range(tmax):    # determine clusters, i.e. arrays of data indices        J = np.argmin(D[:,M], axis=1)        for kappa in range(k):            C[kappa] = np.where(J==kappa)[0]        # update cluster medoids        for kappa in range(k):            J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)            j = np.argmin(J)            Mnew[kappa] = C[kappa][j]            np.sort(Mnew)            # check for convergence            if np.array_equal(M, Mnew):                break                M = np.copy(Mnew)            else:                # final update of cluster memberships                J = np.argmin(D[:,M], axis=1)                for kappa in range(k):                    C[kappa] = np.where(J==kappa)[0]                    # return results            return M, C

and the I will call The function KMedoids with this program, I think my program run slowly in line D = Pairwise_distances(arraydata, metric='euclidean')

D = pairwise_distances(arraydata,metric='euclidean')# split into 2 clustersM, C = kMedoids(D, 2)print('medoids:')for point_idx in M:    print(arraydata[point_idx] )print('')# array for get labeltemp = []indeks = []print('clustering result:')for label in C:    for point_idx in C[label]:        print('label {0}:　{1}'.format(label, arraydata[point_idx]))        temp.append(label)        indeks.append(point_idx)

This is the result from this program

clustering result:label 0:　[0.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+001.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+001.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+001.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+001.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+001.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+001.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00

Why my result of my program is slow for large data and almost have a result "Memory Error"? I hope someone can help me to review this code to improve its performance to get the result and process large amounts of data.

Question 2

This is still on-topic since the error is due to slow performance (on large sets of data).

Question 3

return M, C looks misindented. Please doublecheck your indentation.

Question 4

This is exactly the implementation found inNumPy and SciPy Recipes for Data Science on k-Medoids Clustering but with some indentation mistakes (probably due to the copy & paste process). To understand the indentation being used in the algorithm is might be worth to read the answers on thefor...else syntax:

https://stackoverflow.com/questions/9979970/why-does-python-use-else-after-for-and-while-loops#9980752

A working script with the correct indentation and some improvement is found athttps://github.com/letiantian/kmedoids/blob/master/kmedoids.py :

import numpy as npimport randomdef kMedoids(D, k, tmax=100):   # determine dimensions of distance matrix D   m, n = D.shape   if k > n:       raise Exception('too many medoids')   # find a set of valid initial cluster medoid indices since we   # can't seed different clusters with two points at the same location   valid_medoid_inds = set(range(n))   invalid_medoid_inds = set([])   rs,cs = np.where(D==0)   # the rows, cols must be shuffled because we will keep the first duplicate below   index_shuf = list(range(len(rs)))   np.random.shuffle(index_shuf)   rs = rs[index_shuf]   cs = cs[index_shuf]   for r,c in zip(rs,cs):       # if there are two points with a distance of 0...       # keep the first one for cluster init       if r < c and r not in invalid_medoid_inds:           invalid_medoid_inds.add(c)   valid_medoid_inds = list(valid_medoid_inds - invalid_medoid_inds)   if k > len(valid_medoid_inds):       raise Exception('too many medoids (after removing {} duplicate points)'.format(        len(invalid_medoid_inds)))   # randomly initialize an array of k medoid indices   M = np.array(valid_medoid_inds)   np.random.shuffle(M)   M = np.sort(M[:k])   # create a copy of the array of medoid indices   Mnew = np.copy(M)   # initialize a dictionary to represent clusters   C = {}   for t in xrange(tmax):       # determine clusters, i. e. arrays of data indices       J = np.argmin(D[:,M], axis=1)       for kappa in range(k):           C[kappa] = np.where(J==kappa)[0]       # update cluster medoids       for kappa in range(k):           J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)           j = np.argmin(J)           Mnew[kappa] = C[kappa][j]       np.sort(Mnew)       # check for convergence       if np.array_equal(M, Mnew):           break       M = np.copy(Mnew)   else:       # final update of cluster memberships       J = np.argmin(D[:,M], axis=1)       for kappa in range(k):           C[kappa] = np.where(J==kappa)[0]   # return results   return M, C

Note that theelse as well as thereturn are at the same indentation as thefor-loop.

However, I recommend to change the second inner loop

# update cluster medoidsfor kappa in range(k):       J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)       j = np.argmin(J)       Mnew[kappa] = C[kappa][j]

to

# update cluster medoidsfor kappa in range(k):       J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)       # Fix for the low-idx bias by J.Nagele (2019):       shuffled_idx = np.arange(len(J))       np.random.shuffle(shuffled_idx)        j = shuffled_idx[np.argmin(J[shuffled_idx])]        Mnew[kappa] = C[kappa][j]

as otherwise the resulting medoids depend on the order of the input. That is because the argmin command returns always the lowest index in case multiple potential medoids have equal distances to the cluster members.

Question 5

It might be a better answer if you put the part about the indentation first, remember since this is code review the answer should be mostly about the users code.

Question 6

Please quote the relevant parts of your links so readers don't have to visit other sites and so that your answer will still make sense even if the link(s) no longer work.

Question 7

OK I will change it accordingly

Question 8

I hope now it got better!

Jojo 1212 bronze badges · Answer 1 · 2020-06-10 13:24:26Z

This is exactly the implementation found inNumPy and SciPy Recipes for Data Science on k-Medoids Clustering but with some indentation mistakes (probably due to the copy & paste process). To understand the indentation being used in the algorithm is might be worth to read the answers on thefor...else syntax:

https://stackoverflow.com/questions/9979970/why-does-python-use-else-after-for-and-while-loops#9980752

A working script with the correct indentation and some improvement is found athttps://github.com/letiantian/kmedoids/blob/master/kmedoids.py :

import numpy as npimport randomdef kMedoids(D, k, tmax=100):   # determine dimensions of distance matrix D   m, n = D.shape   if k > n:       raise Exception('too many medoids')   # find a set of valid initial cluster medoid indices since we   # can't seed different clusters with two points at the same location   valid_medoid_inds = set(range(n))   invalid_medoid_inds = set([])   rs,cs = np.where(D==0)   # the rows, cols must be shuffled because we will keep the first duplicate below   index_shuf = list(range(len(rs)))   np.random.shuffle(index_shuf)   rs = rs[index_shuf]   cs = cs[index_shuf]   for r,c in zip(rs,cs):       # if there are two points with a distance of 0...       # keep the first one for cluster init       if r < c and r not in invalid_medoid_inds:           invalid_medoid_inds.add(c)   valid_medoid_inds = list(valid_medoid_inds - invalid_medoid_inds)   if k > len(valid_medoid_inds):       raise Exception('too many medoids (after removing {} duplicate points)'.format(        len(invalid_medoid_inds)))   # randomly initialize an array of k medoid indices   M = np.array(valid_medoid_inds)   np.random.shuffle(M)   M = np.sort(M[:k])   # create a copy of the array of medoid indices   Mnew = np.copy(M)   # initialize a dictionary to represent clusters   C = {}   for t in xrange(tmax):       # determine clusters, i. e. arrays of data indices       J = np.argmin(D[:,M], axis=1)       for kappa in range(k):           C[kappa] = np.where(J==kappa)[0]       # update cluster medoids       for kappa in range(k):           J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)           j = np.argmin(J)           Mnew[kappa] = C[kappa][j]       np.sort(Mnew)       # check for convergence       if np.array_equal(M, Mnew):           break       M = np.copy(Mnew)   else:       # final update of cluster memberships       J = np.argmin(D[:,M], axis=1)       for kappa in range(k):           C[kappa] = np.where(J==kappa)[0]   # return results   return M, C

Note that theelse as well as thereturn are at the same indentation as thefor-loop.

However, I recommend to change the second inner loop

# update cluster medoidsfor kappa in range(k):       J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)       j = np.argmin(J)       Mnew[kappa] = C[kappa][j]

to

# update cluster medoidsfor kappa in range(k):       J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)       # Fix for the low-idx bias by J.Nagele (2019):       shuffled_idx = np.arange(len(J))       np.random.shuffle(shuffled_idx)        j = shuffled_idx[np.argmin(J[shuffled_idx])]        Mnew[kappa] = C[kappa][j]

as otherwise the resulting medoids depend on the order of the input. That is because the argmin command returns always the lowest index in case multiple potential medoids have equal distances to the cluster members.

It might be a better answer if you put the part about the indentation first, remember since this is code review the answer should be mostly about the users code.
Please quote the relevant parts of your links so readers don't have to visit other sites and so that your answer will still make sense even if the link(s) no longer work.

Movatterモバイル変換

Stack Exchange Network

Clustering using k-medoids

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS