K-Means clustering in Python2

Question 1

I've implemented theK-Means clustering algorithm in Python2, and I wanted to know what remarks you guys could make regarding my code. I've included a small test set with 2D-vectors and 2 classes, but it works with higher dimensions and more classes. Here, it should sort all the elements starting with the same letters in the same classes (exceptak, with is quite in between).

What I'm interested in:

Did I do somethingunpythonic?
Are there unclear or unnecessary parts?
What changes could drastically improve the code's performance?

from __future__ import print_functionfrom operator import add, subfrom _collections import defaultdictimport random as rdimport mathclass Element:    def __init__(self, idf, data, cls='ukn'):        self.idf = idf        self.data = data        self.cls = cls    def __str__(self):        return '%s: %s -> %s' % (str(self.idf), str(self.data), str(self.cls))    def eucl_dist_to(self, other):        diff = map(sub, self.data, other.data)        norm2 = math.sqrt(sum([math.pow(coord, 2) for coord in diff]))        return norm2def main():    test_data_tuple = [('aa', 1, 7), ('ab', 1, 8), ('ac', 1, 9), ('ba', 2, 1), ('bb', 2, 3), ('ad', 2, 8), ('ae', 2, 9),                   ('bc', 3, 2), ('bd', 3, 4), ('af', 3, 7), ('be', 4, 2), ('bf', 4, 3), ('ag', 4, 8),                   ('bg', 5, 1), ('bh', 5, 3), ('ah', 5, 6), ('bi', 6, 2), ('bj', 6, 4), ('ai', 6, 6), ('aj', 6, 7),                   ('bk', 7, 0), ('bl', 7, 2), ('bm', 7, 3), ('ak', 7, 5)]    test_data_elt = list()    for t in test_data_tuple:        test_data_elt.append(Element(t[0], tuple(t[1:])))    # test_data_elt.sort(key=lambda e: e.idf)    results = k_means(test_data_elt, 2, 5)def k_means(elts, nb_classes, nb_steps):    k = nb_classes    #init    centroids = list()    cls_content = defaultdict(list)    for cls_nb in range(k):        cls_name = 'class_'+str(cls_nb)        centroids.append(Element(cls_name, rd.choice(elts).data, cls='centroid'))        cls_content[cls_name] = list()    for itr in range(nb_steps):        # assign        cls_content.clear()        for elt in elts:            min_dist = float('inf')            for c in centroids:                dist = c.eucl_dist_to(elt)                if dist < min_dist:                    min_dist = dist                    best_class = c.idf            cls_content[best_class].append(elt)            elt.cls = best_class        # adjust        for c in centroids:            elts_in = list(cls_content[c.idf])            if len(elts_in) == 0:                sum_elts_data = 0            else:                sum_elts_data = list(elts_in[0].data)                for elt in elts_in[1:]:                    sum_elts_data = map(add, sum_elts_data, elt.data)                sum_elts_data[:] = [e / float(len(elts_in)) for e in sum_elts_data]                c.data = tuple(sum_elts_data)        print('\nIteration %d' % itr)        for c in centroids:            cls_name = c.idf            print('\n'+str(c))            for e in cls_content[cls_name]:                print(e)    return cls_contentif __name__ == '__main__':    main()

Question 2

Did you consider usingscipy.cluster.vq.kmeans?

Question 3

Well, the point was to code the thing myself, I sure assumed that really good implementations already existed, I just wanted to have a go on my own :) Although I didn't know this one, thanks for having linked it!

Question 4

That's fine, I just wanted to check. (Take a look atthis answer for some implementation ideas.)

Question 5

something unpythonic?

Consider using plural identifiers:test_data_tuples,test_data_elts. Later, for example, you use a perfectly lovelyfor elt in elts: idiom.

In Element__init__, I wouldn't mind seeing cls='unknown' spelled out, but I amreally missing a docstring that explains what the heckidf stands for. Also, as near as I can tell,data really is acoord, e.g.(x, y).

Kudos onk = nb_classes. I feel you did a good job of making the public API very clear, and then within the bodyk is conventional.

I can't say I'm a fan ofimport random as rd. It needlessly obscured the fairly obvious meaning ofrd.choice(elts).data.

Thefor itr ... loop exhibits admirable clarity.

Testingif len(elts_in) == 0:, plus themap, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could usesum andlambda, or a helper function? It seems like we could always find something sensible to assign toc.data, even in the empty case. Or maybe what I'm really looking for is anif that rapidly moves on to acontinue.

Theprint('\n'+str(c)) is not very pythonic. I'd preferprint('\n%s' % c). Ok, that's the last of the tiny nits I will pick.

Overall, it looks like fairly solid code.

changes could drastically improve the code's performance?

Sorry, don't see anything obvious that would help.

J_H 43.3k3 gold badges38 silver badges158 bronze badges · Answer 1 · 2017-08-26 00:48:50Z

something unpythonic?

Consider using plural identifiers:test_data_tuples,test_data_elts. Later, for example, you use a perfectly lovelyfor elt in elts: idiom.

In Element__init__, I wouldn't mind seeing cls='unknown' spelled out, but I amreally missing a docstring that explains what the heckidf stands for. Also, as near as I can tell,data really is acoord, e.g.(x, y).

Kudos onk = nb_classes. I feel you did a good job of making the public API very clear, and then within the bodyk is conventional.

I can't say I'm a fan ofimport random as rd. It needlessly obscured the fairly obvious meaning ofrd.choice(elts).data.

Thefor itr ... loop exhibits admirable clarity.

Testingif len(elts_in) == 0:, plus themap, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could usesum andlambda, or a helper function? It seems like we could always find something sensible to assign toc.data, even in the empty case. Or maybe what I'm really looking for is anif that rapidly moves on to acontinue.

Theprint('\n'+str(c)) is not very pythonic. I'd preferprint('\n%s' % c). Ok, that's the last of the tiny nits I will pick.

Overall, it looks like fairly solid code.

changes could drastically improve the code's performance?

Sorry, don't see anything obvious that would help.

Movatterモバイル変換

Stack Exchange Network

K-Means clustering in Python2

1 Answer1

You mustlog in to answer this question.

Linked

Related

Hot Network Questions

Subscribe to RSS