6
\$\begingroup\$

I've implemented theK-Means clustering algorithm in Python2, and I wanted to know what remarks you guys could make regarding my code. I've included a small test set with 2D-vectors and 2 classes, but it works with higher dimensions and more classes. Here, it should sort all the elements starting with the same letters in the same classes (exceptak, with is quite in between).

What I'm interested in:

  • Did I do somethingunpythonic?
  • Are there unclear or unnecessary parts?
  • What changes could drastically improve the code's performance?

from __future__ import print_functionfrom operator import add, subfrom _collections import defaultdictimport random as rdimport mathclass Element:    def __init__(self, idf, data, cls='ukn'):        self.idf = idf        self.data = data        self.cls = cls    def __str__(self):        return '%s: %s -> %s' % (str(self.idf), str(self.data), str(self.cls))    def eucl_dist_to(self, other):        diff = map(sub, self.data, other.data)        norm2 = math.sqrt(sum([math.pow(coord, 2) for coord in diff]))        return norm2def main():    test_data_tuple = [('aa', 1, 7), ('ab', 1, 8), ('ac', 1, 9), ('ba', 2, 1), ('bb', 2, 3), ('ad', 2, 8), ('ae', 2, 9),                   ('bc', 3, 2), ('bd', 3, 4), ('af', 3, 7), ('be', 4, 2), ('bf', 4, 3), ('ag', 4, 8),                   ('bg', 5, 1), ('bh', 5, 3), ('ah', 5, 6), ('bi', 6, 2), ('bj', 6, 4), ('ai', 6, 6), ('aj', 6, 7),                   ('bk', 7, 0), ('bl', 7, 2), ('bm', 7, 3), ('ak', 7, 5)]    test_data_elt = list()    for t in test_data_tuple:        test_data_elt.append(Element(t[0], tuple(t[1:])))    # test_data_elt.sort(key=lambda e: e.idf)    results = k_means(test_data_elt, 2, 5)def k_means(elts, nb_classes, nb_steps):    k = nb_classes    #init    centroids = list()    cls_content = defaultdict(list)    for cls_nb in range(k):        cls_name = 'class_'+str(cls_nb)        centroids.append(Element(cls_name, rd.choice(elts).data, cls='centroid'))        cls_content[cls_name] = list()    for itr in range(nb_steps):        # assign        cls_content.clear()        for elt in elts:            min_dist = float('inf')            for c in centroids:                dist = c.eucl_dist_to(elt)                if dist < min_dist:                    min_dist = dist                    best_class = c.idf            cls_content[best_class].append(elt)            elt.cls = best_class        # adjust        for c in centroids:            elts_in = list(cls_content[c.idf])            if len(elts_in) == 0:                sum_elts_data = 0            else:                sum_elts_data = list(elts_in[0].data)                for elt in elts_in[1:]:                    sum_elts_data = map(add, sum_elts_data, elt.data)                sum_elts_data[:] = [e / float(len(elts_in)) for e in sum_elts_data]                c.data = tuple(sum_elts_data)        print('\nIteration %d' % itr)        for c in centroids:            cls_name = c.idf            print('\n'+str(c))            for e in cls_content[cls_name]:                print(e)    return cls_contentif __name__ == '__main__':    main()
askedJan 6, 2017 at 17:05
BusyAnt's user avatar
\$\endgroup\$
3
  • \$\begingroup\$Did you consider usingscipy.cluster.vq.kmeans?\$\endgroup\$CommentedJan 9, 2017 at 19:41
  • \$\begingroup\$Well, the point was to code the thing myself, I sure assumed that really good implementations already existed, I just wanted to have a go on my own :) Although I didn't know this one, thanks for having linked it!\$\endgroup\$CommentedJan 10, 2017 at 11:30
  • 1
    \$\begingroup\$That's fine, I just wanted to check. (Take a look atthis answer for some implementation ideas.)\$\endgroup\$CommentedJan 14, 2017 at 19:20

1 Answer1

3
\$\begingroup\$

something unpythonic?

Consider using plural identifiers:test_data_tuples,test_data_elts. Later, for example, you use a perfectly lovelyfor elt in elts: idiom.

In Element__init__, I wouldn't mind seeing cls='unknown' spelled out, but I amreally missing a docstring that explains what the heckidf stands for. Also, as near as I can tell,data really is acoord, e.g.(x, y).

Kudos onk = nb_classes. I feel you did a good job of making the public API very clear, and then within the bodyk is conventional.

I can't say I'm a fan ofimport random as rd. It needlessly obscured the fairly obvious meaning ofrd.choice(elts).data.

Thefor itr ... loop exhibits admirable clarity.

Testingif len(elts_in) == 0:, plus themap, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could usesum andlambda, or a helper function? It seems like we could always find something sensible to assign toc.data, even in the empty case. Or maybe what I'm really looking for is anif that rapidly moves on to acontinue.

Theprint('\n'+str(c)) is not very pythonic. I'd preferprint('\n%s' % c). Ok, that's the last of the tiny nits I will pick.

Overall, it looks like fairly solid code.

changes could drastically improve the code's performance?

Sorry, don't see anything obvious that would help.

answeredAug 26, 2017 at 0:48
J_H's user avatar
\$\endgroup\$

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.