K-means clustering in Python

Question 1

The following code uses scikit-learn to carry out K-means clustering where \$K = 4\$, on an example related to wine marketing from the bookDataSmart. That book uses excel but I wanted to learn Python (including numPy and sciPy) so I implemented this example in that language (of course the K-means clustering is done by the scikit-learn package, I'm first interested in just getting the data in to my program and getting the answer out).

I'm new to Python so any advice on style or ways to write my code in a more idiomatic way would be appreciated.

The csv files needed (in the same directory as the program code) can be produced from downloading "Chapter 2" from the book link above and saving the first and second sheets of the resulting excel file as csv.

# -*- coding: utf-8 -*-"""A program to carry out Kmeans clustering where K=4on data relating to wine marketing from book "Data Smart: Using Data Science to Transform Information into Insight"Requires csv input file OfferInfo.csv with headings'Campaign', 'Varietal', 'Minimum Qty (kg)', 'Discount (%)', 'Origin', 'Past Peak'and input file Transactions.csv with headings'Customer Last Name', 'Offer #'"""#make more similar to Python 3from __future__ import print_function, division, absolute_import, unicode_literals#other stuff we need to importimport csvimport numpy as npfrom sklearn.cluster import KMeans#beginning of main program#read in OfferInfo.csvcsvf = open('OfferInfo.csv','rU')rows = csv.reader(csvf)offer_sheet = [row for row in rows]csvf.close()#read in Transactions.csvcsvf = open('Transactions.csv','rU')rows = csv.reader(csvf)transaction_sheet = [row for row in rows]csvf.close()#first row of each spreadsheet is column headings, so we remove themoffer_sheet_data = offer_sheet[1:]transaction_sheet_data = transaction_sheet[1:]K=4 #four clustersnum_deals = len(offer_sheet_data) #assume listed offers are distinct#find the sorted list of customer last namescustomer_names = []for row in transaction_sheet_data:    customer_names.append(row[0])customer_names = list(set(customer_names))customer_names.sort()num_customers = len(customer_names)#create a num_deals x num_customers matrix of which customer took which dealdeal_customer_matrix = np.zeros((num_deals,num_customers))for row in transaction_sheet_data:    cust_number = customer_names.index(row[0])    deal_number = int(row[1])    deal_customer_matrix[deal_number-1,cust_number] = 1customer_deal_matrix = deal_customer_matrix.transpose()#initialize and carry out clusteringkm = KMeans(n_clusters = K)km.fit(customer_deal_matrix)#find center of clusterscenters = km.cluster_centers_centers[centers<0] = 0 #the minimization function may find very small negative numbers, we threshold them to 0centers = centers.round(2)print('\n--------Centers of the four different clusters--------')print('Deal\t Cent1\t Cent2\t Cent3\t Cent4')for i in range(num_deals):    print(i+1,'\t',centers[0,i],'\t',centers[1,i],'\t',centers[2,i],'\t',centers[3,i])#find which cluster each customer is inprediction = km.predict(customer_deal_matrix)print('\n--------Which cluster each customer is in--------')print('{:<15}\t{}'.format('Customer','Cluster'))for i in range(len(prediction)):    print('{:<15}\t{}'.format(customer_names[i],prediction[i]+1))#determine which deals are most often in each clusterdeal_cluster_matrix = np.zeros((num_deals,K),dtype=np.int)print('\n-----How many of each deal involve a customer in each cluster-----')print('Deal\t Clust1\t Clust2\t Clust3\t Clust4')            for i in range(deal_number):    for j in range(cust_number):        if deal_customer_matrix[i,j] == 1:            deal_cluster_matrix[i,prediction[j]] += 1for i in range(deal_number):    print(i+1,'\t',end='')    for j in range(K):        print(deal_cluster_matrix[i,j],'\t',end='')    print()print()print('The total distance of the solution found is',sum((km.transform(customer_deal_matrix)).min(axis=1)))

Question 2

One obvious improvement would be to break the code up a bit more - identify standalone pieces of functionality and put them into functions, e.g.:

def read_data(filename):    csvf = open(filename,'rU')    rows = csv.reader(csvf)    data = [row for row in rows]    csvf.close()    return dataoffer_sheet = read_data('OfferInfo.csv')transaction_sheet = read_data('Transactions.csv')

This reduces duplication and, therefore, possibilities for errors. It allows easier development, as you can create and test each function separately before connecting it all together. It also makes it easier to improve the functionality, in this case by adopting thewith context manager:

def read_data(filename):    with open(filename, 'rU') as csvf:         return [row for row in csv.reader(csvf)]

You make that change in only one place and everywhere that calls it benefits.

I would also have as little code as possible at the top level. Instead, move it inside an enclosing function, and only call that function if we're running the file directly:

def analyse(offer_file, transaction_file):    offer_sheet = read_data(offer_file)    transaction_sheet = read_data(transaction_file)    ...if __name__ == "__main__":    analyse('OfferInfo.csv', 'Transactions.csv')

This makes it easier toimport the code you develop elsewhere without running the test/demo code.

Question 3

Because@jonrsharp commented much of the functional improvements, I'll speak on style improvements.

Firstly, take a look atPEP8, the official Python style guide. It holds a lot of beneficial information about how Python code should look-and-feel.

Now onto my comments:

Beware of simple comments. Comments help explain the more intricate lines of code and the reasons behind them. Comments like#beginning of main program offer little-to-no value in terms of comprehension.
Conventional comments in Python have a space after the# and the first word (unless its an identifier) is capitalized:
```
# This is a better-styled comment.
```
Put spaces after commas. Whenever you have a comma-separated list in code (bulk declarations, parameters, etc.) always but a single space after each comma.
Usestr.join to make printing easier. Here is one of your print statements:
```
print(i+1,'\t',centers[0,i],'\t',centers[1,i],'\t',centers[2,i],'\t',centers[3,i])
```
join helps removes the repeated use of hard-coded values:
```
print('\t'.join(str(val) for val in [i+1] + [centers[j, i] for j in range(4)])))
```
The above may seem more complex than the original version. Visually, it may be. However its more flexible and Pythonic.

Usingprint(). Most of the time instead of using simpleprint() to print a blank line, you can append a newline onto another print statement:

for i in range(deal_number):    print(i+1,'\t',end='')        for j in range(K):            print(deal_cluster_matrix[i,j],'\t',end='')    print() print() print('The total distance of the solution found is' ... )

becomes:

for i in range(deal_number):    print('\n', i+1, '\t', end='')        for j in range(K):            print(deal_cluster_matrix[i,j], '\t', end='')print('\nThe total distance of the solution found is' ... )

jonrsharpe 14.1k2 gold badges37 silver badges62 bronze badges · Accepted Answer · 2014-05-29 20:37:09Z

One obvious improvement would be to break the code up a bit more - identify standalone pieces of functionality and put them into functions, e.g.:

def read_data(filename):    csvf = open(filename,'rU')    rows = csv.reader(csvf)    data = [row for row in rows]    csvf.close()    return dataoffer_sheet = read_data('OfferInfo.csv')transaction_sheet = read_data('Transactions.csv')

This reduces duplication and, therefore, possibilities for errors. It allows easier development, as you can create and test each function separately before connecting it all together. It also makes it easier to improve the functionality, in this case by adopting thewith context manager:

def read_data(filename):    with open(filename, 'rU') as csvf:         return [row for row in csv.reader(csvf)]

You make that change in only one place and everywhere that calls it benefits.

I would also have as little code as possible at the top level. Instead, move it inside an enclosing function, and only call that function if we're running the file directly:

def analyse(offer_file, transaction_file):    offer_sheet = read_data(offer_file)    transaction_sheet = read_data(transaction_file)    ...if __name__ == "__main__":    analyse('OfferInfo.csv', 'Transactions.csv')

This makes it easier toimport the code you develop elsewhere without running the test/demo code.

Movatterモバイル変換

Stack Exchange Network

K-means clustering in Python

2 Answers2

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS