8
\$\begingroup\$

The following code uses scikit-learn to carry out K-means clustering where \$K = 4\$, on an example related to wine marketing from the bookDataSmart. That book uses excel but I wanted to learn Python (including numPy and sciPy) so I implemented this example in that language (of course the K-means clustering is done by the scikit-learn package, I'm first interested in just getting the data in to my program and getting the answer out).

I'm new to Python so any advice on style or ways to write my code in a more idiomatic way would be appreciated.

The csv files needed (in the same directory as the program code) can be produced from downloading "Chapter 2" from the book link above and saving the first and second sheets of the resulting excel file as csv.

# -*- coding: utf-8 -*-"""A program to carry out Kmeans clustering where K=4on data relating to wine marketing from book "Data Smart: Using Data Science to Transform Information into Insight"Requires csv input file OfferInfo.csv with headings'Campaign', 'Varietal', 'Minimum Qty (kg)', 'Discount (%)', 'Origin', 'Past Peak'and input file Transactions.csv with headings'Customer Last Name', 'Offer #'"""#make more similar to Python 3from __future__ import print_function, division, absolute_import, unicode_literals#other stuff we need to importimport csvimport numpy as npfrom sklearn.cluster import KMeans#beginning of main program#read in OfferInfo.csvcsvf = open('OfferInfo.csv','rU')rows = csv.reader(csvf)offer_sheet = [row for row in rows]csvf.close()#read in Transactions.csvcsvf = open('Transactions.csv','rU')rows = csv.reader(csvf)transaction_sheet = [row for row in rows]csvf.close()#first row of each spreadsheet is column headings, so we remove themoffer_sheet_data = offer_sheet[1:]transaction_sheet_data = transaction_sheet[1:]K=4 #four clustersnum_deals = len(offer_sheet_data) #assume listed offers are distinct#find the sorted list of customer last namescustomer_names = []for row in transaction_sheet_data:    customer_names.append(row[0])customer_names = list(set(customer_names))customer_names.sort()num_customers = len(customer_names)#create a num_deals x num_customers matrix of which customer took which dealdeal_customer_matrix = np.zeros((num_deals,num_customers))for row in transaction_sheet_data:    cust_number = customer_names.index(row[0])    deal_number = int(row[1])    deal_customer_matrix[deal_number-1,cust_number] = 1customer_deal_matrix = deal_customer_matrix.transpose()#initialize and carry out clusteringkm = KMeans(n_clusters = K)km.fit(customer_deal_matrix)#find center of clusterscenters = km.cluster_centers_centers[centers<0] = 0 #the minimization function may find very small negative numbers, we threshold them to 0centers = centers.round(2)print('\n--------Centers of the four different clusters--------')print('Deal\t Cent1\t Cent2\t Cent3\t Cent4')for i in range(num_deals):    print(i+1,'\t',centers[0,i],'\t',centers[1,i],'\t',centers[2,i],'\t',centers[3,i])#find which cluster each customer is inprediction = km.predict(customer_deal_matrix)print('\n--------Which cluster each customer is in--------')print('{:<15}\t{}'.format('Customer','Cluster'))for i in range(len(prediction)):    print('{:<15}\t{}'.format(customer_names[i],prediction[i]+1))#determine which deals are most often in each clusterdeal_cluster_matrix = np.zeros((num_deals,K),dtype=np.int)print('\n-----How many of each deal involve a customer in each cluster-----')print('Deal\t Clust1\t Clust2\t Clust3\t Clust4')            for i in range(deal_number):    for j in range(cust_number):        if deal_customer_matrix[i,j] == 1:            deal_cluster_matrix[i,prediction[j]] += 1for i in range(deal_number):    print(i+1,'\t',end='')    for j in range(K):        print(deal_cluster_matrix[i,j],'\t',end='')    print()print()print('The total distance of the solution found is',sum((km.transform(customer_deal_matrix)).min(axis=1)))
200_success's user avatar
200_success
146k22 gold badges191 silver badges481 bronze badges
askedMay 29, 2014 at 18:25
Martin Leslie's user avatar
\$\endgroup\$

2 Answers2

7
\$\begingroup\$

One obvious improvement would be to break the code up a bit more - identify standalone pieces of functionality and put them into functions, e.g.:

def read_data(filename):    csvf = open(filename,'rU')    rows = csv.reader(csvf)    data = [row for row in rows]    csvf.close()    return dataoffer_sheet = read_data('OfferInfo.csv')transaction_sheet = read_data('Transactions.csv')

This reduces duplication and, therefore, possibilities for errors. It allows easier development, as you can create and test each function separately before connecting it all together. It also makes it easier to improve the functionality, in this case by adopting thewith context manager:

def read_data(filename):    with open(filename, 'rU') as csvf:         return [row for row in csv.reader(csvf)]

You make that change in only one place and everywhere that calls it benefits.


I would also have as little code as possible at the top level. Instead, move it inside an enclosing function, and only call that function if we're running the file directly:

def analyse(offer_file, transaction_file):    offer_sheet = read_data(offer_file)    transaction_sheet = read_data(transaction_file)    ...if __name__ == "__main__":    analyse('OfferInfo.csv', 'Transactions.csv')

This makes it easier toimport the code you develop elsewhere without running the test/demo code.

answeredMay 29, 2014 at 20:37
jonrsharpe's user avatar
\$\endgroup\$
3
\$\begingroup\$

Because@jonrsharp commented much of the functional improvements, I'll speak on style improvements.

Firstly, take a look atPEP8, the official Python style guide. It holds a lot of beneficial information about how Python code should look-and-feel.

Now onto my comments:

  1. Beware of simple comments. Comments help explain the more intricate lines of code and the reasons behind them. Comments like#beginning of main program offer little-to-no value in terms of comprehension.

    Conventional comments in Python have a space after the# and the first word (unless its an identifier) is capitalized:

    # This is a better-styled comment.
  2. Put spaces after commas. Whenever you have a comma-separated list in code (bulk declarations, parameters, etc.) always but a single space after each comma.

  3. Usestr.join to make printing easier. Here is one of your print statements:

    print(i+1,'\t',centers[0,i],'\t',centers[1,i],'\t',centers[2,i],'\t',centers[3,i])

    join helps removes the repeated use of hard-coded values:

    print('\t'.join(str(val) for val in [i+1] + [centers[j, i] for j in range(4)])))

    The above may seem more complex than the original version. Visually, it may be. However its more flexible and Pythonic.

  4. Usingprint(). Most of the time instead of using simpleprint() to print a blank line, you can append a newline onto another print statement:

    for i in range(deal_number):    print(i+1,'\t',end='')        for j in range(K):            print(deal_cluster_matrix[i,j],'\t',end='')    print() print() print('The total distance of the solution found is' ... )

    becomes:

    for i in range(deal_number):    print('\n', i+1, '\t', end='')        for j in range(K):            print(deal_cluster_matrix[i,j], '\t', end='')print('\nThe total distance of the solution found is' ... )
answeredMay 30, 2014 at 13:47
BeetDemGuise's user avatar
\$\endgroup\$

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.