6
\$\begingroup\$

I want to optimize my Apriori algorithm for speed:

from itertools import combinationsimport pandas as pdimport numpy as nptrans=pd.read_table('output.txt', header=None,index_col=0)def apriori(trans, support=0.01, minlen=1):    ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()    collen, rowlen  =ts.shape    #-------------Max leng (not used)    #tssum=ts.sum(axis=1)    #maxlen=int(tssum.loc[tssum.idxmax()])    pattern = []    for cnum in range(minlen, rowlen+1):        for cols in combinations(ts, cnum):            patsup = ts[list(cols)].all(axis=1).sum()            patsup=float(patsup)/collen            pattern.append([",".join(cols), patsup])    sdf = pd.DataFrame(pattern, columns=["Pattern", "Support"])    results=sdf[sdf.Support >= support]    return results

When you input a dataframe of transactions:

>>> trans        1  2    3    40                     11      a  b    c  NaN666     a  d    e  NaN10101   b  c    d  NaN1010    a  b    c    d414147  b  c  NaN  NaN10101   a  b    d  NaN1242    d  e  NaN  NaN101     a  b    c  NaN411     c  d    e  NaN444     a  b    c  NaN[10 rows x 4 columns]

It will yield:

Ap=apriori(trans)print Ap>>>    Pattern  Support0        a      0.61        b      0.72        c      0.73        d      0.64        e      0.35      a,b      0.56      a,c      0.47      a,d      0.38      a,e      0.19      b,c      0.610     b,d      0.312     c,d      0.313     c,e      0.114     d,e      0.315   a,b,c      0.416   a,b,d      0.218   a,c,d      0.120   a,d,e      0.121   b,c,d      0.224   c,d,e      0.1

I want to know if this can be optimized further so that it can run faster on large datasets. I also want to know if there a way to use purely Pandas without combinations from itertools.

200_success's user avatar
200_success
146k22 gold badges191 silver badges481 bronze badges
askedDec 26, 2013 at 7:02
user3084006's user avatar
\$\endgroup\$
2
  • \$\begingroup\$You may want to check out pyFIM for an alternative approach:borgelt.net/pyfim.html\$\endgroup\$CommentedApr 17, 2014 at 19:42
  • \$\begingroup\$how your output.txt dataset looks\$\endgroup\$CommentedFeb 22, 2017 at 4:49

1 Answer1

6
\$\begingroup\$

First off, this is just a part of the Apriori algorithm. Here, you're just finding frequent itemsets. Apriori continues to find association rules in those itemsets.

Also, usingcombinations() like this is not optimal. For example, if we know that the combinationAB does not enjoy reasonable support, we do not need to consider any combination that containsAB anymore (ABC,ABD, etc. will all be infrequent as well). Your algorithm does not take this into consideration.

This means that your algorithm will check all the possible combinations (2n, where n is the number of possible items), while in fact we can prune the search tree as above and reduce this complexity drastically (depending on the density of the dataset).

answeredNov 27, 2015 at 15:20
\$\endgroup\$

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.