48

On a concrete problem, say I have a DataFrame DF

     word  tag count0    a     S    301    the   S    202    a     T    603    an    T    54    the   T    10

I want to find,for every "word", the "tag" that has the most "count". So the return would be something like

     word  tag count1    the   S    202    a     T    603    an    T    5

I don't care about the count column or if the order/Index is original or messed up. Returning a dictionary {'the' : 'S', ...} is just fine.

I hope I can do

DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )

but it doesn't work. I can't access column information.

More abstractly,what does thefunction in agg(function) see as its argument?

btw, is .agg() the same as .aggregate() ?

Many thanks.

Brad Solomon's user avatar
Brad Solomon
41.3k39 gold badges167 silver badges261 bronze badges
askedMar 10, 2013 at 13:16
jf328's user avatar

2 Answers2

71

agg is the same asaggregate. It's callable is passed the columns (Series objects) of theDataFrame, one at a time.


You could useidxmax to collect the index labels of the rows with the maximumcount:

idx = df.groupby('word')['count'].idxmax()print(idx)

yields

worda       2an      3the     1Name: count

and then useloc to select those rows in theword andtag columns:

print(df.loc[idx, ['word', 'tag']])

yields

  word tag2    a   T3   an   T1  the   S

Note thatidxmax returns indexlabels.df.loc can be used to select rowsby label. But if the index is not unique -- that is, if there are rows with duplicate index labels -- thendf.loc will selectall rows with the labels listed inidx. So be careful thatdf.index.is_unique isTrue if you useidxmax withdf.loc


Alternative, you could useapply.apply's callable is passed a sub-DataFrame which gives you access to all the columns:

import pandas as pddf = pd.DataFrame({'word':'a the a an the'.split(),                   'tag': list('SSTTT'),                   'count': [30, 20, 60, 5, 10]})print(df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

yields

worda       Tan      Tthe     S

Usingidxmax andloc is typically faster thanapply, especially for large DataFrames. Using IPython's %timeit:

N = 10000df = pd.DataFrame({'word':'a the a an the'.split()*N,                   'tag': list('SSTTT')*N,                   'count': [30, 20, 60, 5, 10]*N})def using_apply(df):    return (df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))def using_idxmax_loc(df):    idx = df.groupby('word')['count'].idxmax()    return df.loc[idx, ['word', 'tag']]In [22]: %timeit using_apply(df)100 loops, best of 3: 7.68 ms per loopIn [23]: %timeit using_idxmax_loc(df)100 loops, best of 3: 5.43 ms per loop

If you want a dictionary mapping words to tags, then you could useset_indexandto_dict like this:

In [36]: df2 = df.loc[idx, ['word', 'tag']].set_index('word')In [37]: df2Out[37]:      tagword    a      Tan     Tthe    SIn [38]: df2.to_dict()['tag']Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}
answeredMar 10, 2013 at 13:24
unutbu's user avatar
Sign up to request clarification or add additional context in comments.

1 Comment

@bananafish: The syntax has gotten simpler: You can usedf.groupby('word')['count'].idxmax() now.
18

Here's a simple way to figure out what is being passed (the unutbu) solution then 'applies'!

In [33]: def f(x):....:     print type(x)....:     print x....:     In [34]: df.groupby('word').apply(f)<class 'pandas.core.frame.DataFrame'>  word tag  count0    a   S     302    a   T     60<class 'pandas.core.frame.DataFrame'>  word tag  count0    a   S     302    a   T     60<class 'pandas.core.frame.DataFrame'>  word tag  count3   an   T      5<class 'pandas.core.frame.DataFrame'>  word tag  count1  the   S     204  the   T     10

your function just operates (in this case) on a sub-section of the frame with the grouped variable all having the same value (in this cas 'word'), if you are passing a function, then you have to deal with the aggregation of potentially non-string columns; standard functions, like 'sum' do this for you

Automatically does NOT aggregate on the string columns

In [41]: df.groupby('word').sum()Out[41]:       countword       a        90an        5the      30

You ARE aggregating on all columns

In [42]: df.groupby('word').apply(lambda x: x.sum())Out[42]:         word tag countword                  a         aa  ST    90an        an   T     5the   thethe  ST    30

You can do pretty much anything within the function

In [43]: df.groupby('word').apply(lambda x: x['count'].sum())Out[43]: worda       90an       5the     30
answeredMar 10, 2013 at 13:47
Jeff's user avatar

1 Comment

what should this function return ?

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.