python pandas, DF.groupby().agg(), column reference in agg()

Question 1

On a concrete problem, say I have a DataFrame DF

     word  tag count0    a     S    301    the   S    202    a     T    603    an    T    54    the   T    10

I want to find,for every "word", the "tag" that has the most "count". So the return would be something like

     word  tag count1    the   S    202    a     T    603    an    T    5

I don't care about the count column or if the order/Index is original or messed up. Returning a dictionary {'the' : 'S', ...} is just fine.

I hope I can do

DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )

but it doesn't work. I can't access column information.

More abstractly,what does thefunction in agg(function) see as its argument?

btw, is .agg() the same as .aggregate() ?

Many thanks.

Question 2

agg is the same asaggregate. It's callable is passed the columns (Series objects) of theDataFrame, one at a time.

You could useidxmax to collect the index labels of the rows with the maximumcount:

idx = df.groupby('word')['count'].idxmax()print(idx)

yields

worda       2an      3the     1Name: count

and then useloc to select those rows in theword andtag columns:

print(df.loc[idx, ['word', 'tag']])

yields

  word tag2    a   T3   an   T1  the   S

Note thatidxmax returns indexlabels.df.loc can be used to select rowsby label. But if the index is not unique -- that is, if there are rows with duplicate index labels -- thendf.loc will selectall rows with the labels listed inidx. So be careful thatdf.index.is_unique isTrue if you useidxmax withdf.loc

Alternative, you could useapply.apply's callable is passed a sub-DataFrame which gives you access to all the columns:

import pandas as pddf = pd.DataFrame({'word':'a the a an the'.split(),                   'tag': list('SSTTT'),                   'count': [30, 20, 60, 5, 10]})print(df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

yields

worda       Tan      Tthe     S

Usingidxmax andloc is typically faster thanapply, especially for large DataFrames. Using IPython's %timeit:

N = 10000df = pd.DataFrame({'word':'a the a an the'.split()*N,                   'tag': list('SSTTT')*N,                   'count': [30, 20, 60, 5, 10]*N})def using_apply(df):    return (df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))def using_idxmax_loc(df):    idx = df.groupby('word')['count'].idxmax()    return df.loc[idx, ['word', 'tag']]In [22]: %timeit using_apply(df)100 loops, best of 3: 7.68 ms per loopIn [23]: %timeit using_idxmax_loc(df)100 loops, best of 3: 5.43 ms per loop

If you want a dictionary mapping words to tags, then you could useset_indexandto_dict like this:

In [36]: df2 = df.loc[idx, ['word', 'tag']].set_index('word')In [37]: df2Out[37]:      tagword    a      Tan     Tthe    SIn [38]: df2.to_dict()['tag']Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}

Question 3

@bananafish: The syntax has gotten simpler: You can usedf.groupby('word')['count'].idxmax() now.

Question 4

Here's a simple way to figure out what is being passed (the unutbu) solution then 'applies'!

In [33]: def f(x):....:     print type(x)....:     print x....:     In [34]: df.groupby('word').apply(f)<class 'pandas.core.frame.DataFrame'>  word tag  count0    a   S     302    a   T     60<class 'pandas.core.frame.DataFrame'>  word tag  count0    a   S     302    a   T     60<class 'pandas.core.frame.DataFrame'>  word tag  count3   an   T      5<class 'pandas.core.frame.DataFrame'>  word tag  count1  the   S     204  the   T     10

your function just operates (in this case) on a sub-section of the frame with the grouped variable all having the same value (in this cas 'word'), if you are passing a function, then you have to deal with the aggregation of potentially non-string columns; standard functions, like 'sum' do this for you

Automatically does NOT aggregate on the string columns

In [41]: df.groupby('word').sum()Out[41]:       countword       a        90an        5the      30

You ARE aggregating on all columns

In [42]: df.groupby('word').apply(lambda x: x.sum())Out[42]:         word tag countword                  a         aa  ST    90an        an   T     5the   thethe  ST    30

You can do pretty much anything within the function

In [43]: df.groupby('word').apply(lambda x: x['count'].sum())Out[43]: worda       90an       5the     30

Question 5

what should this function return ?

unutbu 886k197 gold badges1.9k silver badges1.7k bronze badges · Accepted Answer · 2015-08-24 12:39:17Z

agg is the same asaggregate. It's callable is passed the columns (Series objects) of theDataFrame, one at a time.

You could useidxmax to collect the index labels of the rows with the maximumcount:

idx = df.groupby('word')['count'].idxmax()print(idx)

yields

worda       2an      3the     1Name: count

and then useloc to select those rows in theword andtag columns:

print(df.loc[idx, ['word', 'tag']])

yields

  word tag2    a   T3   an   T1  the   S

Note thatidxmax returns indexlabels.df.loc can be used to select rowsby label. But if the index is not unique -- that is, if there are rows with duplicate index labels -- thendf.loc will selectall rows with the labels listed inidx. So be careful thatdf.index.is_unique isTrue if you useidxmax withdf.loc

Alternative, you could useapply.apply's callable is passed a sub-DataFrame which gives you access to all the columns:

import pandas as pddf = pd.DataFrame({'word':'a the a an the'.split(),                   'tag': list('SSTTT'),                   'count': [30, 20, 60, 5, 10]})print(df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

yields

worda       Tan      Tthe     S

Usingidxmax andloc is typically faster thanapply, especially for large DataFrames. Using IPython's %timeit:

N = 10000df = pd.DataFrame({'word':'a the a an the'.split()*N,                   'tag': list('SSTTT')*N,                   'count': [30, 20, 60, 5, 10]*N})def using_apply(df):    return (df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))def using_idxmax_loc(df):    idx = df.groupby('word')['count'].idxmax()    return df.loc[idx, ['word', 'tag']]In [22]: %timeit using_apply(df)100 loops, best of 3: 7.68 ms per loopIn [23]: %timeit using_idxmax_loc(df)100 loops, best of 3: 5.43 ms per loop

If you want a dictionary mapping words to tags, then you could useset_indexandto_dict like this:

In [36]: df2 = df.loc[idx, ['word', 'tag']].set_index('word')In [37]: df2Out[37]:      tagword    a      Tan     Tthe    SIn [38]: df2.to_dict()['tag']Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}

@bananafish: The syntax has gotten simpler: You can usedf.groupby('word')['count'].idxmax() now.

Movatterモバイル変換

Collectives™ on Stack Overflow

python pandas, DF.groupby().agg(), column reference in agg()

2 Answers2

1 Comment

1 Comment

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS