Pandas sort by group aggregate and column

Question 1

Given the following dataframe

In [31]: rand = np.random.RandomState(1)         df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,                            'B': rand.randn(6),                            'C': rand.rand(6) > .5})In [32]: dfOut[32]:      A         B      C         0  foo  1.624345  False         1  bar -0.611756   True         2  baz -0.528172  False         3  foo -1.072969   True         4  bar  0.865408  False         5  baz -2.301539   True

I would like to sort it in groups (A) by the aggregated sum ofB, and then by the value inC (not aggregated). So basically get the order of theA groups with

In [28]: df.groupby('A').sum().sort('B')Out[28]:             B  C         A                        baz -2.829710  1         bar  0.253651  1         foo  0.551377  1

And then by True/False, so that it ultimately looks like this:

In [30]: df.ix[[5, 2, 1, 4, 3, 0]]Out[30]: A         B      C    5  baz -2.301539   True    2  baz -0.528172  False    1  bar -0.611756   True    4  bar  0.865408  False    3  foo -1.072969   True    0  foo  1.624345  False

How can this be done?

Question 2

Groupby A:

In [0]: grp = df.groupby('A')

Within each group, sum over B and broadcast the values using transform. Then sort by B:

In [1]: grp[['B']].transform(sum).sort('B')Out[1]:          B2 -2.8297105 -2.8297101  0.2536514  0.2536510  0.5513773  0.551377

Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:

In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]In [3]: sort1Out[3]:     A         B      C2  baz -0.528172  False5  baz -2.301539   True1  bar -0.611756   True4  bar  0.865408  False0  foo  1.624345  False3  foo -1.072969   True

Finally, sort the 'C' values within groups of 'A' using thesort=False option to preserve the A sort order from step 1:

In [4]: f = lambda x: x.sort('C', ascending=False)In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)In [6]: sort2Out[6]:         A         B      CAbaz 5  baz -2.301539   True    2  baz -0.528172  Falsebar 1  bar -0.611756   True    4  bar  0.865408  Falsefoo 3  foo -1.072969   True    0  foo  1.624345  False

Clean up the df index by usingreset_index withdrop=True:

In [7]: sort2.reset_index(0, drop=True)Out[7]:     A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False

Question 3

Also, I assumed thatgroupby'ssort=False flag would return an arbitrary, not necessarily sorted order (I guess I was associating them with python dictionaries for some reason). But this answer implies that the flag is guaranteed to preserve the original order of the dataframe rows?

Question 4

I'm 99% sure it preserves the order of the groups as they first appear . I don't have any code to back this up, but some quick testing confirms this intuition.

Question 5

Thanks @Zelazny7 for this answer. It is exactly what I want. However, it seems in the latest pandas package, to achieve the sameOut[7],inplace=True should be added to the arguments inInput[7] .

Question 6

Adding more information: sort() is now DEPRECATED. its is advisable to use DataFrame.sort_values()

Question 7

Here's a more concise approach...

df['a_bsum'] = df.groupby('A')['B'].transform(sum)df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)

The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.

Result:

    A       B           C5   baz     -2.301539   True2   baz     -0.528172   False1   bar     -0.611756   True4   bar      0.865408   False3   foo     -1.072969   True0   foo      1.624345   False

NOTE:sort is deprecated, usesort_values instead

Question 8

As withsort_values the last operation is not dropping the column. That is happening because the default isinplace=False. So, specifyinginplace=True will also do the work. An alternative would be using the followingdf.drop('a_bsum', axis=1, inplace=True) after.

Question 9

Alternatively, assigning the dataframe to the variabledf will do the work as welldf = df.sort_values(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1).

Question 10

One way to do this is to insert a dummy column with the sums in order to sort:

In [10]: sum_B_over_A = df.groupby('A').sum().BIn [11]: sum_B_over_AOut[11]: Abar    0.253652baz   -2.829711foo    0.551376Name: Bin [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)In [13]: dfOut[13]:      A         B      C  sum_B_over_A0  foo  1.624345  False      0.5513761  bar -0.611756   True      0.2536522  baz -0.528172  False     -2.8297113  foo -1.072969   True      0.5513764  bar  0.865408  False      0.2536525  baz -2.301539   True     -2.829711In [14]: df.sort(['sum_B_over_A', 'A', 'B'])Out[14]:      A         B      C   sum_B_over_A5  baz -2.301539   True      -2.8297112  baz -0.528172  False      -2.8297111  bar -0.611756   True       0.2536524  bar  0.865408  False       0.2536523  foo -1.072969   True       0.5513760  foo  1.624345  False       0.551376

and maybe you would drop the dummy row:

In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)Out[15]:      A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False

Question 11

I'msure I've seen some clever way to do this here (essentially allowing a key to sort), but I can't seem to find it.

Question 12

Glad to know there's a better way to dodf.A.map(dict(zip(sum_B_over_A.index, sum_B_over_A))) :) (should beget_value, no?). Also didn't know about column-wise drops, thanks a lot. (though I kinda prefer the version w/out the dummy column for some reason)

Question 13

@BirdJaguarIV whoops typo :). Yes, it does seem silly using a dummy (tbh I could've been more clever with my apply [12] to do it in one, and it may well be more efficient, but I decided I wouldn't like to be the person reading it...). Like I say, I think there is a clever way to do this kind of comlex sort :s

Question 14

You didn't sort by column C.

Question 15

@MarkByers you can append 'C' to the list of columns to sort by, so it's:df.sort(['sum_B_over_A', 'A', 'B', 'C'])... I should really add link to thesort docs.

Question 16

The question is difficult to understand. However, group by A and sum by B then sort values descending. The column A sort order depends on B. You can then use filtering to create a new dataframe filter by A values order the dataframe.

rand = np.random.RandomState(1)df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,                        'B': rand.randn(6),                        'C': rand.rand(6) > .5})grouped=df.groupby('A')['B'].sum().sort_values(ascending=False)print(grouped)print(grouped.index.get_level_values(0))

Output:

Afoo    0.551377bar    0.253651baz   -2.829710

Zelazny7 40.7k18 gold badges72 silver badges86 bronze badges · Accepted Answer · 2013-02-18 22:11:48Z

Groupby A:

In [0]: grp = df.groupby('A')

Within each group, sum over B and broadcast the values using transform. Then sort by B:

In [1]: grp[['B']].transform(sum).sort('B')Out[1]:          B2 -2.8297105 -2.8297101  0.2536514  0.2536510  0.5513773  0.551377

Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:

In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]In [3]: sort1Out[3]:     A         B      C2  baz -0.528172  False5  baz -2.301539   True1  bar -0.611756   True4  bar  0.865408  False0  foo  1.624345  False3  foo -1.072969   True

Finally, sort the 'C' values within groups of 'A' using thesort=False option to preserve the A sort order from step 1:

In [4]: f = lambda x: x.sort('C', ascending=False)In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)In [6]: sort2Out[6]:         A         B      CAbaz 5  baz -2.301539   True    2  baz -0.528172  Falsebar 1  bar -0.611756   True    4  bar  0.865408  Falsefoo 3  foo -1.072969   True    0  foo  1.624345  False

Clean up the df index by usingreset_index withdrop=True:

In [7]: sort2.reset_index(0, drop=True)Out[7]:     A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False

Also, I assumed thatgroupby'ssort=False flag would return an arbitrary, not necessarily sorted order (I guess I was associating them with python dictionaries for some reason). But this answer implies that the flag is guaranteed to preserve the original order of the dataframe rows?
I'm 99% sure it preserves the order of the groups as they first appear . I don't have any code to back this up, but some quick testing confirms this intuition.
Thanks @Zelazny7 for this answer. It is exactly what I want. However, it seems in the latest pandas package, to achieve the sameOut[7],inplace=True should be added to the arguments inInput[7] .
Adding more information: sort() is now DEPRECATED. its is advisable to use DataFrame.sort_values()

Movatterモバイル変換

Collectives™ on Stack Overflow

Pandas sort by group aggregate and column

4 Answers4

4 Comments

2 Comments

5 Comments

Comments

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS