62

Given the following dataframe

In [31]: rand = np.random.RandomState(1)         df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,                            'B': rand.randn(6),                            'C': rand.rand(6) > .5})In [32]: dfOut[32]:      A         B      C         0  foo  1.624345  False         1  bar -0.611756   True         2  baz -0.528172  False         3  foo -1.072969   True         4  bar  0.865408  False         5  baz -2.301539   True

I would like to sort it in groups (A) by the aggregated sum ofB, and then by the value inC (not aggregated). So basically get the order of theA groups with

In [28]: df.groupby('A').sum().sort('B')Out[28]:             B  C         A                        baz -2.829710  1         bar  0.253651  1         foo  0.551377  1

And then by True/False, so that it ultimately looks like this:

In [30]: df.ix[[5, 2, 1, 4, 3, 0]]Out[30]: A         B      C    5  baz -2.301539   True    2  baz -0.528172  False    1  bar -0.611756   True    4  bar  0.865408  False    3  foo -1.072969   True    0  foo  1.624345  False

How can this be done?

askedFeb 18, 2013 at 16:55
beardc's user avatar

4 Answers4

64

Groupby A:

In [0]: grp = df.groupby('A')

Within each group, sum over B and broadcast the values using transform. Then sort by B:

In [1]: grp[['B']].transform(sum).sort('B')Out[1]:          B2 -2.8297105 -2.8297101  0.2536514  0.2536510  0.5513773  0.551377

Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:

In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]In [3]: sort1Out[3]:     A         B      C2  baz -0.528172  False5  baz -2.301539   True1  bar -0.611756   True4  bar  0.865408  False0  foo  1.624345  False3  foo -1.072969   True

Finally, sort the 'C' values within groups of 'A' using thesort=False option to preserve the A sort order from step 1:

In [4]: f = lambda x: x.sort('C', ascending=False)In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)In [6]: sort2Out[6]:         A         B      CAbaz 5  baz -2.301539   True    2  baz -0.528172  Falsebar 1  bar -0.611756   True    4  bar  0.865408  Falsefoo 3  foo -1.072969   True    0  foo  1.624345  False

Clean up the df index by usingreset_index withdrop=True:

In [7]: sort2.reset_index(0, drop=True)Out[7]:     A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False
answeredFeb 18, 2013 at 22:11
Zelazny7's user avatar
Sign up to request clarification or add additional context in comments.

4 Comments

Also, I assumed thatgroupby'ssort=False flag would return an arbitrary, not necessarily sorted order (I guess I was associating them with python dictionaries for some reason). But this answer implies that the flag is guaranteed to preserve the original order of the dataframe rows?
I'm 99% sure it preserves the order of the groups as they first appear . I don't have any code to back this up, but some quick testing confirms this intuition.
Thanks @Zelazny7 for this answer. It is exactly what I want. However, it seems in the latest pandas package, to achieve the sameOut[7],inplace=True should be added to the arguments inInput[7] .
Adding more information: sort() is now DEPRECATED. its is advisable to use DataFrame.sort_values()
30

Here's a more concise approach...

df['a_bsum'] = df.groupby('A')['B'].transform(sum)df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)

The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.

Result:

    A       B           C5   baz     -2.301539   True2   baz     -0.528172   False1   bar     -0.611756   True4   bar      0.865408   False3   foo     -1.072969   True0   foo      1.624345   False

NOTE:sort is deprecated, usesort_values instead

aorcsik's user avatar
aorcsik
15.6k5 gold badges42 silver badges50 bronze badges
answeredMay 14, 2013 at 14:03
Mark Byers's user avatar

2 Comments

As withsort_values the last operation is not dropping the column. That is happening because the default isinplace=False. So, specifyinginplace=True will also do the work. An alternative would be using the followingdf.drop('a_bsum', axis=1, inplace=True) after.
Alternatively, assigning the dataframe to the variabledf will do the work as welldf = df.sort_values(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1).
9

One way to do this is to insert a dummy column with the sums in order to sort:

In [10]: sum_B_over_A = df.groupby('A').sum().BIn [11]: sum_B_over_AOut[11]: Abar    0.253652baz   -2.829711foo    0.551376Name: Bin [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)In [13]: dfOut[13]:      A         B      C  sum_B_over_A0  foo  1.624345  False      0.5513761  bar -0.611756   True      0.2536522  baz -0.528172  False     -2.8297113  foo -1.072969   True      0.5513764  bar  0.865408  False      0.2536525  baz -2.301539   True     -2.829711In [14]: df.sort(['sum_B_over_A', 'A', 'B'])Out[14]:      A         B      C   sum_B_over_A5  baz -2.301539   True      -2.8297112  baz -0.528172  False      -2.8297111  bar -0.611756   True       0.2536524  bar  0.865408  False       0.2536523  foo -1.072969   True       0.5513760  foo  1.624345  False       0.551376

and maybe you would drop the dummy row:

In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)Out[15]:      A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False
answeredFeb 18, 2013 at 18:06
Andy Hayden's user avatar

5 Comments

I'msure I've seen some clever way to do this here (essentially allowing a key to sort), but I can't seem to find it.
Glad to know there's a better way to dodf.A.map(dict(zip(sum_B_over_A.index, sum_B_over_A))) :) (should beget_value, no?). Also didn't know about column-wise drops, thanks a lot. (though I kinda prefer the version w/out the dummy column for some reason)
@BirdJaguarIV whoops typo :). Yes, it does seem silly using a dummy (tbh I could've been more clever with my apply [12] to do it in one, and it may well be more efficient, but I decided I wouldn't like to be the person reading it...). Like I say, I think there is a clever way to do this kind of comlex sort :s
You didn't sort by column C.
@MarkByers you can append 'C' to the list of columns to sort by, so it's:df.sort(['sum_B_over_A', 'A', 'B', 'C'])... I should really add link to thesort docs.
0

The question is difficult to understand. However, group by A and sum by B then sort values descending. The column A sort order depends on B. You can then use filtering to create a new dataframe filter by A values order the dataframe.

rand = np.random.RandomState(1)df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,                        'B': rand.randn(6),                        'C': rand.rand(6) > .5})grouped=df.groupby('A')['B'].sum().sort_values(ascending=False)print(grouped)print(grouped.index.get_level_values(0))

Output:

Afoo    0.551377bar    0.253651baz   -2.829710
answeredJul 12, 2021 at 14:58
ListenSoftware Louise Ai Agent's user avatar

Comments

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.