NotificationsYou must be signed in to change notification settings
Fork18.6k
Star45.8k

BUG: groupby.agg should always agg#57706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

rhshadrach wants to merge11 commits intopandas-dev:main

base:main

Choose a base branch

fromrhshadrach:gb_agg_must_agg

Open

BUG: groupby.agg should always agg#57706

rhshadrach wants to merge11 commits intopandas-dev:mainfromrhshadrach:gb_agg_must_agg

Conversation

Copy link

Member

rhshadrach commentedMar 2, 2024•
edited
Loading

Built on top of#57671; the diff should get better once that's merged. Still plan on splitting part of this up as a precursor (and perhaps multiple).

For the closed issues above, tests here still likely need to be added.

The goal here is to make groupby.agg more consistently handle UDFs. Currently:

We sometimes raise if a UDF returns a NumPy ndarray
We sometimes treat non-scalars as transforms
We sometimes fail (non-purposefully) on non-scalars
We sometimes pass the entire group to the UDF rather than column-by-column

My opinion is that we should treat all UDFs as reducers, regardless of what they return. Some alternatives:

If we detect something as being a non-scalar, try treating it as a transform
Raise on anything detected as being a non-scalar

For 1, we will sometimes guess wrong, and transforming isn't something we should be doing in a method calledagg anyways. For 2, we are restricting what I think are valid use cases for aggregation, e.g.gb.agg(np.array) orgb.agg(list).

In implementing this, I ran into two issues:

_aggregate_frame fails if non-scalars are returned by the UDF, and also passes all of the selected columns as a DataFrame to the UDF. This is called when there is a single grouping and args or kwargs are provided, or when there is a single grouping and passing the UDF each column individually fails with a ValueError("No objects to concatenate"). This does not seem possible to fix, would be hard to deprecate (could add a new argument or use afuture option?), and is bad enough behavior that it seems to me we should just rip the band aid here for 3.0.
Resampler.apply is an alias forResample.agg, and we do not want to impactResampler.apply with these changes. For this, I kept the old paths through groupby specifically for resampler and plan to properly deprecate the current method and implement apply (by calling groupby's apply) as part of 3.x development. (Ref:BUG: resample apply is actually aggregate #38463)

rhshadrach added2 commits

March 2, 2024 12:54

BUG: groupby.agg should always agg

3505c7a

cleanup

12ef132

rhshadrach added Enhancement Groupby API Design Needs Discussion

Requires discussion from core team before further action

ApplyApply, Aggregate, Transform, Map labels

Mar 2, 2024

rhshadrach added this to the3.0 milestone

Mar 2, 2024

rhshadrachand others added8 commits

March 3, 2024 07:23

Test fixup

87d32ae

Merge branch 'main' into gb_agg_must_agg

711fcb6

WIP

0849e1c

Fixup for pivot

08d26c6

Merge remote-tracking branch 'upstream/main' into gb_agg_must_agg

d9e34e7

Deprecate passing reduction kernels to groupby.agg

d264e20

Merge origin

6447e9f

Revert & more tests

2eeb95f

Copy link

MemberAuthor

rhshadrach commentedMar 23, 2024

I think this is ready for review. Assuming the direction this is moving us is good, still need to decide if we are okay with this being a breaking change in 3.0 (my preference) or if it should be deprecated. If we do go the deprecation route, it will be noisy (many cases where results will be the same but we can't tell so need to warn). The only way I see a deprecation working is if we add an option, e.g.future_groupby_agg so that users can opt in to the new implementation.

cc@jorisvandenbossche @MarcoGorelli @Dr-Irv @mroeschke for any thoughts.

pre-commit

b6fed21

Copy link

Member

mroeschke commentedMar 25, 2024

I could generally be OK making this a "breaking bug change" for 3.0. Just 2 points:

IsDataFrame.agg already strict like this?
ForRaise on anything detected as being a non-scalar, I would be open to still allowing.agg to store a non-scalar, i.e. nested value, as a result element and returndtype=object. But these values should never expand the dimensions of the result when usingagg

Copy link

Contributor

Dr-Irv commentedMar 26, 2024

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:
#33242 (comment)

Copy link

MemberAuthor

rhshadrach commentedMar 27, 2024•
edited
Loading

@mroeschke

IsDataFrame.agg already strict like this?

Great question - unfortunately the answer is no. We useDataFrame.apply under the hood. Perhaps if we are going to do this with groupby, we should also do it across the board. That would make me lean more toward introducing something like afuture.agg option.

2. ForRaise on anything detected as being a non-scalar, I would be open to still allowing.agg to store a non-scalar, i.e. nested value, as a result element and returndtype=object. But these values should never expand the dimensions of the result when usingagg

Agreed. I believe that's the case here for a UDF, but not strings (e.g.cumsum). This is#44845 - but I'd like to focus on UDFs here and work on string arguments separately.

@Dr-Irv

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:

Certainly - I added this astest_unused_kwargs. In that test, we usenp.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

Dr-Irv marked this pull request as ready for review

March 27, 2024 13:33

Copy link

Contributor

Dr-Irv commentedMar 27, 2024

Certainly - I added this astest_unused_kwargs. In that test, we usenp.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

I'm not sure. In the example I created, you had 2 functions, one with 1 argument, the other with 2 arguments, and what was being passed to those 2 functions was different because of the number of arguments. I don't see how the test you created confirms thatSeries are passed independent of the function declaration.

So maybe you should have a test like this that addresses the particular issue in#33242 :

deftwoargs(x,y):assertisinstance(x,pd.Series)returnx.sum()deftest_two_args(self):df=pd.DataFrame({'a': [1,2,3,4,5,6],'b': [1,1,0,1,1,0],'c': ['x','x','x','z','z','z'],'d': ['s','s','s','d','d','d']})df.groupby('c')[['a','b']].agg(twoargs,0)

Copy link

MemberAuthor

rhshadrach commentedMar 27, 2024

I don't see how the test you created confirms thatSeries are passed independent of the function declaration.

We don't ever inspect the UDF to see what arguments it can take - our logic branches on whether additional arguments are passed in the call toagg:.agg(func, 0) vs.agg(func) previously resulted in two different paths, one which passed the DataFrame and one which passed each Seriers.

Still, no opposition to an additional test here. Will add.

rhshadrach mentioned this pull request

Apr 5, 2024

BUG: incorrect aggregation of dataframe when using UDF with kwarg#58146

Closed

3 tasks

WillAyd mentioned this pull request

Apr 17, 2024

BUG: GroupBy.agg allows non-aggregating built-in functions#58284

Closed

3 tasks

Labels

API Design Apply

Apply, Aggregate, Transform, Map

Enhancement Groupby Needs Discussion

Requires discussion from core team before further action

3 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: groupby.agg should always agg#57706

Are you sure you want to change the base?

BUG: groupby.agg should always agg#57706

Uh oh!

Conversation

rhshadrach commentedMar 2, 2024•
edited
Loading

Uh oh!