Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

BUG: groupby.agg should always agg#57706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
rhshadrach wants to merge11 commits intopandas-dev:main
base:main
Choose a base branch
Loading
fromrhshadrach:gb_agg_must_agg

Conversation

rhshadrach
Copy link
Member

@rhshadrachrhshadrach commentedMar 2, 2024
edited
Loading

Built on top of#57671; the diff should get better once that's merged. Still plan on splitting part of this up as a precursor (and perhaps multiple).

For the closed issues above, tests here still likely need to be added.

The goal here is to make groupby.agg more consistently handle UDFs. Currently:

  • We sometimes raise if a UDF returns a NumPy ndarray
  • We sometimes treat non-scalars as transforms
  • We sometimes fail (non-purposefully) on non-scalars
  • We sometimes pass the entire group to the UDF rather than column-by-column

My opinion is that we should treat all UDFs as reducers, regardless of what they return. Some alternatives:

  1. If we detect something as being a non-scalar, try treating it as a transform
  2. Raise on anything detected as being a non-scalar

For 1, we will sometimes guess wrong, and transforming isn't something we should be doing in a method calledagg anyways. For 2, we are restricting what I think are valid use cases for aggregation, e.g.gb.agg(np.array) orgb.agg(list).

In implementing this, I ran into two issues:

  • _aggregate_frame fails if non-scalars are returned by the UDF, and also passes all of the selected columns as a DataFrame to the UDF. This is called when there is a single grouping and args or kwargs are provided, or when there is a single grouping and passing the UDF each column individually fails with a ValueError("No objects to concatenate"). This does not seem possible to fix, would be hard to deprecate (could add a new argument or use afuture option?), and is bad enough behavior that it seems to me we should just rip the band aid here for 3.0.
  • Resampler.apply is an alias forResample.agg, and we do not want to impactResampler.apply with these changes. For this, I kept the old paths through groupby specifically for resampler and plan to properly deprecate the current method and implement apply (by calling groupby's apply) as part of 3.x development. (Ref:BUG: resample apply is actually aggregate #38463)

@rhshadrachrhshadrach added Enhancement Groupby API Design Needs DiscussionRequires discussion from core team before further action ApplyApply, Aggregate, Transform, Map labelsMar 2, 2024
@rhshadrachrhshadrach added this to the3.0 milestoneMar 2, 2024
@rhshadrach
Copy link
MemberAuthor

I think this is ready for review. Assuming the direction this is moving us is good, still need to decide if we are okay with this being a breaking change in 3.0 (my preference) or if it should be deprecated. If we do go the deprecation route, it will be noisy (many cases where results will be the same but we can't tell so need to warn). The only way I see a deprecation working is if we add an option, e.g.future_groupby_agg so that users can opt in to the new implementation.

cc@jorisvandenbossche@MarcoGorelli@Dr-Irv@mroeschke for any thoughts.

@mroeschke
Copy link
Member

I could generally be OK making this a "breaking bug change" for 3.0. Just 2 points:

  1. IsDataFrame.agg already strict like this?
  2. ForRaise on anything detected as being a non-scalar, I would be open to still allowing.agg to store a non-scalar, i.e. nested value, as a result element and returndtype=object. But these values should never expand the dimensions of the result when usingagg

@Dr-Irv
Copy link
Contributor

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:
#33242 (comment)

@rhshadrach
Copy link
MemberAuthor

rhshadrach commentedMar 27, 2024
edited
Loading

@mroeschke

  1. IsDataFrame.agg already strict like this?

Great question - unfortunately the answer is no. We useDataFrame.apply under the hood. Perhaps if we are going to do this with groupby, we should also do it across the board. That would make me lean more toward introducing something like afuture.agg option.

2. ForRaise on anything detected as being a non-scalar, I would be open to still allowing.agg to store a non-scalar, i.e. nested value, as a result element and returndtype=object. But these values should never expand the dimensions of the result when usingagg

Agreed. I believe that's the case here for a UDF, but not strings (e.g.cumsum). This is#44845 - but I'd like to focus on UDFs here and work on string arguments separately.

@Dr-Irv

I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests:

Certainly - I added this astest_unused_kwargs. In that test, we usenp.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

@Dr-IrvDr-Irv marked this pull request as ready for reviewMarch 27, 2024 13:33
@Dr-Irv
Copy link
Contributor

Certainly - I added this astest_unused_kwargs. In that test, we usenp.sum(np.sum(data)). This way works both on column and frame inputs, and would raise if we pass a frame instead of column-by-column. Is this sufficient?

I'm not sure. In the example I created, you had 2 functions, one with 1 argument, the other with 2 arguments, and what was being passed to those 2 functions was different because of the number of arguments. I don't see how the test you created confirms thatSeries are passed independent of the function declaration.

So maybe you should have a test like this that addresses the particular issue in#33242 :

deftwoargs(x,y):assertisinstance(x,pd.Series)returnx.sum()deftest_two_args(self):df=pd.DataFrame({'a': [1,2,3,4,5,6],'b': [1,1,0,1,1,0],'c': ['x','x','x','z','z','z'],'d': ['s','s','s','d','d','d']})df.groupby('c')[['a','b']].agg(twoargs,0)

@rhshadrach
Copy link
MemberAuthor

I don't see how the test you created confirms thatSeries are passed independent of the function declaration.

We don't ever inspect the UDF to see what arguments it can take - our logic branches on whether additional arguments are passed in the call toagg:.agg(func, 0) vs.agg(func) previously resulted in two different paths, one which passed the DataFrame and one which passed each Seriers.

Still, no opposition to an additional test here. Will add.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
API DesignApplyApply, Aggregate, Transform, MapEnhancementGroupbyNeeds DiscussionRequires discussion from core team before further action
Projects
None yet
Milestone
3.0
3 participants
@rhshadrach@mroeschke@Dr-Irv

[8]ページ先頭

©2009-2025 Movatter.jp