closesBUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy #53030 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
Allcode checks passed.
Addedtype annotations to new arguments/methods/functions.
Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Root cause:

agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

replace all NA with nan
convert the `results' to respective pyarrow extension array, using pyarrow library methods
pyarrow library methods is used instead ofmaybe_convert_object, asmaybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Kei added30 commits

April 1, 2024 19:04

Set preserve_dtype flag for bool type only when result is also bool

9faa460

Update implementation to change type to pyarrow only

969d5b1

Change import order

66114f3

Convert numpy array to pandas representation of pyarrow array

b0290ed

Add tests

20c8fa0

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

97b3d54

Change pyarrow to optional import in agg_series() method

932d737

Seperate tests

82ddeb5

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

d510052

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

62a31d9

Revert to old implementation

a54bf58

Update implementation to use pyarrow array method

64330f0

Update test_aggregate tests

0647711

Move pyarrow import to top of method

affde38

Update according to pr comments

842f561

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

93b5bf3

Fallback convert to input dtype is output is all nan or empty array

6f35c0e

Strip na values when inferring pyarrow dtype

abd0adf

Update tests to check expected inferred dtype instead of inputy dtype

bebc442

Override test case for test_arrow.py

bb6343b

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

3a3f2a2

Empty commit to trigger build run

6dc40f5

In agg series, convert to np values, then cast to pyarrow dtype, acco…

4ef96f7

…unt for missing pyarrow dtypes

Update tests

c6a98c0

Update rst docs

9181eaf

Update impl to fix tests

612d7d0

Declare variable in outer scope

3b6696b

Update impl to use maybe_cast_pointwise_result instead of maybe_cast_…

680e238

…to_pyarrow_array

Fix tests with nested array

3a8597e

Update according to pr comments

6496b15

undermyumbrella1and others added4 commits

May 12, 2024 15:39

Remove redundant tests

fa257b0

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

0a9b83f

retrigger pipeline

139319a

Merge main

9c2f9f2

rhshadrach marked this pull request as draft

August 25, 2024 13:02

rhshadrach added Groupby Arrow

pyarrow functionality

pyarrow dtype retention

op with pyarrow dtype -> expect pyarrow result

Bug and removed Arrowpyarrow functionality labels

Aug 25, 2024

Copy link

Contributor

github-actionsbot commentedSep 25, 2024

This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this.

github-actionsbot added the Stale label

Sep 25, 2024

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

fef315d

rhshadrach changed the title~~Fix/group by agg pyarrow bool numpy same type~~BUG: groupby.agg with UDF changing pyarrow dtypes

Oct 6, 2024

rhshadrach added3 commits

March 22, 2025 11:22

Merge branch 'main' ofhttps://github.com/pandas-dev/pandasinto fix/…

f758eb1

…group_by_agg_pyarrow_bool_numpy_same_type

Rework

283eda9

Cleanup

d6edeff

rhshadrach marked this pull request as ready for review

March 22, 2025 16:15

rhshadrach removed the Stale label

Mar 22, 2025

rhshadrach added2 commits

March 22, 2025 17:44

Fixup

b2e34fb

More skips

9cbf339

rhshadrach commented

Mar 23, 2025

View reviewed changes

pandas/tests/groupby/aggregate/test_aggregate.py

Comment on lines +1899 to +1905

		result = gb.agg(lambda x: {"number": 1})

		arr = pa.array([{"number": 1}, {"number": 1}, {"number": 1}])
		expected = DataFrame(
		{"B": ArrowExtensionArray(arr)},
		index=Index(["c1", "c2", "c3"], name="A"),
		)

Copy link

MemberAuthor

rhshadrachMar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When the column starts as a PyArrow dtype and returns dictionaries, it seems questionable to me whether we should return the corresponding PyArrow dtype. The other option is a NumPy array of object dtype. But both seem like reasonable results and I imagine the PyArrow is likely to be more convenient for the user who is using PyArrow dtypes.

rhshadrach requested a review frommroeschke

March 23, 2025 12:06

rhshadrach added this to the3.0 milestone

Mar 23, 2025

Copy link

Contributor

github-actionsbot commentedMay 22, 2025

This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this.

github-actionsbot added the Stale label

May 22, 2025

Copy link

Member

jbrockmendel commentedJul 8, 2025

as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

maybe_convert_object has a convert_to_nullable keyword. if you pass that as True you'll get back a numpy-nullable array, which you can then convert to pyarrow. Not sure if that is actually better than what you're doing here.

jbrockmendel reviewed

Jul 8, 2025

View reviewed changes

pandas/core/groupby/ops.py

		npvalues = lib.maybe_convert_objects(result, try_float=False)

		if isinstance(obj._values, ArrowExtensionArray):
		from pandas.core.dtypes.common import is_string_dtype

Copy link

Member

jbrockmendelJul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can this go at the top

jbrockmendel reviewed

Jul 8, 2025

View reviewed changes

pandas/core/groupby/ops.py

		if not isinstance(obj._values, np.ndarray):
		# When obj.dtype is a string, any object can be cast. Only do so if the
		# UDF returned strings or NA values.
		if not is_string_dtype(obj.dtype) or is_string_dtype(

Copy link

Member

jbrockmendelJul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

i suspect what you really want here is lib.is_string_array?

Copy link

Member

jbrockmendel commentedJul 8, 2025

The correct solution here and to several related issues is an EA._construct_with_inference method that behaves similarly to maybe_convert_objects but preserves dtype backend. xref#56430.

(Felt the need to plug that, but doesn't need to be a blocker for this in the interim)

Labels

Bug Groupby pyarrow dtype retention

op with pyarrow dtype -> expect pyarrow result

Stale

3 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: groupby.agg with UDF changing pyarrow dtypes#59601

Are you sure you want to change the base?

BUG: groupby.agg with UDF changing pyarrow dtypes#59601

Conversation

rhshadrach commentedAug 25, 2024•
edited
Loading

Uh oh!

Uh oh!

github-actionsbot commentedSep 25, 2024

Uh oh!

rhshadrachMar 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actionsbot commentedMay 22, 2025

Uh oh!

jbrockmendel commentedJul 8, 2025

Uh oh!

jbrockmendelJul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendelJul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commentedJul 8, 2025

Uh oh!

Uh oh!

Movatterモバイル変換

Uh oh!

BUG: groupby.agg with UDF changing pyarrow dtypes#59601

Are you sure you want to change the base?

BUG: groupby.agg with UDF changing pyarrow dtypes#59601

Conversation

rhshadrach commentedAug 25, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

github-actionsbot commentedSep 25, 2024

Uh oh!

rhshadrachMar 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actionsbot commentedMay 22, 2025

Uh oh!

jbrockmendel commentedJul 8, 2025

Uh oh!

jbrockmendelJul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendelJul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commentedJul 8, 2025

Uh oh!

Uh oh!

rhshadrach commentedAug 25, 2024•
edited
Loading