Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

BUG: groupby.agg with UDF changing pyarrow dtypes#59601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
rhshadrach wants to merge45 commits intopandas-dev:main
base:main
Choose a base branch
Loading
fromrhshadrach:fix/group_by_agg_pyarrow_bool_numpy_same_type

Conversation

rhshadrach
Copy link
Member

@rhshadrachrhshadrach commentedAug 25, 2024
edited
Loading

Continuation of#58129

Root cause:

  • agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

  • replace all NA with nan
  • convert the `results' to respective pyarrow extension array, using pyarrow library methods
  • pyarrow library methods is used instead ofmaybe_convert_object, asmaybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Kei added30 commitsApril 1, 2024 19:04
@rhshadrachrhshadrach marked this pull request as draftAugust 25, 2024 13:02
@rhshadrachrhshadrach added Groupby Arrowpyarrow functionality pyarrow dtype retentionop with pyarrow dtype -> expect pyarrow result Bug and removed Arrowpyarrow functionality labelsAug 25, 2024
@github-actionsGitHub Actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this.

@rhshadrachrhshadrach changed the titleFix/group by agg pyarrow bool numpy same typeBUG: groupby.agg with UDF changing pyarrow dtypesOct 6, 2024
@rhshadrachrhshadrach marked this pull request as ready for reviewMarch 22, 2025 16:15
Comment on lines +1899 to +1905
result = gb.agg(lambda x: {"number": 1})

arr = pa.array([{"number": 1}, {"number": 1}, {"number": 1}])
expected = DataFrame(
{"B": ArrowExtensionArray(arr)},
index=Index(["c1", "c2", "c3"], name="A"),
)
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When the column starts as a PyArrow dtype and returns dictionaries, it seems questionable to me whether we should return the corresponding PyArrow dtype. The other option is a NumPy array of object dtype. But both seem like reasonable results and I imagine the PyArrow is likely to be more convenient for the user who is using PyArrow dtypes.

@rhshadrachrhshadrach added this to the3.0 milestoneMar 23, 2025
@github-actionsGitHub Actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@mroeschkemroeschkeAwaiting requested review from mroeschke

Assignees
No one assigned
Labels
BugGroupbypyarrow dtype retentionop with pyarrow dtype -> expect pyarrow resultStale
Projects
None yet
Milestone
3.0
Development

Successfully merging this pull request may close these issues.

BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy
2 participants
@rhshadrach@undermyumbrella1

[8]ページ先頭

©2009-2025 Movatter.jp