Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork18.7k
BUG: groupby.agg with UDF changing pyarrow dtypes#59601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
BUG: groupby.agg with UDF changing pyarrow dtypes#59601
Conversation
…unt for missing pyarrow dtypes
This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this. |
…group_by_agg_pyarrow_bool_numpy_same_type
result = gb.agg(lambda x: {"number": 1}) | ||
arr = pa.array([{"number": 1}, {"number": 1}, {"number": 1}]) | ||
expected = DataFrame( | ||
{"B": ArrowExtensionArray(arr)}, | ||
index=Index(["c1", "c2", "c3"], name="A"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
When the column starts as a PyArrow dtype and returns dictionaries, it seems questionable to me whether we should return the corresponding PyArrow dtype. The other option is a NumPy array of object dtype. But both seem like reasonable results and I imagine the PyArrow is likely to be more convenient for the user who is using PyArrow dtypes.
This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this. |
maybe_convert_object has a convert_to_nullable keyword. if you pass that as True you'll get back a numpy-nullable array, which you can then convert to pyarrow. Not sure if that is actually better than what you're doing here. |
npvalues = lib.maybe_convert_objects(result, try_float=False) | ||
if isinstance(obj._values, ArrowExtensionArray): | ||
from pandas.core.dtypes.common import is_string_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
can this go at the top
if not isinstance(obj._values, np.ndarray): | ||
# When obj.dtype is a string, any object can be cast. Only do so if the | ||
# UDF returned strings or NA values. | ||
if not is_string_dtype(obj.dtype) or is_string_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
i suspect what you really want here is lib.is_string_array?
The correct solution here and to several related issues is an EA._construct_with_inference method that behaves similarly to maybe_convert_objects but preserves dtype backend. xref#56430. (Felt the need to plug that, but doesn't need to be a blocker for this in the interim) |
Uh oh!
There was an error while loading.Please reload this page.
Continuation of#58129
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Root cause:
agg_series
always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be differentFix:
maybe_convert_object
, asmaybe_convert_object
does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),