Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork18.5k
API (string dtype): implement hierarchy (NA > NaN, pyarrow > python) for consistent comparisons between different string dtypes#61138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
…for consistent comparisons between different string dtypes
expected = pd.array([None, None, None], dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) | ||
# # with list | ||
# other = [None, None, "c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Did you want to implement testing this in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, this was already implemented, just need to add this case back to the test. The original "array" test was actually testing with a list. I updated the test to now actually use an array (parametrized with all the different dtypes, to get all combinations of dtypes in both operands), and added a separate test with just the list.
9a0c382
to4ebd93b
Compareresult = getattr(a, op_name)(pd.NA) | ||
expected = pd.array([None, None, None], dtype=expected_dtype) | ||
tm.assert_extension_array_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For this case of comparing with NA, we already have a dedicated test just above, so removing it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Needs a whatsnew?
Uh oh!
There was an error while loading.Please reload this page.
…ng-dtype-comparison-methods-priority
@jorisvandenbossche - I've merged main and pushed a commit here. If you have any objections, I can pull it off.
For the last one, previously ArrowExtensionArray vs Nan-Python was giving back NumPy bool. This was the only case where ArrowExtensionArray was not resulting in ArrowExtensionArray.
object dtype looks correct to me here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
lgtm,@mroeschke can you have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@rhshadrach thanks for updating this!
object dtype looks correct to me here.
Hmm, not entirely sure anymore what I meant with that object dtype was not yet covered. I thought maybe the case where the object dtype does not contain just strings, but also that seems to work fine
in determining the result dtype when there are different string dtypes compared. Some examples: | ||
- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think this is not correct, it returns just the nullableboolean
dtype? (i.e.pd.BooleanDtype()
) Whereboolean[pyarrow]
is an alias forpd.ArrowDtype(pa.boolean())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hmm, I see that it is actually the behaviour with this PR as well, but I thought I would have "fixed" that while making things consistent
jorisvandenbosscheMay 14, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
And I also see that I coded explicitly myself this expected dtype in the tests ...
rhshadrachMay 18, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Given the ordering
object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
when we compare(pyarrow, NA)
with anything we want the result to be as if we compared(pyarrow, NA)
with itself, which should result inboolean[pyarrow]
.
One case related to object dtype that is still failing is comparing with an object series that has mixed types: In [3]:ser1=pd.Series(["a",None,"b"],dtype=pd.StringDtype("pyarrow",na_value=np.nan))In [4]:ser2=pd.Series(["a",None,2],dtype=object)In [5]:ser1==ser2...File~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:517,inArrowExtensionArray._box_pa_array(cls,value,pa_type,copy)514pa_array=pa.array(value,type=pa_type,from_pandas=True)515except (pa.ArrowInvalid,pa.ArrowTypeError):516# GH50430: let pyarrow infer type, then cast-->517pa_array=pa.array(value,from_pandas=True)519ifpa_typeisNoneandpa.types.is_duration(pa_array.type):520# Workaround https://github.com/apache/arrow/issues/37291521frompandas.core.tools.timedeltasimportto_timedelta...ArrowTypeError:Expectedbytes,gota'int'objectIn [6]:ser1=pd.Series(["a",None,"b"],dtype=pd.StringDtype("python",na_value=np.nan))In [7]:ser1==ser2Out[7]:0True1False2False So with just object dtype, such a comparison works. And it also works with the python-backed string dtype. But fails with the pyarrow-backed string dtype, because in this case the comparison defers to the ArrowExtensionArray implementation, which tries to convert the other side to a pyarrow array, which is not supported for mixed types. While we generally (although in many cases definitely not best practice) mixed-types object dtype in pandas. (but let's consider this for a separate issue/PR) |
6177e22
intopandas-dev:mainUh oh!
There was an error while loading.Please reload this page.
Thank@jorisvandenbossche and@rhshadrach |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free tosuggest an improvement. |
Uh oh!
There was an error while loading.Please reload this page.
Closes#60639
This does not yet handle the case of comparison to object dtype.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.