This is a rebase and update of several previous PRs:#21977 and#16022, which were both reverted due to issues downstream. I've kept a bunch of smaller commits showing the updates to comments to try and make it easier for review.

I've tested with Astropy and there are still two failing tests over there, which I don't quite understand yet, so I'm hoping a review might be able to point out some hints for what I'm missing. It looks like we do enterMaskedColumn.__array_wrap__, but with acontext=None, so that wrap isn't undoing anything because it doesn't know which function it came from... I'm not sure if this means the update needs to be handled in numpy or astropy.

astropy/units/tests/test_quantity_ufuncs.py::TestQuantityMathFuncs::test_power_array_array3astropy/table/tests/test_column.py::TestColumn::test_numpy_boolean_ufuncs[MaskedColumn]

cc@rcomer,@seberg,@mhvk since you've all had some great comments and help on the previous PRs!

Summary

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls directly.

Example

importnumpyasnpa=np.ma.array(np.arange(10), [0,1]*5)x=a<10# x.data == [True,  True,  True,  True,  True, False, False, False, False, False]y=np.ma.less(a,10)# y.data == [True,  True,  True,  True,  True,  True, False,  True, False, True]

Even though data values ina are masked, they are still acted upon and evaluated. Changing that underlying data. However, calling the masked less version doesn't do anything with the data in the array. Note that this is a big change but to a somewhat ambiguous case of what happens to masked values under function evaluation.

The real reason I began looking into this was to not evaluate the function in the masked values because I'd already premasked the condition that would throw warnings at me. See#4959 for evaluating log10 at locations less than 0 that were under the mask, so still threw warnings.

importnumpyasnpA=arange(-2,5)/2Am=numpy.ma.masked_less_equal(A,0)np.log10(Am)# Previously threw a warning, now works as expectednp.ma.log10(Am)# No warnings before or after

Implementation

I haven't added any documentation yet because I wanted to check and see if this was even the right way of approaching this idea. I struggled to figure out how to pass theout= object pointers around in a consistent manner, so I am definitely interested in suggestions on improving that. I had to update quite a few seemingly random areas of the Masked codebase to make that work everywhere. There are a few tests that I had to update because values under the mask were being compared, and those have now changed with this implementation applying the masked version of ufuncs everywhere.

This is a pretty major (underlying) API change, so I expect there to be quite a bit of discussion on the best approaches here.

Linked Issues

Fixes#4959
Closes#15200

greglucasand others added19 commits

January 2, 2023 17:15

ENH: Adding __array_ufunc__ capability to MaskedArrays.

ab732d0

This enables any ufunc numpy operations that are called on aMaskedArray to use the masked version of that function automaticallywithout needing to resort to np.ma.func() calls.

DOC: Adding improvement note for MaskedArray ufunc

63df460

nomask in nomask out

a63e97a

BUG: fix ma.minimum.reduce with axis keyword

eca1e3c

Fixes the problem reported atnumpy#21977 (comment)The reduce method here effectively calls itself with an unmaskedMaskedArray (mask=nomask) and then expects either a MaskedArray ora scalar.  This change ensures that an ordinary ndarray isconverted to a MaskedArray, following the pattern already used inmean and var in this module.

TST: add a test for ma.minimum.reduce with axis keyword

997d27d

Adapted from the problem reported atnumpy#21977 (comment)

MNT: Remove __add__ and other unnecessary overrides

bd091a6

ENH: Add sign ufunc to masked arrays

b5b9ad3

ENH: Remove masked ufunc power restriction

d0ac064

Now we are calling np.power() in std() which goes through the ufuncmachinery, so we don't want to pass any additional unsafe castingkwargs that aren't allowed within the masked implementation.

ENH: Add masked invert ufunc

dd74620

MNT: Remove some restrictions on masked array ufunc implementations

c638cdc

FIX: Rearrange handling of ndarray-like ufuncs

b35c309

Move the np-ufunc check to the top of the routine so we immediatelygo to super() when necessary. Before we were returning NotImplementedif an arg wasn't able to be handled.Update the arg instance check to defer for everything but anotherclass that has implemented __array_ufunc__

FIX: Pass kwargs through ufunc reduce

de22beb

MNT: Add power ufunc to masked arrays

9564f27

FIX: add back getmaskarray to __array_wrap__ of masked arrays

a7ba76f

FIX: view should return the type(self) rather than MaskedArray

fa6c56f

This allows for subclasses to be handled correctly

FIX: Revert sum and trace overrides from masked arrays

cbfd86f

ENH: Add rint and update round method

4a58583

FIX: Change back partition/argpartition

2acf530

FIX: Remove compress updates

e6e80e6

github-actionsbot added the 01 - Enhancement label

Jan 3, 2023

MAINT: Remove unused delegate_binop code

366dfc3

This is handled in the C code now within the ufunc machinery.

greglucas force-pushed themasked-ufunc-2 branch from1bf272f to366dfc3Compare

January 3, 2023 00:26

charris added the component: numpy.mamasked arrays label

Jan 3, 2023

Copy link

Contributor

mhvk commentedJan 3, 2023

@greglucas - theastropy.units error I think we would need to work around - as is, we don't really treatMaskedArray correctly and it shows here; we will just have to deal with it...
But theastropy.table error is more perhaps more serious:MaskedColumn is a subclass ofMaskedArray and it assumes it does not have to worry about__array_ufunc__. While we could likely define it and make things work again (indeed, that might be good), we may just be the canary in the coalmine: any otherMaskedArray subclasses may well break too.

Copy link

Contributor

mhvk commentedJan 3, 2023

Actually, I quickly tried to implement__array_ufunc__ myself onMaskedColumn, and it exposed a bug in the implementation here even fornp.ma.MaskedArray:

In [2]: np.signbit(np.ma.MaskedArray([1., 2.], mask=[True, False]))---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)Cell In[2], line 1----> 1 np.signbit(np.ma.MaskedArray([1., 2.], mask=[True, False]))TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 'signbit'>, '__call__', masked_array(data=[--, 2.0],             mask=[ True, False],       fill_value=1e+20)): 'MaskedArray'

The reason is that if the function is not defined innp.ma, you just hand off tosuper(), i.e.,np.ndarray.__array_ufunc__. But that one returnsNotImplemented because it does not know how to deal withMaskedArray. I think you'd have to extract the data and copy the mask over. Though perhaps easier is to definenp.ma.signbit!

Copy link

Contributor

mhvk commentedJan 3, 2023

p.s. Might want to check test completeness by checking all functions innp.testing.overrides.get_overridable_numpy_ufuncs() are covered...

seberg reviewed

Jan 4, 2023

View reviewed changes

numpy/ma/core.py

		if ma_ufunc is np_ufunc:
		# We didn't have a Masked version of the ufunc, so we need to
		# call the ndarray version to prevent infinite recursion.
		return super().__array_ufunc__(np_ufunc, method, inputs, *kwargs)

Copy link

Member

sebergJan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You will need to unwrap the data arrays here and then call things again. On the plus side, it may actually make sense to always just do that.
Also, you should not usesuper() here, but rather call the original ufunc I think, because after unwrapping it could be that you need to dispatch to a different argument.

There is the small change that the ufuncs behave a bit different from thenp.ma.ufunc version (and maybe even better?): seegh-22347.
So there might even be a point of just ignoring whatnp.ma.func does.

After the fallout from the "simple"gh-22046, I am a bit hesitant about changing these subtleties, we probably need to test very carefully and I am wondering if we should push it of for a "2.0" (which I hope to bring up now for in a year). In that case, a possible approach could be to hide these definitions behind an environment variable (feature flag) maybe.

Copy link

Contributor

mhvk commentedJan 4, 2023

Maybe worth pulling@seberg's comment in the main thread: a question is whether the regular numpy functions should just call thenp.ma ones, including the, to me, rather strange use of domains (seegh-22347), or rather do a simpler wrapping where masks of input arguments simply or-d together, without worry about the values of the inputs. Arguing against this of course is that one then is basically moving to a new class (which I did for astropy; seehttps://docs.astropy.org/en/latest/utils/masked/index.html and in particular the discussion of differences withnp.ma.MaskedArray).

But probably it is easier to just add the few missing ufunc tonp.ma and work with what you have!

Copy link

ContributorAuthor

greglucas commentedJan 5, 2023

p.s. Might want to check test completeness by checking all functions in np.testing.overrides.get_overridable_numpy_ufuncs() are covered...

But probably it is easier to just add the few missing ufunc to np.ma and work with what you have!

Thanks for pointing me to that function! Somewhat surprisingly, there are actually 30 ufuncs that have no implementation innp.ma currently! Most of them are easy, but I don't think there is any wrapper class that handles multi-output functions yet (ex:https://numpy.org/doc/stable/reference/generated/numpy.frexp.html)

I went digging through some of the old mailing list archives that I wasn't aware of and came across this really nice discussion:
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/OSGGDTJX6XZK3LXLNQYGZP766FD3YFQQ/#QNVWET3O7KXFGAB7NVZNWUGHE766TGNL
and if I'm parsing it all correctly it seems like people are basically on the same page and working out the final details is all that is left. I had also forgotten that@ahaldane had pointed me at their implementation of a new class in one of the previous PRs:
https://github.com/ahaldane/ndarray_ducktypes/blob/master/ndarray_ducktypes/MaskedArray.py

My goal for this was to have a bit of a stepping stone before going all-in on a new class. However, I am now possibly seeing that this might just be playing a game of whack-a-mole that I'm sure will change some behavior and upset someone, so maybe we should just do a hard-breaking-change all at once instead of introduce these small papercuts. (i.e. work on the better implementation for a 2.0 release rather than adding these now)

I think one of the things missing from many of the discussions is that many downstream libraries make use ofnp.sum(),np.stack(), etc... without thinking about what instance is being passed into them (various arrays from numpy, pandas, dask, masked), so I do think it is important to try and start dispatching thenp.func tonp.ma.func specific methods for the Masked class since that seems to be the odd one out at the moment...

My desires from a new class:

No auto-masking of "bad" computation locations, let the user do that apriori if they want

I haven't seen anyone argue for this yet, and this looks like it goes way back in git blame.

No warnings when doing a computation of a bad value under a mask

It would be annoying to get divide by zero warnings on your computations when you've already pre-masked the data (see my example at the top of the PR for the motivation here!)

np.func gets forwarded tonp.ma.func whenever possible (implement__array_ufunc__ and__array_function__)

It is quite surprising right now to donp.stack(ma, ma) and get a Masked instance back, but with an incorrect mask! (it would be better to just get a plain ndarray IMO)

Proper subclass wrapping/heirarchy preservation

Should fall out of (3) if implemented properly with NotImplemented being returned.

I'll try and take a better look at@ahaldane's implementation in nd_duck_array and@mhvk's implementation in Astropy. I think right now we are missing having a clear path between where we are now and where we want to go, so something a little more concrete with actions of what is needed might be nice. I'll try and give it some thought.

Copy link

Contributor

mhvk commentedJan 5, 2023

Thanks, that's a great summary! There is only one item I'd add: think carefully how to deal with underlyingdata other thanndarray. For instance, part of the reason I made a separate implementation in astropy was that the currentMaskedArray simply cannot be used to createMaskedQuantity -- withMaskedArray holding aQuantity attributes like the unit become hidden (and I think@ahaldane's version did not make that easy either). I found it worked much better by havingQuantity on top, effectively holding a masked array as data (in our case, with corresponding class hierarchyMaskedQuantity(Quantity, MaskedNDArray)). But for something likedask, it might be better to have both the data and the mask be dask arrays, while for Xarray, that would not make sense.

Overall, it certainly is not crazy to move in small steps! I think a bigger question here is whether when wrapping regular numpyufunc, one wants to just get thenp.ma.<ufunc> equivalent, or something that moves closer to a final goal (i.e., as you note, no masking of bad elements but also no warnings for masked ones [note that the latter may be quite tricky...]).