NotificationsYou must be signed in to change notification settings
Fork11.1k
Star29.9k

BUG: quantile should error when weights are all zeros#28595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

Tontonio3 wants to merge19 commits intonumpy:main

base:main

Choose a base branch

fromTontonio3:quantile-warn

Open

BUG: quantile should error when weights are all zeros#28595

Tontonio3 wants to merge19 commits intonumpy:mainfromTontonio3:quantile-warn

Conversation

Copy link

Contributor

Tontonio3 commentedMar 27, 2025

Fixes issue#28589

np.quantile now raises errors if:

All weights are zero
At least one weight isnp.nan
At least one weight isnp.inf

Tontonio3 added5 commits

March 27, 2025 13:18

added err messages and tests

293ea24

Modified tests and added release note

ae8cd54

Fixed tests

565fc75

Fixed bug to handle object dtypes

8309d16

Fixed bug to handle object dtypes

04d05f6

github-actionsbot added the 00 - Bug label

Mar 27, 2025

tylerjereddy added the component: numpy.lib label

Mar 28, 2025

tylerjereddy reviewed

Mar 28, 2025

View reviewed changes

Copy link

Contributor

tylerjereddy left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like a bunch of folks tried to fix this at the same time. This PR deals with more cases than the others, but handling NaN/Inf is a bit messier maybe. The other two PRs recently opened are:

I provided initial reviews to each since they are from first-time contributors, but we should probably consolidate one way or another.

numpy/lib/_function_base_impl.py Outdated

		raise ValueError("At least one weight must be non-zero")
		if weights.dtype != object:
		if np.any(np.isinf(weights)):
		raise ValueError("Weights must be non-infinite")

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: non-infinite-> finite probably

numpy/lib/tests/test_function_base.py Outdated

		with pytest.raises(ValueError) as ex:
		a = np.quantile(arr, q, weights=wgt, method=m)
		assert "Weights must be non-infinite" in str(ex)
		wgt[i] = 1

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Probably sufficient to just check a single index rather than all of them. For the message check, I think we usually prefer using thematch builtin topytest.raises these days.

numpy/lib/tests/test_function_base.py Outdated

		wgt[i] = np.inf
		with pytest.raises(ValueError) as ex:
		a = np.quantile(arr, q, weights=wgt, method=m)
		assert "Weights must be non-infinite" in str(ex)

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

could parametrize the test to check one weights array with a few NaNs and another with just 1, but probably don't need these two exhaustive loops

numpy/lib/tests/test_function_base.py Outdated

		wgt[i] = np.nan
		with pytest.raises(ValueError) as ex:
		a = np.quantile(arr, q, weights=wgt, method=m)
		assert "At least one weight is nan" in str(ex)

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could also probably just parametrize overnp.nan andnp.inf + expected message above for concision (and same comment about not needing the exhaustive loops

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks, will do

ngoldbaum mentioned this pull request

Mar 28, 2025

BUG:quantile should error whenweights are all zeros#28589

Open

Copy link

ContributorAuthor

Tontonio3 commentedMar 28, 2025•
edited
Loading

but handling NaN/Inf is a bit messier maybe

I'll try to make it better, but withnp.quantile acceptingdtype=objectnp.arrays makes it difficult becausenp.isinf andnp.isnan do not work withdytype=object arrays. Which makes it difficult to handle it elegantly, I should probably open a pull request about this tbh.

Also, when I started working on the issue the other requests weren't open yet lol

Tontonio3 added4 commits

March 28, 2025 15:30

Streamlined testing, improved error handling capabilities

c76b5ad

Changed infinite error message

bfcec09

Bug fix

f06a1f8

Fixed lint test

1303b3c

melissawm added this toNumPy first-time contributor PRs

Mar 31, 2025

melissawm moved this toAwaiting a code review inNumPy first-time contributor PRs

Mar 31, 2025

Copy link

ContributorAuthor

Tontonio3 commentedMar 31, 2025

Is there anything I need to do to fix the tests? I'm kinda confused as to why they are failing

mhvk reviewed

Mar 31, 2025

View reviewed changes

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some comments inline, but also a more philosophical one here. Python usually avoids "Look Before You Leap" (LBYL), in favour of "Easier to Ask Forgiveness than Permission" (EAFP), and I worry a bit when seeing many checks for possible failure modes, which make the code more complex, and also makes things slower for the common case where the user puts in sensible numbers.

Forquantile, there is already a check against negative weights, but those are quite important, because they would not lead to a failure, but still give a wrong result.

The case is a little different forinf and all-zero - those giveRuntimeWarning, so users could know something is amiss. Butnan gives no warning (indeed,nan are allowed in the values...).

Anyway, maybe overall protecting the user is worth it, but we should at least try to make the performance penalty as small as possible, hence my suggestions inline.

But looking closer, there may be another option that would be faster: the weights get transferred down via various functions to_quantile. There, one sees:

numpy/numpy/lib/_function_base_impl.py

Lines 4892 to 4895 ina9a9bf8

	# We use the weights to calculate the empirical cumulative
	# distribution function cdf
	cdf=weights.cumsum(axis=0,dtype=np.float64)
	cdf/=cdf[-1, ...]# normalization to 1

So, at this point we have two benefits: any object array are already turned into float, and we only have to check the last element to see whether there are any problems, rather than scan the whole array. So, one could add in between these two lines,

if not np.all(np.isfinite(cdf[-1, ...]):    raise ValueError(...)

For the check for all-zero, one can include it a little below, inside another branch:

        if np.any(cdf[0, ...] == 0):            # might as well guard against all-zero here.            if np.any(cdf[-1, ...] == 0):                raise ValueError(...)            cdf[cdf == 0] = -1

The above should be better for performance, though I can see the argument for just having nicely validated data to start with.

numpy/lib/_function_base_impl.py Outdated

		@@ -4535,8 +4535,26 @@ def quantile(a,
		if axis is not None:
		axis = _nx.normalize_axis_tuple(axis, a.ndim, argname="axis")
		weights = _weights_are_valid(weights=weights, a=a, axis=axis)
		if weights.dtype != object:

Copy link

Contributor

mhvkMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A general comment: I think these checks should happen inside_weights_ar_valid - this will ensure they are used forpercentile as well.

numpy/lib/tests/test_function_base.py Outdated

		@@ -4142,7 +4142,25 @@ def test_closest_observation(self):
		assert_equal(4, np.quantile(arr[0:9], q, method=m))
		assert_equal(5, np.quantile(arr, q, method=m))



		@pytest.mark.parametrize(["err_msg", "weight"],

Copy link

Contributor

mhvkMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd parametrize overnp.quantile andnp.percentile as well - they should have the same errors.

Copy link

Contributor

mhvkMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also, if you pass in a list rather than an array, you could parametrize overdtype=float anddtype=object, to make this a little more readable.

Copy link

ContributorAuthor

Tontonio3Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

numpy/lib/_function_base_impl.py Outdated

		@@ -4535,8 +4535,26 @@ def quantile(a,
		if axis is not None:
		axis = _nx.normalize_axis_tuple(axis, a.ndim, argname="axis")
		weights = _weights_are_valid(weights=weights, a=a, axis=axis)
		if weights.dtype != object:
		if np.any(np.isinf(weights)):

Copy link

Contributor

mhvkMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Another general comment: as written, the common case has to go through a lot of checks. I think it would be better to optimize for the common case, and not worry too much about distinguishing failure cases. E.g., you can do just one evaluation with:

if not np.all(np.isfinite(weights)):    raise ValueError("weights must be finite.")

numpy/lib/_function_base_impl.py Outdated

		# Since np.isinf and np.isnan do not work in dtype object arrays
		# Also, dtpye object arrays with np.nan in them break <, > and == opperators
		# This specific handling had to be done (Can be improved)
		elif weights.dtype == object:

Copy link

Contributor

mhvkMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that this loop can still give unexpected errors, because you are here counting on object arrays to be turned into their values as scalars. E.g.,

np.isnan(np.array([1.,None,np.inf])[1])# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

This will be an uninformative error!

I think we have two choices: just not check forobject dtype, or convert to float before checking (and then passing on it that conversion fails).

numpy/lib/_function_base_impl.py Outdated

		if np.any(weights < 0):
		raise ValueError("Weights must be non-negative.")
		elif np.all(weights == 0):

Copy link

Contributor

mhvkMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Here again we could ensure the common case remains fast by doing:

if np.any(weights <= 0):    raise ValueError("weights must be non-negative and cannot be all zero.")    # or, more explicit error messages,    if np.all(weights == 0):        raise ValueError("At least one weight must be non-zero.")    else:        raise ValueError("Weights must be non-negative.")

Copy link

Contributor

mhvkApr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Trying to keep this inline:

The issue with this is that some of the weights might be 0, but none of them are negative. So it would raise an error even though it shouldn't

You're right, I was too sloppy in writing this, theelse should beelif np.any(weights <0) so that the case of some weights 0 falls through (slowly, but better than making all cases slow!).

p.s. Given this, I'd probably swap the order, i.e.,

if np.any(weights <= 0):    # Do these checks guarded by the above `if` to avoid slowing down the common case.    if np.any(weights < 0):        raise ValueError("Weights must be non-negative.")    elif np.all(weights == 0):        raise ValueError("At least one weight must be non-zero.")

Copy link

Member

ngoldbaumApr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@Tontonio3 I don't see how you responded to this suggestion. Please make sure all reviewer feedback is addressed before requesting re-review.

Copy link

ContributorAuthor

Tontonio3Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@ngoldbaum Your're right, I forgot to implement this

Copy link

ContributorAuthor

Tontonio3 commentedApr 2, 2025

A general comment: I think these checks should happen inside_weights_ar_valid - this will ensure they are used for percentile as well.

The issue is that_weights_ar_valid is used innp.average which does accept negative weights

Copy link

ContributorAuthor

Tontonio3 commentedApr 2, 2025•
edited
Loading

if np.any(weights <= 0):
raise ValueError("weights must be non-negative and cannot be all zero.")
# or, more explicit error messages,
if np.all(weights == 0):
raise ValueError("At least one weight must be non-zero.")
else:
raise ValueError("Weights must be non-negative.")

The issue with this is that some of the weights might be 0, but none of them are negative. So it would raise an error even though it shouldn't

Copy link

ContributorAuthor

Tontonio3 commentedApr 2, 2025•
edited
Loading

Another general comment: as written, the common case has to go through a lot of checks.

I'd love to do that, the issue is that everything breaks withdtype=object arrays. Which is extremely frustrating, it'd be easier to just not allowdtype=object arrays

Copy link

Contributor

mhvk commentedApr 2, 2025

A general comment: I think these checks should happen inside_weights_ar_valid - this will ensure they are used for percentile as well.
The issue is that_weights_ar_valid is used innp.average which does accept negative weights

Duh, I thought I had checked that, but I now see I was wrong. Sorry about that!

Copy link

Contributor

mhvk commentedApr 2, 2025

Another general comment: as written, the common case has to go through a lot of checks.
I'd love to do that, the issue is that everything breaks withdtype=object arrays. Which is extremely frustrating, it'd be easier to just not allowdtype=object arrays

Yes, indeed, this is partially why I suggested to do the check much further down, when the object arrays are turned into float.

Tontonio3 added2 commits

April 2, 2025 17:25

Improved testing

ad95df2

Changed error handling, made it faster, removed dtype=object special …

dc14e6b

…cases

Copy link

ContributorAuthor

Tontonio3 commentedApr 2, 2025

@mhvk I've implemented your suggestions. Although now if you have anp.nan withing aobject array it will give this:
RuntimeWarning: invalid value encountered in greater

The Improved testing commit is to save the old error handing.

Tontonio3 added4 commits

April 2, 2025 17:51

More comprehensive testing

8eeed6a

More comprehensive testing

4fe3444

Fixed lint

ba2c398

Fixed tests

2e8e2ea

Fixed CircleCI test

3a20796

Copy link

ContributorAuthor

Tontonio3 commentedApr 16, 2025

@seberg @ngoldbaum is there anything else that needs to be done to close this issue?

streamlined checks

40075b3

Copy link

Member

ngoldbaum commentedApr 25, 2025

The linter still needs to be fixed.

ngoldbaum added this to the2.3.0 release milestone

Apr 25, 2025

Tontonio3 added2 commits

April 28, 2025 13:22

lint fix

a83887b

lint fix

3d1c7b0

ngoldbaum approved these changes

Apr 28, 2025

View reviewed changes

Copy link

Member

ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Overall looks good to me. Some minor comments inline.

numpy/lib/_function_base_impl.py

		if np.any(weights <= 0):
		if np.any(weights < 0):
		raise ValueError("Weights must be non-negative.")
		elif np.all(weights == 0):

Copy link

Member

ngoldbaumApr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can't you just sayelse here?

Copy link

ContributorAuthor

Tontonio3May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I can't just say else here, as the first if will be true if only one of the weights is 0, which is valid

Copy link

Member

sebergMay 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

you can move the== 0 to the sum to simplify this, though.

IMO the checks may be clearer asnot all(weights >= 0) to reject NaNs right away, but doesn't matter much so long the== 0 case is handled below.

Also, why not useisfinite?

Copy link

ContributorAuthor

Tontonio3Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the weights are non-zero, but the array that it is given in all zeroes, the sum will be zero. I know it is a specific edge case, but it would give an error when it shouldn't give one.