- Notifications
You must be signed in to change notification settings - Fork16
ENH: addquantile function with weights support#494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
quantile function -method="linear", no weightslucyleeow commentedOct 21, 2025
I am assuming all delegate packages do what numpy does, in terms of broadcasting? If that is the case we should probably follow unless we think scipy has better rules? |
quantile function -method="linear", no weightsquantile function with weights support| raiseValueError(msg) | ||
| else: | ||
| ifmethodnotin {"inverted_cdf","averaged_inverted_cdf"}: | ||
| msg=f"`method` '{method}' not supported with weights." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
only method x/ y support weights?
| methods= {"linear","inverted_cdf","averaged_inverted_cdf"} | ||
| ifmethodnotinmethods: | ||
| msg=f"`method` must be one of{methods}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sort methods to get a deterministic output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it's deterministic already. But do you mean declaring methods in the sorted order?
Like this:methods = {"averaged_inverted_cdf", "inverted_cdf", "linear"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I thought this was not deterministic - but maybe this was in older python versions or between different OS, or maybe I just misremembered? In any case - sorry about the noise.
Uh oh!
There was an error while loading.Please reload this page.
| raiseValueError(msg) | ||
| nan_policies= {"propagate","omit"} | ||
| ifnan_policynotinnan_policies: | ||
| msg=f"`nan_policy` must be one of{nan_policies}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ditto
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
lucascolley commentedOct 25, 2025
@cakedev0 would you be able to fill out the PR description a little more to justify API decisions, primarily comparing how the implementation compares to what is available in NumPy and SciPy respectively? Is there a link to a PR which would use this in sklearn, or are we not at that stage yet? |
cakedev0 commentedOct 25, 2025
Added some comparisons with numpy/scipy in the PR description.
We are not at this stage yet, but I would be happy to open an "experiment PR" where I replace most calls to |
lucascolley commentedOct 25, 2025
Thanks, yeah, that sounds like a good idea. |
cakedev0 commentedOct 26, 2025
Updates from working on the "experiment PR" (you can see the diff here:cakedev0/scikit-learn#3):
|
cakedev0 commentedOct 27, 2025
(putting this in draft mode while I'm working on supporting nan_policy="omit" for the non-weighted case) |
lucyleeow commentedOct 28, 2025
FYI@cakedev0, there is currently a discussion about nan API in numpy:https://mail.python.org/archives/list/numpy-discussion@python.org/thread/P4EEL5P2PNDASWLI24CPHOW2KG2LZ3YY/
Can you expand? |
cakedev0 commentedOct 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Sorry, not the (but I guess it's fine not to support all the methods in the array API version) |
mdhaber commentedNov 4, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Perhaps I can't be entirely objective, but I think there's also objective argument against a split between It sounds like the pain points are the broadcasting behavior and weights. Let's see if I can address those.
NumPy is self-inconsistent on this one. NEP 5 specifies what generalized universal functions should look like. I don't know the history - maybe Now, this is not to say that either of the NumPy precedents should be the final answer for the array API or
If I can get some feedback aboutscipy/scipy#22644 - which is about |
lorentzenchr commentedNov 5, 2025
After getting weights into numpy quantile/percentile, I was so exhausted to continue pushing it elsewhere. At least now, there is one reference, numpy. I would love to see it adopted elsewhere. Quantiles are just very important in many fields. |
mdhaber commentedNov 5, 2025
I drafted it for The same code works for all H&F methods. I was a little concerned about non-integer weights before, but I don't think there's a real problem. For integer weights, it does the same thing as repeated samples, so we can document that these are frequency weights1. Its just that for most methods, the interpolation actuallydoes depend on the total amount of data / total weight. Detailsimportnumpyasnpimportmatplotlib.pyplotaspltrng=np.random.default_rng(328983495438219)p=np.linspace(0,1,300)res=np.quantile([0,1],p,method='linear')res2=np.quantile([0,0,1,1],p,method='linear')res3=np.quantile([0,0,0,1,1,1],p,method='linear')plt.plot(p,res,p,res2,p,res3) ![]() Footnotes
|
mdhaber commentedNov 6, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
PR is up atscipy/scipy#23941. Here's a wrapper that hamstrings importnumpyasnpfromscipyimportstatsdefquantile_cartesian_product(x,p,method='linear',axis=0):p_shape=p.shapep=np.ravel(p)x=np.moveaxis(x,axis,-1)res=stats.quantile(x,p,method=method,axis=-1,keepdims=True)iflen(p_shape):res=np.moveaxis(res,-1,0)res=np.reshape(res,p_shape+res.shape[1:])else:res=np.squeeze(res,axis=-1)returnres# testrng=np.random.default_rng(234892398458345982)x=rng.random((6,7,8))foraxisin [-1,0,1]:forp_shapein [(), (1,), (2,), (2,3)]:p=rng.random(p_shape)kwargs=dict(axis=1,method='weibull')res=quantile_with_wonky_cartesian_product_behavior(x,p,**kwargs)ref=np.quantile(x,p,**kwargs)np.testing.assert_allclose(res,ref) |
lucascolley commentedDec 1, 2025
where does this stand now thatscipy/scipy#23941 has merged? |
cakedev0 commentedDec 8, 2025
Personally, I don't really agree with some decisions that were made in Also, I'm more busy lately and I don't want to spend time porting the quantile function from scipy here, it's really big now and relies on a lot of internal scipy helpers, it will be a lot of work. Being very new to the ecosystem I'm not sure my opinion is well informed, still here are my disagreements with the choices that were made in scipy: About the broadcasting behavior: I agree that the behavior of So for me, it's not clear which option is the best. IMO, here are the pros and cons of the scipy approach: About the weights: personally, I don't understand the interest of frequency weights when you have a valid interpretation only for integers. Those should be called For methods |
lucascolley commentedDec 8, 2025
I don't have a horse in the race w.r.t. the nicest user interface, so I will leave that discussion to those already involved. What I will say is that from an ecosystem perspective, now that SciPy has committed to the public API, I would in principle be happy with a situation where we:
The downside of having two functions with the same name providing different semantics is real. But if the consensus from the technical discussion is that there are real tradeoffs such that each set of semantics has situations in which it is preferable, that might be acceptable. |
mdhaber commentedDec 8, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Naturalness is subjective, so we valued consistency with the rest of
Of course they are. But like every other Now, I don't necessarily like the existence of
Every other independent two-sample statistic in SciPy (e.g. If you mean "they don't broadcast the arrays exactly like After that, the only difference is that the length of the output core dimension of
About a dozen functions in SciPy accept frequency Footnotes
|
cakedev0 commentedDec 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
About NEP 5: can you be more precise about how scipy's quantile respects it and not numpy's quantile? I've read it a couple of times, and honestly I don't fully understand. One of the main example is So to your question:
The answer is not as obvious as you make it sound. If the Cartesian product is valid for Edit:
Both exists. So maybe it's fine for both versions of quantile to exist in the array-API space? 😄 |
cakedev0 commentedDec 9, 2025
To me, this shows that quantile is different from all other functions in Also, if the default of |
cakedev0 commentedDec 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I'm not objective because I invested time in some code that will be completely thrown away if scipy.stats choices are adopted here. I argued for my choices in detail, or at least clarified that other choices come with trade-offs too. Now, I'll let people who have no horse in the race decide, and:
|
mdhaber commentedDec 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
"which is now
Which is diferent from what the NEP 5 says that
importnumpyasnpN=10a=np.ones((3,5,N))b=np.ones((5,N))np.inner(a,b).shape# (3, 5, 5) I'd suggest that
Which states:
I say "correct" because presumably the discussions that led to the inclusion of the Cartesian product is OG NumPy behavior. NEP 5 broadcasting superseded it. Let's not go backwards. |
cakedev0 commentedDec 9, 2025
... Well that is very convincing. Immense thank you for the detailed explanation, I definitely learned a lot there (and sorry for the time you had to invest, I hope I'll be able to make good use of the understanding I gained here in future contribution).
Indeed no. Sorry, I assumed that based on the very similar names, my bad! |
cakedev0 commentedDec 9, 2025
I still think that this behavior isn't right:
But this is more of a detail, and I do understand why it's like this. I'll try to see if I can propose something I would find better, but I'm not sure I'll be able to. Maybe no You said you don't like |
mdhaber commentedDec 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
You're welcome to argue against the existence of
OK, let's think through the implications of that signature with But fromscipy.statsimportexpectileexpectile(np.arange(10),np.asarray([0.5]))# np.float64(4.5)expectile(np.arange(10),0.5))# np.float64(4.5) The documented expectation is that So really, we have this situation where we want to support So you want to get the best of both worlds - efficiency benefits of Footnotes
|

Uh oh!
There was an error while loading.Please reload this page.
Resolves#340
Specs:
"linear", "inverted_cdf", "averaged_inverted_cdf""inverted_cdf", "averaged_inverted_cdf"Comparison with NumPy:
Those specs matches NumPy
quantilefornan_policy="propagate"andnanquantilefornan_policy="omit". Differences are:nan_policy="propagate"|"omit"(inspired by scipy) instead of two functions like numpy (quantilefornan_policy="propagate"andnanquantile. Why? Less code/doc duplication, and clearer behavior IMO."averaged_inverted_cdf"method (only"inverted_cdf"method is supported by numpy): we need this in scikit-learn.nan_policy="omit": there are a few calls tonanpercentilein scikit-learn. Would be easy to implement, following what has been done in scipy.xpx.partionrelies on sorting when not delegating. This could be workedComparison with SciPy:
Main difference is the broadcasting/reduction behavior. We aligned on numpy for this one.
Also, SciPy doesn't support weights.
Implementation for the non-weighted case is more or less copy-pasted from SciPy, except that I've only kept support for 3 methods.
Future use of
xpx.quantilein scikit-learnThose specs should be enough to replace
sklearn.utils.stats._weighted_percentileand to replace most uses ofnp.quantile/percentilewhen rewriting numpy-based functions to support Array API standard.Note that the implementation for the weighted case differs a bit from sklearn's
_weighted_percentile: it's mostly the same approach (sortweightsbased onaand compute cumulative sum), but they differ in the way edge cases are handled (null weights mostly). I believe my implementation to be easier to read and to get right, and equivalent in terms of performance (dominated by argsort for big inputs).Still TODO:
_weighted_percentileto see if there is something I missed in my tests