Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Refactor weighted percentile functions to avoid redundant sorting#30945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

arjun-sundar3363
Copy link
Contributor

REF: Integrate symmetrization in _weighted_percentile to avoid double sorting

Description

This pull request refactors the computation of weighted percentiles by integrating symmetrization directly into the_weighted_percentile function. With this change, we avoid sorting the input array twice when computing the averaged weighted percentile. The following changes have been made:

  • Added asymmetrize parameter to_weighted_percentile that, when enabled, computes the averaged weighted percentile using both positive and negative arrays.
  • Updated_averaged_weighted_percentile to leverage the new symmetrization functionality.
  • Preserved the original functionality and all existing comments.
  • Ensured that the code complies with the scikit-learn contributing guidelines and passes all relevant tests.

This refactor improves efficiency without altering the external API or behavior.

Please review and let me know if any adjustments are required.

@github-actionsGitHub Actions
Copy link

github-actionsbot commentedMar 5, 2025
edited
Loading

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit:f45abdb. Link to the linter CI:here

@arjun-sundar3363arjun-sundar3363 marked this pull request as draftMarch 5, 2025 10:06
@betatim
Copy link
Member

betatim commentedMar 5, 2025
edited
Loading

Thanks for opening this PR! It is still a draft, so I've not looked at the diff too closely yet. I wanted to raise one thing already though: how did you decide to do this work? The reason I am asking is that I think the new code is quite a bit more complex than what it replaced, so having some benchmarks or motivation for why making this change is worth it would be good to have.

There is a TODO comment in the code that suggests making this change, but I think we need to remember/figure out again why that comment was put there. IMHO the comment itself isn't a good enough motivation

virchan reacted with thumbs up emoji

@arjun-sundar3363
Copy link
ContributorAuthor

Hi@betatim

Thanks for your feedback! I'm new to open source and eager to contribute. I found this TODO and thought it would make a difference, but I see your point about complexity. I’d love any advice—not just on this PR but on open source in general. Looking forward to your thoughts!

ogrisel reacted with thumbs up emoji

@ogrisel
Copy link
Member

As the author of the TODO, I confirm that this refactoring is needed; otherwise we won't be able to generalize the use of this function throughout the code base whenever it's needed to fix sample_weight related problems (see#16298) without introducing significant performance regressions.

@ogrisel
Copy link
Member

Note that there is also a concurrent PR that refactors this function for a different purpose:

Not sure which will be ready to merge first, but both are valuable and once one is merged, the other will need to be adapted accordingly.

@arjun-sundar3363arjun-sundar3363 marked this pull request as ready for reviewApril 4, 2025 09:48
@ogrisel
Copy link
Member

ogrisel commentedApr 4, 2025
edited
Loading

@Nujra40 there are a bunch of failing tests. Would you mind trying to fix the code to get them to pass before we start the review? Unless you need help to solve some of them, if so please ask in the comments:

FAILED tests/test_common.py::test_estimators[KBinsDiscretizer(quantile_method='averaged_inverted_cdf')-check_sample_weights_shape] - IndexError: index -1 is out of bounds for axis 0 with size 0FAILED tests/test_common.py::test_estimators[KBinsDiscretizer(quantile_method='averaged_inverted_cdf')-check_sample_weights_not_overwritten] - IndexError: index -1 is out of bounds for axis 0 with size 0FAILED tests/test_common.py::test_estimators[KBinsDiscretizer(quantile_method='averaged_inverted_cdf',subsample=None)-check_sample_weight_equivalence_on_dense_data] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_fit_transform[quantile-averaged_inverted_cdf-expected5-sample_weight5] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_fit_transform[quantile-averaged_inverted_cdf-expected6-sample_weight6] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_fit_transform[quantile-averaged_inverted_cdf-expected7-sample_weight7] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_fit_transform_n_bins_array[quantile-averaged_inverted_cdf-expected4-sample_weight4] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_fit_transform_n_bins_array[quantile-averaged_inverted_cdf-expected5-sample_weight5] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_fit_transform_n_bins_array[quantile-averaged_inverted_cdf-expected6-sample_weight6] - AssertionError: FAILED preprocessing/tests/test_discretization.py::test_kbinsdiscretizer_effect_sample_weight - AssertionError: FAILED utils/tests/test_stats.py::test_averaged_weighted_median - AssertionErrorFAILED utils/tests/test_stats.py::test_averaged_weighted_percentile - AssertionErrorFAILED utils/tests/test_stats.py::test_averaged_and_weighted_percentile - AssertionError

See the continuous integration logs ("failing checks") for details.

Furthermore,@lucyleeow's#29431 PR has just been merged inmain so this PR's branch needs to be updated by merging the currentmain branch into it and solve the conflicts before proceeding further. Please let us know if you need help.

@arjun-sundar3363
Copy link
ContributorAuthor

Hi@ogrisel I will look into it and get back to you if required.

@arjun-sundar3363arjun-sundar3363force-pushed therefactor/weighted-percentile-symmetrization branch from7b943ae to00d3ef9CompareApril 5, 2025 17:08
@arjun-sundar3363
Copy link
ContributorAuthor

Hi@ogrisel Need some help to figure out whats wrong. Is the tests not complete for my added parts of the code?

@lucyleeow
Copy link
Member

@@ -41,13 +43,17 @@ def _weighted_percentile(array, sample_weight, percentile_rank=50, xp=None):
xp : array_namespace, default=None
The standard-compatible namespace for `array`. Default: infer.

symmetrize : bool, default=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit about this name - could we make it more consistent with how it's named elsewhere? I know this function is private and we ultimately would like to use the scipy quantile so we don't have to maintain our own, but it would still be nice if it was clearer what this method was equivalent to.

Ithink this quantile method is called "averaged_inverted_cdf" in numpy/scipy. And we were calling it_averaged_weighted_percentile. I can understand "symmetrize" term but what about "average" or "average_inverted" etc ?

return result[0] if n_dim == 1 else result


# TODO: refactor to do the symmetrisation inside _weighted_percentile to avoid
# sorting the input array twice.
def _averaged_weighted_percentile(array, sample_weight, percentile_rank=50, xp=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why not just replace instances of_averaged_weighted_percentile with_weighted_percentile(symmetrize=True) ?

@lucyleeow
Copy link
Member

@arjun-sundar3363 are you still interested in working on this? Thanks

Copy link
Member

@lucyleeowlucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not sure we need to delete all the comments here?

)
sorted_weights_neg[sorted_nan_mask_neg] = 0

weight_cdf_neg = xp.cumulative_sum(sorted_weights_neg.T, axis=1)
Copy link
Member

@lucyleeowlucyleeowJul 11, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We may not need to calculate the negative/reverse array cumulative sum again. We can probably rely on symmetry e.g., the value ofreverse_cum_sum[i] would be the same as the total sum minus the forward cumulative sum up until that point i.e.reverse_cum_sum[i] = total_sum - forward_cum_sum[n-1-i]

We can probably also work out the index of100-percentile_rank as well.

ifreverse_cum_sum[i_rev] = (100-percentile_rank)*total_sum ; then
total_sum - forward_cum_sum[n-1-i_rev] = (100-percentile_rank)*total_sum ; rearrange
forward_cum_sum[n-1-i_rev] = percentile_rank*total_sum

theni_fwd = n-1-i_rev soi_rev = n-1-i_fwd.

Since we are wanting the left of the bracket (i_fwd,i_fwd+1), the left bracket in reverse isn-2-i_fwd

@lucyleeow
Copy link
Member

closing in favour of#31775

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@lucyleeowlucyleeowlucyleeow left review comments

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@arjun-sundar3363@betatim@ogrisel@lucyleeow

[8]ページ先頭

©2009-2025 Movatter.jp