Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[MRG] EHN refactoring of the ratio argument.#413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
glemaitre merged 24 commits intoscikit-learn-contrib:masterfromglemaitre:is/411
May 8, 2018

Conversation

glemaitre
Copy link
Member

@glemaitreglemaitre commentedMar 20, 2018
edited
Loading

Reference Issue

closes#411
closes#406

What does this implement/fix? Explain your changes.

TODO

  • Implement ratio as a float for binary class + test
  • remove the dict for cleaning-method + test
  • deprecate the previous behaviour.
  • enable to pass a list for cleaning method + test
  • change 'auto' from 'all' to 'not minority'
  • add check_sampling_target
  • withfloat, check that the ratio will not over-sample or under-sample when it should not.
  • add documentation check_sampling_target
  • add an alias to ratio called sampling_target
  • deprecate check_ratio for check_sampling_target
  • add documentation for sampling_target
  • update all docstring
  • update the example ratio
  • update API documentation (add check_sampling_target)
  • update docstring of the datasets
  • check coverage report
  • entry in what's new
  • Rename everything tosampling_strategy

Any other comments?

@codecov
Copy link

codecovbot commentedMar 20, 2018
edited
Loading

Codecov Report

Merging#413 intomaster willdecrease coverage by0.06%.
The diff coverage is99.43%.

Impacted file tree graph

@@            Coverage Diff             @@##           master     #413      +/-   ##==========================================- Coverage   98.77%   98.71%   -0.07%==========================================  Files          68       70       +2       Lines        4014     4188     +174     ==========================================+ Hits         3965     4134     +169- Misses         49       54       +5
Impacted FilesCoverage Δ
imblearn/ensemble/tests/test_balance_cascade.py100% <100%> (ø)⬆️
imblearn/ensemble/tests/test_easy_ensemble.py100% <100%> (ø)⬆️
...rn/under_sampling/prototype_generation/__init__.py100% <100%> (ø)⬆️
imblearn/tests/test_common.py95.45% <100%> (ø)⬆️
imblearn/over_sampling/adasyn.py98.57% <100%> (+0.06%)⬆️
...ampling/prototype_selection/tests/test_nearmiss.py100% <100%> (ø)⬆️
..._sampling/prototype_selection/tests/test_allknn.py100% <100%> (ø)⬆️
imblearn/combine/tests/test_smote_tomek.py100% <100%> (ø)⬆️
...prototype_selection/neighbourhood_cleaning_rule.py100% <100%> (ø)⬆️
...sampling/prototype_generation/cluster_centroids.py100% <100%> (ø)⬆️
... and53 more

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last update24f4973...09c5aaa. Read thecomment docs.

@pep8speaks
Copy link

pep8speaks commentedMar 27, 2018
edited
Loading

Hello@glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 27, 2018 at 15:54 Hours UTC

@glemaitre
Copy link
MemberAuthor

glemaitre commentedMar 27, 2018
edited
Loading

@massich@chkoar This is ready to be reviewed. It is a big one.

@glemaitreglemaitre changed the title[WIP] EHN refactoring of the ratio argument.[MRG] EHN refactoring of the ratio argument.Mar 27, 2018
@glemaitre
Copy link
MemberAuthor

behaviour.
The parameter ``sampling_target`` control which sample of the link will be
removed. For instance, the default (i.e., ``sampling_target='auto'``) will
remove the sample from the majority class. Both samples from the majority and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would changeremove the sample from forremove samples from. + I don't really understand whatcontrol which sample of the link will be removed means.link confuses me

@@ -0,0 +1,241 @@
"""
======================================================================
Usage of the ``sampling_target`` parameter for the different algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Howto use thesampling_target parameter (depending on the sampling strategy)

Copy link

@jorisvandenbosschejorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Big diff so didn't yet look at everything, but:

  • given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition

  • I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives:sampling_strategy,sampling_protocol

minority class after resampling and the number of samples in the
majority class, respectively.

.. warning::

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if you indent this two spaced, then it is included in the list (which is better I think)

sampling_target : float, str, dict or callable, (default='auto')
Sampling information to resample the data set.

- When ``float``, it correspond to the ratio :math:`\\alpha_{os}`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

correspond -> corresponds


``'minority'``: resample only the minority class;

``'majority'``: resample only the majority class;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since this is a RandomOversampler, does 'majority' make any sense?


``'auto'``: equivalent to ``'not minority'``.

- When ``list``, the list contains the targeted classes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is not clear to me what it does.

- If ``dict``, the keys correspond to the targeted classes. The values
correspond to the desired number of samples.
- If callable, function taking ``y`` and returns a ``dict``. The keys
sampling_target : float, str, dict, callable, (default='auto')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

identation

plot_pie(y)

###############################################################################
# Using ``sampling_target`` in resampling algorithm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

algorithm -> algorithms


print('Information of the iris data set after making it'
' imbalanced using a callable: \n sampling_target={} \n y: {}'
.format(sampling_target, Counter(y)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

sampling_target is from the previous example

binary_mask = np.bitwise_or(y == 0, y == 2)
binary_y = y[binary_mask]
binary_X = X[binary_mask]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can you show the counter of the data? So you can afterwards compare the number after resampling

#
# ``sampling_target`` can be given as a string which specify the class targeted
# by the resampling. With under- and over-sampling, the number of samples will
# be equalized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

emphasize you are no longer using the binary data


fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, shadow=True,
autopct='%1.1f%%')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

would it be possible to add both absolute number and percentages?

@glemaitre
Copy link
MemberAuthor

Thanks@jorisvandenbossche
I don't like protocol but strategy seems generic enough.

Regarding the repetition, do you have something in mind? If it comes back to inject the proper docstring, I am not sure how to do that. If you think about a glossary, it could be cool but the issue is that we will have a docstring which will be generic for all over-, under-, cleaning-samplers.

What I mean is:

class MySampler(...):"""....    sampling_target: float, str, dict, callable         Sampling strategy ....         - If float, represent the balancing ratio, check `glossary <float_ratio>` for more details""""---Glossary:sampling_target as float : for over-sampling ...; for under-sampling

So somehow the user needs to know what he is using and select the proper explanation which was something that I wanted to avoid from the previous things that we had.

@chkoar
Copy link
Member

Lack of time here to review this PR but I have two comments to make.

given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition

I had the same idea in#241

I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives: sampling_strategy, sampling_protocol

I believe that words like strategy and protocol are very nice, even on their own.

@glemaitre
Copy link
MemberAuthor

I believe that words like strategy and protocol are very nice, even on their own.

I think that pre-adding sampling is not harming. I can imagine the case of a meta-estimator using an estimator from scikit-learn which use thestrategy keyword` and then we are doomed.

I had the same idea in#241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

@glemaitre
Copy link
MemberAuthor

Ok sosampling_strategy kicked in and the docstring are factorize using the base class.

@massich@jorisvandenbossche@chkoar if you have any other remarks regarding the API, it would be nice.

Regarding the examples, I want to make them better in a next PR.

@chkoar
Copy link
Member

chkoar commentedApr 2, 2018
edited
Loading

I had the same idea in#241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

Sorry@glemaitre. My bad. I was never referred to the class docstrings. Maybe I haven't said that explicitly. I actually said that for thefit and the_sample methods for the derived classes. As I am seeing the class docstrings you committed I understood why you had concerns. :D

@glemaitre
Copy link
MemberAuthor

glemaitre commentedApr 2, 2018 via email

Ac‎tually good point. We can do that in another PR when subsitution class will be merged. 

@glemaitreglemaitre merged commit71ff0f6 intoscikit-learn-contrib:masterMay 8, 2018
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@jorisvandenbosschejorisvandenbosschejorisvandenbossche left review comments

@massichmassichmassich left review comments

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

[RFC] Refactorratio Improve the ratio explanation in the docstring of the samplers
5 participants
@glemaitre@pep8speaks@chkoar@jorisvandenbossche@massich

[8]ページ先頭

©2009-2025 Movatter.jp