scikit-learn-contrib/imbalanced-learnPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star7k

[MRG] EHN refactoring of the ratio argument.#413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

glemaitre merged 24 commits intoscikit-learn-contrib:masterfromglemaitre:is/411

May 8, 2018

Merged

[MRG] EHN refactoring of the ratio argument.#413

glemaitre merged 24 commits intoscikit-learn-contrib:masterfromglemaitre:is/411

May 8, 2018

Conversation

Copy link

Member

glemaitre commentedMar 20, 2018•
edited
Loading

Reference Issue

closes#411
closes#406

What does this implement/fix? Explain your changes.

TODO

Implement ratio as a float for binary class + test
remove the dict for cleaning-method + test
deprecate the previous behaviour.
enable to pass a list for cleaning method + test
change 'auto' from 'all' to 'not minority'
add check_sampling_target
withfloat, check that the ratio will not over-sample or under-sample when it should not.
add documentation check_sampling_target
add an alias to ratio called sampling_target
deprecate check_ratio for check_sampling_target
add documentation for sampling_target
update all docstring
update the example ratio
update API documentation (add check_sampling_target)
update docstring of the datasets
check coverage report
entry in what's new
Rename everything tosampling_strategy

Any other comments?

EHN add ratio as a float and refactor tests

166f183

Copy link

codecovbot commentedMar 20, 2018•
edited
Loading

Codecov Report

Merging#413 intomaster willdecrease coverage by0.06%.
The diff coverage is99.43%.

@@            Coverage Diff             @@##           master     #413      +/-   ##==========================================- Coverage   98.77%   98.71%   -0.07%==========================================  Files          68       70       +2       Lines        4014     4188     +174     ==========================================+ Hits         3965     4134     +169- Misses         49       54       +5

Impacted Files	Coverage Δ
imblearn/ensemble/tests/test_balance_cascade.py	`100% <100%> (ø)`	⬆️
imblearn/ensemble/tests/test_easy_ensemble.py	`100% <100%> (ø)`	⬆️
...rn/under_sampling/prototype_generation/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/tests/test_common.py	`95.45% <100%> (ø)`	⬆️
imblearn/over_sampling/adasyn.py	`98.57% <100%> (+0.06%)`	⬆️
...ampling/prototype_selection/tests/test_nearmiss.py	`100% <100%> (ø)`	⬆️
..._sampling/prototype_selection/tests/test_allknn.py	`100% <100%> (ø)`	⬆️
imblearn/combine/tests/test_smote_tomek.py	`100% <100%> (ø)`	⬆️
...prototype_selection/neighbourhood_cleaning_rule.py	`100% <100%> (ø)`	⬆️
...sampling/prototype_generation/cluster_centroids.py	`100% <100%> (ø)`	⬆️
... and53 more

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last update24f4973...09c5aaa. Read thecomment docs.

glemaitre added12 commits

March 20, 2018 22:46

FIX add new way of sampling

452926e

Check deprecation warning

dff36e7

TST fix test

8ef9416

FIX depreacte ratio and ratio_

f3fef5c

FIX rename the different functions

c12805c

TST udpate test and deprecating

311b269

FIX use check_sampling_target internally instead of check_ratio

596a475

DOC add check_sampling_target into the API

f5a1254

DOC update all docstring

00fc0bb

DOC update ratio example

fcf8b29

DOC remove ratio occurences

b11e154

EXA add example sampling target

1fe87df

Copy link

pep8speaks commentedMar 27, 2018•
edited
Loading

Hello@glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 27, 2018 at 15:54 Hours UTC

glemaitre added3 commits

March 27, 2018 17:35

FIX test remove ratio from the test

8157bd7

EXA fix underline

78f91a0

DOC add whatsnew entry

b5c91c4

Copy link

MemberAuthor

glemaitre commentedMar 27, 2018•
edited
Loading

@massich @chkoar This is ready to be reviewed. It is a big one.

glemaitre changed the title~~[WIP] EHN refactoring of the ratio argument.~~[MRG] EHN refactoring of the ratio argument.

Mar 27, 2018

DOC udpate whats new

db3f550

Copy link

MemberAuthor

glemaitre commentedMar 27, 2018

@jorisvandenbossche I would like to have your feedback as well.

You can check the documentation of those three classes which is representative of the full PR:

massich reviewed

Mar 27, 2018

View reviewed changes

doc/under_sampling.rst Outdated

		behaviour.
		The parameter ``sampling_target`` control which sample of the link will be
		removed. For instance, the default (i.e., ``sampling_target='auto'``) will
		remove the sample from the majority class. Both samples from the majority and

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would changeremove the sample from forremove samples from. + I don't really understand whatcontrol which sample of the link will be removed means.link confuses me

massich reviewed

Mar 27, 2018

View reviewed changes

examples/plot_sampling_target_usage.py Outdated

		@@ -0,0 +1,241 @@
		"""
		======================================================================
		Usage of the ``sampling_target`` parameter for the different algorithm

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Howto use thesampling_target parameter (depending on the sampling strategy)

jorisvandenbossche reviewed

Mar 27, 2018

View reviewed changes

Copy link

jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Big diff so didn't yet look at everything, but:

given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition
I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives:sampling_strategy,sampling_protocol

imblearn/over_sampling/random_over_sampler.py Outdated

		minority class after resampling and the number of samples in the
		majority class, respectively.

		.. warning::

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if you indent this two spaced, then it is included in the list (which is better I think)

imblearn/over_sampling/random_over_sampler.py Outdated

		sampling_target : float, str, dict or callable, (default='auto')
		Sampling information to resample the data set.

		- When ``float``, it correspond to the ratio :math:`\\alpha_{os}`

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

correspond -> corresponds

imblearn/over_sampling/random_over_sampler.py Outdated


		``'minority'``: resample only the minority class;

		``'majority'``: resample only the majority class;

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since this is a RandomOversampler, does 'majority' make any sense?

imblearn/under_sampling/prototype_selection/edited_nearest_neighbours.py Outdated


		``'auto'``: equivalent to ``'not minority'``.

		- When ``list``, the list contains the targeted classes.

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is not clear to me what it does.

imblearn/under_sampling/prototype_selection/nearmiss.py Outdated

		- If ``dict``, the keys correspond to the targeted classes. The values
		correspond to the desired number of samples.
		- If callable, function taking ``y`` and returns a ``dict``. The keys
		sampling_target : float, str, dict, callable, (default='auto')

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

identation

examples/plot_sampling_target_usage.py Outdated

		plot_pie(y)

		###############################################################################
		# Using ``sampling_target`` in resampling algorithm

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

algorithm -> algorithms

examples/plot_sampling_target_usage.py Outdated


		print('Information of the iris data set after making it'
		' imbalanced using a callable: \n sampling_target={} \n y: {}'
		.format(sampling_target, Counter(y)))

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

sampling_target is from the previous example

examples/plot_sampling_target_usage.py

		binary_mask = np.bitwise_or(y == 0, y == 2)
		binary_y = y[binary_mask]
		binary_X = X[binary_mask]

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can you show the counter of the data? So you can afterwards compare the number after resampling

examples/plot_sampling_target_usage.py Outdated

		#
		# ``sampling_target`` can be given as a string which specify the class targeted
		# by the resampling. With under- and over-sampling, the number of samples will
		# be equalized.

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

emphasize you are no longer using the binary data

examples/plot_sampling_target_usage.py Outdated


		fig, ax = plt.subplots()
		ax.pie(sizes, explode=explode, labels=labels, shadow=True,
		autopct='%1.1f%%')

Copy link

jorisvandenbosscheMar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

would it be possible to add both absolute number and percentages?

Copy link

MemberAuthor

glemaitre commentedMar 28, 2018

Thanks@jorisvandenbossche
I don't like protocol but strategy seems generic enough.

Regarding the repetition, do you have something in mind? If it comes back to inject the proper docstring, I am not sure how to do that. If you think about a glossary, it could be cool but the issue is that we will have a docstring which will be generic for all over-, under-, cleaning-samplers.

What I mean is:

class MySampler(...):"""....    sampling_target: float, str, dict, callable         Sampling strategy ....         - If float, represent the balancing ratio, check `glossary <float_ratio>` for more details""""---Glossary:sampling_target as float : for over-sampling ...; for under-sampling

So somehow the user needs to know what he is using and select the proper explanation which was something that I wanted to avoid from the previous things that we had.

glemaitre mentioned this pull request

Mar 28, 2018

[WIP] ENH: Class Senstive Scaling#416

Open

Copy link

Member

chkoar commentedMar 28, 2018

Lack of time here to review this PR but I have two comments to make.

given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition

I had the same idea in#241

I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives: sampling_strategy, sampling_protocol

I believe that words like strategy and protocol are very nice, even on their own.

Copy link

MemberAuthor

glemaitre commentedMar 28, 2018

I believe that words like strategy and protocol are very nice, even on their own.

I think that pre-adding sampling is not harming. I can imagine the case of a meta-estimator using an estimator from scikit-learn which use thestrategy keyword` and then we are doomed.

I had the same idea in#241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

add docstring substitution

199f320

glemaitre added6 commits

March 29, 2018 13:02

iter

89e27d9

DOC factorize docstring

7ebfb68

joris comments

e5a4dd3

TST add tests for injection in docstring

4ba215d

go back to old type class for python 2

0bf8f85

Rename and PEP8

09c5aaa

Copy link

MemberAuthor

glemaitre commentedMar 30, 2018

Ok sosampling_strategy kicked in and the docstring are factorize using the base class.

@massich @jorisvandenbossche @chkoar if you have any other remarks regarding the API, it would be nice.

Regarding the examples, I want to make them better in a next PR.

Copy link

Member

chkoar commentedApr 2, 2018•
edited
Loading

I had the same idea in#241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

Sorry@glemaitre. My bad. I was never referred to the class docstrings. Maybe I haven't said that explicitly. I actually said that for thefit and the_sample methods for the derived classes. As I am seeing the class docstrings you committed I understood why you had concerns. :D

Copy link

MemberAuthor

glemaitre commentedApr 2, 2018 via email

Ac‎tually good point. We can do that in another PR when subsitution class will be merged.

glemaitre merged commit71ff0f6 intoscikit-learn-contrib:master

May 8, 2018

Labels

None yet

5 participants

Movatterモバイル変換

[MRG] EHN refactoring of the ratio argument.#413

[MRG] EHN refactoring of the ratio argument.#413

Uh oh!

Conversation

glemaitre commentedMar 20, 2018• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

codecovbot commentedMar 20, 2018• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pep8speaks commentedMar 27, 2018• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Comment last updated on March 27, 2018 at 15:54 Hours UTC

Uh oh!

glemaitre commentedMar 27, 2018• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

glemaitre commentedMar 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commentedMar 28, 2018

Uh oh!

chkoar commentedMar 28, 2018

Uh oh!

glemaitre commentedMar 28, 2018

Uh oh!

glemaitre commentedMar 30, 2018

Uh oh!

chkoar commentedApr 2, 2018• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

glemaitre commentedApr 2, 2018 via email

Uh oh!

Uh oh!

glemaitre commentedMar 20, 2018•
edited
Loading

codecovbot commentedMar 20, 2018•
edited
Loading

pep8speaks commentedMar 27, 2018•
edited
Loading

glemaitre commentedMar 27, 2018•
edited
Loading

chkoar commentedApr 2, 2018•
edited
Loading