scikit-learn-contrib/imbalanced-learnPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star7k

[WIP] ENH: Resample additional arrays apart from X and y#463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

glemaitre wants to merge16 commits intoscikit-learn-contrib:master

base:master

Choose a base branch

fromglemaitre:is/sample_weight_sampling

Open

[WIP] ENH: Resample additional arrays apart from X and y#463

glemaitre wants to merge16 commits intoscikit-learn-contrib:masterfromglemaitre:is/sample_weight_sampling

Conversation

Copy link

Member

glemaitre commentedAug 27, 2018•
edited
Loading

Implement the last point of#462 and should be merged after it.
Partially addressing#460

glemaitre added11 commits

August 27, 2018 15:45

API: define fit_resample only without any fit

0ac2c92

PEP8

c872679

DOC: add whats new entry

bbecd30

DOC add issue number

f7120d8

PEP8 examples

00f8e44

DOC fix import

9046477

iter

8b3aa50

TST remove sample in pipeline

24fd62d

TST: make sure samplers common test are run

59725c7

PEP8

6189206

EHN: resample additional arrays apart from X and y

61f53a7

glemaitre changed the title ~~EHN: resample additional arrays apart from X and y~~[WIP] EHN: resample additional arrays apart from X and y

Aug 27, 2018

Copy link

pep8speaks commentedAug 27, 2018•
edited
Loading

Hello@glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 29, 2018 at 09:59 Hours UTC

Copy link

MemberAuthor

glemaitre commentedAug 28, 2018

Sampling extra arrays could be easy in the case of prototype selection.
One can just peak the weight of the selected samples.

However, what would be a good and meaningful default when new sample weight need to be created.

Right now, I created an*arrays sequence but we might interested to only limit tosample_weight since the creation of new instances would make sense only if we know what are we creating. Up-sampling in arrays that we don't know anything about it could be weird.

Copy link

MemberAuthor

glemaitre commentedAug 28, 2018

Ups I forgot to ping@jnothman in my last comment

glemaitre added5 commits

August 28, 2018 14:25

FIX only consider sample_weight

8f86d98

Merge remote-tracking branch 'origin/master' into is/sample_weight_sa…

38526bc

…mpling

iter

9f600fd

iter

2b8ab83

EXA fix fake sampler in example

7a8fad0

Copy link

codecovbot commentedAug 29, 2018•
edited
Loading

Codecov Report

Merging#463 intomaster willdecrease coverage by0.35%.
The diff coverage is94.97%.

@@            Coverage Diff             @@##           master     #463      +/-   ##==========================================- Coverage   98.92%   98.57%   -0.36%==========================================  Files          85       75      -10       Lines        5324     4633     -691     ==========================================- Hits         5267     4567     -700- Misses         57       66       +9

Impacted Files	Coverage Δ
imblearn/pipeline.py	`97.07% <ø> (+2.1%)`	⬆️
imblearn/utils/estimator_checks.py	`96.62% <100%> (-0.46%)`	⬇️
imblearn/ensemble/_balance_cascade.py	`100% <100%> (ø)`	⬆️
...ling/_prototype_selection/_random_under_sampler.py	`100% <100%> (ø)`	⬆️
...nder_sampling/_prototype_selection/_tomek_links.py	`100% <100%> (ø)`	⬆️
...rototype_selection/_condensed_nearest_neighbour.py	`100% <100%> (ø)`	⬆️
imblearn/over_sampling/_random_over_sampler.py	`100% <100%> (ø)`	⬆️
imblearn/combine/_smote_tomek.py	`100% <100%> (ø)`	⬆️
...rototype_selection/_neighbourhood_cleaning_rule.py	`100% <100%> (ø)`	⬆️
imblearn/combine/_smote_enn.py	`100% <100%> (ø)`	⬆️
... and73 more

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last updateffdde80...7a8fad0. Read thecomment docs.

jnothman reviewed

Sep 3, 2018

View reviewed changes

Copy link

Member

jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Btw, I don't think returning non-X,y will work with the current handling ofPipeline.fit's kwargs. We really need sample prop routing to handle that case. Currently, the handling would be unambiguous if one of:

the resampler is the last step, in which case we return any additional sample props like weights
the resampler is the second-last step, and there is no fit_param calledlast_step_name__sample_weight, in which case we pass all sample props into the last step'sfit, I think.

Otherwise, it's unclear where to pass the returnedsample_weight given that**fit_params intends to prescribe this at the timePipeline.fit is called.

imblearn/base.py

		The corresponding label of `X_resampled`.

		sample_weight_resampled : ndarray, shape (n_samples_new,)

Copy link

Member

jnothmanSep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would rather have a dict of non-X,y returned. (Optionally? In scikit-learn I would rather this be mandatory so we don't need to handle both cases.)

imblearn/base.py

		``sample_weight`` was not ``None``.

		idx_resampled : ndarray, shape (n_samples_new,)
		Indices of the selected features. This output is optional and only

Copy link

Member

jnothmanSep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do you mean the selectedsamples?

Copy link

MemberAuthor

glemaitreSep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yes

imblearn/base.py

		Resampled sample weights. This output is returned only if
		``sample_weight`` was not ``None``.

		idx_resampled : ndarray, shape (n_samples_new,)

Copy link

Member

jnothmanSep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you explain why this should be returned fromfit_resample, rather than stored as an attribute?

Copy link

MemberAuthor

glemaitreSep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that it was some original design (before it was in scikit-learn). But actually it would be better to keep it as an attribute with the singlefit_resample.

Copy link

MemberAuthor

glemaitre commentedSep 3, 2018

Otherwise, it's unclear where to pass the returned sample_weight given that **fit_params intends to prescribe this at the time Pipeline.fit is called.

If I understand well and from what I could see, Pipeline does not supportsample_weight right now. But in the meanwhile, do you recommend to add afit_resample(X, y, **sample_props) signature and return a dictsample_props.