scikit-learn-contrib/imbalanced-learnPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star7k

ENH make Random*Sampler accept dask array and dataframe#777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

glemaitre wants to merge32 commits intoscikit-learn-contrib:master

base:master

Choose a base branch

fromglemaitre:dask_base_tools

Open

ENH make Random*Sampler accept dask array and dataframe#777

glemaitre wants to merge32 commits intoscikit-learn-contrib:masterfromglemaitre:dask_base_tools

Conversation

Copy link

Member

glemaitre commentedNov 5, 2020•
edited
Loading

POC to see if we can make at least theRandomUnderSampler andRandomOverSampler accept some dask array and dataframe

Note:

ENH make RandomUnderSampler accept dask array

95247e6

Copy link

pep8speaks commentedNov 5, 2020•
edited
Loading

Hello@glemaitre! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found:

In the fileconftest.py:

Line 22:64:W504 line break after binary operator
Line 28:71:W504 line break after binary operator

In the fileimblearn/base.py:

Line 9:1:F401 'numpy as np' imported but unused
Line 59:33:W504 line break after binary operator
Line 95:33:W504 line break after binary operator

In the fileimblearn/dask/tests/test_utils.py:

Line 5:1:E402 module level import not at top of file
Line 7:1:E402 module level import not at top of file
Line 8:1:E402 module level import not at top of file

In the fileimblearn/ensemble/_forest.py:

Line 622:44:W504 line break after binary operator

In the fileimblearn/under_sampling/_prototype_selection/tests/test_random_under_sampler.py:

Line 11:1:F401 'sklearn.utils._testing.assert_array_equal' imported but unused

In the fileimblearn/utils/_docstring.py:

Line 46:3:E121 continuation line under-indented for hanging indent

In the fileimblearn/utils/testing.py:

Line 65:21:W503 line break before binary operator

Comment last updated at 2020-11-08 19:22:54 UTC

glemaitre added6 commits

November 5, 2020 20:20

add dask to the install

ea30287

PEP8

0766964

iter

d9edb9a

PEP8

4960724

iter

2152429

PEP8

e5ce7a6

glemaitre changed the title~~ENH make RandomUnderSampler accept dask array and dataframe~~ENH make Random*Sampler accept dask array and dataframe

Nov 5, 2020

glemaitre added10 commits

November 5, 2020 21:51

iter

b537a20

iter

f781be0

avoid import dask explicitely

fb3d6a4

TST remove redundant test

b7d9f3b

iter

d26da3c

xxx

c065808

install complete dask

f2d0ec0

iter

20ba934

iter

0941a5e

iter

7aae9d9

Copy link

lgtm-combot commentedNov 5, 2020

This pull requestintroduces 2 alerts when merging7aae9d9 intoedd7522 -view on LGTM.com

new alerts:

2 for Unused local variable

glemaitre added2 commits

November 6, 2020 11:23

iter

00c0a26

requirements

8bfa040

glemaitre mentioned this pull request

Nov 6, 2020

Samplers / pipelines for imbalanced datasetsdask/dask-ml#317

Open

TomAugspurger reviewed

Nov 6, 2020

View reviewed changes

Copy link

TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good overall. I think the comment aboutdask.compute(...) rather thanx.compute(), y.compute() is the most important.

Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.

imblearn/dask/_support.py Outdated

		_REGISTERED_DASK_CONTAINER = []

		try:
		from dask import array, dataframe

Copy link

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

People can have justdask[array] installed (not dataframe) so it's possible to have thearray import succeed, but thedataframe import fail. So if you want to support that case those would need to be in separate try / except blocks.

Maybe you instead wantfrom dask import is_dask_collection? That's a bit broader though (it also covers anything implementing dask's collection interface likedask.Bag,xarray.DataArray andxarray.Dataset).

Copy link

MemberAuthor

glemaitreNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That seems what I wanted :)

imblearn/dask/utils.py

Comment on lines +12 to +15

		if hasattr(y, "unique"):
		labels = np.asarray(y.unique())
		else:
		labels = np.unique(y).compute()

Copy link

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).

Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword tofit (dunno if that applies here).

Copy link

MemberAuthor

glemaitreNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I thought about it. Do you think that having a context manager outside would make sense:

withset_config(avoid_check=True):# some imblearn/scikit-learn/dask code

Thought, we might get into trouble with issues related toscikit-learn/scikit-learn#18736

It might just be easier to have an optional class parameter that applies only for dask arrays.

imblearn/under_sampling/_prototype_selection/_random_under_sampler.py Outdated

		force_all_finite=False,
		if is_dask_container(y) and hasattr(y, "to_dask_array"):
		y = y.to_dask_array()
		y.compute_chunk_sizes()

Copy link

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this:https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like across_val_score, then I think this would end up loading data multiple times just to compute the chunk sizes.

Copy link

MemberAuthor

glemaitreNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is something that I was unsure of, here. If I recall, the issue was that I could not have calledravel on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.

However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.

So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting

imblearn/utils/_validation.py Outdated

Comment on lines 129 to 130

		if is_dask_container(unique):
		unique, counts = unique.compute(), counts.compute()

Copy link

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As written, this will fully execute the task graph ofy twice. Once to computeunique and once to computecounts.

Suggested change

	ifis_dask_container(unique):
	unique,counts=unique.compute(),counts.compute()
	ifis_dask_container(unique):
	unique,counts=dask.compute(unique,counts)

You'll need toimport dask

iter

d4aabf8

Copy link

lgtm-combot commentedNov 7, 2020

This pull requestintroduces 5 alerts when mergingd4aabf8 intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure
2 for Unused local variable
1 for Unused import

iter

58acdf2

Copy link

lgtm-combot commentedNov 7, 2020

This pull requestintroduces 5 alerts when merging58acdf2 intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure
2 for Unused local variable
1 for Unused import

PEP8

e54c772

Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merginge54c772 intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

glemaitre added4 commits

November 8, 2020 12:17

iter

f2a572f

iter

36a0aa3

check raise FutureWarning

c7bdc74

iter

f095221

Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merging167fc2a intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

glemaitre force-pushed thedask_base_tools branch from167fc2a toec4a717Compare

November 8, 2020 14:50

iter

20b44c6

glemaitre force-pushed thedask_base_tools branch fromec4a717 to20b44c6Compare

November 8, 2020 15:14

Copy link

codecovbot commentedNov 8, 2020•
edited
Loading

Codecov Report

Attention: Patch coverage is92.91785% with25 lines in your changes missing coverage. Please review.

Project coverage is 98.18%. Comparing base(2a0376e) to head(456c3eb).
Report is 184 commits behind head on master.

Files with missing lines	Patch %	Lines
imblearn/dask/utils.py	71.42%	8 Missing and 4 partials⚠️
imblearn/utils/wrapper.py	84.21%	6 Missing⚠️
imblearn/utils/_validation.py	95.00%	2 Missing and 2 partials⚠️
imblearn/base.py	88.88%	3 Missing⚠️

Additional details and impacted files

@@            Coverage Diff             @@##           master     #777      +/-   ##==========================================+ Coverage   96.55%   98.18%   +1.63%==========================================  Files          82       94      +12       Lines        5140     5900     +760       Branches        0      515     +515     ==========================================+ Hits         4963     5793     +830+ Misses        177      100      -77- Partials        0        7       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report?Share it here.

Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merging20b44c6 intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

glemaitre added2 commits

November 8, 2020 19:20

iter

4cd9116

PEP8

a6e975b

Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merginga6e975b intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

glemaitre added3 commits

November 8, 2020 19:52

iter

32eda46

iter

6c592ff

PEP8

456c3eb

Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merging456c3eb intoedd7522 -view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

glemaitre mentioned this pull request

Aug 30, 2021

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based#856

Closed

glemaitre force-pushed themaster branch fromf8347ad to56eefdfCompare

September 29, 2021 16:10

glemaitre force-pushed themaster branch from3228f8a to7e94390Compare

October 21, 2021 20:41

Copy link

ridhachahed commentedNov 29, 2021

@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ?

Copy link

MemberAuthor

glemaitre commentedNov 29, 2021

It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal withdask-ml anddask. I think that one of the issue that I had was about validation:#777 (comment)

It would need further work to go ahead.

Copy link

ridhachahed commentedNov 29, 2021

I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :)

glemaitre mentioned this pull request

May 20, 2024

how can the tool be used with DASK-ML to deal with large data#1079

Closed

Labels

None yet

4 participants

Movatterモバイル変換

ENH make Random*Sampler accept dask array and dataframe#777

Are you sure you want to change the base?

ENH make Random*Sampler accept dask array and dataframe#777

Uh oh!

Conversation

glemaitre commentedNov 5, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

pep8speaks commentedNov 5, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Comment last updated at 2020-11-08 19:22:54 UTC

Uh oh!

lgtm-combot commentedNov 5, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

glemaitreNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

glemaitreNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

glemaitreNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurgerNov 6, 2020

Choose a reason for hiding this comment

Uh oh!

lgtm-combot commentedNov 7, 2020

Uh oh!

lgtm-combot commentedNov 7, 2020

Uh oh!

lgtm-combot commentedNov 8, 2020

Uh oh!

lgtm-combot commentedNov 8, 2020

Uh oh!

codecovbot commentedNov 8, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lgtm-combot commentedNov 8, 2020

Uh oh!

lgtm-combot commentedNov 8, 2020

Uh oh!

lgtm-combot commentedNov 8, 2020

Uh oh!

ridhachahed commentedNov 29, 2021

Uh oh!

glemaitre commentedNov 29, 2021

Uh oh!

ridhachahed commentedNov 29, 2021

Uh oh!

Uh oh!

glemaitre commentedNov 5, 2020•
edited
Loading

pep8speaks commentedNov 5, 2020•
edited
Loading

codecovbot commentedNov 8, 2020•
edited
Loading