Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH make Random*Sampler accept dask array and dataframe#777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
glemaitre wants to merge32 commits intoscikit-learn-contrib:master
base:master
Choose a base branch
Loading
fromglemaitre:dask_base_tools

Conversation

glemaitre
Copy link
Member

@glemaitreglemaitre commentedNov 5, 2020
edited
Loading

POC to see if we can make at least theRandomUnderSampler andRandomOverSampler accept some dask array and dataframe

Note:

@pep8speaks
Copy link

pep8speaks commentedNov 5, 2020
edited
Loading

Hello@glemaitre! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found:

Line 22:64:W504 line break after binary operator
Line 28:71:W504 line break after binary operator

Line 9:1:F401 'numpy as np' imported but unused
Line 59:33:W504 line break after binary operator
Line 95:33:W504 line break after binary operator

Line 5:1:E402 module level import not at top of file
Line 7:1:E402 module level import not at top of file
Line 8:1:E402 module level import not at top of file

Line 622:44:W504 line break after binary operator

Line 11:1:F401 'sklearn.utils._testing.assert_array_equal' imported but unused

Line 46:3:E121 continuation line under-indented for hanging indent

Line 65:21:W503 line break before binary operator

Comment last updated at 2020-11-08 19:22:54 UTC

@glemaitreglemaitre changed the titleENH make RandomUnderSampler accept dask array and dataframeENH make Random*Sampler accept dask array and dataframeNov 5, 2020
@lgtm-com
Copy link

lgtm-combot commentedNov 5, 2020

This pull requestintroduces 2 alerts when merging7aae9d9 intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Unused local variable

Copy link

@TomAugspurgerTomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good overall. I think the comment aboutdask.compute(...) rather thanx.compute(), y.compute() is the most important.

Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.

_REGISTERED_DASK_CONTAINER = []

try:
from dask import array, dataframe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

People can have justdask[array] installed (not dataframe) so it's possible to have thearray import succeed, but thedataframe import fail. So if you want to support that case those would need to be in separate try / except blocks.

Maybe you instead wantfrom dask import is_dask_collection? That's a bit broader though (it also covers anything implementing dask's collection interface likedask.Bag,xarray.DataArray andxarray.Dataset).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That seems what I wanted :)

Comment on lines +12 to +15
if hasattr(y, "unique"):
labels = np.asarray(y.unique())
else:
labels = np.unique(y).compute()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).

Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword tofit (dunno if that applies here).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I thought about it. Do you think that having a context manager outside would make sense:

withset_config(avoid_check=True):# some imblearn/scikit-learn/dask code

Thought, we might get into trouble with issues related toscikit-learn/scikit-learn#18736

It might just be easier to have an optional class parameter that applies only for dask arrays.

force_all_finite=False,
if is_dask_container(y) and hasattr(y, "to_dask_array"):
y = y.to_dask_array()
y.compute_chunk_sizes()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this:https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like across_val_score, then I think this would end up loading data multiple times just to compute the chunk sizes.

stsievert reacted with thumbs up emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is something that I was unsure of, here. If I recall, the issue was that I could not have calledravel on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.

However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.

So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting

Comment on lines 129 to 130
if is_dask_container(unique):
unique, counts = unique.compute(), counts.compute()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As written, this will fully execute the task graph ofy twice. Once to computeunique and once to computecounts.

Suggested change
ifis_dask_container(unique):
unique,counts=unique.compute(),counts.compute()
ifis_dask_container(unique):
unique,counts=dask.compute(unique,counts)

You'll need toimport dask

glemaitre reacted with thumbs up emoji
@lgtm-com
Copy link

lgtm-combot commentedNov 7, 2020

This pull requestintroduces 5 alerts when mergingd4aabf8 intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure
  • 2 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-combot commentedNov 7, 2020

This pull requestintroduces 5 alerts when merging58acdf2 intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure
  • 2 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merginge54c772 intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@lgtm-com
Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merging167fc2a intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@codecov
Copy link

codecovbot commentedNov 8, 2020
edited
Loading

Codecov Report

Attention: Patch coverage is92.91785% with25 lines in your changes missing coverage. Please review.

Project coverage is 98.18%. Comparing base(2a0376e) to head(456c3eb).
Report is 184 commits behind head on master.

Files with missing linesPatch %Lines
imblearn/dask/utils.py71.42%8 Missing and 4 partials⚠️
imblearn/utils/wrapper.py84.21%6 Missing⚠️
imblearn/utils/_validation.py95.00%2 Missing and 2 partials⚠️
imblearn/base.py88.88%3 Missing⚠️
Additional details and impacted files
@@            Coverage Diff             @@##           master     #777      +/-   ##==========================================+ Coverage   96.55%   98.18%   +1.63%==========================================  Files          82       94      +12       Lines        5140     5900     +760       Branches        0      515     +515     ==========================================+ Hits         4963     5793     +830+ Misses        177      100      -77- Partials        0        7       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report?Share it here.

@lgtm-com
Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merging20b44c6 intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@lgtm-com
Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merginga6e975b intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@lgtm-com
Copy link

lgtm-combot commentedNov 8, 2020

This pull requestintroduces 2 alerts when merging456c3eb intoedd7522 -view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@ridhachahed
Copy link

@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ?

@glemaitre
Copy link
MemberAuthor

It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal withdask-ml anddask. I think that one of the issue that I had was about validation:#777 (comment)

It would need further work to go ahead.

@ridhachahed
Copy link

I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@TomAugspurgerTomAugspurgerTomAugspurger left review comments

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@glemaitre@pep8speaks@ridhachahed@TomAugspurger

[8]ページ先頭

©2009-2025 Movatter.jp