- Notifications
You must be signed in to change notification settings - Fork1.3k
ENH make Random*Sampler accept dask array and dataframe#777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pep8speaks commentedNov 5, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Hello@glemaitre! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found:
Comment last updated at 2020-11-08 19:22:54 UTC |
This pull requestintroduces 2 alerts when merging7aae9d9 intoedd7522 -view on LGTM.com new alerts:
|
TomAugspurger left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks good overall. I think the comment aboutdask.compute(...)
rather thanx.compute(), y.compute()
is the most important.
Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.
imblearn/dask/_support.py Outdated
_REGISTERED_DASK_CONTAINER = [] | ||
try: | ||
from dask import array, dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
People can have justdask[array]
installed (not dataframe) so it's possible to have thearray
import succeed, but thedataframe
import fail. So if you want to support that case those would need to be in separate try / except blocks.
Maybe you instead wantfrom dask import is_dask_collection
? That's a bit broader though (it also covers anything implementing dask's collection interface likedask.Bag
,xarray.DataArray
andxarray.Dataset
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That seems what I wanted :)
if hasattr(y, "unique"): | ||
labels = np.asarray(y.unique()) | ||
else: | ||
labels = np.unique(y).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).
Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword tofit
(dunno if that applies here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I thought about it. Do you think that having a context manager outside would make sense:
withset_config(avoid_check=True):# some imblearn/scikit-learn/dask code
Thought, we might get into trouble with issues related toscikit-learn/scikit-learn#18736
It might just be easier to have an optional class parameter that applies only for dask arrays.
force_all_finite=False, | ||
if is_dask_container(y) and hasattr(y, "to_dask_array"): | ||
y = y.to_dask_array() | ||
y.compute_chunk_sizes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this:https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like across_val_score
, then I think this would end up loading data multiple times just to compute the chunk sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is something that I was unsure of, here. If I recall, the issue was that I could not have calledravel
on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.
However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.
So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting
imblearn/utils/_validation.py Outdated
if is_dask_container(unique): | ||
unique, counts = unique.compute(), counts.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
As written, this will fully execute the task graph ofy
twice. Once to computeunique
and once to computecounts
.
ifis_dask_container(unique): | |
unique,counts=unique.compute(),counts.compute() | |
ifis_dask_container(unique): | |
unique,counts=dask.compute(unique,counts) |
You'll need toimport dask
This pull requestintroduces 5 alerts when mergingd4aabf8 intoedd7522 -view on LGTM.com new alerts:
|
This pull requestintroduces 5 alerts when merging58acdf2 intoedd7522 -view on LGTM.com new alerts:
|
This pull requestintroduces 2 alerts when merginge54c772 intoedd7522 -view on LGTM.com new alerts:
|
This pull requestintroduces 2 alerts when merging167fc2a intoedd7522 -view on LGTM.com new alerts:
|
codecovbot commentedNov 8, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@## master #777 +/- ##==========================================+ Coverage 96.55% 98.18% +1.63%========================================== Files 82 94 +12 Lines 5140 5900 +760 Branches 0 515 +515 ==========================================+ Hits 4963 5793 +830+ Misses 177 100 -77- Partials 0 7 +7 ☔ View full report in Codecov by Sentry. |
This pull requestintroduces 2 alerts when merging20b44c6 intoedd7522 -view on LGTM.com new alerts:
|
This pull requestintroduces 2 alerts when merginga6e975b intoedd7522 -view on LGTM.com new alerts:
|
This pull requestintroduces 2 alerts when merging456c3eb intoedd7522 -view on LGTM.com new alerts:
|
ridhachahed commentedNov 29, 2021
@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ? |
It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal with It would need further work to go ahead. |
ridhachahed commentedNov 29, 2021
I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :) |
Uh oh!
There was an error while loading.Please reload this page.
POC to see if we can make at least the
RandomUnderSampler
andRandomOverSampler
accept some dask array and dataframeNote: