scikit-learn-contrib/imbalanced-learnPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star7k

[WIP] ENH: SPIDER Sampling Algorithm#603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

MattEding wants to merge24 commits intoscikit-learn-contrib:master

base:master

Choose a base branch

fromMattEding:spider

Open

[WIP] ENH: SPIDER Sampling Algorithm#603

MattEding wants to merge24 commits intoscikit-learn-contrib:masterfromMattEding:spider

Conversation

Copy link

Contributor

MattEding commentedSep 19, 2019•
edited
Loading

As per requested by the pinnedNew Methods, I have implemented the Selective Pre-processing of Imbalanced Data (SPIDER) sampling algorithm.

I have developed unit tests based on drawing out a sample dataset and working out by hand what the expected results would be as it is deterministic. See the following notebook for diagramshere.

Currently, the implementation for dense and sparse return the same data points but in different orders (np.lexsort will show they are indeed the same). Consequently this fails the existing unit test that compares dense vs sparse outputs. I would rather not have to require sorting results to ensure that test passes due to the overhead for sorting large datasets. Maybe I can just have a parameter that defaultssort=True to give the option to bypass this issue.

The only other unit test I saw fail with PyTest is awith raises(MESSAGE), but I will fix that later since I am not sure if other developers will want to keep my newsample_type = 'preprocess-strategy'. I had chosen do this because SPIDER does both cleaning and oversampling and I did not feel that inheriting from either was appropriate--oversamplers allowsampling_strategy as afloat which does not make sense for this algorithm, and cleaners only really allow for under-sampling sampling strategies which is also inappropriate.

Benchmark results using cross-validation with the mean of 5 folds comparing None, NCR, SMOTE, and the 3 SPIDER variants arehere in the PKL, CSV, and PNG folders. I used the Zenodo datasets (excluding set 26 due to it being a large dataset for my local computer to work with in a time effective manner). I set alln_jobs=-1 on my 4 core macbook if you want to infer thetime values. Additionally when a scorer was undefined, I left the value as 0 rather than changing it to NaN.

TODO:

~~Resolve the two failing tests I addressed above~~
~~Online reference explaining the algorithm~~
Possibly re-categorize algorithm
Maybe change self._{xyz} to be variables to be explicitly passed into methods to avoid possible multiprocessing mutation complications

MattEding added6 commits

September 15, 2019 22:28

SPIDER class runs, need to develop tests to make sure it is working a…

54e1377

…s intended behind scenes; benchmark code

unit tests for weak, resample, strong pass; need to fix sparse tests

df48951

pep8; unit testing spider

774be31

fix sparse spider test, remove jupyter notebooks

3f21d37

Merge branch 'master' ofhttps://github.com/scikit-learn-contrib/imba…

58f4d78

…lanced-learn into spider

docstring SPIDER; spider sample strategy list

dbb0564

Copy link

pep8speaks commentedSep 19, 2019•
edited
Loading

Hello@MattEding! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 00:59:05 UTC

MattEding changed the title~~[MRG] ENH: SPIDER Sampling Algorithm~~[WIP] ENH: SPIDER Sampling Algorithm

Sep 19, 2019

pep8

e82fe2c

Copy link

Member

chkoar commentedSep 19, 2019

Thanks Matt! That would be a nice addition!

Copy link

ContributorAuthor

MattEding commentedSep 20, 2019

Rough draft for online documentation illustrations of how the SPIDER algorithm works for arbitrary X and y values. The generated plots work as planned; just need to add written text discussing how to interpret graphs. Also needs to be moved from .ipynb to .py that can be understood by sphinx.
https://github.com/MattEding/SPIDER-Algorithm/blob/master/SPIDER-Formulation.ipynb

MattEding added2 commits

September 30, 2019 14:23

spider dense & sparse now resample in same order; add spider formulat…

c0fe84e

…ion illustration to docs

pep8; remove old commented-out code in spider amplify

4ec89cf

Copy link

codecovbot commentedSep 30, 2019•
edited
Loading

Codecov Report

Merging#603 intomaster willdecrease coverage by0.02%.
The diff coverage is97.92%.

@@            Coverage Diff             @@##           master     #603      +/-   ##==========================================- Coverage   98.46%   98.43%   -0.03%==========================================  Files          82       86       +4       Lines        4886     5062     +176     ==========================================+ Hits         4811     4983     +172- Misses         75       79       +4

Impacted Files	Coverage Δ
imblearn/utils/_validation.py	`100% <100%> (ø)`	⬆️
imblearn/combine/_preprocess/base.py	`100% <100%> (ø)`
imblearn/combine/_preprocess/__init__.py	`100% <100%> (ø)`
imblearn/utils/tests/test_validation.py	`100% <100%> (ø)`	⬆️
imblearn/combine/tests/test_spider.py	`100% <100%> (ø)`
imblearn/combine/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/combine/_preprocess/_spider.py	`96.39% <96.39%> (ø)`
... and1 more

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last update3839df1...53dfc4e. Read thecomment docs.

Copy link

ContributorAuthor

MattEding commentedSep 30, 2019

With Travis CIDISTRIB="ubuntu", it appears that this build is using Tensorflow 2.0 as the unit tests are failing withAttributeError: module 'tensorflow' has no attribute 'placeholder'.

glemaitre force-pushed themaster branch from65132db to68123d0Compare

November 8, 2019 22:54

Copy link

Member

glemaitre commentedNov 17, 2019

I would suggest moving this implementation intosmote_variants. The idea behind this move is to benchmark the smote variants on a common benchmark on a large number of datasets and include in imbalanced-learn only the versions that show an advantage. You can see the discussion and contribute to it:https://github.com/gykovacs/smote_variants/issues/14

@MattEding would this strategy would be fine with you?

Copy link

Member

chkoar commentedNov 17, 2019

I believe thatSPIDER would be great addition toimblearn package. I need to find some time to thoroughly review it.@MattEding thanks for the patience.

Copy link

ContributorAuthor

MattEding commentedNov 17, 2019

@glemaitre I understand the motivation for not wanting to include it until it has a more proven track record. My only reservation about trying to include it insmote_variants is that SPIDER is unrelated to SMOTE. One of its motivations was to avoid creating synthetic samples:

"Random introduction of synthetic examples by SMOTE may be questionable or difficult to justify in some domains, where it is important to preserve a link between original data and a constructed classifier (e.g., to justify suggested decisions)."
Stefanowski and Wilk, 2008.Selective Pre-processing of Imbalanced Data for Improving Classification Performance.

As a side note, you may want to revisitNew Methods inviting people to contribute new sampling techniques by specifying that contributions may need to go through other repositories first.

Copy link

Member

glemaitre commentedNov 17, 2019

My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE.

This is a really could point, I misread when reviewing the different PR :).

My proposal was a bit too quick.@chkoar has some concerns also regarding my proposal as well. We probably need to open an issue to have a public discussion about it.

So as a summary, let's go-ahead for the inclusion of this method. I should have a bit of time to review it in the coming days.@MattEding could you solve the conflict?

Merge branch 'master' into spider

504d601

Copy link

lgtm-combot commentedNov 17, 2019

This pull requestintroduces 4 alerts when merging504d601 into9b31677 -view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import
1 for Syntax error

fixed syntax error; removed unused & variables; _safe_indexing; n_job…

2517bd9

…s docstring substitution; formatted code syntax to be more congruent with rest of codebase

Copy link

lgtm-combot commentedNov 19, 2019

This pull requestintroduces 3 alerts when merging2517bd9 intof356284 -view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

remove unused import & variables

a9cd91c

Copy link

lgtm-combot commentedNov 19, 2019

This pull requestfixes 3 alerts when merginga9cd91c intof356284 -view on LGTM.com

fixed alerts:

2 for Mismatch in multiple assignment
1 for Wrong name for an argument in a call

MattEding added4 commits

November 19, 2019 08:44

fixed relative import; __all__ quote cosmetics

b8c8a8e

update unit tests for preproces-sampling; fix cleaning err msg to hav…

5888ff7

…e "do not"; add cleaning test for minority error

Merge branch 'master' ofhttps://github.com/scikit-learn-contrib/imba…

54fce0d

…lanced-learn into SPIDER

fix test_docstring

f354ef0

fix docstring format errors; add unit tests

a8d21e0

Copy link

lgtm-combot commentedNov 20, 2019

This pull requestintroduces 1 alert when merginga8d21e0 into9a1191e -view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

MattEding added7 commits

November 20, 2019 09:42

fix renamed param in example plotting

149ea38

Merge branch 'master' ofhttps://github.com/scikit-learn-contrib/imba…

ba8347a

…lanced-learn into spider

add additional_neighbors validation tests; refactor try-except to sat…

1155fa3

…isfy codecov

fix 80 char line

3dd9ae9

fix visual indent; remove unused import

610f3c6

import from _testing to remove future warning

583af30

fix assertion typo

53dfc4e

Copy link

ContributorAuthor

MattEding commentedNov 22, 2019

I am unfamiliar with how codecov determines bad lines of code coverage. With try-except blocks being normal control flow in Python--in fact it's consideredgood practice I am curious as to why these are being flagged as poor coverage. For example, the catching of anUnboundLocalError is a lot cleaner than saylocals().get('discard_indices', []). Also I am presuming that the red flags in the helper functions are due to multiple return statements. Is this the case?

Copy link

nschlaffke commentedJun 4, 2020

How do the things look like? It would be nice to have this algorithm

Copy link

ContributorAuthor

MattEding commentedJun 6, 2020

@nschlaffke
It's been quite some time since I've worked on SPIDER, but I believe that I have left it as a fully working algorithm--it passes the (very small) toy test data set that I worked out manually. See examples/combine/plot_illustration_spider.py and the unit tests for details.

The things that needed to be fleshed out would be how it integrates with the rest of imblearn, and whether to lower the Code Coverage threshold rejecting try-except clauses or rewrite in different style. For example, I had created a new base class for this since it didn't feel like it would fit the role of over/under sampling. This may not be what the maintainers would desire or may want it to be refactored.