Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[WIP] ENH: SPIDER Sampling Algorithm#603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
MattEding wants to merge24 commits intoscikit-learn-contrib:master
base:master
Choose a base branch
Loading
fromMattEding:spider

Conversation

MattEding
Copy link
Contributor

@MattEdingMattEding commentedSep 19, 2019
edited
Loading

As per requested by the pinnedNew Methods, I have implemented the Selective Pre-processing of Imbalanced Data (SPIDER) sampling algorithm.

I have developed unit tests based on drawing out a sample dataset and working out by hand what the expected results would be as it is deterministic. See the following notebook for diagramshere.

Currently, the implementation for dense and sparse return the same data points but in different orders (np.lexsort will show they are indeed the same). Consequently this fails the existing unit test that compares dense vs sparse outputs. I would rather not have to require sorting results to ensure that test passes due to the overhead for sorting large datasets. Maybe I can just have a parameter that defaultssort=True to give the option to bypass this issue.

The only other unit test I saw fail with PyTest is awith raises(MESSAGE), but I will fix that later since I am not sure if other developers will want to keep my newsample_type = 'preprocess-strategy'. I had chosen do this because SPIDER does both cleaning and oversampling and I did not feel that inheriting from either was appropriate--oversamplers allowsampling_strategy as afloat which does not make sense for this algorithm, and cleaners only really allow for under-sampling sampling strategies which is also inappropriate.

Benchmark results using cross-validation with the mean of 5 folds comparing None, NCR, SMOTE, and the 3 SPIDER variants arehere in the PKL, CSV, and PNG folders. I used the Zenodo datasets (excluding set 26 due to it being a large dataset for my local computer to work with in a time effective manner). I set alln_jobs=-1 on my 4 core macbook if you want to infer thetime values. Additionally when a scorer was undefined, I left the value as 0 rather than changing it to NaN.

TODO:

  • Resolve the two failing tests I addressed above
  • Online reference explaining the algorithm
  • Possibly re-categorize algorithm
  • Maybe change self._{xyz} to be variables to be explicitly passed into methods to avoid possible multiprocessing mutation complications

@pep8speaks
Copy link

pep8speaks commentedSep 19, 2019
edited
Loading

Hello@MattEding! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 00:59:05 UTC

@MattEdingMattEding changed the title[MRG] ENH: SPIDER Sampling Algorithm[WIP] ENH: SPIDER Sampling AlgorithmSep 19, 2019
@chkoar
Copy link
Member

Thanks Matt! That would be a nice addition!

@MattEding
Copy link
ContributorAuthor

Rough draft for online documentation illustrations of how the SPIDER algorithm works for arbitrary X and y values. The generated plots work as planned; just need to add written text discussing how to interpret graphs. Also needs to be moved from .ipynb to .py that can be understood by sphinx.
https://github.com/MattEding/SPIDER-Algorithm/blob/master/SPIDER-Formulation.ipynb

@codecov
Copy link

codecovbot commentedSep 30, 2019
edited
Loading

Codecov Report

Merging#603 intomaster willdecrease coverage by0.02%.
The diff coverage is97.92%.

Impacted file tree graph

@@            Coverage Diff             @@##           master     #603      +/-   ##==========================================- Coverage   98.46%   98.43%   -0.03%==========================================  Files          82       86       +4       Lines        4886     5062     +176     ==========================================+ Hits         4811     4983     +172- Misses         75       79       +4
Impacted FilesCoverage Δ
imblearn/utils/_validation.py100% <100%> (ø)⬆️
imblearn/combine/_preprocess/base.py100% <100%> (ø)
imblearn/combine/_preprocess/__init__.py100% <100%> (ø)
imblearn/utils/tests/test_validation.py100% <100%> (ø)⬆️
imblearn/combine/tests/test_spider.py100% <100%> (ø)
imblearn/combine/__init__.py100% <100%> (ø)⬆️
imblearn/combine/_preprocess/_spider.py96.39% <96.39%> (ø)
... and1 more

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last update3839df1...53dfc4e. Read thecomment docs.

@MattEding
Copy link
ContributorAuthor

With Travis CIDISTRIB="ubuntu", it appears that this build is using Tensorflow 2.0 as the unit tests are failing withAttributeError: module 'tensorflow' has no attribute 'placeholder'.

@glemaitre
Copy link
Member

I would suggest moving this implementation intosmote_variants. The idea behind this move is to benchmark the smote variants on a common benchmark on a large number of datasets and include in imbalanced-learn only the versions that show an advantage. You can see the discussion and contribute to it:https://github.com/gykovacs/smote_variants/issues/14

@MattEding would this strategy would be fine with you?

@chkoar
Copy link
Member

I believe thatSPIDER would be great addition toimblearn package. I need to find some time to thoroughly review it.@MattEding thanks for the patience.

@MattEding
Copy link
ContributorAuthor

@glemaitre I understand the motivation for not wanting to include it until it has a more proven track record. My only reservation about trying to include it insmote_variants is that SPIDER is unrelated to SMOTE. One of its motivations was to avoid creating synthetic samples:

"Random introduction of synthetic examples by SMOTE may be questionable or difficult to justify in some domains, where it is important to preserve a link between original data and a constructed classifier (e.g., to justify suggested decisions)."
Stefanowski and Wilk, 2008.Selective Pre-processing of Imbalanced Data for Improving Classification Performance.

As a side note, you may want to revisitNew Methods inviting people to contribute new sampling techniques by specifying that contributions may need to go through other repositories first.

@glemaitre
Copy link
Member

My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE.

This is a really could point, I misread when reviewing the different PR :).

My proposal was a bit too quick.@chkoar has some concerns also regarding my proposal as well. We probably need to open an issue to have a public discussion about it.

So as a summary, let's go-ahead for the inclusion of this method. I should have a bit of time to review it in the coming days.@MattEding could you solve the conflict?

chkoar reacted with thumbs up emoji

@lgtm-com
Copy link

This pull requestintroduces 4 alerts when merging504d601 into9b31677 -view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for Unused import
  • 1 for Syntax error

…s docstring substitution; formatted code syntax to be more congruent with rest of codebase
@lgtm-com
Copy link

This pull requestintroduces 3 alerts when merging2517bd9 intof356284 -view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

This pull requestfixes 3 alerts when merginga9cd91c intof356284 -view on LGTM.com

fixed alerts:

  • 2 for Mismatch in multiple assignment
  • 1 for Wrong name for an argument in a call

@lgtm-com
Copy link

This pull requestintroduces 1 alert when merginga8d21e0 into9a1191e -view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

@MattEding
Copy link
ContributorAuthor

I am unfamiliar with how codecov determines bad lines of code coverage. With try-except blocks being normal control flow in Python--in fact it's consideredgood practice I am curious as to why these are being flagged as poor coverage. For example, the catching of anUnboundLocalError is a lot cleaner than saylocals().get('discard_indices', []). Also I am presuming that the red flags in the helper functions are due to multiple return statements. Is this the case?

@nschlaffke
Copy link

How do the things look like? It would be nice to have this algorithm

@MattEding
Copy link
ContributorAuthor

@nschlaffke
It's been quite some time since I've worked on SPIDER, but I believe that I have left it as a fully working algorithm--it passes the (very small) toy test data set that I worked out manually. See examples/combine/plot_illustration_spider.py and the unit tests for details.

The things that needed to be fleshed out would be how it integrates with the rest of imblearn, and whether to lower the Code Coverage threshold rejecting try-except clauses or rewrite in different style. For example, I had created a new base class for this since it didn't feel like it would fit the role of over/under sampling. This may not be what the maintainers would desire or may want it to be refactored.

nschlaffke reacted with thumbs up emoji

@glemaitre
Copy link
Member

I will try to have a look soon. I had to make some underground work to ensure the compatibility with scikit-learn 0.23 and I should be done with it.

nschlaffke reacted with heart emoji

@glemaitre
Copy link
Member

I will check when I get time. We can put a milestone for 0.9.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
0.9
Development

Successfully merging this pull request may close these issues.

5 participants
@MattEding@pep8speaks@chkoar@glemaitre@nschlaffke

[8]ページ先頭

©2009-2025 Movatter.jp