scikit-learn-contrib/imbalanced-learnPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star7k

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification#437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

EvgeniDubov wants to merge61 commits intoscikit-learn-contrib:master

base:master

Choose a base branch

fromEvgeniDubov:hellinger_distance_criterion

Open

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification#437

EvgeniDubov wants to merge61 commits intoscikit-learn-contrib:masterfromEvgeniDubov:hellinger_distance_criterion

Conversation

Copy link

EvgeniDubov commentedJul 11, 2018•
edited
Loading

Reference Issue

[sklearn] Feature Request: Hellinger split criterion for classificaiton trees#9947

What does this implement/fix? Explain your changes.

Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models

Any other comments?

This is my first submission, sorry in advance for the many possible things I've missed.
Looking forward for your feedback.

EvgeniDubov added4 commits

July 11, 2018 14:19

added cython implementation of hellinger distance criterion compatibl…

e58f628

…e with sklearn

added usage example

649e204

added README

689a41b

update license

e4e13a7

Copy link

pep8speaks commentedJul 11, 2018•
edited
Loading

Hello@EvgeniDubov! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-23 09:29:09 UTC

Copy link

codecovbot commentedJul 11, 2018•
edited
Loading

Codecov Report

Merging#437 intomaster willnot change coverage.
The diff coverage isn/a.

@@           Coverage Diff           @@##           master     #437   +/-   ##=======================================  Coverage   98.83%   98.83%           =======================================  Files          86       86             Lines        5317     5317           =======================================  Hits         5255     5255             Misses         62       62

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last updatefc9e483...7acac1f. Read thecomment docs.

Copy link

Member

glemaitre commentedJul 11, 2018

Cool. I am looking forward for this contribution. Could you make a quick todo list in the first summary.

Copy link

Member

glemaitre commentedJul 11, 2018

From what I see:

We did not have any Cython up to now. So you will need to setup the project for it. You can refer tohttps://github.com/jakevdp/cython_template
You have to solve the issues raised by PEP8.
I think that we need to improve the example with some narrative documentation.
We need to have a section in the User Guide such that people can find out about this feature and in which case to use it. Probably it could be embedded in the same section than the BalancedBaggingClassifier.
You can add a what's new entry as well.

EvgeniDubovand others added4 commits

July 12, 2018 10:33

Fixed pep8 issues in the example

e33d8a3

Fixed pep8 issues in the setup

f08534b

added support for cython build based onhttps://github.com/jakevdp/cy…

d655b60

…thon_template

Merge branch 'hellinger_distance_criterion' ofhttps://github.com/Evg…

8d779e0

…eniDubov/imbalanced-learn into hellinger_distance_criterion# Conflicts:#imblearn/tree_split/setup.py

Copy link

Author

EvgeniDubov commentedJul 16, 2018

I've pushed the code for cython build support, and all the automatic checks failed.
LGTM failed because my implementation assumes the existence of sklearn/tree/_criterion.pxd and it is missing.
AppVeyor and Travis failed because Cython package is not installed.
I assume that the last will fail on the missing .pxd file too after Cython dependency will be resolved.

Please let me know if there is a way for me to configure these tools or are they administrated by the maintainers only.

EvgeniDubovand others added5 commits

July 25, 2018 13:16

updated 'whats new'

b94ed53

updated the example

3e265ce

updates user guide and api

097a582

fixed LGTM issues

0517276

Merge branch 'master' into hellinger_distance_criterion

65a4c62

Gitman-code mentioned this pull request

Aug 18, 2018

Feature Request: Hellinger split criterion for classificaiton treesscikit-learn/scikit-learn#9947

Closed

Copy link

Member

glemaitre commentedAug 30, 2018

@EvgeniDubov I am getting some time to look at this PR.
I will probably make a force push to have a different building system for the Cython but this is not the most important thing.

I was looking at the literature and the original paper. I did not find a clear statement on how to compute the distance in a multiclass problem which is actually supported by the trees in scikit-learn.

@EvgeniDubov @DrEhrfurchtgebietend do you have a reference for multiclass?

Copy link

Gitman-code commentedAug 30, 2018

I have not done much multi-class classification. I do not even know how it is implemented with the traditional split criterion. Is it possible to set this up to only work for binary classification? Can we release without solving this?

Copy link

Author

EvgeniDubov commentedSep 3, 2018

@glemaitre ineed sklearn's 'gini' and 'entropy' support multiclass but hellinger requires some modification to support it.
Here is a quote from an abstract of apaper on this subject

In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

Copy link

Member

glemaitre commentedSep 3, 2018

Oh nice, I did not look at this paper but the 2009 and 2011. It seems that I missed it.

I can contribute it as a separate Cython implementation, preferably in a separate PR.

I think that we should have the multi-class criterion directly. The reason is that we don't have a mechanism for raising an error if the criterion is used for a multiclass problem. However, it seems that it is quite feasible to implement the algorithm 1 in the paper that you attached.

Regarding the cython file, could you take all the cython setup fromglemaitre@27fffea and just paste your criterion file and tests at the right location (+documentation). I prefer to be closer to those Cython setup (basically the major change is about the package naming).

glemaitre force-pushed themaster branch 2 times, most recently frombbf2b12 to513203cCompare

September 7, 2018 13:26

Copy link

Gitman-code commentedSep 10, 2018

I am curious if we need to do something specific for how feature importance will be calculated after this change is done. There are two questions here. First, does the standard method of the sum of improvement in the criterion really generalize to all criterion. I think the answer is yes but if so then it might not be the case that this is really the definition we want. In an imbalanced case we would in theory have imbalanced features (ie nearly all the same value) which if important would be used high in the tree but not frequently. This would result in a low weight under the current definition. Would a definition of the average gain when used instead of the total gain across all uses be better? To limit discussion here I put this into a SOpost.

Copy link

Member

glemaitre commentedSep 11, 2018•
edited
Loading

There is 3 points to consider:

The feature importance at a node is normalized by the weighted number of samples at that node.
The information gain on the top of the tree will be more important than on the bottom of the tree. So a feature used on the top of the tree might be more important than a feature used several time at the bottom of the tree, if the information gain is actually lower.
The feature importance computed in the tree is a bias estimator:http://explained.ai/rf-importance/index.html and there is probably nothing to do about that apart of doing a permutation test.

Copy link

Gitman-code commentedSep 11, 2018

Thanks for the feedback@glemaitre .

So you agree that different split criteria could be used to calculate the feature importance in general? It seems to intuitively make sense that if it is used to build the tree it makes sense to use it to define importance.

The weighting of the importance by the number of samples at the node was sort of what got me thinking down this path. Hellinger distance is designed to be less sensitive to number of samples but I think that is only a factor in finding the split.

The permutation feature importance is a great method. I see that there arediscussions to move it to sklearn.

The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics like RMSE since changes to only a few predictions will only alter it by a tiny amount. Permutation feature importance would not be sensitive to this either. Or at least it will only be as sensitive as metric you use for evaluation is to such outliers. Do you know of any standard metric for identifying features of this type? Sorry this has gotten a little off topic.

EvgeniDubov added2 commits

October 2, 2018 15:28

Merged withglemaitre@27fffea

a700cbd

Merge branch 'hellinger_distance_criterion' ofhttps://github.com/Evg…

13bc07d

…eniDubov/imbalanced-learn into hellinger_distance_criterion# Conflicts:#doc/over_sampling.rst#doc/whats_new/v0.0.4.rst

EvgeniDubov added8 commits

August 15, 2019 14:43

commented out hellinger usage example to narrow down travis failure r…

4af0af8

…oot cause

added hellinger usage example to tree.rst

97ef77f

- added Cython temp files to git ignore

a7855a7

- added Hellinger pyd file to MANIFEST- update cython version requirements in hellinger cython code

Merge remote-tracking branch 'upstream/master' into hellinger_distanc…

ccbbaaf

…e_criterion

documentation update

21e6909

added cython installation to travis

9268ee9

fix few LGTM issues

e4c5360

fix LGTM issue

fc9e483

glemaitre force-pushed themaster branch from65132db to68123d0Compare

November 8, 2019 22:54

EvgeniDubov added9 commits

December 23, 2019 10:01

Merge remote-tracking branch 'upstream/master' into hellinger_distanc…

1eee5af

…e_criterion# Conflicts:#.gitignore#.travis.yml#appveyor.yml#imblearn/tensorflow/_generator.py#imblearn/tensorflow/tests/test_generator.py#imblearn/utils/_validation.py

travis fix

a3cfa7d

fix appveyor

0bb474c

updated MANIFEST.in

2fc250e

aligned setup file to master

008b808

fixed lint issues

6e67b96

fix lint issues

3afdbb4

fixed lint issues

806cc7b

fix lint issues

7acac1f

Copy link

Author

EvgeniDubov commentedDec 26, 2019

@glemaitre @chkoar I've synced with master and got lint, travis and appveyor issues, none of which caused by my contribution
can you please take a look

Copy link

giladwa1 commentedFeb 17, 2020

@glemaitre @chkoar My DS team is using Hellinger distance split criterion from@EvgeniDubov private repo. We would appreciate it being part of scikit-learn-contrib. We're willing to help move this PR forward in any way possible.

Copy link

Member

chkoar commentedFeb 17, 2020

@giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case.

Copy link

Member

glemaitre commentedFeb 17, 2020 via email

Speaking a bit more with the dev of scikit-learn, I think that it could beintegrated into scikit-learn directly.It would only be for the binary case and we should have good tests and anice example showing its benefit.The issue in imbalanced-learn is that we will be required to code in cythonand then it had a lot of burden on the wheel generation which personally Iwould like to avoid if possible.This somehow a cost which is a bit hidden.

…

On Mon, 17 Feb 2020 at 08:57, Christos Aridas ***@***.***> wrote:@giladwa1 <https://github.com/giladwa1> I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#437?email_source=notifications&email_token=ABY32P6QOZSEZJUMUFF755TRDI7OLA5CNFSM4FJO4VXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5MXTQ#issuecomment-586861518>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P6GMHFI2R2NKZS62OTRDI7OLANCNFSM4FJO4VXA> .

-- Guillaume LemaitreScikit-learn @ Inria Foundationhttps://glemaitre.github.io/

Copy link

Member

chkoar commentedFeb 17, 2020

The issue in imbalanced-learn is that we will be required to code in cython and then it had a lot of burden on the wheel generation which personally I would like to avoid if possible.

That is very true.