- Notifications
You must be signed in to change notification settings - Fork1.3k
[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification#437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pep8speaks commentedJul 11, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Hello@EvgeniDubov! Thanks for updating this PR. We checked the lines you've touched forPEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-12-23 09:29:09 UTC |
codecovbot commentedJul 11, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Codecov Report
@@ Coverage Diff @@## master #437 +/- ##======================================= Coverage 98.83% 98.83% ======================================= Files 86 86 Lines 5317 5317 ======================================= Hits 5255 5255 Misses 62 62 Continue to review full report at Codecov.
|
Cool. I am looking forward for this contribution. Could you make a quick todo list in the first summary. |
From what I see:
|
…eniDubov/imbalanced-learn into hellinger_distance_criterion# Conflicts:#imblearn/tree_split/setup.py
I've pushed the code for cython build support, and all the automatic checks failed. Please let me know if there is a way for me to configure these tools or are they administrated by the maintainers only. |
@EvgeniDubov I am getting some time to look at this PR. I was looking at the literature and the original paper. I did not find a clear statement on how to compute the distance in a multiclass problem which is actually supported by the trees in scikit-learn. @EvgeniDubov @DrEhrfurchtgebietend do you have a reference for multiclass? |
Gitman-code commentedAug 30, 2018
I have not done much multi-class classification. I do not even know how it is implemented with the traditional split criterion. Is it possible to set this up to only work for binary classification? Can we release without solving this? |
@glemaitre ineed sklearn's 'gini' and 'entropy' support multiclass but hellinger requires some modification to support it.
I can contribute it as a separate Cython implementation, preferably in a separate PR. |
Oh nice, I did not look at this paper but the 2009 and 2011. It seems that I missed it.
I think that we should have the multi-class criterion directly. The reason is that we don't have a mechanism for raising an error if the criterion is used for a multiclass problem. However, it seems that it is quite feasible to implement the algorithm 1 in the paper that you attached. Regarding the cython file, could you take all the cython setup fromglemaitre@27fffea and just paste your criterion file and tests at the right location (+documentation). I prefer to be closer to those Cython setup (basically the major change is about the package naming). |
bbf2b12
to513203c
CompareGitman-code commentedSep 10, 2018
I am curious if we need to do something specific for how feature importance will be calculated after this change is done. There are two questions here. First, does the standard method of the sum of improvement in the criterion really generalize to all criterion. I think the answer is yes but if so then it might not be the case that this is really the definition we want. In an imbalanced case we would in theory have imbalanced features (ie nearly all the same value) which if important would be used high in the tree but not frequently. This would result in a low weight under the current definition. Would a definition of the average gain when used instead of the total gain across all uses be better? To limit discussion here I put this into a SOpost. |
glemaitre commentedSep 11, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There is 3 points to consider:
|
Gitman-code commentedSep 11, 2018
Thanks for the feedback@glemaitre . So you agree that different split criteria could be used to calculate the feature importance in general? It seems to intuitively make sense that if it is used to build the tree it makes sense to use it to define importance. The weighting of the importance by the number of samples at the node was sort of what got me thinking down this path. Hellinger distance is designed to be less sensitive to number of samples but I think that is only a factor in finding the split. The permutation feature importance is a great method. I see that there arediscussions to move it to sklearn. The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics like RMSE since changes to only a few predictions will only alter it by a tiny amount. Permutation feature importance would not be sensitive to this either. Or at least it will only be as sensitive as metric you use for evaluation is to such outliers. Do you know of any standard metric for identifying features of this type? Sorry this has gotten a little off topic. |
…eniDubov/imbalanced-learn into hellinger_distance_criterion# Conflicts:#doc/over_sampling.rst#doc/whats_new/v0.0.4.rst
- added Hellinger pyd file to MANIFEST- update cython version requirements in hellinger cython code
…e_criterion# Conflicts:#.gitignore#.travis.yml#appveyor.yml#imblearn/tensorflow/_generator.py#imblearn/tensorflow/tests/test_generator.py#imblearn/utils/_validation.py
@glemaitre@chkoar I've synced with master and got lint, travis and appveyor issues, none of which caused by my contribution |
giladwa1 commentedFeb 17, 2020
@glemaitre@chkoar My DS team is using Hellinger distance split criterion from@EvgeniDubov private repo. We would appreciate it being part of scikit-learn-contrib. We're willing to help move this PR forward in any way possible. |
@giladwa1 I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case. |
Speaking a bit more with the dev of scikit-learn, I think that it could beintegrated into scikit-learn directly.It would only be for the binary case and we should have good tests and anice example showing its benefit.The issue in imbalanced-learn is that we will be required to code in cythonand then it had a lot of burden on the wheel generation which personally Iwould like to avoid if possible.This somehow a cost which is a bit hidden. …On Mon, 17 Feb 2020 at 08:57, Christos Aridas ***@***.***> wrote:@giladwa1 <https://github.com/giladwa1> I am not familiar with the Hellinger distance yet but if people are willing to help to get this merged I am ok even if it works only for the binary case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#437?email_source=notifications&email_token=ABY32P6QOZSEZJUMUFF755TRDI7OLA5CNFSM4FJO4VXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5MXTQ#issuecomment-586861518>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P6GMHFI2R2NKZS62OTRDI7OLANCNFSM4FJO4VXA> . -- Guillaume LemaitreScikit-learn @ Inria Foundationhttps://glemaitre.github.io/ |
That is very true. |
giladwa1 commentedFeb 20, 2020
@glemaitre@chkoar Thanks for the quick reply, I will continue the discussion in the scikit-learn PRscikit-learn/scikit-learn#16478 |
@glemaitre since this PR transferred could we close this PR? |
Sandy4321 commentedJan 8, 2023
pls clarify is it added to main code or not? |
Uh oh!
There was an error while loading.Please reload this page.
Reference Issue
[sklearn] Feature Request: Hellinger split criterion for classificaiton trees#9947
What does this implement/fix? Explain your changes.
Hellinger Distance as tree split criterion, cython implementation compatible with sklean tree based classification models
Any other comments?
This is my first submission, sorry in advance for the many possible things I've missed.
Looking forward for your feedback.