Movatterモバイル変換

Skip to content

scikit-learn-contrib/imbalanced-learnPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star7k

DOC improve documentation of nearmiss#1028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open

solegalli wants to merge3 commits intoscikit-learn-contrib:master

base:master

Choose a base branch

fromsolegalli:update_nearmiss

Open

DOC improve documentation of nearmiss#1028

solegalli wants to merge3 commits intoscikit-learn-contrib:masterfromsolegalli:update_nearmiss

Conversation

solegalli

Copy link

Contributor

solegalli commentedJul 11, 2023

related to#853

Separating in smaller PRs as requested.

I had some trouble understanding the documentation for this class, so I hope that when I re-worded it, I captured the message that wanted to be given in the original version. More comments in the code.

@solegalli

reword nearmiss

9d50a13

solegalli

solegalli commented

View reviewed changes

doc/under_sampling.rst


		>>> from imblearn.under_sampling import NearMiss
		>>> nm1 = NearMiss(version=1)
		>>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
		>>> print(sorted(Counter(y_resampled).items()))
		[(0, 64), (1, 64), (2, 64)]

		As later stated in the next section, :class:`NearMiss` heuristic rules are

Copy link

ContributorAuthor

solegalliJul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that describing parameters without having described the logic breaks the flow and doesn't really help the user understand what the parameters do.

I suggest removing this paragraph altogether and expanding on the meaning of the parameters (if necessary) on the docstrings.

solegalli

solegalli commented

View reviewed changes

doc/under_sampling.rst

		is the largest.
		NearMiss-3 is a 2-steps algorithm:

		First, for each negative sample, that is, for each observation of the minority class,

Copy link

ContributorAuthor

solegalliJul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I refer back to target class and minority because positive and negative are a bit confusing, and at this stage, the reader may have forgotten which is which.

solegalli

solegalli commented

View reviewed changes

doc/under_sampling.rst

		previous toy example. It can be seen that the decision functions obtained in
		each case are different.

		When under-sampling a specific class, NearMiss-1 can be altered by the presence

Copy link

ContributorAuthor

solegalliJul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is the paragraph that I don't understand.

Are we saying that Nearmiss-1 is sensitive to noise, because the samples from the majority closer to the minority are noise?

That is what I understood and rephrased it. Correct me if I am wrong.

But besides that, is it always the case that observations from the majority closer to the minority are noise? I get it that they will be the hardest to classify, but that does not necessarily make them noise. Noise, as I understand it, is some sort of random variation. Or in this case, it would be samples from the majority class that are not necessarily representative of the majority, or in other words sort of outliers. But all of these, is not in the text. The text as it stands right now just says that if they are close, they are noise, and I don't think that is correct.

Thoughts?

Copy link

Member

glemaitreJul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A noisy observation would be something outside of the sample distribution. So I agree that it does not have anything to do with the decision boundary. Those are just harder to classify if the data overlap but they are not noise.

solegalli

solegalli commented

View reviewed changes

doc/under_sampling.rst


		NearMiss-3 is probably the least sensitive version to noise due to the first sample
		selection step.

Copy link

ContributorAuthor

solegalliJul 11, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

More thoughts on NearMiss:

These is just my opinion, I would be happy to be challenged on this one.

To me, it feels that the authors almost used some random logic to select the samples: those with the smallest average distance, those with the largest and then something else in version 3. In their article, they don't really describe how they came to this conclusion, so it seems arbitrary to me. Given this, I am not sure it makes sense to discuss anything further on this user guide. I would remove all the discussion about noise in fact.

In addition, this method was designed and tested on text data, something similar to bag of words / tokens, which is a very particular format. And I am not aware that it has been tested / used somewhere else? So overall, I am not a great fan of this method, but that aside, should we not add a disclaimer somewhere saying that it was designed for text, so caution when implementing in more traditional datasets?

Copy link

Member

glemaitreJul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actually we don't even implement the right things here :)
#980

solegalli added2 commits

July 11, 2023 15:03

@solegalli

fix typos

c1cb00f

@solegalli

reword docstrings

121a75d

@solegalli

Copy link

ContributorAuthor

solegalli commentedJul 11, 2023

@glemaitre I am done with the initial changes, but I think it would be worth to discuss this a bit. Thank you!

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Labels

None yet

2 participants

@solegalli

@glemaitre

[8]ページ先頭

©2009-2025 Movatter.jp