Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

my RIF estimator, i need ideassssss#31515

Unanswered
GiulioSurya asked this question inQ&A
Discussion options

Hello everyone,

As part of my Master's thesis, I am developing a new estimator based on Isolation Forest that operates on residuals. Without delving into the theoretical background, which isn't relevant here, I'm currently facing a technical issue.

My repository is available at:
(Rif estimator)

The repository includes two modules:

  • RIF
  • _residual_gen

The estimator is implemented within the scikit-learn ecosystem and therefore inherits its methods. In particular, here is what happens:

When I call thefit method on theRIF estimator, it internally invokesfit_transform from_residual_gen, which is responsible for computing residuals and using them to fit the Isolation Forest.
These residuals are computed using a Random Forest model. To avoid data leakage, they are calculated either without-of-bag (OOB) predictions ork-fold cross-validation. (There’s also a “vanilla” version without leakage control, but that’s not relevant for this issue.)

Once computed, the residuals are cached. Why?
Because whenRIF.predict(X) is called:

  • If the inputX is the same as the one used inRIF.fit(X), the cached residuals are reused.
  • If the inputX is different, the previously fitted Random Forest is used to compute new residuals, and anomalies are detected on these.

Currently, this distinction between training and prediction data is handled usingid(X), which checks whether the memory reference of the two datasets is the same. I also tried using a hash of the dataset content, but both approaches seem fragile and not robust in practice.

I’m looking for a better solution, either one that improves the logic of comparing the two datasets, or a new approach that achieves the same goal in a more reliable way.

Any help or suggestions would be greatly appreciated.

Best regards,
Giulio

You must be logged in to vote

Replies: 0 comments

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
1 participant
@GiulioSurya

[8]ページ先頭

©2009-2025 Movatter.jp