Movatterモバイル変換

[0]ホーム

Jump to content

Isolation forest

Edit links

From Wikipedia, the free encyclopedia

Algorithm for anomaly detection

A major contributor to this article appears to have aclose connection with its subject. It may require cleanup to comply with Wikipedia's content policies, particularlyneutral point of view. Please discuss further on thetalk page.(October 2023) (Learn how and when to remove this message)

Isolation Forest is an algorithm for dataanomaly detection usingbinary trees. It was developed by Fei Tony Liu in 2008.^[1] It has a linear time complexity and a low memory use, which works well for high-volume data.^[2]^[3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.

Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.

Parameter	Small Datasets	Large Datasets	High-Dimensional Data	Imbalanced Datasets
Number of Trees `n_estimators`	Use fewer trees to save on computation.^[13]	More trees improve performance, but are costly.^[14]	Requires more trees to capture complexity.^[15]	Adjust based on dataset size.^[15]
Subsample Size `max_samples`	Smaller subsample reduces cost.^[13]	Larger subsample increases accuracy.^[14]	Dimensionality reduction can optimize subsample size.^[15]	Smaller subsample for efficiency.^[15]
Contamination Factor `contamination`	Tune based on domain knowledge.^[13]	Cross-validation for tuning.^[14]	Careful tuning to avoid misclassification.^[15]	Lower contamination helps avoid bias.^[15]
Maximum Features `max_features`	Use all features unless limited by computation.^[13]	Logarithmic or √n scaling for large datasets.^[14]	Select most informative features.^[15]	Balance feature selection to avoid overfitting.^[15]
Tree Depth `max_depth`	Moderate depth to prevent overfitting.^[13]	Shallower depth to save on computation.^[14]	Deeper trees to capture data complexity.^[15]	Adjust to balance overfitting.^[15]

Movatterモバイル変換

History

Isolation trees

Anomaly detection

Anomaly score

Application of isolation forest for credit card fraud detection (anomaly)

Dataset and preprocessing

Model training and hyperparameter tuning

Results and evaluation

Visualizing results

Key details

2. Precision-Recall Curve

Strengths of isolation forest

Challenges

Future directions

Conclusion

Properties

Parameter selection

SCiForest

Steps in SCiForest implementation

1. Subspace selection

2. Isolation tree construction

3. Anomaly scoring

4. Thresholding

SCiForest implementation flowchart

Extended isolation forest

Improvements in extended isolation forest

Open source implementations

Python implementation with Scikit-learn

See also

References