![]() | A major contributor to this article appears to have aclose connection with its subject. It may require cleanup to comply with Wikipedia's content policies, particularlyneutral point of view. Please discuss further on thetalk page.(October 2023) (Learn how and when to remove this message) |
Isolation Forest is an algorithm for dataanomaly detection usingbinary trees. It was developed by Fei Tony Liu in 2008.[1] It has a linear time complexity and a low memory use, which works well for high-volume data.[2][3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.
Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.
The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting andZhi-Hua Zhou in 2008.[2] In 2012 the same authors showed that iForest has linear time complexity, a small memory requirement, and is applicable to high-dimensional data.[3] In 2010, an extension of the algorithm, SCiforest, was published to address clustered and axis-paralleled anomalies.[4]
The premise of the Isolation Forest algorithm is that anomalous data points are easier to separate from the rest of the sample. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.
An example of random partitioning in a 2D dataset ofnormally distributed points is shown in the first figure for a non-anomalous point and in the second one for a point that is more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.
Recursive partitioning can be represented by a tree structure namedIsolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point in the first figure is greater than the path length of in the second figure.
Let be a set of d-dimensional points and. An Isolation Tree (iTree) is defined as a data structure with the following properties:
In order to build an iTree, the algorithm recursively divides by randomly selecting an attribute and a split value, until either
When the iTree is fully grown, each point in is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smallerpath length in the tree, where the path length of point is defined as the number of edges traverses from the root node to get to an external node.
A probabilistic explanation of iTree is provided in the original iForest paper.[2]
Anomaly detection with Isolation Forest is done as follows:[4]
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that ofBinary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.[4] Therefore, the estimation of average for external node terminations is the same as that of the unsuccessful searches in BST, that is[5]
where is the test set size, is the sample set size and is the harmonic number, which can be estimated by, where is theEuler-Mascheroni constant.
Above, is the average given, so we can use it to normalize to get an estimate of the anomaly score for a given instance x:
where is the average value of from a collection of iTrees. For any data point:
The Isolation Forest algorithm has shown its effectiveness in spotting anomalies in data sets like uncovering credit card fraud instances among transactions, by European cardholders with an unbalanced dataset where it can distinguish fraudulent activities from legitimate ones by identifying rare patterns that show notable differences.[6]
In this research projects dataset, there are 284807 transactions recorded in total out of which only 492 are identified as fraudulent (0.172%). Due to this imbalance between authentic and fraudulent transactions detection of fraud becomes quite demanding; hence specialized metrics such as the Area Under the Precision Recall Curve (AUPRC) are essential for accurate evaluation rather, than relying solely on traditional accuracy measures.[6]
The dataset consists of PCA transformed features (from V1, to V28) well as the Time (time elapsed since the initial transaction) and Amount (transaction value). We processed the dataset using the steps:
Scaling : The Time and Amount features by utilizing StandardScaler to standardize their input range.[7]
Imputation: Missing data in the dataset were filled in by using the average of the corresponding columns, with SimpleImputer.[7]
Feature Selection : To enhance the model's effectiveness and accuracy in predictions and analysis tasks was to choose features with the kurtosis value for further examination because these particular features tend to have the most significant outliers that could potentially signal irregularities or anomalies within the data set used for modeling purposes. For training purposes specifically, a selection of the 10 features were identified and prioritized as key components, in refining the model's capabilities and enhancing its overall performance.[8]
The Isolation Forest model was specifically trained on transactions (Class=0) focusing on recognizing common behavioral patterns in data analysis tasks. The algorithm separates out instances by measuring the distance needed to isolate them within a collection of randomly divided trees.[6]
Hyperparameter Tuning:
A grid search was performed over the following hyperparameters
Contamination: Expected percentage of anomalies in the dataset, tested at values 0.01, 0.02, and 0.05[8]
Max Features: Number of features to sample for each tree, tested at values 5, 8, and 10.[8]
The best configuration was found with:
The model was evaluated on a separate test set using accuracy, precision, recall, and the Area Under the Precision-Recall Curve (AUPRC). Below are the key results:
While the accuracy seems impressive at glance it mainly showcases the prevalence of regular transactions within the dataset. Precision and recall emphasize the challenges in detecting fraud because of the significant imbalance present. In assessing both precision and recall the AUPRC offers an evaluation by considering the balance between precision and recall.[6]
1. Scatter Plot of Detected Anomalies
Observation: The plot shows that many fraudulent transactions (red points) are located on the edges or far from the central cluster of normal transactions (blue points). However, some red points overlap with the blue cluster, indicating potential false positives or challenging cases for the model.
Key Details:
Observation:
The Isolation Forest algorithm provides a robust solution for anomaly detection, particularly in domains like fraud detection where anomalies are rare and challenging to identify. However, its reliance on hyperparameters and sensitivity to imbalanced data necessitate careful tuning and complementary techniques for optimal results.[6][8]
The performance of the Isolation Forest algorithm is highly dependent on the selection of its parameters. Properly tuning these parameters can significantly enhance the algorithm's ability to accurately identify anomalies. Understanding the role and impact of each parameter is crucial for optimizing the model's performance.[11]
The Isolation Forest algorithm involves several key parameters that influence its behavior and effectiveness. These parameters control various aspects of the tree construction process, the size of the sub-samples, and the thresholds for identifying anomalies.[11] Selecting appropriate parameters is the key to the performance of the Isolation Forest algorithm. Each of the parameters influences anomaly detection differently. Key parameters include:
Number of Trees : This parameter determines the number of trees in the Isolation Forest. A higher number of trees improves anomaly detection accuracy but increases computational costs. The optimal number balances resource availability with performance needs. For example, a smaller dataset might require fewer trees to save on computation, while larger datasets benefit from additional trees to capture more complexity.[2]
Subsample Size : The subsample size dictates the number of data points used to construct each tree. Smaller subsample sizes reduce computational complexity but may capture less variability in the data. For instance, a subsample size of 256 is commonly used, but the optimal value depends on dataset characteristics.[2]
Contamination Factor : This parameter estimates the proportion of outliers in the dataset. Higher contamination values flag more data points as anomalies, which can lead to false positives. Tuning this parameter carefully based on domain knowledge or cross-validation is critical to avoid bias or misclassification.[3]
Maximum Features: This parameter specifies the number of random features to consider for each split in the tree. Limiting the number of features increases randomness, making the model more robust. However, in high-dimensional datasets, selecting only the most informative features prevents overfitting and improves generalization.[2][3]
Tree Depth : Tree depth determines the maximum number of splits for a tree. Deeper trees better capture data complexity but risk overfitting, especially in small datasets. Shallow trees, on the other hand, improve computational efficiency.[3]
The table below summarizes parameter selection strategies based on dataset characteristics.
Parameter | Small Datasets | Large Datasets | High-Dimensional Data | Imbalanced Datasets |
---|---|---|---|---|
Number of Treesn_estimators | Use fewer trees to save on computation.[13] | More trees improve performance, but are costly.[14] | Requires more trees to capture complexity.[15] | Adjust based on dataset size.[15] |
Subsample Sizemax_samples | Smaller subsample reduces cost.[13] | Larger subsample increases accuracy.[14] | Dimensionality reduction can optimize subsample size.[15] | Smaller subsample for efficiency.[15] |
Contamination Factorcontamination | Tune based on domain knowledge.[13] | Cross-validation for tuning.[14] | Careful tuning to avoid misclassification.[15] | Lower contamination helps avoid bias.[15] |
Maximum Featuresmax_features | Use all features unless limited by computation.[13] | Logarithmic or √n scaling for large datasets.[14] | Select most informative features.[15] | Balance feature selection to avoid overfitting.[15] |
Tree Depthmax_depth | Moderate depth to prevent overfitting.[13] | Shallower depth to save on computation.[14] | Deeper trees to capture data complexity.[15] | Adjust to balance overfitting.[15] |
Benefits of Proper Parameter Tuning:
Improved Accuracy: Fine-tuning parameters helps the algorithm better distinguish between normal data and anomalies, reducing false positives and negatives.[11]
Computational Efficiency: Selecting appropriate values for parameters like the number of trees and sub-sample size makes the algorithm more efficient without sacrificing accuracy.[11]
Generalization: Limiting tree depth and using bootstrap sampling helps the model generalize better to new data, reducing overfitting.[11]
SCiForest (Isolation Forest with Split-selection Criterion) is an extension of the original Isolation Forest algorithm, specifically designed to target clustered anomalies. It introduces a split-selection criterion and uses random hyper-planes that are non-axis-parallel to the original attributes. SCiForest does not require an optimal hyper-plane at every node; instead, it generates multiple random hyper-planes, and through sufficient trials, a good-enough hyper-plane is selected. This approach makes the resulting model highly effective due to the aggregate power of the ensemble learner.[4]
The implementation of SciForest involves four primary steps, each tailored to improve anomaly detection by isolating clustered anomalies more effectively than standard Isolation Forest methods.
Using techniques like KMeans or hierarchical clustering, SciForest organizes features into clusters to identify meaningful subsets. By sampling random subspaces, SciForest emphasizes meaningful feature groups, reducing noise and improving focus. This reduces the impact of irrelevant or noisy dimensions.[4]
Within each selected subspace, isolation trees are constructed. These trees isolate points through random recursive splitting:
Anomalous points, being sparse or distinct, are isolated more quickly (shorter path lengths) compared to normal points.[2]
For each data point, the isolation depth () is calculated for all trees across all subspaces. The anomaly score for a data point is defined as:
Where:
Points with lower average path lengths () are more likely to be anomalies.[3]
The final anomaly scores are compared against a predefined threshold to classify data points. If, the point is classified as anomalous; otherwise, it is normal. The anomaly score threshold, θ, can be tailored to specific applications to control the proportion of identified anomalies.[6]
These steps collectively enable SciForest to adapt to varied data distributions while maintaining efficiency in anomaly detection.
This flowchart visually represents the step-by-step process of SCiForest implementation, from inputting high-dimensional datasets to detecting anomalies. Each step is highlighted with its key functionality, providing a clear overview of the methodology.
Extended Isolation Forest (Extended IF or EIF) is another extension of the original Isolation Forest algorithm. Extended IF uses rotated trees in different planes, similarly to SCiForest and random values are selected to split the data, such as a random slope or intercept.
The standard Isolation Forest requires two pieces of information, those being 1) a random feature or coordinate, and 2) a random value for the feature from the range of available values in the data. The Extended IF also requires only two pieces of information, this time being 1) a random slope for the branch cut, and 2) a random intercept for the branch cut which is chosen from the range of available values of the training data. This makes the Extended IF simpler than usingrotation trees.[16]
The figure depicts a score map of a regular Isolation Forest in comparison to an Extended Isolation Forest for a sinusoidal-shaped data-set. This image allows us to clearly observe the improvement made by the Extended Isolation Forest in evaluating the scores much more accurately when compared to the shape of the data. While the regular isolation forest fails in capturing the sinusoid shape of the data and properly evaluating the anomaly scores. The regular Isolation Forest shapes the anomaly scores into a rectangular shape and simply assumes that any region nearby the sinusoid data point is not to be anomalous. In comparison, the EIF is more accurate in evaluating anomaly scores with more detail and unlike its predecessor, the EIF is able to detect anomalies that are close to the sinusoid shape of the data but are still anomalous. The original EIF publication includes also this comparison with a single-blob-shaped data-set and a two-blob-shaped data-set, also comparing the EIF results to isolation forest using rotation trees.[16]
The Extended Isolation Forest enhances the traditional Isolation Forest algorithm by addressing some of its limitations, particularly in handling high-dimensional data and improving anomaly detection accuracy. Key improvements in EIF include:
Enhanced Splitting Mechanism: Unlike traditional Isolation Forest, which uses random axis-aligned splits, EIF uses hyperplanes for splitting data. This approach allows for more flexible and accurate partitioning of the data space, which is especially useful in high-dimensional datasets.
Improved Anomaly Scoring: EIF refines the anomaly scoring process by considering the distance of data points from the hyperplane used in splitting. This provides a more granular and precise anomaly score, leading to better differentiation between normal and anomalous points.
Handling of High-Dimensional Data: The use of hyperplanes also improves EIF's performance in high-dimensional spaces. Traditional Isolation Forest can suffer from the curse of dimensionality in such scenarios, but EIF mitigates this issue by creating more meaningful and informative partitions in the data space.[17]
Original implementation by Fei Tony Liu isIsolation Forest inR.
Other implementations (in alphabetical order):
Other variations of Isolation Forest algorithm implementations:
The isolation forest algorithm is commonly used bydata scientists through the version made available in thescikit-learn library. The snippet below depicts a brief implementation of an isolation forest, with direct explanations with comments.
importpandasaspdfromsklearn.ensembleimportIsolationForest# Consider 'data.csv' is a file containing samples as rows and features as column, and a column labeled 'Class' with a binary classification of your samples.df=pd.read_csv("data.csv")X=df.drop(columns=["Class"])y=df["Class"]# Determine how many samples will be outliers based on the classificationoutlier_fraction=len(df[df["Class"]==1])/float(len(df[df["Class"]==0]))# Create and fit model, parameters can be optimizedmodel=IsolationForest(n_estimators=100,contamination=outlier_fraction,random_state=42)model.fit(df)
In this snippet we can observe the simplicity of a standard implementation of the algorithm. The only requirement data that the user needs to adjust is the outlier fraction in which the user determines a percentage of the samples to be classifier as outliers. This can be commonly done by selection a group among the positive and negative samples according to a giving classification. Most of the other steps are pretty standard to anydecision tree based technique made available through scikit-learn, in which the user simply needs to split the target variable from the features and fit the model after it is defined with a giving number of estimators (or trees).
This snippet is a shortened adapted version of an implementation explored by GeeksforGeeks, which can be accessed for further explorations.[20]
{{cite book}}
:|journal=
ignored (help)