2.7.Novelty and Outlier Detection#
Many applications require being able to decide whether a new observationbelongs to the same distribution as existing observations (it is aninlier), or should be considered as different (it is anoutlier).Often, this ability is used to clean real data sets. Two importantdistinctions must be made:
- outlier detection:
The training data contains outliers which are defined as observations thatare far from the others. Outlier detection estimators thus try to fit theregions where the training data is the most concentrated, ignoring thedeviant observations.
- novelty detection:
The training data is not polluted by outliers and we are interested indetecting whether anew observation is an outlier. In this context anoutlier is also called a novelty.
Outlier detection and novelty detection are both used for anomalydetection, where one is interested in detecting abnormal or unusualobservations. Outlier detection is then also known as unsupervised anomalydetection and novelty detection as semi-supervised anomaly detection. In thecontext of outlier detection, the outliers/anomalies cannot form adense cluster as available estimators assume that the outliers/anomalies arelocated in low density regions. On the contrary, in the context of noveltydetection, novelties/anomalies can form a dense cluster as long as they are ina low density region of the training data, considered as normal in thiscontext.
The scikit-learn project provides a set of machine learning tools thatcan be used both for novelty or outlier detection. This strategy isimplemented with objects learning in an unsupervised way from the data:
estimator.fit(X_train)
new observations can then be sorted as inliers or outliers with apredict method:
estimator.predict(X_test)
Inliers are labeled 1, while outliers are labeled -1. The predict methodmakes use of a threshold on the raw scoring function computed by theestimator. This scoring function is accessible through thescore_samplesmethod, while the threshold can be controlled by thecontaminationparameter.
Thedecision_function method is also defined from the scoring function,in such a way that negative values are outliers and non-negative ones areinliers:
estimator.decision_function(X_test)
Note thatneighbors.LocalOutlierFactor does not supportpredict,decision_function andscore_samples methods by defaultbut only afit_predict method, as this estimator was originally meant tobe applied for outlier detection. The scores of abnormality of the trainingsamples are accessible through thenegative_outlier_factor_ attribute.
If you really want to useneighbors.LocalOutlierFactor for noveltydetection, i.e. predict labels or compute the score of abnormality of newunseen data, you can instantiate the estimator with thenovelty parameterset toTrue before fitting the estimator. In this case,fit_predict isnot available.
Warning
Novelty detection with Local Outlier Factor
Whennovelty is set toTrue be aware that you must only usepredict,decision_function andscore_samples on new unseen dataand not on the training samples as this would lead to wrong results.I.e., the result ofpredict will not be the same asfit_predict.The scores of abnormality of the training samples are always accessiblethrough thenegative_outlier_factor_ attribute.
The behavior ofneighbors.LocalOutlierFactor is summarized in thefollowing table.
Method | Outlier detection | Novelty detection |
|---|---|---|
| OK | Not available |
| Not available | Use only on new data |
| Not available | Use only on new data |
| Use | Use only on new data |
| OK | OK |
2.7.1.Overview of outlier detection methods#
A comparison of the outlier detection algorithms in scikit-learn. LocalOutlier Factor (LOF) does not show a decision boundary in black as ithas no predict method to be applied on new data when it is used for outlierdetection.

ensemble.IsolationForest andneighbors.LocalOutlierFactorperform reasonably well on the data sets considered here.Thesvm.OneClassSVM is known to be sensitive to outliers and thusdoes not perform very well for outlier detection. That being said, outlierdetection in high-dimension, or without any assumptions on the distributionof the inlying data is very challenging.svm.OneClassSVM may stillbe used with outlier detection but requires fine-tuning of its hyperparameternu to handle outliers and prevent overfitting.linear_model.SGDOneClassSVM provides an implementation of alinear One-Class SVM with a linear complexity in the number of samples. Thisimplementation is here used with a kernel approximation technique to obtainresults similar tosvm.OneClassSVM which uses a Gaussian kernelby default. Finally,covariance.EllipticEnvelope assumes the data isGaussian and learns an ellipse. For more details on the different estimatorsrefer to the exampleComparing anomaly detection algorithms for outlier detection on toy datasets and thesections hereunder.
Examples
SeeComparing anomaly detection algorithms for outlier detection on toy datasetsfor a comparison of the
svm.OneClassSVM, theensemble.IsolationForest, theneighbors.LocalOutlierFactorandcovariance.EllipticEnvelope.SeeEvaluation of outlier detection estimatorsfor an example showing how to evaluate outlier detection estimators,the
neighbors.LocalOutlierFactorand theensemble.IsolationForest, using ROC curves frommetrics.RocCurveDisplay.
2.7.2.Novelty Detection#
Consider a data set of\(n\) observations from the samedistribution described by\(p\) features. Consider now that weadd one more observation to that data set. Is the new observation sodifferent from the others that we can doubt it is regular? (i.e. doesit come from the same distribution?) Or on the contrary, is it sosimilar to the other that we cannot distinguish it from the originalobservations? This is the question addressed by the novelty detectiontools and methods.
In general, it is about to learn a rough, close frontier delimitingthe contour of the initial observations distribution, plotted inembedding\(p\)-dimensional space. Then, if further observationslay within the frontier-delimited subspace, they are considered ascoming from the same population as the initialobservations. Otherwise, if they lay outside the frontier, we can saythat they are abnormal with a given confidence in our assessment.
The One-Class SVM has been introduced by Schölkopf et al. for that purposeand implemented in theSupport Vector Machines module in thesvm.OneClassSVM object. It requires the choice of akernel and a scalar parameter to define a frontier. The RBF kernel isusually chosen although there exists no exact formula or algorithm toset its bandwidth parameter. This is the default in the scikit-learnimplementation. Thenu parameter, also known as the margin ofthe One-Class SVM, corresponds to the probability of finding a new,but regular, observation outside the frontier.
References
Estimating the support of a high-dimensional distributionSchölkopf, Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.
Examples
SeeOne-class SVM with non-linear kernel (RBF) for visualizing thefrontier learned around some data by a
svm.OneClassSVMobject.

2.7.2.1.Scaling up the One-Class SVM#
An online linear version of the One-Class SVM is implemented inlinear_model.SGDOneClassSVM. This implementation scales linearly withthe number of samples and can be used with a kernel approximation toapproximate the solution of a kernelizedsvm.OneClassSVM whosecomplexity is at best quadratic in the number of samples. See sectionOnline One-Class SVM for more details.
Examples
SeeOne-Class SVM versus One-Class SVM using Stochastic Gradient Descentfor an illustration of the approximation of a kernelized One-Class SVMwith the
linear_model.SGDOneClassSVMcombined with kernel approximation.
2.7.3.Outlier Detection#
Outlier detection is similar to novelty detection in the sense thatthe goal is to separate a core of regular observations from somepolluting ones, calledoutliers. Yet, in the case of outlierdetection, we don’t have a clean data set representing the populationof regular observations that can be used to train any tool.
2.7.3.1.Fitting an elliptic envelope#
One common way of performing outlier detection is to assume that theregular data come from a known distribution (e.g. data are Gaussiandistributed). From this assumption, we generally try to define the“shape” of the data, and can define outlying observations asobservations which stand far enough from the fit shape.
The scikit-learn provides an objectcovariance.EllipticEnvelope that fits a robust covarianceestimate to the data, and thus fits an ellipse to the central datapoints, ignoring points outside the central mode.
For instance, assuming that the inlier data are Gaussian distributed, itwill estimate the inlier location and covariance in a robust way (i.e.without being influenced by outliers). The Mahalanobis distancesobtained from this estimate are used to derive a measure of outlyingness.This strategy is illustrated below.

Examples
SeeRobust covariance estimation and Mahalanobis distances relevance foran illustration of the difference between using a standard(
covariance.EmpiricalCovariance) or a robust estimate(covariance.MinCovDet) of location and covariance toassess the degree of outlyingness of an observation.SeeOutlier detection on a real data setfor an example of robust covariance estimation on a real data set.
References
Rousseeuw, P.J., Van Driessen, K. “A fast algorithm for the minimumcovariance determinant estimator” Technometrics 41(3), 212 (1999)
2.7.3.2.Isolation Forest#
One efficient way of performing outlier detection in high-dimensional datasetsis to use random forests.Theensemble.IsolationForest ‘isolates’ observations by randomly selectinga feature and then randomly selecting a split value between the maximum andminimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, thenumber of splittings required to isolate a sample is equivalent to the pathlength from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is ameasure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies.Hence, when a forest of random trees collectively produce shorter pathlengths for particular samples, they are highly likely to be anomalies.
The implementation ofensemble.IsolationForest is based on an ensembleoftree.ExtraTreeRegressor. Following Isolation Forest original paper,the maximum depth of each tree is set to\(\lceil \log_2(n) \rceil\) where\(n\) is the number of samples used to build the tree (see (Liu et al.,2008) for more details).
This algorithm is illustrated below.

Theensemble.IsolationForest supportswarm_start=True whichallows you to add more trees to an already fitted model:
>>>fromsklearn.ensembleimportIsolationForest>>>importnumpyasnp>>>X=np.array([[-1,-1],[-2,-1],[-3,-2],[0,0],[-20,50],[3,5]])>>>clf=IsolationForest(n_estimators=10,warm_start=True)>>>clf.fit(X)# fit 10 trees>>>clf.set_params(n_estimators=20)# add 10 more trees>>>clf.fit(X)# fit the added trees
Examples
SeeIsolationForest example foran illustration of the use of IsolationForest.
SeeComparing anomaly detection algorithms for outlier detection on toy datasetsfor a comparison of
ensemble.IsolationForestwithneighbors.LocalOutlierFactor,svm.OneClassSVM(tuned to perform like an outlier detectionmethod),linear_model.SGDOneClassSVM, and a covariance-basedoutlier detection withcovariance.EllipticEnvelope.
References
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.”Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.
2.7.3.3.Local Outlier Factor#
Another efficient way to perform outlier detection on moderately high dimensionaldatasets is to use the Local Outlier Factor (LOF) algorithm.
Theneighbors.LocalOutlierFactor (LOF) algorithm computes a score(called local outlier factor) reflecting the degree of abnormality of theobservations.It measures the local density deviation of a given data point with respect toits neighbors. The idea is to detect the samples that have a substantiallylower density than their neighbors.
In practice the local density is obtained from the k-nearest neighbors.The LOF score of an observation is equal to the ratio of theaverage local density of its k-nearest neighbors, and its own local density:a normal instance is expected to have a local density similar to that of itsneighbors, while abnormal data are expected to have much smaller local density.
The number k of neighbors considered, (alias parametern_neighbors) istypically chosen 1) greater than the minimum number of objects a cluster has tocontain, so that other objects can be local outliers relative to this cluster,and 2) smaller than the maximum number of close by objects that can potentiallybe local outliers. In practice, such information is generally not available, andtakingn_neighbors=20 appears to work well in general. When the proportion ofoutliers is high (i.e. greater than 10 %, as in the example below),n_neighbors should be greater (n_neighbors=35 in the example below).
The strength of the LOF algorithm is that it takes both local and globalproperties of datasets into consideration: it can perform well even in datasetswhere abnormal samples have different underlying densities.The question is not, how isolated the sample is, but how isolated it iswith respect to the surrounding neighborhood.
When applying LOF for outlier detection, there are nopredict,decision_function andscore_samples methods but only afit_predictmethod. The scores of abnormality of the training samples are accessiblethrough thenegative_outlier_factor_ attribute.Note thatpredict,decision_function andscore_samples can be usedon new unseen data when LOF is applied for novelty detection, i.e. when thenovelty parameter is set toTrue, but the result ofpredict maydiffer from that offit_predict. SeeNovelty detection with Local Outlier Factor.
This strategy is illustrated below.

Examples
SeeOutlier detection with Local Outlier Factor (LOF)for an illustration of the use of
neighbors.LocalOutlierFactor.SeeComparing anomaly detection algorithms for outlier detection on toy datasetsfor a comparison with other anomaly detection methods.
References
Breunig, Kriegel, Ng, and Sander (2000)LOF: identifying density-based local outliers.Proc. ACM SIGMOD
2.7.4.Novelty detection with Local Outlier Factor#
To useneighbors.LocalOutlierFactor for novelty detection, i.e.predict labels or compute the score of abnormality of new unseen data, youneed to instantiate the estimator with thenovelty parameterset toTrue before fitting the estimator:
lof=LocalOutlierFactor(novelty=True)lof.fit(X_train)
Note thatfit_predict is not available in this case to avoid inconsistencies.
Warning
Novelty detection with Local Outlier Factor
Whennovelty is set toTrue be aware that you must only usepredict,decision_function andscore_samples on new unseen dataand not on the training samples as this would lead to wrong results.I.e., the result ofpredict will not be the same asfit_predict.The scores of abnormality of the training samples are always accessiblethrough thenegative_outlier_factor_ attribute.
Novelty detection withneighbors.LocalOutlierFactor is illustrated below(seeNovelty detection with Local Outlier Factor (LOF)).

