Glossary of Common Terms and API Elements#

This glossary hopes to definitively represent the tacit and explicitconventions applied in Scikit-learn and its API, while providing a referencefor users and contributors. It aims to describe the concepts and either detailtheir corresponding API or link to other relevant parts of the documentationwhich do so. By linking to glossary entries from the API Reference and UserGuide, we may minimize redundancy and inconsistency.

We begin by listing general concepts (and any that didn’t fit elsewhere), butmore specific sets of related terms are listed below:Class APIs and Estimator Types,Target Types,Methods,Parameters,Attributes,Data and sample properties.

General Concepts#

1d#
1d array#

One-dimensional array. A NumPy array whose.shape has length 1.A vector.

2d#
2d array#

Two-dimensional array. A NumPy array whose.shape has length 2.Often represents a matrix.

API#

Refers to both thespecific interfaces for estimators implemented inScikit-learn and thegeneralized conventions across types ofestimators as described in this glossary andoverviewed in thecontributor documentation.

The specific interfaces that constitute Scikit-learn’s public API arelargely documented inAPI Reference. However, we less formally consideranything as public API if none of the identifiers required to access itbegins with_. We generally try to maintainbackwardscompatibility for all objects in the public API.

Private API, including functions, modules and methods beginning_are not assured to be stable.

array-like#

The most common data format forinput to Scikit-learn estimators andfunctions, array-like is any type object for whichnumpy.asarray will produce an array of appropriate shape(usually 1 or 2-dimensional) of appropriate dtype (usually numeric).

This includes:

  • a numpy array

  • a list of numbers

  • a list of length-k lists of numbers for some fixed length k

  • apandas.DataFrame with all columns numeric

  • a numericpandas.Series

It excludes:

Note thatoutput from scikit-learn estimators and functions (e.g.predictions) should generally be arrays or sparse matrices, or liststhereof (as in multi-outputtree.DecisionTreeClassifier’spredict_proba). An estimator wherepredict() returns a list orapandas.Series is not valid.

attribute#
attributes#

We mostly use attribute to refer to how model information is stored onan estimator during fitting. Any public attribute stored on anestimator instance is required to begin with an alphabetic characterand end in a single underscore if it is set infit orpartial_fit. These are what is documented under an estimator’sAttributes documentation. The information stored in attributes isusually either: sufficient statistics used for prediction ortransformation;transductive outputs such aslabels_ orembedding_; or diagnostic data, such asfeature_importances_.Common attributes are listedbelow.

A public attribute may have the same name as a constructorparameter, with a_ appended. This is used to store avalidated or estimated version of the user’s input. For example,decomposition.PCA is constructed with ann_componentsparameter. From this, together with other parameters and the data,PCA estimates the attributen_components_.

Further private attributes used in prediction/transformation/etc. mayalso be set when fitting. These begin with a single underscore and arenot assured to be stable for public access.

A public attribute on an estimator instance that does not end in anunderscore should be the stored, unmodified value of an__init__parameter of the same name. Because of this equivalence, theseare documented under an estimator’sParameters documentation.

backwards compatibility#

We generally try to maintain backward compatibility (i.e. interfacesand behaviors may be extended but not changed or removed) from releaseto release but this comes with some exceptions:

Public API only

The behavior of objects accessed through private identifiers(those beginning_) may be changed arbitrarily betweenversions.

As documented

We will generally assume that the users have adhered to thedocumented parameter types and ranges. If the documentation asksfor a list and the user gives a tuple, we do not assure consistentbehavior from version to version.

Deprecation

Behaviors may change following adeprecation period(usually two releases long). Warnings are issued using Python’swarnings module.

Keyword arguments

We may sometimes assume that all optional parameters (other than Xand y tofit and similar methods) are passed as keywordarguments only and may be positionally reordered.

Bug fixes and enhancements

Bug fixes and – less often – enhancements may change the behaviorof estimators, including the predictions of an estimator trained onthe same data andrandom_state. When this happens, weattempt to note it clearly in the changelog.

Serialization

We make no assurances that pickling an estimator in one versionwill allow it to be unpickled to an equivalent model in thesubsequent version. (For estimators in the sklearn package, weissue a warning when this unpickling is attempted, even if it mayhappen to work.) SeeSecurity & Maintainability Limitations.

utils.estimator_checks.check_estimator

We provide limited backwards compatibility assurances for theestimator checks: we may add extra requirements on estimatorstested with this function, usually when these were informallyassumed but not formally tested.

Despite this informal contract with our users, the software is providedas is, as stated in the license. When a release inadvertentlyintroduces changes that are not backward compatible, these are knownas software regressions.

callable#

A function, class or an object which implements the__call__method; anything that returns True when the argument ofcallable().

categorical feature#

A categorical or nominalfeature is one that has afinite set of discrete values across the population of data.These are commonly represented as columns of integers orstrings. Strings will be rejected by most scikit-learnestimators, and integers will be treated as ordinal orcount-valued. For the use with most estimators, categoricalvariables should be one-hot encoded. Notable exceptions includetree-based models such as random forests and gradient boostingmodels that often work better and faster with integer-codedcategorical variables.OrdinalEncoder helps encodingstring-valued categorical features as ordinal integers, andOneHotEncoder can be used toone-hot encode categorical features.See alsoEncoding categorical features and thecategorical-encodingpackage for tools related to encoding categorical features.

clone#
cloned#

To copy anestimator instance and create a new one withidenticalparameters, but without any fittedattributes, usingclone.

Whenfit is called, ameta-estimator usually clonesa wrapped estimator instance before fitting the cloned instance.(Exceptions, for legacy reasons, includePipeline andFeatureUnion.)

If the estimator’srandom_state parameter is an integer (or if theestimator doesn’t have arandom_state parameter), anexact cloneis returned: the clone and the original estimator will give the exactsame results. Otherwise,statistical clone is returned: the clonemight yield different results from the original estimator. Moredetails can be found inControlling randomness.

common tests#

This refers to the tests run on almost every estimator class inScikit-learn to check they comply with basic API conventions. They areavailable for external use throughutils.estimator_checks.check_estimator orutils.estimator_checks.parametrize_with_checks, with most of theimplementation insklearn/utils/estimator_checks.py.

Note: Some exceptions to the common testing regime are currentlyhard-coded into the library, but we hope to replace this by markingexceptional behaviours on the estimator using semanticestimatortags.

cross-fitting#
cross fitting#

A resampling method that iteratively partitions data into mutuallyexclusive subsets to fit two stages. During the first stage, themutually exclusive subsets enable predictions or transformations to becomputed on data not seen during training. The computed data is thenused in the second stage. The objective is to avoid having anyoverfitting in the first stage introduce bias into the input datadistribution of the second stage.For examples of its use, see:TargetEncoder,StackingClassifier,StackingRegressor andCalibratedClassifierCV.

cross-validation#
cross validation#

A resampling method that iteratively partitions data into mutuallyexclusive ‘train’ and ‘test’ subsets so model performance can beevaluated on unseen data. This conserves data as avoids the need to holdout a ‘validation’ dataset and accounts for variability as multiplerounds of cross validation are generally performed.SeeUser Guide for more details.

deprecation#

We use deprecation to slowly violate ourbackwardscompatibility assurances, usually to:

  • change the default value of a parameter; or

  • remove a parameter, attribute, method, class, etc.

We will ordinarily issue a warning when a deprecated element is used,although there may be limitations to this. For instance, we will raisea warning when someone sets a parameter that has been deprecated, butmay not when they access that parameter’s attribute on the estimatorinstance.

See theContributors’ Guide.

dimensionality#

May be used to refer to the number offeatures (i.e.n_features), or columns in a 2d feature matrix.Dimensions are, however, also used to refer to the length of a NumPyarray’s shape, distinguishing a 1d array from a 2d matrix.

docstring#

The embedded documentation for a module, class, function, etc., usuallyin code as a string at the beginning of the object’s definition, andaccessible as the object’s__doc__ attribute.

We try to adhere toPEP257, and followNumpyDocconventions.

double underscore#
double underscore notation#

When specifying parameter names for nested estimators,__ may beused to separate between parent and child in some contexts. The mostcommon use is when setting parameters through a meta-estimator withset_params and hence in specifying a search grid inparameter search. Seeparameter.It is also used inpipeline.Pipeline.fit for passingsample properties to thefit methods of estimators inthe pipeline.

dtype#
data type#

NumPy arrays assume a homogeneous data type throughout, available inthe.dtype attribute of an array (or sparse matrix). We generallyassume simple data types for scikit-learn data: float or integer.We may support object or string data types for arrays before encodingor vectorizing. Our estimators do not work with struct arrays, forinstance.

Our documentation can sometimes give information about the dtypeprecision, e.g.np.int32,np.int64, etc. When the precision isprovided, it refers to the NumPy dtype. If an arbitrary precision isused, the documentation will refer to dtypeinteger orfloating.Note that in this case, the precision can be platform dependent.Thenumeric dtype refers to accepting bothinteger andfloating.

When it comes to choosing between 64-bit dtype (i.e.np.float64 andnp.int64) and 32-bit dtype (i.e.np.float32 andnp.int32), itboils down to a trade-off between efficiency and precision. The 64-bittypes offer more accurate results due to their lower floating-pointerror, but demand more computational resources, resulting in sloweroperations and increased memory usage. In contrast, 32-bit typespromise enhanced operation speed and reduced memory consumption, butintroduce a larger floating-point error. The efficiency improvements aredependent on lower level optimization such as vectorization,single instruction multiple dispatch (SIMD), or cache optimization butcrucially on the compatibility of the algorithm in use.

Specifically, the choice of precision should account for whether theemployed algorithm can effectively leveragenp.float32. Somealgorithms, especially certain minimization methods, are exclusivelycoded fornp.float64, meaning that even ifnp.float32 is passed, ittriggers an automatic conversion back tonp.float64. This not onlynegates the intended computational savings but also introducesadditional overhead, making operations withnp.float32 unexpectedlyslower and more memory-intensive due to this extra conversion step.

duck typing#

We try to applyduck typing to determine how tohandle some input values (e.g. checking whether a given estimator isa classifier). That is, we avoid usingisinstance where possible,and rely on the presence or absence of attributes to determine anobject’s behaviour. Some nuance is required when following thisapproach:

  • For some estimators, an attribute may only be available once it isfitted. For instance, we cannot a priori determine ifpredict_proba is available in a grid search where the gridincludes alternating between a probabilistic and a non-probabilisticpredictor in the final step of the pipeline. In the following, wecan only determine ifclf is probabilistic after fitting it onsome data:

    >>>fromsklearn.model_selectionimportGridSearchCV>>>fromsklearn.linear_modelimportSGDClassifier>>>clf=GridSearchCV(SGDClassifier(),...param_grid={'loss':['log_loss','hinge']})

    This means that we can only check for duck-typed attributes afterfitting, and that we must be careful to makemeta-estimatorsonly present attributes according to the state of the underlyingestimator after fitting.

  • Checking if an attribute is present (usinghasattr) is in generaljust as expensive as getting the attribute (getattr or dotnotation). In some cases, getting the attribute may indeed beexpensive (e.g. for some implementations offeature_importances_, which may suggest this is an API designflaw). So code which doeshasattr followed bygetattr shouldbe avoided;getattr within a try-except block is preferred.

  • For determining some aspects of an estimator’s expectations orsupport for some feature, we useestimator tags instead ofduck typing.

early stopping#

This consists in stopping an iterative optimization method before theconvergence of the training loss, to avoid over-fitting. This isgenerally done by monitoring the generalization score on a validationset. When available, it is activated through the parameterearly_stopping or by setting a positiven_iter_no_change.

estimator instance#

We sometimes use this terminology to distinguish anestimatorclass from a constructed instance. For example, in the following,cls is an estimator class, whileest1 andest2 areinstances:

cls=RandomForestClassifierest1=cls()est2=RandomForestClassifier()
examples#

We try to give examples of basic usage for most functions andclasses in the API:

  • as doctests in their docstrings (i.e. within thesklearn/ librarycode itself).

  • as examples in theexample galleryrendered (usingsphinx-gallery) from scripts in theexamples/ directory, exemplifying key features or parametersof the estimator/function. These should also be referenced from theUser Guide.

  • sometimes in theUser Guide (built fromdoc/)alongside a technical description of the estimator.

experimental#

An experimental tool is already usable but its public API, such asdefault parameter values or fitted attributes, is still subject tochange in future versions without the usualdeprecationwarning policy.

evaluation metric#
evaluation metrics#

Evaluation metrics give a measure of how well a model performs. We mayuse this term specifically to refer to the functions inmetrics(disregardingpairwise), as distinct from thescore method and thescoring API used in crossvalidation. SeeMetrics and scoring: quantifying the quality of predictions.

These functions usually accept a ground truth (or the raw datawhere the metric evaluates clustering without a ground truth) and aprediction, be it the output ofpredict (y_pred),ofpredict_proba (y_proba), or of an arbitrary scorefunction includingdecision_function (y_score).Functions are usually named to end with_score if a greaterscore indicates a better model, and_loss if a lesser scoreindicates a better model. This diversity of interface motivatesthe scoring API.

Note that some estimators can calculate metrics that are not includedinmetrics and are estimator-specific, notably modellikelihoods.

estimator tags#

Estimator tags describe certain capabilities of an estimator. This wouldenable some runtime behaviors based on estimator inspection, but italso allows each estimator to be tested for appropriate invarianceswhile being excepted from othercommon tests.

Some aspects of estimator tags are currently determined throughtheduck typing of methods likepredict_proba and throughsome special attributes on estimator objects:

For more detailed info, seeEstimator Tags.

feature#
features#
feature vector#

In the abstract, a feature is a function (in its mathematical sense)mapping a sampled object to a numeric or categorical quantity.“Feature” is also commonly used to refer to these quantities, being theindividual elements of a vector representing a sample. In a datamatrix, features are represented as columns: each column contains theresult of applying a feature function to a set of samples.

Elsewhere features are known as attributes, predictors, regressors, orindependent variables.

Nearly all estimators in scikit-learn assume that features are numeric,finite and not missing, even when they have semantically distinctdomains and distributions (categorical, ordinal, count-valued,real-valued, interval). See alsocategorical feature andmissing values.

n_features indicates the number of features in a dataset.

fitting#

Callingfit (orfit_transform,fit_predict,etc.) on an estimator.

fitted#

The state of an estimator afterfitting.

There is no conventional procedure for checking if an estimatoris fitted. However, an estimator that is not fitted:

function#

We provide ad hoc function interfaces for many algorithms, whileestimator classes provide a more consistent interface.

In particular, Scikit-learn may provide a function interface that fitsa model to some data and returns the learnt model parameters, as inlinear_model.enet_path. For transductive models, this alsoreturns the embedding or cluster labels, as inmanifold.spectral_embedding orcluster.dbscan. Manypreprocessing transformers also provide a function interface, akin tocallingfit_transform, as inpreprocessing.maxabs_scale. Users should be careful to avoiddata leakage when making use of thesefit_transform-equivalent functions.

We do not have a strict policy about when to or when not to providefunction forms of estimators, but maintainers should considerconsistency with existing interfaces, and whether providing a functionwould lead users astray from best practices (as regards data leakage,etc.)

gallery#

Seeexamples.

hyperparameter#
hyper-parameter#

Seeparameter.

impute#
imputation#

Most machine learning algorithms require that their inputs have nomissing values, and will not work if this requirement isviolated. Algorithms that attempt to fill in (or impute) missing valuesare referred to as imputation algorithms.

indexable#

Anarray-like,sparse matrix, pandas DataFrame orsequence (usually a list).

induction#
inductive#

Inductive (contrasted withtransductive) machine learningbuilds a model of some data that can then be applied to new instances.Most estimators in Scikit-learn are inductive, havingpredictand/ortransform methods.

joblib#

A Python library (https://joblib.readthedocs.io) used in Scikit-learn tofacilitate simple parallelism and caching. Joblib is oriented towardsefficiently working with numpy arrays, such as through use ofmemory mapping. SeeParallelism for moreinformation.

label indicator matrix#
multilabel indicator matrix#
multilabel indicator matrices#

The format used to represent multilabel data, where each row of a 2darray or sparse matrix corresponds to a sample, each columncorresponds to a class, and each element is 1 if the sample is labeledwith the class and 0 if not.

leakage#
data leakage#

A problem in cross validation where generalization performance can beover-estimated since knowledge of the test data was inadvertentlyincluded in training a model. This is a risk, for instance, whenapplying atransformer to the entirety of a dataset ratherthan each training portion in a cross validation split.

We aim to provide interfaces (such aspipeline andmodel_selection) that shield the user from data leakage.

memmapping#
memory map#
memory mapping#

A memory efficiency strategy that keeps data on disk rather thancopying it into main memory. Memory maps can be created for arraysthat can be read, written, or both, usingnumpy.memmap. Whenusingjoblib to parallelize operations in Scikit-learn, itmay automatically memmap large arrays to reduce memory duplicationoverhead in multiprocessing.

missing values#

Most Scikit-learn estimators do not work with missing values. When theydo (e.g. inimpute.SimpleImputer), NaN is the preferredrepresentation of missing values in float arrays. If the array hasinteger dtype, NaN cannot be represented. For this reason, we supportspecifying anothermissing_values value whenimputation orlearning can be performed in integer space.Unlabeled data is a special case of missingvalues in thetarget.

n_features#

The number offeatures.

n_outputs#

The number ofoutputs in thetarget.

n_samples#

The number ofsamples.

n_targets#

Synonym forn_outputs.

narrative docs#
narrative documentation#

An alias forUser Guide, i.e. documentation writtenindoc/modules/. Unlike theAPI reference providedthrough docstrings, the User Guide aims to:

  • group tools provided by Scikit-learn together thematically or interms of usage;

  • motivate why someone would use each particular tool, often throughcomparison;

  • provide both intuitive and technical descriptions of tools;

  • provide or link toexamples of using key features of atool.

np#

A shorthand for Numpy due to the conventional import statement:

importnumpyasnp
online learning#

Where a model is iteratively updated by receiving each batch of groundtruthtargets soon after making predictions on correspondingbatch of data. Intrinsically, the model must be usable for predictionafter each batch. Seepartial_fit.

out-of-core#

An efficiency strategy where not all the data is stored in main memoryat once, usually by performing learning on batches of data. Seepartial_fit.

outputs#

Individual scalar/categorical variables per sample in thetarget. For example, in multilabel classification eachpossible label corresponds to a binary output. Also calledresponses,tasks ortargets.Seemulticlass multioutput andcontinuous multioutput.

pair#

A tuple of length two.

parameter#
parameters#
param#
params#

We mostly useparameter to refer to the aspects of an estimator thatcan be specified in its construction. For example,max_depth andrandom_state are parameters ofRandomForestClassifier.Parameters to an estimator’s constructor are stored unmodified asattributes on the estimator instance, and conventionally start with analphabetic character and end with an alphanumeric character. Eachestimator’s constructor parameters are described in the estimator’sdocstring.

We do not use parameters in the statistical sense, where parameters arevalues that specify a model and can be estimated from data. What wecall parameters might be what statisticians call hyperparameters to themodel: aspects for configuring model structure that are often notdirectly learnt from data. However, our parameters are also used toprescribe modeling operations that do not affect the learnt model, suchasn_jobs for controlling parallelism.

When talking about the parameters of ameta-estimator, we mayalso be including the parameters of the estimators wrapped by themeta-estimator. Ordinarily, these nested parameters are denoted byusing adouble underscore (__) to separate between theestimator-as-parameter and its parameter. Thusclf=BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=3))has a deep parameterestimator__max_depth with value3,which is accessible withclf.estimator.max_depth orclf.get_params()['estimator__max_depth'].

The list of parameters and their current values can be retrieved fromanestimator instance using itsget_params method.

Between construction and fitting, parameters may be modified usingset_params. To enable this, parameters are not ordinarilyvalidated or altered when the estimator is constructed, or when eachparameter is set. Parameter validation is performed whenfit iscalled.

Common parameters are listedbelow.

pairwise metric#
pairwise metrics#

In its broad sense, a pairwise metric defines a function for measuringsimilarity or dissimilarity between two samples (with each ordinarilyrepresented as afeature vector). We particularly provideimplementations of distance metrics (as well as improper metrics likeCosine Distance) throughmetrics.pairwise_distances, and ofkernel functions (a constrained class of similarity functions) inmetrics.pairwise.pairwise_kernels. These can compute pairwise distancematrices that are symmetric and hence store data redundantly.

See alsoprecomputed andmetric.

Note that for most distance metrics, we rely on implementations fromscipy.spatial.distance, but may reimplement for efficiency inour context. Themetrics.DistanceMetric interface is used to implementdistance metrics for integration with efficient neighbors search.

pd#

A shorthand forPandas due to theconventional import statement:

importpandasaspd
precomputed#

Where algorithms rely onpairwise metrics, and can be computedfrom pairwise metrics alone, we often allow the user to specify thattheX provided is already in the pairwise (dis)similarityspace, rather than in a feature space. That is, when passed tofit, it is a square, symmetric matrix, with each vectorindicating (dis)similarity to every sample, and when passed toprediction/transformation methods, each row corresponds to a testingsample and each column to a training sample.

Use of precomputed X is usually indicated by setting ametric,affinity orkernel parameter to the string ‘precomputed’. Ifthis is the case, then the estimator should set thepairwiseestimator tag as True.

rectangular#

Data that can be represented as a matrix withsamples on thefirst axis and a fixed, finite set offeatures on the secondis called rectangular.

This term excludes samples with non-vectorial structures, such as text,an image of arbitrary size, a time series of arbitrary length, a set ofvectors, etc. The purpose of avectorizer is to producerectangular forms of such data.

sample#
samples#

We usually use this term as a noun to indicate a single feature vector.Elsewhere a sample is called an instance, data point, or observation.n_samples indicates the number of samples in a dataset, being thenumber of rows in a data arrayX.Note that this definition is standard in machine learning and deviates fromstatistics where it meansa set of individuals or objects collected orselected.

sample property#
sample properties#

A sample property is data for each sample (e.g. an array of lengthn_samples) passed to an estimator method or a similar function,alongside but distinct from thefeatures (X) andtarget (y). The most prominent example issample_weight; see others atData and sample properties.

As of version 0.19 we do not have a consistent approach to handlingsample properties and their routing inmeta-estimators, thoughafit_params parameter is often used.

scikit-learn-contrib#

A venue for publishing Scikit-learn-compatible libraries that arebroadly authorized by the core developers and the contrib community,but not maintained by the core developer team.Seehttps://scikit-learn-contrib.github.io.

scikit-learn enhancement proposals#
SLEP#
SLEPs#

Changes to the API principles and changes to dependencies or supportedversions happen via aSLEP and follows thedecision-making process outlined inScikit-learn governance and decision-making.For all votes, a proposal must have been made public and discussed before thevote. Such a proposal must be a consolidated document, in the form of a“Scikit-Learn Enhancement Proposal” (SLEP), rather than a long discussion on anissue. A SLEP must be submitted as a pull-request toenhancement proposals using theSLEP template.

semi-supervised#
semi-supervised learning#
semisupervised#

Learning where the expected prediction (label or ground truth) is onlyavailable for some samples provided as training data whenfitting the model. We conventionally apply the label-1tounlabeled samples in semi-supervised classification.

sparse matrix#
sparse graph#

A representation of two-dimensional numeric data that is more memoryefficient than the corresponding dense numpy array where almost all elementsare zero. We use thescipy.sparse framework, which providesseveral underlying sparse data representations, orformats.Some formats are more efficient than others for particular tasks, andwhen a particular format provides especial benefit, we try to documentthis fact in Scikit-learn parameter descriptions.

Some sparse matrix formats (notably CSR, CSC, COO and LIL) distinguishbetweenimplicit andexplicit zeros. Explicit zeros are stored(i.e. they consume memory in adata array) in the data structure,while implicit zeros correspond to every element not otherwise definedin explicit storage.

Two semantics for sparse matrices are used in Scikit-learn:

matrix semantics

The sparse matrix is interpreted as an array with implicit andexplicit zeros being interpreted as the number 0. This is theinterpretation most often adopted, e.g. when sparse matricesare used for feature matrices ormultilabel indicatormatrices.

graph semantics

As withscipy.sparse.csgraph, explicit zeros areinterpreted as the number 0, but implicit zeros indicate a maskedor absent value, such as the absence of an edge between twovertices of a graph, where an explicit value indicates an edge’sweight. This interpretation is adopted to represent connectivityin clustering, in representations of nearest neighborhoods(e.g.neighbors.kneighbors_graph), and for precomputeddistance representation where only distances in the neighborhoodof each point are required.

When working with sparse matrices, we assume that it is sparse for agood reason, and avoid writing code that densifies a user-providedsparse matrix, instead maintaining sparsity or raising an error if notpossible (i.e. if an estimator does not / cannot support sparsematrices).

stateless#

An estimator is stateless if it does not store any information that isobtained duringfit. This information can be either parameterslearned duringfit or statistics computed from thetraining data. An estimator is stateless if it has noattributesapart from ones set in__init__. Callingfit for theseestimators will only validate the publicattributes passedin__init__.

supervised#
supervised learning#

Learning where the expected prediction (label or ground truth) isavailable for each sample whenfitting the model, provided asy. This is the approach taken in aclassifier orregressor among other estimators.

target#
targets#

Thedependent variable insupervised (andsemisupervised) learning, passed asy to an estimator’sfit method. Also known asdependent variable,outcomevariable,response variable,ground truth orlabel. Scikit-learnworks with targets that have minimal structure: a class from a finiteset, a finite real-valued number, multiple classes, or multiplenumbers. SeeTarget Types.

transduction#
transductive#

A transductive (contrasted withinductive) machine learningmethod is designed to model a specific dataset, but not to apply thatmodel to unseen data. Examples includemanifold.TSNE,cluster.AgglomerativeClustering andneighbors.LocalOutlierFactor.

unlabeled#
unlabeled data#

Samples with an unknown ground truth when fitting; equivalently,missing values in thetarget. See alsosemisupervised andunsupervised learning.

unsupervised#
unsupervised learning#

Learning where the expected prediction (label or ground truth) is notavailable for each sample whenfitting the model, as inclusterers andoutlier detectors. Unsupervisedestimators ignore anyy passed tofit.

Class APIs and Estimator Types#

classifier#
classifiers#

Asupervised (orsemi-supervised)predictorwith a finite set of discrete possible output values.

A classifier supports modeling some ofbinary,multiclass,multilabel, ormulticlassmultioutput targets. Within scikit-learn, all classifiers supportmulti-class classification, defaulting to using a one-vs-reststrategy over the binary classification problem.

Classifiers must store aclasses_ attribute after fitting,and inherit frombase.ClassifierMixin, which setstheir correspondingestimator tags correctly.

A classifier can be distinguished from other estimators withis_classifier.

A classifier must implement:

It may also be appropriate to implementdecision_function,predict_proba andpredict_log_proba.

clusterer#
clusterers#

Aunsupervisedpredictor with a finite set of discreteoutput values.

A clusterer usually storeslabels_ after fitting, and must doso if it istransductive.

A clusterer must implement:

density estimator#

Anunsupervised estimation of input probability densityfunction. Commonly used techniques are:

estimator#
estimators#

An object which manages the estimation and decoding of a model. Themodel is estimated as a deterministic function of:

The estimated model is stored in public and privateattributeson the estimator instance, facilitating decoding through predictionand transformation methods.

Estimators must provide afit method, and should provideset_params andget_params, although these are usuallyprovided by inheritance frombase.BaseEstimator.

The core functionality of some estimators may also be available as afunction.

feature extractor#
feature extractors#

Atransformer which takes input where each sample is notrepresented as anarray-like object of fixed length, andproduces anarray-like object offeatures for eachsample (and thus a 2-dimensional array-like for a set of samples). Inother words, it (lossily) maps a non-rectangular data representationintorectangular data.

Feature extractors must implement at least:

meta-estimator#
meta-estimators#
metaestimator#
metaestimators#

Anestimator which takes another estimator as a parameter.Examples includepipeline.Pipeline,model_selection.GridSearchCV,feature_selection.SelectFromModel andensemble.BaggingClassifier.

In a meta-estimator’sfit method, any contained estimatorsshould becloned before they are fit.

An exception to this isthat an estimator may explicitly document that it accepts a pre-fittedestimator (e.g. usingprefit=True infeature_selection.SelectFromModel). One known issue with thisis that the pre-fitted estimator will lose its model if themeta-estimator is cloned. A meta-estimator should havefit calledbefore prediction, even if all contained estimators are pre-fitted.

In cases where a meta-estimator’s primary behaviors (e.g.predict ortransform implementation) are functions ofprediction/transformation methods of the providedbase estimator (ormultiple base estimators), a meta-estimator should provide at least thestandard methods provided by the base estimator. It may not bepossible to identify which methods are provided by the underlyingestimator until the meta-estimator has beenfitted (see alsoduck typing), for whichutils.metaestimators.available_if may help. Itshould also provide (or modify) theestimator tags andclasses_ attribute provided by the base estimator.

Meta-estimators should be careful to validate data as minimally aspossible before passing it to an underlying estimator. This savescomputation time, and may, for instance, allow the underlyingestimator to easily work with data that is notrectangular.

outlier detector#
outlier detectors#

Anunsupervised binarypredictor which models thedistinction between core and outlying samples.

Outlier detectors must implement:

Inductive outlier detectors may also implementdecision_function to give a normalized inlier score whereoutliers have score below 0.score_samples may provide anunnormalized score per sample.

predictor#
predictors#

Anestimator supportingpredict and/orfit_predict. This encompassesclassifier,regressor,outlier detector andclusterer.

In statistics, “predictors” refers tofeatures.

regressor#
regressors#

Asupervised (orsemi-supervised)predictorwithcontinuous output values.

Regressors inherit frombase.RegressorMixin, which sets theirestimator tags correctly.

A regressor can be distinguished from other estimators withis_regressor.

A regressor must implement:

transformer#
transformers#

An estimator supportingtransform and/orfit_transform.A purelytransductive transformer, such asmanifold.TSNE, may not implementtransform.

vectorizer#
vectorizers#

Seefeature extractor.

There are further APIs specifically related to a small family of estimators,such as:

cross-validation splitter#
CV splitter#
cross-validation generator#

A non-estimator family of classes used to split a dataset into asequence of train and test portions (seeCross-validation: evaluating estimator performance),by providingsplit andget_n_splits methods.Note that unlike estimators, these do not havefit methodsand do not provideset_params orget_params.Parameter validation may be performed in__init__.

cross-validation estimator#

An estimator that has built-in cross-validation capabilities toautomatically select the best hyper-parameters (see theUserGuide). Some example of cross-validation estimatorsareElasticNetCV andLogisticRegressionCV.Cross-validation estimators are namedEstimatorCV and tend to beroughly equivalent toGridSearchCV(Estimator(),...). Theadvantage of using a cross-validation estimator over the canonicalestimator class along withgrid search isthat they can take advantage of warm-starting by reusing precomputedresults in the previous steps of the cross-validation process. Thisgenerally leads to speed improvements. An exception is theRidgeCV class, which can insteadperform efficient Leave-One-Out (LOO) CV. By default, all theseestimators, apart fromRidgeCV with anLOO-CV, will be refitted on the full training dataset after finding thebest combination of hyper-parameters.

scorer#

A non-estimator callable object which evaluates an estimator on giventest data, returning a number. Unlikeevaluation metrics,a greater returned number must correspond with abetter score.SeeThe scoring parameter: defining model evaluation rules.

Further examples:

Metadata Routing#

consumer#

An object which consumesmetadata. This object is usually anestimator, ascorer, or aCV splitter. Consumingmetadata means using it in calculations, e.g. usingsample_weight to calculate a certain type of score. Being aconsumer doesn’t mean that the object always receives a certainmetadata, rather it means it can use it if it is provided.

metadata#

Data which is related to the givenX andy data, butis not directly a part of the data, e.g.sample_weight orgroups, and is passed along to different objects and methods,e.g. to ascorer or aCV splitter.

router#

An object which routes metadata toconsumers. Thisobject is usually ameta-estimator, e.g.Pipeline orGridSearchCV.Some routers can also be a consumer. This happens for example when ameta-estimator uses the givengroups, and it also passes italong to some of its sub-objects, such as aCV splitter.

Please refer toMetadata Routing User Guide for moreinformation.

Target Types#

binary#

A classification problem consisting of two classes. A binary targetmay be represented as for amulticlass problem but with only twolabels. A binary decision function is represented as a 1d array.

Semantically, one class is often considered the “positive” class.Unless otherwise specified (e.g. usingpos_label inevaluation metrics), we consider the class label with thegreater value (numerically or lexicographically) as the positive class:of labels [0, 1], 1 is the positive class; of [1, 2], 2 is the positiveclass; of [‘no’, ‘yes’], ‘yes’ is the positive class; of [‘no’, ‘YES’],‘no’ is the positive class. This affects the output ofdecision_function, for instance.

Note that a dataset sampled from a multiclassy or a continuousy may appear to be binary.

type_of_target will return ‘binary’ forbinary input, or a similar array with only a single class present.

continuous#

A regression problem where each sample’s target is a finite floatingpoint number represented as a 1-dimensional array of floats (orsometimes ints).

type_of_target will return ‘continuous’ forcontinuous input, but if the data is all integers, it will beidentified as ‘multiclass’.

continuous multioutput#
continuous multi-output#
multioutput continuous#
multi-output continuous#

A regression problem where each sample’s target consists ofn_outputsoutputs, each one a finite floating point number, for afixed intn_outputs>1 in a particular dataset.

Continuous multioutput targets are represented as multiplecontinuous targets, horizontally stacked into an arrayof shape(n_samples,n_outputs).

type_of_target will return‘continuous-multioutput’ for continuous multioutput input, but if thedata is all integers, it will be identified as‘multiclass-multioutput’.

multiclass#
multi-class#

A classification problem consisting of more than two classes. Amulticlass target may be represented as a 1-dimensional array ofstrings or integers. A 2d column vector of integers (i.e. asingle output inmultioutput terms) is also accepted.

We do not officially support other orderable, hashable objects as classlabels, even if estimators may happen to work when given classificationtargets of such type.

For semi-supervised classification,unlabeled samples shouldhave the special label -1 iny.

Within scikit-learn, all estimators supporting binary classificationalso support multiclass classification, using One-vs-Rest by default.

Apreprocessing.LabelEncoder helps to canonicalize multiclasstargets as integers.

type_of_target will return ‘multiclass’ formulticlass input. The user may also want to handle ‘binary’ inputidentically to ‘multiclass’.

multiclass multioutput#
multi-class multi-output#
multioutput multiclass#
multi-output multi-class#

A classification problem where each sample’s target consists ofn_outputsoutputs, each a class label, for a fixed intn_outputs>1 in a particular dataset. Each output has afixed set of available classes, and each sample is labeled with aclass for each output. An output may be binary or multiclass, and inthe case where all outputs are binary, the target ismultilabel.

Multiclass multioutput targets are represented as multiplemulticlass targets, horizontally stacked into an arrayof shape(n_samples,n_outputs).

Note: For simplicity, we may not always support string class labelsfor multiclass multioutput, and integer class labels should be used.

multioutput provides estimators which estimate multi-outputproblems using multiple single-output estimators. This may not fullyaccount for dependencies among the different outputs, which methodsnatively handling the multioutput case (e.g. decision trees, nearestneighbors, neural networks) may do better.

type_of_target will return‘multiclass-multioutput’ for multiclass multioutput input.

multilabel#
multi-label#

Amulticlass multioutput target where each output isbinary. This may be represented as a 2d (dense) array orsparse matrix of integers, such that each column is a separate binarytarget, where positive labels are indicated with 1 and negative labelsare usually -1 or 0. Sparse multilabel targets are not supportedeverywhere that dense multilabel targets are supported.

Semantically, a multilabel target can be thought of as a set of labelsfor each sample. While not used internally,preprocessing.MultiLabelBinarizer is provided as a utility toconvert from a list of sets representation to a 2d array or sparsematrix. One-hot encoding a multiclass target withpreprocessing.LabelBinarizer turns it into a multilabelproblem.

type_of_target will return‘multilabel-indicator’ for multilabel input, whether sparse or dense.

multioutput#
multi-output#

A target where each sample has multiple classification/regressionlabels. Seemulticlass multioutput andcontinuousmultioutput. We do not currently support modelling mixedclassification and regression targets.

Methods#

decision_function#

In a fittedclassifier oroutlier detector, predicts a“soft” score for each sample in relation to each class, rather than the“hard” categorical prediction produced bypredict. Its inputis usually only some observed data,X.

If the estimator was not alreadyfitted, calling this methodshould raise aexceptions.NotFittedError.

Output conventions:

binary classification

A 1-dimensional array, where values strictly greater than zeroindicate the positive class (i.e. the last class inclasses_).

multiclass classification

A 2-dimensional array, where the row-wise arg-maximum is thepredicted class. Columns are ordered according toclasses_.

multilabel classification

Scikit-learn is inconsistent in its representation ofmultilabeldecision functions. It may be represented one of two ways:

  • List of 2d arrays, each array of shape: (n_samples, 2), like inmulticlass multioutput. List is of lengthn_labels.

  • Single 2d array of shape (n_samples,n_labels), with each‘column’ in the array corresponding to the individual binaryclassification decisions. This is identical to themulticlass classification format, though its semantics differ: itshould be interpreted, like in the binary case, by thresholding at0.

multioutput classification

A list of 2d arrays, corresponding to each multiclass decisionfunction.

outlier detection

A 1-dimensional array, where a value greater than or equal to zeroindicates an inlier.

fit#

Thefit method is provided on every estimator. It usually takes somesamplesX,targetsy if the model is supervised,and potentially othersample properties such assample_weight. It should:

  • clear any priorattributes stored on the estimator, unlesswarm_start is used;

  • validate and interpret anyparameters, ideally raising anerror if invalid;

  • validate the input data;

  • estimate and store model attributes from the estimated parameters andprovided data; and

  • return the nowfitted estimator to facilitate methodchaining.

Target Types describes possible formats fory.

fit_predict#

Used especially forunsupervised,transductiveestimators, this fits the model and returns the predictions (similar topredict) on the training data. In clusterers, these predictionsare also stored in thelabels_ attribute, and the output of.fit_predict(X) is usually equivalent to.fit(X).predict(X).The parameters tofit_predict are the same as those tofit.

fit_transform#

A method ontransformers which fits the estimator and returnsthe transformed training data. It takes parameters as infitand its output should have the same shape as calling.fit(X,...).transform(X). There are nonetheless rare cases where.fit_transform(X,...) and.fit(X,...).transform(X) do notreturn the same value, wherein training data needs to be handleddifferently (due to model blending in stacked ensembles, for instance;such cases should be clearly documented).Transductive transformers may also providefit_transform but nottransform.

One reason to implementfit_transform is that performingfitandtransform separately would be less efficient than together.base.TransformerMixin provides a default implementation,providing a consistent interface across transformers wherefit_transform is or is not specialized.

Ininductive learning – where the goal is to learn ageneralized model that can be applied to new data – users should becareful not to applyfit_transform to the entirety of a dataset(i.e. training and test data together) before further modelling, asthis results indata leakage.

get_feature_names_out#

Primarily forfeature extractors, but also used for othertransformers to provide string names for each column in the output ofthe estimator’stransform method. It outputs an array ofstrings and may take an array-like of strings as input, correspondingto the names of input columns from which output column names canbe generated. Ifinput_features is not passed in, then thefeature_names_in_ attribute will be used. If thefeature_names_in_ attribute is not defined, then theinput names are named[x0,x1,...,x(n_features_in_-1)].

get_n_splits#

On aCV splitter (not an estimator), returns the number ofelements one would get if iterating through the return value ofsplit given the same parameters. Takes the same parameters assplit.

get_params#

Gets allparameters, and their values, that can be set usingset_params. A parameterdeep can be used, when set toFalse to only return those parameters not including__, i.e. notdue to indirection via contained estimators.

Most estimators adopt the definition frombase.BaseEstimator,which simply adopts the parameters defined for__init__.pipeline.Pipeline, among others, reimplementsget_paramsto declare the estimators named in itssteps parameters asthemselves being parameters.

partial_fit#

Facilitates fitting an estimator in an online fashion. Unlikefit,repeatedly callingpartial_fit does not clear the model, butupdates it with the data provided. The portion of dataprovided topartial_fit may be called a mini-batch.Each mini-batch must be of consistent shape, etc. In iterativeestimators,partial_fit often only performs a single iteration.

partial_fit may also be used forout-of-core learning,although usually limited to the case where learning can be performedonline, i.e. the model is usable after eachpartial_fit and thereis no separate processing needed to finalize the model.cluster.Birch introduces the convention that callingpartial_fit(X) will produce a model that is not finalized, but themodel can be finalized by callingpartial_fit() i.e. withoutpassing a further mini-batch.

Generally, estimator parameters should not be modified between callstopartial_fit, althoughpartial_fit should validate themas well as the new mini-batch of data. In contrast,warm_startis used to repeatedly fit the same estimator with the same databut varying parameters.

Likefit,partial_fit should return the estimator object.

To clear the model, a new estimator should be constructed, for instancewithbase.clone.

Note: Usingpartial_fit afterfit results in undefined behavior.

predict#

Makes a prediction for each sample, usually only takingX asinput (but see under regressor output conventions below). In aclassifier orregressor, this prediction is in the sametarget space used in fitting (e.g. one of {‘red’, ‘amber’, ‘green’} ifthey in fitting consisted of these strings). Despite this, evenwheny passed tofit is a list or other array-like, theoutput ofpredict should always be an array or sparse matrix. In aclusterer oroutlier detector the prediction is aninteger.

If the estimator was not alreadyfitted, calling this methodshould raise aexceptions.NotFittedError.

Output conventions:

classifier

An array of shape(n_samples,)(n_samples,n_outputs).Multilabel data may be represented as a sparsematrix if a sparse matrix was used in fitting. Each element shouldbe one of the values in the classifier’sclasses_attribute.

clusterer

An array of shape(n_samples,) where each value is from 0 ton_clusters-1 if the corresponding sample is clustered,and -1 if the sample is not clustered, as incluster.dbscan.

outlier detector

An array of shape(n_samples,) where each value is -1 for anoutlier and 1 otherwise.

regressor

A numeric array of shape(n_samples,), usually float64.Some regressors have extra options in theirpredict method,allowing them to return standard deviation (return_std=True)or covariance (return_cov=True) relative to the predictedvalue. In this case, the return value is a tuple of arrayscorresponding to (prediction mean, std, cov) as required.

predict_log_proba#

The natural logarithm of the output ofpredict_proba, providedto facilitate numerical stability.

predict_proba#

A method inclassifiers andclusterers that canreturn probability estimates for each class/cluster. Its input isusually only some observed data,X.

If the estimator was not alreadyfitted, calling this methodshould raise aexceptions.NotFittedError.

Output conventions are like those fordecision_function exceptin thebinary classification case, where one column is outputfor each class (whiledecision_function outputs a 1d array). Forbinary and multiclass predictions, each row should add to 1.

Like other methods,predict_proba should only be present when theestimator can make probabilistic predictions (seeduck typing).This means that the presence of the method may depend on estimatorparameters (e.g. inlinear_model.SGDClassifier) or trainingdata (e.g. inmodel_selection.GridSearchCV) and may onlyappear after fitting.

score#

A method on an estimator, usually apredictor, which evaluatesits predictions on a given dataset, and returns a single numericalscore. A greater return value should indicate better predictions;accuracy is used for classifiers and R^2 for regressors by default.

If the estimator was not alreadyfitted, calling this methodshould raise aexceptions.NotFittedError.

Some estimators implement a custom, estimator-specific score function,often the likelihood of the data under the model.

score_samples#

A method that returns a score for each given sample. The exactdefinition ofscore varies from one class to another. In the case ofdensity estimation, it can be the log density model on the data, and inthe case of outlier detection, it can be the opposite of the outlierfactor of the data.

If the estimator was not alreadyfitted, calling this methodshould raise aexceptions.NotFittedError.

set_params#

Available in any estimator, takes keyword arguments corresponding tokeys inget_params. Each is provided a new value to assignsuch that callingget_params afterset_params will reflect thechangedparameters. Most estimators use the implementation inbase.BaseEstimator, which handles nested parameters andotherwise sets the parameter as an attribute on the estimator.The method is overridden inpipeline.Pipeline and relatedestimators.

split#

On aCV splitter (not an estimator), this method acceptsparameters (X,y,groups), where all may beoptional, and returns an iterator over(train_idx,test_idx)pairs. Each of {train,test}_idx is a 1d integer array, with valuesfrom 0 fromX.shape[0]-1 of any length, such that no valuesappear in both sometrain_idx and its correspondingtest_idx.

transform#

In atransformer, transforms the input, usually onlyX,into some transformed space (conventionally notated asXt).Output is an array or sparse matrix of lengthn_samples andwith the number of columns fixed afterfitting.

If the estimator was not alreadyfitted, calling this methodshould raise aexceptions.NotFittedError.

Parameters#

These common parameter names, specifically used in estimator construction(see conceptparameter), sometimes also appear as parameters offunctions or non-estimator constructors.

class_weight#

Used to specify sample weights when fitting classifiers as a functionof thetarget class. Wheresample_weight is alsosupported and given, it is multiplied by theclass_weightcontribution. Similarly, whereclass_weight is used in amultioutput (includingmultilabel) tasks, the weightsare multiplied across outputs (i.e. columns ofy).

By default, all samples have equal weight such that classes areeffectively weighted by their prevalence in the training data.This could be achieved explicitly withclass_weight={label1:1,label2:1,...} for all class labels.

More generally,class_weight is specified as a dict mapping classlabels to weights ({class_label:weight}), such that each sampleof the named class is given that weight.

class_weight='balanced' can be used to give all classesequal weight by giving each sample a weight inversely relatedto its class’s prevalence in the training data:n_samples/(n_classes*np.bincount(y)). Class weights will beused differently depending on the algorithm: for linear models (suchas linear SVM or logistic regression), the class weights will alter theloss function by weighting the loss of each sample by its class weight.For tree-based algorithms, the class weights will be used forreweighting the splitting criterion.Note however that this rebalancing does not take the weight ofsamples in each class into account.

For multioutput classification, a list of dicts is used to specifyweights for each output. For example, for four-class multilabelclassification weights should be[{0:1,1:1},{0:1,1:5},{0:1,1:1},{0:1,1:1}] instead of[{1:1},{2:5},{3:1},{4:1}].

Theclass_weight parameter is validated and interpreted withutils.class_weight.compute_class_weight.

cv#

Determines a cross validation splitting strategy, as used incross-validation based routines.cv is also available in estimatorssuch asmultioutput.ClassifierChain orcalibration.CalibratedClassifierCV which use the predictionsof one estimator as training data for another, to not overfit thetraining supervision.

Possible inputs forcv are usually:

  • An integer, specifying the number of folds in K-fold crossvalidation. K-fold will be stratified over classes if the estimatoris a classifier (determined bybase.is_classifier) and thetargets may represent a binary or multiclass (but notmultioutput) classification problem (determined byutils.multiclass.type_of_target).

  • Across-validation splitter instance. Refer to theUser Guide for splitters availablewithin Scikit-learn.

  • An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation atall is an option), the default is 5-fold.

cv values are validated and interpreted withmodel_selection.check_cv.

kernel#

Specifies the kernel function to be used by Kernel Method algorithms.For example, the estimatorssvm.SVC andgaussian_process.GaussianProcessClassifier both have akernel parameter that takes the name of the kernel to use as stringor a callable kernel function used to compute the kernel matrix. Formore reference, see theKernel Approximation and theGaussian Processes user guides.

max_iter#

For estimators involving iterative optimization, this determines themaximum number of iterations to be performed infit. Ifmax_iter iterations are run without convergence, aexceptions.ConvergenceWarning should be raised. Note that theinterpretation of “a single iteration” is inconsistent acrossestimators: some, but not all, use it to mean a single epoch (i.e. apass over every sample in the data).

memory#

Some estimators make use ofjoblib.Memory tostore partial solutions during fitting. Thus whenfit is calledagain, those partial solutions have been memoized and can be reused.

Amemory parameter can be specified as a string with a path to adirectory, or ajoblib.Memory instance (or an object with asimilar interface, i.e. acache method) can be used.

memory values are validated and interpreted withutils.validation.check_memory.

metric#

As a parameter, this is the scheme for determining the distance betweentwo data points. Seemetrics.pairwise_distances. In practice,for some algorithms, an improper distance metric (one that does notobey the triangle inequality, such as Cosine Distance) may be used.

Note: Hierarchical clustering usesaffinity with this meaning.

We also usemetric to refer toevaluation metrics, but avoidusing this sense as a parameter name.

n_components#

The number of features which atransformer should transform theinput into. Seecomponents_ for the special case of affineprojection.

n_iter_no_change#

Number of iterations with no improvement to wait before stopping theiterative procedure. This is also known as apatience parameter. Itis typically used withearly stopping to avoid stopping tooearly.

n_jobs#

This parameter is used to specify how many concurrent processes orthreads should be used for routines that are parallelized withjoblib.

n_jobs is an integer, specifying the maximum number of concurrentlyrunning workers. If 1 is given, no joblib parallelism is used at all,which is useful for debugging. If set to -1, all CPUs are used. Forn_jobs below -1, (n_cpus + 1 + n_jobs) are used. For example withn_jobs=-2, all CPUs but one are used.

n_jobs isNone by default, which meansunset; it willgenerally be interpreted asn_jobs=1, unless the currentjoblib.Parallel backend context specifies otherwise.

Note that even ifn_jobs=1, low-level parallelism (via Numpy and OpenMP)might be used in some configuration.

For more details on the use ofjoblib and its interactions withscikit-learn, please refer to ourparallelism notes.

pos_label#

Value with which positive labels must be encoded in binaryclassification problems in which the positive class is not assumed.This value is typically required to compute asymmetric evaluationmetrics such as precision and recall.

random_state#

Whenever randomization is part of a Scikit-learn algorithm, arandom_state parameter may be provided to control the random numbergenerator used. Note that the mere presence ofrandom_state doesn’tmean that randomization is always used, as it may be dependent onanother parameter, e.g.shuffle, being set.

The passed value will have an effect on the reproducibility of theresults returned by the function (fit,split, or anyother function likek_means).random_state’svalue may be:

None (default)

Use the global random state instance fromnumpy.random.Calling the function multiple times will reusethe same instance, and will produce different results.

An integer

Use a new random number generator seeded by the given integer.Using an int will produce the same results across different calls.However, it may beworthwhile checking that your results are stable across anumber of different distinct random seeds. Popular integerrandom seeds are 0 and42.Integer values must be in the range[0,2**32-1].

Anumpy.random.RandomState instance

Use the provided random state, only affecting other usersof that same random state instance. Calling the functionmultiple times will reuse the same instance, andwill produce different results.

utils.check_random_state is used internally to validate theinputrandom_state and return aRandomStateinstance.

For more details on how to control the randomness of scikit-learnobjects and avoid common pitfalls, you may refer toControlling randomness.

scoring#

Depending on the object, can specify:

  • the score function to be maximized (usually bycross validation),

  • the multiple score functions to be reported,

  • the score function to be used to check early stopping, or

  • for visualization related objects, the score function to output or plot

The score function can be a string acceptedbymetrics.get_scorer or a callablescorer, not to beconfused with anevaluation metric, as the latter have a morediverse API.scoring may also be set to None, in which case theestimator’sscore method is used. SeeThe scoring parameter: defining model evaluation rulesin the User Guide.

Where multiple metrics can be evaluated,scoring may be giveneither as a list of unique strings, a dictionary with names as keys andcallables as values or a callable that returns a dictionary. Note thatthis doesnot specify which score function is to be maximized, andanother parameter such asrefit may be used for this purpose.

Thescoring parameter is validated and interpreted usingmetrics.check_scoring.

verbose#

Logging is not handled very consistently in Scikit-learn at present,but when it is provided as an option, theverbose parameter isusually available to choose no logging (set to False). Any True valueshould enable some logging, but larger integers (e.g. above 10) may beneeded for full verbosity. Verbose logs are usually printed toStandard Output.Estimators should not produce any output on Standard Output with thedefaultverbose setting.

warm_start#

When fitting an estimator repeatedly on the same dataset, but formultiple parameter values (such as to find the value maximizingperformance as ingrid search), it may be possibleto reuse aspects of the model learned from the previous parameter value,saving time. Whenwarm_start is true, the existingfittedmodelattributes are used to initialize the new modelin a subsequent call tofit.

Note that this is only applicable for some models and someparameters, and even some orders of parameter values. In general, thereis an interaction betweenwarm_start and the parameter controllingthe number of iterations of the estimator.

For estimators imported fromensemble,warm_start will interact withn_estimators ormax_iter.For these models, the number of iterations, reported vialen(estimators_) orn_iter_, corresponds the total number ofestimators/iterations learnt since the initialization of the model.Thus, if a model was already initialized withN estimators, andfitis called withn_estimators ormax_iter set toM, the modelwill trainM-N new estimators.

Other models, usually using gradient-based solvers, have a differentbehavior. They all expose amax_iter parameter. The reportedn_iter_ corresponds to the number of iterations done during the lastcall tofit and will be at mostmax_iter. Thus, we do notconsider the state of the estimator since the initialization.

partial_fit also retains the model between calls, but differs:withwarm_start the parameters change and the data is(more-or-less) constant across calls tofit; withpartial_fit,the mini-batch of data changes and model parameters stay fixed.

There are cases where you want to usewarm_start to fit ondifferent, but closely related data. For example, one may initially fitto a subset of the data, then fine-tune the parameter search on thefull dataset. For classification, all data in a sequence ofwarm_start calls tofit must include samples from each class.

Attributes#

See conceptattribute.

classes_#

A list of class labels known to theclassifier, mapping eachlabel to a numerical index used in the model representation our output.For instance, the array output frompredict_proba has columnsaligned withclasses_. Formulti-output classifiers,classes_ should be a list of lists, with one class listing foreach output. For each output, the classes should be sorted(numerically, or lexicographically for strings).

classes_ and the mapping to indices is often managed withpreprocessing.LabelEncoder.

components_#

An affine transformation matrix of shape(n_components,n_features)used in many lineartransformers wheren_components isthe number of output features andn_features is the number ofinput features.

See alsocoef_ which is a similar attribute for linearpredictors.

coef_#

The weight/coefficient matrix of a generalized linear modelpredictor, of shape(n_features,) for binary classificationand single-output regression,(n_classes,n_features) formulticlass classification and(n_targets,n_features) formulti-output regression. Note this does not include the intercept(or bias) term, which is stored inintercept_.

When available,feature_importances_ is not usually provided aswell, but can be calculated as the norm of each feature’s entry incoef_.

See alsocomponents_ which is a similar attribute for lineartransformers.

embedding_#

An embedding of the training data inmanifold learning estimators, with shape(n_samples,n_components),identical to the output offit_transform. See alsolabels_.

n_iter_#

The number of iterations actually performed when fitting an iterativeestimator that may stop upon convergence. See alsomax_iter.

feature_importances_#

A vector of shape(n_features,) available in somepredictors to provide a relative measure of the importance ofeach feature in the predictions of the model.

labels_#

A vector containing a cluster label for each sample of the trainingdata inclusterers, identical to the output offit_predict. See alsoembedding_.

Data and sample properties#

See conceptsample property.

groups#

Used in cross-validation routines to identify samples that are correlated.Each value is an identifier such that, in a supportingCV splitter, samples from somegroups value may notappear in both a training set and its corresponding test set.SeeCross-validation iterators for grouped data.

sample_weight#

A weight for each data point. Intuitively, if all weights are integers,using them in an estimator or ascorer is like duplicating eachdata point as many times as the weight value. Weights can also bespecified as floats, and can have the same effect as above, as manyestimators and scorers are scale invariant. For example, weights[1,2,3] would be equivalent to weights[0.1,0.2,0.3] as theydiffer by a constant factor of 10. Note however that several estimatorsare not invariant to the scale of weights.

sample_weight can be both an argument of the estimator’sfit methodfor model training or a parameter of ascorer for modelevaluation. These callables are said toconsume the sample weightswhile other components of scikit-learn canroute the weights to theunderlying estimators or scorers (seeMetadata Routing).

Weighting samples can be useful in several contexts. For instance, ifthe training data is not uniformly sampled from the target population,it can be corrected by weighting the training data points based on theinverse probability oftheir selection for training (e.g. inverse propensity weighting).

Some model hyper-parameters are expressed in terms of a discrete numberof data points in a region of the feature space. When fitting withsample weights, a count of data points is often automatically convertedto a sum of their weights, but this is not always the case. Pleaserefer to the model docstring for details.

In classification, weights can also be specified for all samplesbelonging to a given target class with theclass_weightestimatorparameter. If bothsample_weight andclass_weight are provided, the final weight assigned to a sample isthe product of the two.

At the time of writing (version 1.8), not all scikit-learn estimatorscorrectly implement the weight-repetition equivalence property. The#16298 meta issue tracksongoing work to detect and fix remaining discrepancies.

Furthermore, some estimators have a stochastic fit method. Forinstance,cluster.KMeans depends on a random initialization,bagging models randomly resample from the training data, etc. In thiscase, the sample weight-repetition equivalence property described abovedoes not hold exactly. However, it should hold at least in expectationover the randomness of the fitting procedure.

X#

Denotes data that is observed at training and prediction time, used asindependent variables in learning. The notation is uppercase to denotethat it is ordinarily a matrix (seerectangular).When a matrix, each sample may be represented by afeaturevector, or a vector ofprecomputed (dis)similarity with eachtraining sample.X may also not be a matrix, and may require afeature extractor or apairwise metric to turn it intoone before learning a model.

Xt#

Shorthand for “transformedX”.

y#
Y#

Denotes data that may be observed at training time as the dependentvariable in learning, but which is unavailable at prediction time, andis usually thetarget of prediction. The notation may beuppercase to denote that it is a matrix, representingmulti-output targets, for instance; but usually we useyand sometimes do so even when multiple outputs are assumed.