Frequently Asked Questions#

Here we try to give some answers to questions that regularly pop up on the mailing list.

About the project#

What is the project name (a lot of people get it wrong)?#

scikit-learn, but not scikit or SciKit nor sci-kit learn.Also not scikits.learn or scikits-learn, which were previously used.

How do you pronounce the project name?#

sy-kit learn. sci stands for science!

Why scikit?#

There are multiple scikits, which are scientific toolboxes built around SciPy.Apart from scikit-learn, another popular one isscikit-image.

Do you support PyPy?#

Due to limited maintainer resources and small number of users, usingscikit-learn withPyPy (an alternative Pythonimplementation with a built-in just-in-time compiler) is not officiallysupported.

How can I obtain permission to use the images in scikit-learn for my work?#

The images contained in thescikit-learn repository and the images generated withinthescikit-learn documentationcan be used via theBSD 3-Clause License foryour work. Citations of scikit-learn are highly encouraged and appreciated. Seeciting scikit-learn.

Implementation decisions#

Why is there no support for deep or reinforcement learning? Will there be such support in the future?#

Deep learning and reinforcement learning both require a rich vocabulary todefine an architecture, with deep learning additionally requiringGPUs for efficient computing. However, neither of these fit withinthe design constraints of scikit-learn. As a result, deep learningand reinforcement learning are currently out of scope for whatscikit-learn seeks to achieve.

You can find more information about the addition of GPU support atWill you add GPU support?.

Note that scikit-learn currently implements a simple multilayer perceptroninsklearn.neural_network. We will only accept bug fixes for this module.If you want to implement more complex deep learning models, please turn topopular deep learning frameworks such astensorflow,keras,andpytorch.

Will you add graphical models or sequence prediction to scikit-learn?#

Not in the foreseeable future.scikit-learn tries to provide a unified API for the basic tasks in machinelearning, with pipelines and meta-algorithms like grid search to tieeverything together. The required concepts, APIs, algorithms andexpertise required for structured learning are different from whatscikit-learn has to offer. If we started doing arbitrary structuredlearning, we’d need to redesign the whole package and the projectwould likely collapse under its own weight.

There are two projects with API similar to scikit-learn thatdo structured prediction:

  • pystruct handles general structuredlearning (focuses on SSVMs on arbitrary graph structures withapproximate inference; defines the notion of sample as an instance ofthe graph structure).

  • seqlearn handles sequences only(focuses on exact inference; has HMMs, but mostly for the sake ofcompleteness; treats a feature vector as a sample and uses an offset encodingfor the dependencies between feature vectors).

Why did you remove HMMs from scikit-learn?#

SeeWill you add graphical models or sequence prediction to scikit-learn?.

Will you add GPU support?#

Adding GPU support by default would introduce heavy hardware-specific softwaredependencies and existing algorithms would need to be reimplemented. This wouldmake it both harder for the average user to install scikit-learn and harder forthe developers to maintain the code.

However, since 2023, a limited but growinglist of scikit-learnestimators can already run on GPUs if the input data isprovided as a PyTorch or CuPy array and if scikit-learn has been configured toaccept such inputs as explained inArray API support (experimental). This Array API supportallows scikit-learn to run on GPUs without introducing heavy andhardware-specific software dependencies to the main package.

Most estimators that rely on NumPy for their computationally intensive operationscan be considered for Array API support and therefore GPU support.

However, not all scikit-learn estimators are amenable to efficiently runningon GPUs via the Array API for fundamental algorithmic reasons. For instance,tree-based models currently implemented with Cython in scikit-learn arefundamentally not array-based algorithms. Other algorithms such as k-means ork-nearest neighbors rely on array-based algorithms but are also implemented inCython. Cython is used to manually interleave consecutive array operations toavoid introducing performance killing memory access to large intermediatearrays: this low-level algorithmic rewrite is called “kernel fusion” and cannotbe expressed via the Array API for the foreseeable future.

Adding efficient GPU support to estimators that cannot be efficientlyimplemented with the Array API would require designing and adopting a moreflexible extension system for scikit-learn. This possibility is beingconsidered in the following GitHub issue (under discussion):

Why do categorical variables need preprocessing in scikit-learn, compared to other tools?#

Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matricesof a single numeric dtype. These do not explicitly represent categoricalvariables at present. Thus, unlike R’sdata.frames orpandas.DataFrame,we require explicit conversion of categorical features to numeric values, asdiscussed inEncoding categorical features.See alsoColumn Transformer with Mixed Types for anexample of working with heterogeneous (e.g. categorical and numeric) data.

Note that recently,HistGradientBoostingClassifier andHistGradientBoostingRegressor gained native support forcategorical features through the optioncategorical_features="from_dtype". Thisoption relies on inferring which columns of the data are categorical based on thepandas.CategoricalDtype andpolars.datatypes.Categorical dtypes.

Does scikit-learn work natively with various types of dataframes?#

Scikit-learn has limited support forpandas.DataFrame andpolars.DataFrame. Scikit-learn estimators can accept both these dataframe typesas input, and scikit-learn transformers can output dataframes using theset_outputAPI. For more details, refer toIntroducing the set_output API.

However, the internal computations in scikit-learn estimators rely on numericaloperations that are more efficiently performed on homogeneous data structures such asNumPy arrays or SciPy sparse matrices. As a result, most scikit-learn estimators willinternally convert dataframe inputs into these homogeneous data structures. Similarly,dataframe outputs are generated from these homogeneous data structures.

Also note thatColumnTransformer makes it convenient to handleheterogeneous pandas dataframes by mapping homogeneous subsets of dataframe columnsselected by name or dtype to dedicated scikit-learn transformers. ThereforeColumnTransformer are often used in the first step ofscikit-learn pipelines when dealing with heterogeneous dataframes (seePipeline: chaining estimatorsfor more details).

See alsoColumn Transformer with Mixed Typesfor an example of working with heterogeneous (e.g. categorical and numeric) data.

Do you plan to implement transform for targety in a pipeline?#

Currently transform only works for featuresX in a pipeline. There’s along-standing discussion about not being able to transformy in a pipeline.Follow on GitHub issue#4143. Meanwhile, you can check outTransformedTargetRegressor,pipegraph,andimbalanced-learn.Note that scikit-learn solved for the case whereyhas an invertible transformation applied before trainingand inverted after prediction. scikit-learn intends to solve foruse cases wherey should be transformed at training timeand not at test time, for resampling and similar uses, like atimbalanced-learn.In general, these use cases can be solvedwith a custom meta estimator rather than aPipeline.

Why are there so many different estimators for linear models?#

Usually, there is one classifier and one regressor per model type, e.g.GradientBoostingClassifier andGradientBoostingRegressor. Both have similar options andboth have the parameterloss, which is especially useful in the regressioncase as it enables the estimation of conditional mean as well as conditionalquantiles.

For linear models, there are many estimator classes which are very close toeach other. Let us have a look at

Maintainer perspective:They all do in principle the same and are different only by the penalty theyimpose. This, however, has a large impact on the way the underlyingoptimization problem is solved. In the end, this amounts to usage of differentmethods and tricks from linear algebra. A special case isSGDRegressor whichcomprises all 4 previous models and is different by the optimization procedure.A further side effect is that the different estimators favor different datalayouts (X C-contiguous or F-contiguous, sparse csr or csc). This complexityof the seemingly simple linear models is the reason for having differentestimator classes for different penalties.

User perspective:First, the current design is inspired by the scientific literature where linearregression models with different regularization/penalty were given differentnames, e.g.ridge regression. Having different model classes with accordingnames makes it easier for users to find those regression models.Secondly, if all the 5 above mentioned linear models were unified into a singleclass, there would be parameters with a lot of options like thesolverparameter. On top of that, there would be a lot of exclusive interactionsbetween different parameters. For example, the possible options of theparameterssolver,precompute andselection would depend on thechosen values of the penalty parametersalpha andl1_ratio.

Contributing#

How can I contribute to scikit-learn?#

SeeContributing to pandas. Before wanting to add a new algorithm, which isusually a major and lengthy undertaking, it is recommended to start withknown issues. Please do not contact the contributorsof scikit-learn directly regarding contributing to scikit-learn.

Why is my pull request not getting any attention?#

The scikit-learn review process takes a significant amount of time, andcontributors should not be discouraged by a lack of activity or review ontheir pull request. We care a lot about getting things rightthe first time, as maintenance and later change comes at a high cost.We rarely release any “experimental” code, so all of our contributionswill be subject to high use immediately and should be of the highestquality possible initially.

Beyond that, scikit-learn is limited in its reviewing bandwidth; many of thereviewers and core developers are working on scikit-learn on their own time.If a review of your pull request comes slowly, it is likely because thereviewers are busy. We ask for your understanding and request that younot close your pull request or discontinue your work solely because ofthis reason.

What does the “spam” label for issues or pull requests mean?#

The “spam” label is an indication for reviewers that the issue orpull request may not have received sufficient effort or preparationfrom the author for a productive review. The maintainers are using this labelas a way to deal with the increase of low value PRs and issues.

If an issue or PR was labeled as spam and simultaneously closed, the decisionis final. A common reason for this happening is when people open a PR for anissue that is still under discussion. Please wait for the discussion toconverge before opening a PR.

If your issue or PR was labeled as spam and not closed the following stepscan increase the chances of the label being removed:

  • follow thecontribution guidelines and use the providedissue and pull request templates

  • improve the formatting and grammar of the text of the title and description of the issue/PR

  • improve the diff to remove noise and unrelated changes

  • improve the issue or pull request title to be more descriptive

  • self review your code, especially ifyou used AI tools to generate it

  • refrain from opening PRs that paraphrase existing code or documentationwithout actually improving the correctness, clarity or educationalvalue of the existing code or documentation.

What are the inclusion criteria for new algorithms?#

We only consider well-established algorithms for inclusion. A rule of thumb isat least 3 years since publication, 200+ citations, and wide use andusefulness. A technique that provides a clear-cut improvement (e.g. anenhanced data structure or a more efficient approximation technique) ona widely-used method will also be considered for inclusion.

From the algorithms or techniques that meet the above criteria, only thosewhich fit well within the current API of scikit-learn, that is afit,predict/transform interface and ordinarily having input/output that is anumpy array or sparse matrix, are accepted.

The contributor should support the importance of the proposed addition withresearch papers and/or implementations in other similar packages, demonstrateits usefulness via common use-cases/applications and corroborate performanceimprovements, if any, with benchmarks and/or plots. It is expected that theproposed algorithm should outperform the methods that are already implementedin scikit-learn at least in some areas.

Please do not propose algorithms you (your best friend, colleague or boss)created. scikit-learn is not a good venue for advertising your own work.

Inclusion of a new algorithm speeding up an existing model is easier if:

  • it does not introduce new hyper-parameters (as it makes the librarymore future-proof),

  • it is easy to document clearly when the contribution improves the speedand when it does not, for instance, “whenn_features>>n_samples”,

  • benchmarks clearly show a speed up.

Also, note that your implementation need not be in scikit-learn to be usedtogether with scikit-learn tools. You can implement your favorite algorithmin a scikit-learn compatible way, upload it to GitHub and let us know. Wewill be happy to list it underRelated Projects. If you already havea package on GitHub following the scikit-learn API, you may also beinterested to look atscikit-learn-contrib.

Why are you so selective on what algorithms you include in scikit-learn?#

Code comes with maintenance cost, and we need to balance the amount ofcode we have with the size of the team (and add to this the fact thatcomplexity scales non linearly with the number of features).The package relies on core developers using their free time tofix bugs, maintain code and review contributions.Any algorithm that is added needs future attention by the developers,at which point the original author might long have lost interest.See alsoWhat are the inclusion criteria for new algorithms?. For a great read aboutlong-term maintenance issues in open-source software, look atthe Executive Summary of Roads and Bridges.

Using scikit-learn#

How do I get started with scikit-learn?#

If you are new to scikit-learn, or looking to strengthen your understanding,we highly recommend thescikit-learn MOOC (Massive Open Online Course).

See ourExternal Resources, Videos and Talks pagefor more details.

What’s the best way to get help on scikit-learn usage?#

  • General machine learning questions: useCross Validated with the[machine-learning] tag.

  • scikit-learn usage questions: useStack Overflow with the[scikit-learn] and[python] tags. You can alternatively use themailing list.

Please make sure to include a minimal reproduction code snippet (ideally shorterthan 10 lines) that highlights your problem on a toy dataset (for instance fromsklearn.datasets or randomly generated with functions ofnumpy.random witha fixed random seed). Please remove any line of code that is not necessary toreproduce your problem.

The problem should be reproducible by simply copy-pasting your code snippet in a Pythonshell with scikit-learn installed. Do not forget to include the import statements.More guidance to write good reproduction code snippets can be found at:https://stackoverflow.com/help/mcve.

If your problem raises an exception that you do not understand (even after googling it),please make sure to include the full traceback that you obtain when running thereproduction script.

For bug reports or feature requests, please make use of theissue tracker on GitHub.

Warning

Please do not email any authors directly to ask for assistance, report bugs,or for any other issue related to scikit-learn.

How should I save, export or deploy estimators for production?#

SeeModel persistence.

How can I create a bunch object?#

Bunch objects are sometimes used as an output for functions and methods. Theyextend dictionaries by enabling values to be accessed by key,bunch["value_key"], or by an attribute,bunch.value_key.

They should not be used as an input. Therefore you almost never need to createaBunch object, unless you are extending scikit-learn’s API.

How can I load my own datasets into a format usable by scikit-learn?#

Generally, scikit-learn works on any numeric data stored as numpy arraysor scipy sparse matrices. Other types that are convertible to numericarrays such aspandas.DataFrame are also acceptable.

For more information on loading your data files into these usable datastructures, please refer toloading external datasets.

How do I deal with string data (or trees, graphs…)?#

scikit-learn estimators assume you’ll feed them real-valued feature vectors.This assumption is hard-coded in pretty much all of the library.However, you can feed non-numerical inputs to estimators in several ways.

If you have text documents, you can use a term frequency features; seeText feature extraction for the built-intext vectorizers.For more general feature extraction from any kind of data, seeLoading features from dicts andFeature hashing.

Another common case is when you have non-numerical data and a custom distance(or similarity) metric on these data. Examples include strings with editdistance (aka. Levenshtein distance), for instance, DNA or RNA sequences. These can beencoded as numbers, but doing so is painful and error-prone. Working withdistance metrics on arbitrary data can be done in two ways.

Firstly, many estimators take precomputed distance/similarity matrices, so ifthe dataset is not too large, you can compute distances for all pairs of inputs.If the dataset is large, you can use feature vectors with only one “feature”,which is an index into a separate data structure, and supply a custom metricfunction that looks up the actual data in this data structure. For instance, to usedbscan with Levenshtein distances:

>>>importnumpyasnp>>>fromlevenimportlevenshtein>>>fromsklearn.clusterimportdbscan>>>data=["ACCTCCTAGAAG","ACCTACTAGAAGTT","GAATATTAGGCCGA"]>>>deflev_metric(x,y):...i,j=int(x[0]),int(y[0])# extract indices...returnlevenshtein(data[i],data[j])...>>>X=np.arange(len(data)).reshape(-1,1)>>>Xarray([[0],       [1],       [2]])>>># We need to specify algorithm='brute' as the default assumes>>># a continuous feature space.>>>dbscan(X,metric=lev_metric,eps=5,min_samples=2,algorithm='brute')(array([0, 1]), array([ 0,  0, -1]))

Note that the example above uses the third-party edit distance packageleven. Similar tricks can be used,with some care, for tree kernels, graph kernels, etc.

Why do I sometimes get a crash/freeze withn_jobs>1 under OSX or Linux?#

Several scikit-learn tools such asGridSearchCV andcross_val_score rely internally on Python’smultiprocessing module to parallelize executiononto several Python processes by passingn_jobs>1 as an argument.

The problem is that Pythonmultiprocessing does afork system callwithout following it with anexec system call for performance reasons. Manylibraries like (some versions of) Accelerate or vecLib under OSX, (some versionsof) MKL, the OpenMP runtime of GCC, nvidia’s Cuda (and probably many others),manage their own internal thread pool. Upon a call tofork, the thread poolstate in the child process is corrupted: the thread pool believes it has manythreads while only the main thread state has been forked. It is possible tochange the libraries to make them detect when a fork happens and reinitializethe thread pool in that case: we did that for OpenBLAS (merged upstream inmain since 0.2.10) and we contributed apatch to GCC’s OpenMP runtime(not yet reviewed).

But in the end the real culprit is Python’smultiprocessing that doesfork withoutexec to reduce the overhead of starting and using newPython processes for parallel computing. Unfortunately this is a violation ofthe POSIX standard and therefore some software editors like Apple refuse toconsider the lack of fork-safety in Accelerate and vecLib as a bug.

In Python 3.4+ it is now possible to configuremultiprocessing touse the"forkserver" or"spawn" start methods (instead of the default"fork") to manage the process pools. To work around this issue whenusing scikit-learn, you can set theJOBLIB_START_METHOD environmentvariable to"forkserver". However the user should be aware that usingthe"forkserver" method preventsjoblib.Parallel to call functioninteractively defined in a shell session.

If you have custom code that usesmultiprocessing directly instead of usingit viajoblib you can enable the"forkserver" mode globally for yourprogram. Insert the following instructions in your main script:

importmultiprocessing# other imports, custom code, load data, define model...if__name__=="__main__":multiprocessing.set_start_method("forkserver")# call scikit-learn utils with n_jobs > 1 here

You can find more details on the new start methods in themultiprocessingdocumentation.

Why does my job use more cores than specified withn_jobs?#

This is becausen_jobs only controls the number of jobs forroutines that are parallelized withjoblib, but parallel code can comefrom other sources:

  • some routines may be parallelized with OpenMP (for code written in C orCython),

  • scikit-learn relies a lot on numpy, which in turn may rely on numericallibraries like MKL, OpenBLAS or BLIS which can provide parallelimplementations.

For more details, please refer to ournotes on parallelism.

How do I set arandom_state for an entire execution?#

Please refer toControlling randomness.

On this page

This Page