Utilities for Developers#
Scikit-learn contains a number of utilities to help with development. These arelocated insklearn.utils
, and include tools in a number of categories.All the following functions and classes are in the modulesklearn.utils
.
Warning
These utilities are meant to be used internally within the scikit-learnpackage. They are not guaranteed to be stable between versions ofscikit-learn. Backports, in particular, will be removed as the scikit-learndependencies evolve.
Validation Tools#
These are tools used to check and validate input. When you write a functionwhich accepts arrays, matrices, or sparse matrices as arguments, the followingshould be used when applicable.
assert_all_finite
: Throw an error if array contains NaNs or Infs.as_float_array
: convert input to an array of floats. If a sparsematrix is passed, a sparse matrix will be returned.check_array
: check that input is a 2D array, raise error on sparsematrices. Allowed sparse matrix formats can be given optionally, as well asallowing 1D or N-dimensional arrays. Callsassert_all_finite
bydefault.check_X_y
: check that X and y have consistent length, callscheck_array on X, and column_or_1d on y. For multilabel classification ormultitarget regression, specify multi_output=True, in which case check_arraywill be called on y.indexable
: check that all input arrays have consistent length and canbe sliced or indexed using safe_index. This is used to validate input forcross-validation.validation.check_memory
checks that input isjoblib.Memory
-like,which means that it can be converted into asklearn.utils.Memory
instance (typically a str denotingthecachedir
) or has the same interface.
If your code relies on a random number generator, it should never usefunctions likenumpy.random.random
ornumpy.random.normal
. Thisapproach can lead to repeatability issues in unit tests. Instead, anumpy.random.RandomState
object should be used, which is built fromarandom_state
argument passed to the class or function. The functioncheck_random_state
, below, can then be used to create a randomnumber generator object.
check_random_state
: create anp.random.RandomState
object froma parameterrandom_state
.If
random_state
isNone
ornp.random
, then arandomly-initializedRandomState
object is returned.If
random_state
is an integer, then it is used to seed a newRandomState
object.If
random_state
is aRandomState
object, then it is passed through.
For example:
>>>fromsklearn.utilsimportcheck_random_state>>>random_state=0>>>random_state=check_random_state(random_state)>>>random_state.rand(4)array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])
When developing your own scikit-learn compatible estimator, the followinghelpers are available.
validation.check_is_fitted
: check that the estimator has been fittedbefore callingtransform
,predict
, or similar methods. This helperallows to raise a standardized error message across estimator.validation.has_fit_parameter
: check that a given parameter issupported in thefit
method of a given estimator.
Efficient Linear Algebra & Array Operations#
extmath.randomized_range_finder
: construct an orthonormal matrixwhose range approximates the range of the input. This is used inextmath.randomized_svd
, below.extmath.randomized_svd
: compute the k-truncated randomized SVD.This algorithm finds the exact truncated singular values decompositionusing randomization to speed up the computations. It is particularlyfast on large matrices on which you wish to extract only a smallnumber of components.arrayfuncs.cholesky_delete
:(used inlars_path
) Remove anitem from a cholesky factorization.arrayfuncs.min_pos
: (used insklearn.linear_model.least_angle
)Find the minimum of the positive values within an array.extmath.fast_logdet
: efficiently compute the log of the determinantof a matrix.extmath.density
: efficiently compute the density of a sparse vectorextmath.safe_sparse_dot
: dot product which will correctly handlescipy.sparse
inputs. If the inputs are dense, it is equivalent tonumpy.dot
.extmath.weighted_mode
: an extension ofscipy.stats.mode
whichallows each item to have a real-valued weight.resample
: Resample arrays or sparse matrices in a consistent way.used inshuffle
, below.shuffle
: Shuffle arrays or sparse matrices in a consistent way.Used ink_means
.
Efficient Random Sampling#
random.sample_without_replacement
: implements efficient algorithmsfor samplingn_samples
integers from a population of sizen_population
without replacement.
Efficient Routines for Sparse Matrices#
Thesklearn.utils.sparsefuncs
cython module hosts compiled extensions toefficiently processscipy.sparse
data.
sparsefuncs.mean_variance_axis
: compute the means andvariances along a specified axis of a CSR matrix.Used for normalizing the tolerance stopping criterion inKMeans
.sparsefuncs_fast.inplace_csr_row_normalize_l1
andsparsefuncs_fast.inplace_csr_row_normalize_l2
: can be used to normalizeindividual sparse samples to unit L1 or L2 norm as done inNormalizer
.sparsefuncs.inplace_csr_column_scale
: can be used to multiply thecolumns of a CSR matrix by a constant scale (one scale per column).Used for scaling features to unit standard deviation inStandardScaler
.sort_graph_by_row_values
: can be used to sort aCSR sparse matrix such that each row is stored with increasing values. Thisis useful to improve efficiency when using precomputed sparse distancematrices in estimators relying on nearest neighbors graph.
Graph Routines#
graph.single_source_shortest_path_length
:(not currently used in scikit-learn)Return the shortest path from a single sourceto all connected nodes on a graph. Code is adapted fromnetworkx.If this is ever needed again, it would be far faster to use a singleiteration of Dijkstra’s algorithm fromgraph_shortest_path
.
Testing Functions#
discovery.all_estimators
: returns a list of all estimators inscikit-learn to test for consistent behavior and interfaces.discovery.all_displays
: returns a list of all displays (related toplotting API) in scikit-learn to test for consistent behavior and interfaces.discovery.all_functions
: returns a list of all functions inscikit-learn to test for consistent behavior and interfaces.
Multiclass and multilabel utility function#
multiclass.is_multilabel
: Helper function to check if the taskis a multi-label classification one.multiclass.unique_labels
: Helper function to extract an orderedarray of unique labels from different formats of target.
Helper Functions#
gen_even_slices
: generator to createn
-packs of slices going upton
. Used indict_learning
andk_means
.gen_batches
: generator to create slices containing batch size elementsfrom 0 ton
safe_mask
: Helper function to convert a mask to the format expectedby the numpy array or scipy sparse matrix on which to use it (sparsematrices support integer indices only while numpy arrays support bothboolean masks and integer indices).safe_sqr
: Helper function for unified squaring (**2
) ofarray-likes, matrices and sparse matrices.
Hash Functions#
murmurhash3_32
provides a python wrapper for theMurmurHash3_x86_32
C++ non cryptographic hash function. This hashfunction is suitable for implementing lookup tables, Bloom filters,Count Min Sketch, feature hashing and implicitly defined sparserandom projections:>>>fromsklearn.utilsimportmurmurhash3_32>>>murmurhash3_32("some feature",seed=0)==-384616559True>>>murmurhash3_32("some feature",seed=0,positive=True)==3910350737True
The
sklearn.utils.murmurhash
module can also be “cimported” fromother cython modules so as to benefit from the high performance ofMurmurHash while skipping the overhead of the Python interpreter.
Warnings and Exceptions#
deprecated
: Decorator to mark a function or class as deprecated.ConvergenceWarning
: Custom warning to catchconvergence problems. Used insklearn.covariance.graphical_lasso
.