Movatterモバイル変換

Statistical learning theory

From Wikipedia, the free encyclopedia

Framework for machine learning

This article is about statistical learning in machine learning. For its use in psychology, seeStatistical learning in language acquisition.

See also:Computational learning theory

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Statistical learning theory is a framework formachine learning drawing from the fields ofstatistics andfunctional analysis.^[1]^[2]^[3] Statistical learning theory deals with thestatistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such ascomputer vision,speech recognition, andbioinformatics.

Introduction

[edit]

The goals of learning are understanding and prediction. Learning falls into many categories, includingsupervised learning,unsupervised learning,online learning, andreinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood.^[4] Supervised learning involves learning from atraining set of data. Every point in the training is an input–output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to predict the output from future input.

Depending on the type of output, supervised learning problems are either problems ofregression or problems ofclassification. If the output takes a continuous range of values, it is a regression problem. UsingOhm's law as an example, a regression could be performed with voltage as input and current as an output. The regression would find the functional relationship between voltage and current to be $R {\displaystyle R}$ , such that $V=IR$ Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. Infacial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name. The input would be represented by a large multidimensional vector whose elements represent pixels in the picture.

After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set.

Formal description

[edit]

Take $X {\displaystyle X}$ to be thevector space of all possible inputs, and $Y {\displaystyle Y}$ to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknownprobability distribution over the product space $Z=X\times Y$ , i.e. there exists some unknown $p(z)=p(\mathbf {x} ,y)$ . The training set is made up of $n {\displaystyle n}$ samples from this probability distribution, and is notated $S=\{(\mathbf {x} _{1},y_{1}),\dots ,(\mathbf {x} _{n},y_{n})\}=\{\mathbf {z} _{1},\dots ,\mathbf {z} _{n}\}$ Every $\mathbf {x} _{i}$ is an input vector from the training data, and $y_{i}$ is the output that corresponds to it.

In this formalism, the inference problem consists of finding a function $f:X\to Y$ such that $f(\mathbf {x} )\sim y$ . Let ${\mathcal {H}}$ be a space of functions $f:X\to Y$ called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let $V(f(\mathbf {x} ),y)$ be theloss function, a metric for the difference between the predicted value $f(\mathbf {x} )$ and the actual value $y {\displaystyle y}$ . Theexpected risk is defined to be $I[f]=\int _{X\times Y}V(f(\mathbf {x} ),y)\,p(\mathbf {x} ,y)\,d\mathbf {x} \,dy$ The target function, the best possible function $f {\displaystyle f}$ that can be chosen, is given by the $f {\displaystyle f}$ that satisfies $f=\mathop {\operatorname {argmin} } _{h\in {\mathcal {H}}}I[h]$

Because the probability distribution $p(\mathbf {x} ,y)$ is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. It is called theempirical risk $I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f(\mathbf {x} _{i}),y_{i})$ A learning algorithm that chooses the function $f_{S}$ that minimizes the empirical risk is calledempirical risk minimization.

Loss functions

[edit]

The choice of loss function is a determining factor on the function $f_{S}$ that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. It is important for the loss function to beconvex.^[5]

Different loss functions are used depending on whether the problem is one of regression or one of classification.

Regression

[edit]

The most common loss function for regression is the square loss function (also known as theL2-norm). This familiar loss function is used inOrdinary Least Squares regression. The form is: $V(f(\mathbf {x} ),y)=(y-f(\mathbf {x} ))^{2}$

The absolute value loss (also known as theL1-norm) is also sometimes used: $V(f(\mathbf {x} ),y)=|y-f(\mathbf {x} )|$

Classification

[edit]

Main article:Statistical classification

In some sense the 0-1indicator function is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output. For binary classification with $Y=\{-1,1\}$ , this is: $V(f(\mathbf {x} ),y)=\theta (-yf(\mathbf {x} ))$ where $\theta$ is theHeaviside step function.

Regularization

[edit]

This image represents an example of overfitting in machine learning. The red dots represent training set data. The green line represents the true functional relationship, while the blue line shows the learned function, which has been overfitted to the training set data.

In machine learning problems, a major problem that arises is that ofoverfitting. Because learning is a prediction problem, the goal is not to find a function that most closely fits the (previously observed) data, but to find one that will most accurately predict output from future input.Empirical risk minimization runs this risk of overfitting: finding a function that matches the data exactly but does not predict future output well.

Overfitting is symptomatic of unstable solutions; a small perturbation in the training set data would cause a large variation in the learned function. It can be shown that if the stability for the solution can be guaranteed, generalization and consistency are guaranteed as well.^[6]^[7]Regularization can solve the overfitting problem and give the problem stability.

Regularization can be accomplished by restricting the hypothesis space ${\mathcal {H}}$ . A common example would be restricting ${\mathcal {H}}$ to linear functions: this can be seen as a reduction to the standard problem oflinear regression. ${\mathcal {H}}$ could also be restricted to polynomial of degree $p {\displaystyle p}$ , exponentials, or bounded functions onL1. Restriction of the hypothesis space avoids overfitting because the form of the potential functions are limited, and so does not allow for the choice of a function that gives empirical risk arbitrarily close to zero.

One example of regularization isTikhonov regularization. This consists of minimizing ${\frac {1}{n}}\sum _{i=1}^{n}V(f(\mathbf {x} _{i}),y_{i})+\gamma \left\|f\right\|_{\mathcal {H}}^{2}$ where $\gamma$ is a fixed and positive parameter, the regularization parameter. Tikhonov regularization ensures existence, uniqueness, and stability of the solution.^[8]

Bounding empirical risk

[edit]

Consider a binary classifier $f:{\mathcal {X}}\to \{0,1\}$ . We can applyHoeffding's inequality to bound the probability that the empirical risk deviates from the true risk to be aSub-Gaussian distribution. $\mathbb {P} (|{\hat {R}}(f)-R(f)|\geq \epsilon )\leq 2e^{-2n\epsilon ^{2}}$ But generally, when we do empirical risk minimization, we are not given a classifier; we must choose it. Therefore, a more useful result is to bound the probability of the supremum of the difference over the whole class. $\mathbb {P} {\bigg (}\sup _{f\in {\mathcal {F}}}|{\hat {R}}(f)-R(f)|\geq \epsilon {\bigg )}\leq 2S({\mathcal {F}},n)e^{-n\epsilon ^{2}/8}\approx n^{d}e^{-n\epsilon ^{2}/8}$ where $S({\mathcal {F}},n)$ is theshattering number and $n {\displaystyle n}$ is the number of samples in your dataset. The exponential term comes from Hoeffding but there is an extra cost of taking the supremum over the whole class, which is the shattering number.

References

[edit]

^Vapnik, Vladimir N. (1995).The Nature of Statistical Learning Theory. New York: Springer.ISBN 978-1-475-72440-0.
^Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. New York, NY: Springer.ISBN 978-0-387-84857-0.
^Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012).Foundations of Machine Learning. US, Massachusetts: MIT Press.ISBN 9780262018258.
^Tomaso Poggio, Lorenzo Rosasco, et al.Statistical Learning Theory and Applications, 2012,Class 1
^Rosasco, Lorenzo; De Vito, Ernesto; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004-05-01)."Are Loss Functions All the Same?".Neural Computation.16 (5):1063–1076.doi:10.1162/089976604773135104.hdl:11380/4590.ISSN 0899-7667.PMID 15070510.
^Vapnik, V.N. and Chervonenkis, A.Y. 1971.On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and Its Applications Vol 16, pp 264-280.
^Mukherjee, S., Niyogi, P. Poggio, T., and Rifkin, R. 2006.Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization.Advances in Computational Mathematics. Vol 25, pp 161-193.
^Tomaso Poggio, Lorenzo Rosasco, et al.Statistical Learning Theory and Applications, 2012,Class 2

v t e Differentiable computing
General	Differentiable programming Information geometry Statistical manifold Automatic differentiation Neuromorphic computing Pattern recognition Ricci calculus Computational learning theory Inductive bias
Hardware	IPU TPU VPU Memristor SpiNNaker
Software libraries	TensorFlow PyTorch Keras scikit-learn Theano JAX Flux.jl MindSpore
Portals Computer programming Technology

Retrieved from "https://en.wikipedia.org/w/index.php?title=Statistical_learning_theory&oldid=1296238628"

Categories:

Hidden categories:

[8]ページ先頭