
Inmachine learning,supervised learning (SL) is a type ofmachine learning paradigm where an algorithm learns to map input data to a specific output based on example input-output pairs. This process involves training a statistical model using labeled data, meaning each piece of input data is provided with the correct output. For instance, if you want a model to identify cats in images, supervised learning would involve feeding it many images of cats (inputs) that are explicitly labeled "cat" (outputs).
The goal of supervised learning is for the trained model to accurately predict the output for new, unseen data.[1] This requires the algorithm to effectivelygeneralize from the training examples, a quality measured by itsgeneralization error. Supervised learning is commonly used for tasks likeclassification (predicting a category, e.g., spam or not spam) andregression (predicting a continuous value, e.g., house prices).
To solve a given problem of supervised learning, the following steps must be performed:
A wide range of supervised learning algorithms are available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see theNo free lunch theorem).
There are four major issues to consider in supervised learning:
A first issue is the tradeoff betweenbias andvariance.[2] Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for. A learning algorithm has high variance for a particular input if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[3] Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).
The second issue is of the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be able to learn with a large amount of training data paired with a "flexible" learning algorithm with low bias and high variance.
A third issue is the dimensionality of the input space. If the input feature vectors have large dimensions, learning the function can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, input data of large dimensions typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, it will likely improve the accuracy of the learned function. In addition, there are many algorithms forfeature selection that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy ofdimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.
A fourth issue is the degree of noise in the desired output values (the supervisorytarget variables). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads tooverfitting. You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation, the part of the target function that cannot be modeled "corrupts" your training data – this phenomenon has been calleddeterministic noise. When either type of noise is present, it is better to go with a higher bias, lower variance estimator.
In practice, there are several approaches to alleviate noise in the output values such asearly stopping to prevent overfitting as well asdetecting and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreasedgeneralization error withstatistical significance.[4][5]
Other factors to consider when choosing and applying a learning algorithm include the following:
When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross-validation). Tuning the performance of a learning algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.
The most widely used learning algorithms are:
Given a set of training examples of the form such that is thefeature vector of the-th example and is its label (i.e., class), a learning algorithm seeks a function, where is the input space and is the output space. The function is an element of some space of possible functions, usually called thehypothesis space. It is sometimes convenient to represent using ascoring function such that is defined as returning the value that gives the highest score:. Let denote the space of scoring functions.
Although and can be any space of functions, many learning algorithms are probabilistic models where takes the form of aconditional probability model, or takes the form of ajoint probability model. For example,naive Bayes andlinear discriminant analysis are joint probability models, whereaslogistic regression is a conditional probability model.
There are two basic approaches to choosing or:empirical risk minimization andstructural risk minimization.[6] Empirical risk minimization seeks the function that best fits the training data. Structural risk minimization includes apenalty function that controls the bias/variance tradeoff.
In both cases, it is assumed that the training set consists of a sample ofindependent and identically distributed pairs,. In order to measure how well a function fits the training data, aloss function is defined. For training example, the loss of predicting the value is.
Therisk of function is defined as the expected loss of. This can be estimated from the training data as
In empirical risk minimization, the supervised learning algorithm seeks the function that minimizes. Hence, a supervised learning algorithm can be constructed by applying anoptimization algorithm to find.
When is a conditional probability distribution and the loss function is the negative log likelihood:, then empirical risk minimization is equivalent tomaximum likelihood estimation.
When contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization. The learning algorithm is able to memorize the training examples without generalizing well (overfitting).
Structural risk minimization seeks to prevent overfitting by incorporating aregularization penalty into the optimization. The regularization penalty can be viewed as implementing a form ofOccam's razor that prefers simpler functions over more complex ones.
A wide variety of penalties have been employed that correspond to different definitions of complexity. For example, consider the case where the function is a linear function of the form
A popular regularization penalty is, which is the squaredEuclidean norm of the weights, also known as the norm. Other norms include the norm,, and the "norm", which is the number of non-zeros. The penalty will be denoted by.
The supervised learning optimization problem is to find the function that minimizes
The parameter controls the bias-variance tradeoff. When, this gives empirical risk minimization with low bias and high variance. When is large, the learning algorithm will have high bias and low variance. The value of can be chosen empirically via cross-validation.
The complexity penalty has a Bayesian interpretation as the negative log prior probability of,, in which case is theposterior probability of.
The training methods described above arediscriminative training methods, because they seek to find a function that discriminates well between the different output values (seediscriminative model). For the special case where is ajoint probability distribution and the loss function is the negative log likelihood a risk minimization algorithm is said to performgenerative training, because can be regarded as agenerative model that explains how the data were generated. Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can be computed in closed form as innaive Bayes andlinear discriminant analysis.

There are several ways in which the standard supervised learning problem can be generalized: