Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Loss functions for classification

From Wikipedia, the free encyclopedia
Concept in machine learning
Part of a series on
Machine learning
anddata mining
Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)
This article mayrequirecleanup to meet Wikipedia'squality standards. The specific problem is:Discuss the difference compared to scoring rules. Please helpimprove this article if you can.(January 2024) (Learn how and when to remove this message)

Inmachine learning andmathematical optimization,loss functions for classification are computationally feasibleloss functions representing the price paid for inaccuracy of predictions inclassification problems (problems of identifying which category a particular observation belongs to).[1] GivenX{\displaystyle {\mathcal {X}}} as the space of all possible inputs (usuallyXRd{\displaystyle {\mathcal {X}}\subset \mathbb {R} ^{d}}), andY={1,1}{\displaystyle {\mathcal {Y}}=\{-1,1\}} as the set of labels (possible outputs), a typical goal of classification algorithms is to find a functionf:XY{\displaystyle f:{\mathcal {X}}\to {\mathcal {Y}}} which best predicts a labely{\displaystyle y} for a given inputx{\displaystyle {\vec {x}}}.[2] However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the samex{\displaystyle {\vec {x}}} to generate differenty{\displaystyle y}.[3] As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as

I[f]=X×YV(f(x),y)p(x,y)dxdy{\displaystyle I[f]=\displaystyle \int _{{\mathcal {X}}\times {\mathcal {Y}}}V(f({\vec {x}}),y)\,p({\vec {x}},y)\,d{\vec {x}}\,dy}

whereV(f(x),y){\displaystyle V(f({\vec {x}}),y)} is a given loss function, andp(x,y){\displaystyle p({\vec {x}},y)} is theprobability density function of the process that generated the data, which can equivalently be written as

p(x,y)=p(yx)p(x).{\displaystyle p({\vec {x}},y)=p(y\mid {\vec {x}})p({\vec {x}}).}

Within classification, several commonly usedloss functions are written solely in terms of the product of the true labely{\displaystyle y} and the predicted labelf(x){\displaystyle f({\vec {x}})}. Therefore, they can be defined as functions of only one variableυ=yf(x){\displaystyle \upsilon =yf({\vec {x}})}, so thatV(f(x),y)=ϕ(yf(x))=ϕ(υ){\displaystyle V(f({\vec {x}}),y)=\phi (yf({\vec {x}}))=\phi (\upsilon )} with a suitably chosen functionϕ:RR{\displaystyle \phi :\mathbb {R} \to \mathbb {R} }. These are calledmargin-based loss functions. Choosing a margin-based loss function amounts to choosingϕ{\displaystyle \phi }. Selection of a loss function within this framework impacts the optimalfϕ{\displaystyle f_{\phi }^{*}} which minimizes the expected risk, seeempirical risk minimization.

In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,

I[f]=X×YV(f(x),y)p(x,y)dxdy=XYϕ(yf(x))p(yx)p(x)dydx=X[ϕ(f(x))p(1x)+ϕ(f(x))p(1x)]p(x)dx=X[ϕ(f(x))p(1x)+ϕ(f(x))(1p(1x))]p(x)dx{\displaystyle {\begin{aligned}I[f]&=\int _{{\mathcal {X}}\times {\mathcal {Y}}}V(f({\vec {x}}),y)\,p({\vec {x}},y)\,d{\vec {x}}\,dy\\[6pt]&=\int _{\mathcal {X}}\int _{\mathcal {Y}}\phi (yf({\vec {x}}))\,p(y\mid {\vec {x}})\,p({\vec {x}})\,dy\,d{\vec {x}}\\[6pt]&=\int _{\mathcal {X}}[\phi (f({\vec {x}}))\,p(1\mid {\vec {x}})+\phi (-f({\vec {x}}))\,p(-1\mid {\vec {x}})]\,p({\vec {x}})\,d{\vec {x}}\\[6pt]&=\int _{\mathcal {X}}[\phi (f({\vec {x}}))\,p(1\mid {\vec {x}})+\phi (-f({\vec {x}}))\,(1-p(1\mid {\vec {x}}))]\,p({\vec {x}})\,d{\vec {x}}\end{aligned}}}

The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values fory{\displaystyle y}, and the fourth becausep(1x)=1p(1x){\displaystyle p(-1\mid x)=1-p(1\mid x)}. The term within brackets[ϕ(f(x))p(1x)+ϕ(f(x))(1p(1x))]{\displaystyle [\phi (f({\vec {x}}))p(1\mid {\vec {x}})+\phi (-f({\vec {x}}))(1-p(1\mid {\vec {x}}))]} is known as theconditional risk.

One can solve for the minimizer ofI[f]{\displaystyle I[f]} by taking the functional derivative of the last equality with respect tof{\displaystyle f} and setting the derivative equal to 0. This will result in the following equation

ϕ(f)fη+ϕ(f)f(1η)=0,(1){\displaystyle {\frac {\partial \phi (f)}{\partial f}}\eta +{\frac {\partial \phi (-f)}{\partial f}}(1-\eta )=0,\;\;\;\;\;(1)}[citation needed][clarification needed]

whereη=p(y=1|x){\displaystyle \eta =p(y=1|{\vec {x}})}, which is also equivalent to setting the derivative of the conditional risk equal to zero.

Given the binary nature of classification, a natural selection for a loss function (assuming equal cost forfalse positives and false negatives) would be the0-1 loss function (0–1indicator function), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by

V(f(x),y)=H(yf(x)){\displaystyle V(f({\vec {x}}),y)=H(-yf({\vec {x}}))}

whereH{\displaystyle H} indicates theHeaviside step function.However, this loss function is non-convex and non-smooth, and solving for the optimal solution is anNP-hard combinatorial optimization problem.[4] As a result, it is better to substituteloss function surrogates which are tractable for commonly used learning algorithms, as they have convenient properties such as being convex and smooth. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.[5] Some of these surrogates are described below.

In practice, the probability distributionp(x,y){\displaystyle p({\vec {x}},y)} is unknown. Consequently, utilizing a training set ofn{\displaystyle n}independently and identically distributed sample points

S={(x1,y1),,(xn,yn)}{\displaystyle S=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}}

drawn from the datasample space, one seeks tominimize empirical risk

IS[f]=1ni=1nV(f(xi),yi){\displaystyle I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f({\vec {x}}_{i}),y_{i})}

as a proxy for expected risk.[3] (Seestatistical learning theory for a more detailed description.)

Bayes consistency

[edit]

UtilizingBayes' theorem, it can be shown that the optimalf0/1{\displaystyle f_{0/1}^{*}}, i.e., the one that minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of

f0/1(x)={1if p(1x)>p(1x)0if p(1x)=p(1x)1if p(1x)<p(1x){\displaystyle f_{0/1}^{*}({\vec {x}})\;=\;{\begin{cases}\;\;\;1&{\text{if }}p(1\mid {\vec {x}})>p(-1\mid {\vec {x}})\\\;\;\;0&{\text{if }}p(1\mid {\vec {x}})=p(-1\mid {\vec {x}})\\-1&{\text{if }}p(1\mid {\vec {x}})<p(-1\mid {\vec {x}})\end{cases}}}.

A loss function is said to beclassification-calibrated or Bayes consistent if its optimalfϕ{\displaystyle f_{\phi }^{*}} is such thatf0/1(x)=sgn(fϕ(x)){\displaystyle f_{0/1}^{*}({\vec {x}})=\operatorname {sgn} (f_{\phi }^{*}({\vec {x}}))}and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision functionfϕ{\displaystyle f_{\phi }^{*}} by directly minimizing the expected risk and without having to explicitly model the probability density functions.

For convex margin lossϕ(υ){\displaystyle \phi (\upsilon )}, it can be shown thatϕ(υ){\displaystyle \phi (\upsilon )} is Bayes consistent if and only if it is differentiable at 0 andϕ(0)<0{\displaystyle \phi '(0)<0}.[6][1] Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation[7]

ϕ(v)=C[f1(v)]+(1f1(v))C[f1(v)](2){\displaystyle \phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]\;\;\;\;\;(2)},

wheref(η),(0η1){\displaystyle f(\eta ),(0\leq \eta \leq 1)} is any invertible function such thatf1(v)=1f1(v){\displaystyle f^{-1}(-v)=1-f^{-1}(v)} andC(η){\displaystyle C(\eta )} is any differentiable strictly concave function such thatC(η)=C(1η){\displaystyle C(\eta )=C(1-\eta )}. Table-I shows the generated Bayes consistent loss functions for some example choices ofC(η){\displaystyle C(\eta )} andf1(v){\displaystyle f^{-1}(v)}. Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification.[7][8] For all loss functions generated from (2), the posterior probabilityp(y=1|x){\displaystyle p(y=1|{\vec {x}})} can be found using the invertiblelink function asp(y=1|x)=η=f1(v){\displaystyle p(y=1|{\vec {x}})=\eta =f^{-1}(v)}. Such loss functions where the posterior probability can be recovered using the invertible link are calledproper loss functions.

Table-I
Loss nameϕ(v){\displaystyle \phi (v)}C(η){\displaystyle C(\eta )}f1(v){\displaystyle f^{-1}(v)}f(η){\displaystyle f(\eta )}
Exponentialev{\displaystyle e^{-v}}2η(1η){\displaystyle 2{\sqrt {\eta (1-\eta )}}}e2v1+e2v{\displaystyle {\frac {e^{2v}}{1+e^{2v}}}}12log(η1η){\displaystyle {\frac {1}{2}}\log({\frac {\eta }{1-\eta }})}
Logistic1log(2)log(1+ev){\displaystyle {\frac {1}{\log(2)}}\log(1+e^{-v})}1log(2)[ηlog(η)(1η)log(1η)]{\displaystyle {\frac {1}{\log(2)}}[-\eta \log(\eta )-(1-\eta )\log(1-\eta )]}ev1+ev{\displaystyle {\frac {e^{v}}{1+e^{v}}}}log(η1η){\displaystyle \log({\frac {\eta }{1-\eta }})}
Square(1v)2{\displaystyle (1-v)^{2}}4η(1η){\displaystyle 4\eta (1-\eta )}12(v+1){\displaystyle {\frac {1}{2}}(v+1)}2η1{\displaystyle 2\eta -1}
Savage1(1+ev)2{\displaystyle {\frac {1}{(1+e^{v})^{2}}}}η(1η){\displaystyle \eta (1-\eta )}ev1+ev{\displaystyle {\frac {e^{v}}{1+e^{v}}}}log(η1η){\displaystyle \log({\frac {\eta }{1-\eta }})}
Tangent(2arctan(v)1)2{\displaystyle (2\arctan(v)-1)^{2}}4η(1η){\displaystyle 4\eta (1-\eta )}arctan(v)+12{\displaystyle \arctan(v)+{\frac {1}{2}}}tan(η12){\displaystyle \tan(\eta -{\frac {1}{2}})}


The sole minimizer of the expected risk,fϕ{\displaystyle f_{\phi }^{*}}, associated with the above generated loss functions can be directly found from equation (1) and shown to be equal to the correspondingf(η){\displaystyle f(\eta )}. This holds even for the nonconvex loss functions, which means that gradient descent based algorithms such asgradient boosting can be used to construct the minimizer.

Proper loss functions, loss margin and regularization

[edit]
(Red) standard Logistic loss (γ=1,μ=2{\displaystyle \gamma =1,\mu =2}) and (Blue) increased margin Logistic loss (γ=0.2{\displaystyle \gamma =0.2})

For proper loss functions, theloss margin can be defined asμϕ=ϕ(0)ϕ(0){\displaystyle \mu _{\phi }=-{\frac {\phi '(0)}{\phi ''(0)}}} and shown to be directly related to the regularization properties of the classifier.[9] Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing aγ{\displaystyle \gamma } parameter and writing the logistic loss as1γlog(1+eγv){\displaystyle {\frac {1}{\gamma }}\log(1+e^{-\gamma v})} where smaller0<γ<1{\displaystyle 0<\gamma <1} increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate ingradient boostingFm(x)=Fm1(x)+γhm(x),{\displaystyle F_{m}(x)=F_{m-1}(x)+\gamma h_{m}(x),} where decreasingγ{\displaystyle \gamma } improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate ofγ{\displaystyle \gamma } is used, the correct formula for retrieving the posterior probability is nowη=f1(γF(x)){\displaystyle \eta =f^{-1}(\gamma F(x))}.

In conclusion, by choosing a loss function with larger margin (smallerγ{\displaystyle \gamma }) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.

Square loss

[edit]

While more commonly used in regression, the square loss function can be re-written as a functionϕ(yf(x)){\displaystyle \phi (yf({\vec {x}}))} and utilized for classification. It can be generated using (2) and Table-I as follows

ϕ(v)=C[f1(v)]+(1f1(v))C[f1(v)]=4(12(v+1))(112(v+1))+(112(v+1))(48(12(v+1)))=(1v)2.{\displaystyle \phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=4({\frac {1}{2}}(v+1))(1-{\frac {1}{2}}(v+1))+(1-{\frac {1}{2}}(v+1))(4-8({\frac {1}{2}}(v+1)))=(1-v)^{2}.}

The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.[1] In addition, functions which yield high values off(x){\displaystyle f({\vec {x}})} for somexX{\displaystyle x\in X} will perform poorly with the square loss function, since high values ofyf(x){\displaystyle yf({\vec {x}})} will be penalized severely, regardless of whether the signs ofy{\displaystyle y} andf(x){\displaystyle f({\vec {x}})} match.

A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically forTikhonov regularization, one can solve for the regularization parameter using leave-one-outcross-validation in the same time as it would take to solve a single problem.[10]

The minimizer ofI[f]{\displaystyle I[f]} for the square loss function can be directly found from equation (1) as

fSquare=2η1=2p(1x)1.{\displaystyle f_{\text{Square}}^{*}=2\eta -1=2p(1\mid x)-1.}

Logistic loss

[edit]

The logistic loss function can be generated using (2) and Table-I as follows

ϕ(v)=C[f1(v)]+(1f1(v))C[f1(v)]=1log(2)[ev1+evlogev1+ev(1ev1+ev)log(1ev1+ev)]+(1ev1+ev)[1log(2)log(ev1+ev1ev1+ev)]=1log(2)log(1+ev).{\displaystyle {\begin{aligned}\phi (v)&=C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)\,C'\left[f^{-1}(v)\right]\\&={\frac {1}{\log(2)}}\left[{\frac {-e^{v}}{1+e^{v}}}\log {\frac {e^{v}}{1+e^{v}}}-\left(1-{\frac {e^{v}}{1+e^{v}}}\right)\log \left(1-{\frac {e^{v}}{1+e^{v}}}\right)\right]+\left(1-{\frac {e^{v}}{1+e^{v}}}\right)\left[{\frac {-1}{\log(2)}}\log \left({\frac {\frac {e^{v}}{1+e^{v}}}{1-{\frac {e^{v}}{1+e^{v}}}}}\right)\right]\\&={\frac {1}{\log(2)}}\log(1+e^{-v}).\end{aligned}}}

The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in theLogitBoost algorithm.

The minimizer ofI[f]{\displaystyle I[f]} for the logistic loss function can be directly found from equation (1) as

fLogistic=log(η1η)=log(p(1x)1p(1x)).{\displaystyle f_{\text{Logistic}}^{*}=\log \left({\frac {\eta }{1-\eta }}\right)=\log \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).}

This function is undefined whenp(1x)=1{\displaystyle p(1\mid x)=1} orp(1x)=0{\displaystyle p(1\mid x)=0} (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows whenp(1x){\displaystyle p(1\mid x)} increases and equals 0 whenp(1x)=0.5{\displaystyle p(1\mid x)=0.5}.[3]

It's easy to check that the logistic loss and binarycross-entropy loss (Log loss) are in fact the same (up to a multiplicative constant1log(2){\displaystyle {\frac {1}{\log(2)}}}). The cross-entropy loss is closely related to theKullback–Leibler divergence between the empirical distribution and the predicted distribution. The cross-entropy loss is ubiquitous in moderndeep neural networks.

Exponential loss

[edit]

The exponential loss function can be generated using (2) and Table-I as follows.

ϕ(v)=C[f1(v)]+(1f1(v))C[f1(v)]=2(e2v1+e2v)(1e2v1+e2v)+(1e2v1+e2v)(12e2v1+e2ve2v1+e2v(1e2v1+e2v))=ev{\displaystyle \phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=2{\sqrt {\left({\frac {e^{2v}}{1+e^{2v}}}\right)\left(1-{\frac {e^{2v}}{1+e^{2v}}}\right)}}+\left(1-{\frac {e^{2v}}{1+e^{2v}}}\right)\left({\frac {1-{\frac {2e^{2v}}{1+e^{2v}}}}{\sqrt {{\frac {e^{2v}}{1+e^{2v}}}(1-{\frac {e^{2v}}{1+e^{2v}}})}}}\right)=e^{-v}}

The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponentially-weighted 0-1 loss is used in theAdaBoost algorithm implicitly giving rise to the exponential loss.

The minimizer ofI[f]{\displaystyle I[f]} for the exponential loss function can be directly found from equation (1) as

fExp=12log(η1η)=12log(p(1x)1p(1x)).{\displaystyle f_{\text{Exp}}^{*}={\frac {1}{2}}\log \left({\frac {\eta }{1-\eta }}\right)={\frac {1}{2}}\log \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).}

Savage loss

[edit]

The Savage loss[7] can be generated using (2) and Table-I as follows

ϕ(v)=C[f1(v)]+(1f1(v))C[f1(v)]=(ev1+ev)(1ev1+ev)+(1ev1+ev)(12ev1+ev)=1(1+ev)2.{\displaystyle \phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=\left({\frac {e^{v}}{1+e^{v}}}\right)\left(1-{\frac {e^{v}}{1+e^{v}}}\right)+\left(1-{\frac {e^{v}}{1+e^{v}}}\right)\left(1-{\frac {2e^{v}}{1+e^{v}}}\right)={\frac {1}{(1+e^{v})^{2}}}.}

The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used ingradient boosting and the SavageBoost algorithm.

The minimizer ofI[f]{\displaystyle I[f]} for the Savage loss function can be directly found from equation (1) as

fSavage=log(η1η)=log(p(1x)1p(1x)).{\displaystyle f_{\text{Savage}}^{*}=\log \left({\frac {\eta }{1-\eta }}\right)=\log \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).}

Tangent loss

[edit]

The Tangent loss[11] can be generated using (2) and Table-I as follows

ϕ(v)=C[f1(v)]+(1f1(v))C[f1(v)]=4(arctan(v)+12)(1(arctan(v)+12))+(1(arctan(v)+12))(48(arctan(v)+12))=(2arctan(v)1)2.{\displaystyle {\begin{aligned}\phi (v)&=C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)C'[f^{-1}(v)]\\&=4\left(\arctan(v)+{\frac {1}{2}}\right)\left(1-\left(\arctan(v)+{\frac {1}{2}}\right)\right)+\left(1-\left(\arctan(v)+{\frac {1}{2}}\right)\right)\left(4-8\left(\arctan(v)+{\frac {1}{2}}\right)\right)\\&=\left(2\arctan(v)-1\right)^{2}.\end{aligned}}}

The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used ingradient boosting, the TangentBoost algorithm and Alternating Decision Forests.[12]

The minimizer ofI[f]{\displaystyle I[f]} for the Tangent loss function can be directly found from equation (1) as

fTangent=tan(η12)=tan(p(1x)12).{\displaystyle f_{\text{Tangent}}^{*}=\tan \left(\eta -{\frac {1}{2}}\right)=\tan \left(p\left(1\mid x\right)-{\frac {1}{2}}\right).}

Hinge loss

[edit]
Main article:Hinge loss

The hinge loss function is defined withϕ(υ)=max(0,1υ)=[1υ]+{\displaystyle \phi (\upsilon )=\max(0,1-\upsilon )=[1-\upsilon ]_{+}}, where[a]+=max(0,a){\displaystyle [a]_{+}=\max(0,a)} is thepositive part function.

V(f(x),y)=max(0,1yf(x))=[1yf(x)]+.{\displaystyle V(f({\vec {x}}),y)=\max(0,1-yf({\vec {x}}))=[1-yf({\vec {x}})]_{+}.}

The hinge loss provides a relatively tight, convex upper bound on the 0–1indicator function. Specifically, the hinge loss equals the 0–1indicator function whensgn(f(x))=y{\displaystyle \operatorname {sgn} (f({\vec {x}}))=y} and|yf(x)|1{\displaystyle |yf({\vec {x}})|\geq 1}. In addition, the empirical risk minimization of this loss is equivalent to the classical formulation forsupport vector machines (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.[4]

While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) atyf(x)=1{\displaystyle yf({\vec {x}})=1}. Consequently, the hinge loss function cannot be used withgradient descent methods orstochastic gradient descent methods which rely on differentiability over the entire domain. However, the hinge loss does have a subgradient atyf(x)=1{\displaystyle yf({\vec {x}})=1}, which allows for the utilization ofsubgradient descent methods.[4] SVMs utilizing the hinge loss function can also be solved usingquadratic programming.

The minimizer ofI[f]{\displaystyle I[f]} for the hinge loss function is

fHinge(x)={1if p(1x)>p(1x)1if p(1x)<p(1x){\displaystyle f_{\text{Hinge}}^{*}({\vec {x}})\;=\;{\begin{cases}1&{\text{if }}p(1\mid {\vec {x}})>p(-1\mid {\vec {x}})\\-1&{\text{if }}p(1\mid {\vec {x}})<p(-1\mid {\vec {x}})\end{cases}}}

whenp(1x)0.5{\displaystyle p(1\mid x)\neq 0.5}, which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.[1] The Hinge loss cannot be derived from (2) sincefHinge{\displaystyle f_{\text{Hinge}}^{*}} is not invertible.

Generalized smooth hinge loss

[edit]

The generalized smooth hinge loss function with parameterα{\displaystyle \alpha } is defined as

fα(z)={αα+1zif z01α+1zα+1z+αα+1if 0<z<10if z1,{\displaystyle f_{\alpha }^{*}(z)\;=\;{\begin{cases}{\frac {\alpha }{\alpha +1}}-z&{\text{if }}z\leq 0\\{\frac {1}{\alpha +1}}z^{\alpha +1}-z+{\frac {\alpha }{\alpha +1}}&{\text{if }}0<z<1\\0&{\text{if }}z\geq 1\end{cases}},}

where

z=yf(x).{\displaystyle z=yf({\vec {x}}).}

It is monotonically decreasing and reaches 0 whenz=1{\displaystyle z=1}.

See also

[edit]

References

[edit]
  1. ^abcdRosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004)."Are Loss Functions All the Same?"(PDF).Neural Computation.16 (5):1063–1076.CiteSeerX 10.1.1.109.6786.doi:10.1162/089976604773135104.PMID 15070510.S2CID 11845688.
  2. ^Shen, Yi (2005),Loss Functions For Binary Classification and Class Probability Estimation(PDF), University of Pennsylvania, retrieved6 December 2014
  3. ^abcRosasco, Lorenzo; Poggio, Tomaso (2014),A Regularization Tour of Machine Learning, MIT-9.520 Lectures Notes, vol. Manuscript
  4. ^abcPiyush, Rai (13 September 2011),Support Vector Machines (Contd.), Classification Loss Functions and Regularizers(PDF), Utah CS5350/6350: Machine Learning, retrieved4 May 2021
  5. ^Ramanan, Deva (27 February 2008),Lecture 14(PDF), UCI ICS273A: Machine Learning, retrieved6 December 2014{{citation}}: CS1 maint: publisher location (link)
  6. ^Bartlett, Peter L.; Jordan, Michael I.; Mcauliffe, Jon D. (2006). "Convexity, Classification, and Risk Bounds".Journal of the American Statistical Association.101 (473):138–156.doi:10.1198/016214505000000907.ISSN 0162-1459.JSTOR 30047445.S2CID 2833811.
  7. ^abcMasnadi-Shirazi, Hamed; Vasconcelos, Nuno (2008)."On the Design of Loss Functions for Classification: Theory, Robustness to Outliers, and SavageBoost"(PDF).Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS'08. USA: Curran Associates Inc.:1049–1056.ISBN 9781605609492.
  8. ^Leistner, C.; Saffari, A.; Roth, P. M.; Bischof, H. (September 2009). "On robustness of on-line boosting - a competitive study".2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. pp. 1362–1369.doi:10.1109/ICCVW.2009.5457451.ISBN 978-1-4244-4442-7.S2CID 6032045.
  9. ^Vasconcelos, Nuno; Masnadi-Shirazi, Hamed (2015)."A View of Margin Losses as Regularizers of Probability Estimates".Journal of Machine Learning Research.16 (85):2751–2795.ISSN 1533-7928.
  10. ^Rifkin, Ryan M.; Lippert, Ross A. (1 May 2007),Notes on Regularized Least Squares(PDF), MIT Computer Science and Artificial Intelligence Laboratory
  11. ^Masnadi-Shirazi, H.; Mahadevan, V.; Vasconcelos, N. (June 2010). "On the design of robust classifiers for computer vision".2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 779–786.CiteSeerX 10.1.1.172.6416.doi:10.1109/CVPR.2010.5540136.ISBN 978-1-4244-6984-0.S2CID 632758.
  12. ^Schulter, S.; Wohlhart, P.; Leistner, C.; Saffari, A.; Roth, P. M.; Bischof, H. (June 2013). "Alternating Decision Forests".2013 IEEE Conference on Computer Vision and Pattern Recognition. pp. 508–515.CiteSeerX 10.1.1.301.1305.doi:10.1109/CVPR.2013.72.ISBN 978-0-7695-4989-7.S2CID 6557162.
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=Loss_functions_for_classification&oldid=1321571517"
Category:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp