Movatterモバイル変換

Loss function

From Wikipedia, the free encyclopedia

Mathematical relation assigning a probability event to a cost

Inmathematical optimization anddecision theory, aloss function orcost function (sometimes also called an error function)^[1] is a function that maps anevent or values of one or more variables onto areal number intuitively representing some "cost" associated with the event. Anoptimization problem seeks to minimize a loss function. Anobjective function is either a loss function or its opposite (in specific domains, variously called areward function, aprofit function, autility function, afitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.

In statistics, typically a loss function is used forparameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old asLaplace, was reintroduced in statistics byAbraham Wald in the middle of the 20th century.^[2] In the context ofeconomics, for example, this is usuallyeconomic cost orregret. Inclassification, it is the penalty for an incorrect classification of an example. Inactuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works ofHarald Cramér in the 1920s.^[3] Inoptimal control, the loss is the penalty for failing to achieve a desired value. Infinancial risk management, the function is mapped to a monetary loss.

Comparison of common loss functions (MAE,SMAE,Huber loss, and Log-Cosh Loss) used for regression

Examples

[edit]

Regret

[edit]

Main article:Regret (decision theory)

Leonard J. Savage argued that using non-Bayesian methods such asminimax, the loss function should be based on the idea ofregret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made under circumstances will be known and the decision that was in fact taken before they were known.

Quadratic loss function

[edit]

The use of aquadratic loss function is common, for example when usingleast squares techniques. It is often more mathematically tractable than other loss functions because of the properties ofvariances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target ist, then a quadratic loss function is

\lambda (x)=C(t-x)^{2}\;

for some constantC; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as thesquared error loss (SEL).^[1]

Many commonstatistics, includingt-tests,regression models,design of experiments, and much else, useleast squares methods applied usinglinear regression theory, which is based on the quadratic loss function.

The quadratic loss function is also used inlinear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as aquadratic form in the deviations of the variables of interest from their desired values; this approach istractable because it results in linearfirst-order conditions. In the context ofstochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like theHuber, Log-Cosh and SMAE losses are used when the data has many large outliers.

Effect of using different loss functions, when the data has outliers

0-1 loss function

[edit]

Instatistics anddecision theory, a frequently used loss function is the0-1 loss function

L({\hat {y}},y)=\left[{\hat {y}}\neq y\right]

usingIverson bracket notation, i.e. it evaluates to 1 when ${\hat {y}}\neq y$ , and 0 otherwise.

Constructing loss and objective functions

[edit]

Expected loss

[edit]

Statistics

[edit]

Bothfrequentist andBayesian statistical theory involve making a decision based on theexpected value of the loss function; however, this quantity is defined differently under the two paradigms.

Frequentist expected loss

[edit]

We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to theprobability distribution,P_θ, of the observed data,X. This is also referred to as therisk function^[11]^[12]^[13]^[14] of the decision ruleδ and the parameterθ. Here the decision rule depends on the outcome ofX. The risk function is given by:

R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).

Here,θ is a fixed but possibly unknown state of nature,X is a vector of observations stochastically drawn from apopulation, $\operatorname {E} _{\theta }$ is the expectation over all population values ofX,dP_θ is aprobability measure over the event space ofX (parametrized by θ) and the integral is evaluated over the entiresupport of X.

Bayes Risk

[edit]

In a Bayesian approach, the expectation is calculated using theprior distributionπ^* of the parameter θ:

\rho (\pi ^{*},a)=\int _{\Theta }\int _{\mathbf {X}}L(\theta ,a({\mathbf {x}}))\,\mathrm {d} P({\mathbf {x}}\vert \theta )\,\mathrm {d} \pi ^{*}(\theta )=\int _{\mathbf {X}}\int _{\Theta }L(\theta ,a({\mathbf {x}}))\,\mathrm {d} \pi ^{*}(\theta \vert {\mathbf {x}})\,\mathrm {d} M({\mathbf {x}})

where m(x) is known as thepredictive likelihood wherein θ has been "integrated out,"π^* (θ | x) is the posterior distribution, and the order of integration has been changed. One then should choose the actiona^* which minimises this expected loss, which is referred to asBayes Risk. In the latter equation, the integrand inside dx is known as thePosterior Risk, and minimising it with respect to decisiona also minimizes the overall Bayes Risk. This optimal decision,a^* is known as theBayes (decision) Rule - it minimises the average loss over all possible states of nature θ, over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ.

Examples in statistics

[edit]

For a scalar parameterθ, a decision function whose output ${\hat {\theta }}$ is an estimate of θ, and a quadratic loss function (squared error loss) $L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},$ the risk function becomes themean squared error of the estimate, $R(\theta ,{\hat {\theta }})=\operatorname {E} _{\theta }\left[(\theta -{\hat {\theta }})^{2}\right].$ AnEstimator found by minimizing theMean squared error estimates thePosterior distribution's mean.
Indensity estimation, the unknown parameter isprobability density itself. The loss function is typically chosen to be anorm in an appropriatefunction space. For example, forL² norm, $L(f,{\hat {f}})=\|f-{\hat {f}}\|_{2}^{2}\,,$ the risk function becomes themean integrated squared error $R(f,{\hat {f}})=\operatorname {E} \left(\|f-{\hat {f}}\|^{2}\right).\,$

Economic choice under uncertainty

[edit]

In economics, decision-making under uncertainty is often modelled using thevon Neumann–Morgenstern utility function of the uncertain variable of interest, such as end-of-period wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized.

Decision rules

[edit]

Adecision rule makes a choice using an optimality criterion. Some commonly used criteria are:

Minimax: Choose the decision rule with the lowest worst loss — that is, minimize the worst-case (maximum possible) loss: ${\underset {\delta }{\operatorname {arg\,min} }}\ \max _{\theta \in \Theta }\ R(\theta ,\delta ).$
Invariance: Choose the decision rule which satisfies an invariance requirement.
Choose the decision rule with the lowest average loss (i.e. minimize theexpected value of the loss function): ${\underset {\delta }{\operatorname {arg\,min} }}\operatorname {E} _{\theta \in \Theta }[R(\theta ,\delta )]={\underset {\delta }{\operatorname {arg\,min} }}\ \int _{\theta \in \Theta }R(\theta ,\delta )\,p(\theta )\,d\theta .$

Selecting a loss function

[edit]

Sound statistical practice requires selecting an estimator consistent with the actual acceptable variation experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances.^[15]

A common example involves estimating "location". Under typical statistical assumptions, themean or average is the statistic for estimating location that minimizes the expected loss experienced under thesquared-error loss function, while themedian is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.

In economics, when an agent isrisk neutral, the objective function is simply expressed as the expected value of a monetary quantity, such as profit, income, or end-of-period wealth. Forrisk-averse orrisk-loving agents, loss is measured as the negative of autility function, and the objective function to be optimized is the expected value of utility.

Other measures of cost are possible, for examplemortality ormorbidity in the field ofpublic health orsafety engineering.

For mostoptimization algorithms, it is desirable to have a loss function that is globallycontinuous anddifferentiable.

Two very commonly used loss functions are thesquared loss, $L(a)=a^{2}$ , and theabsolute loss, $L(a)=|a|$ . However the absolute loss has the disadvantage that it is not differentiable at $a=0$ . The squared loss has the disadvantage that it has the tendency to be dominated byoutliers—when summing over a set of $a {\displaystyle a}$ 's (as in ${\textstyle \sum _{i=1}^{n}L(a_{i})}$ ), the final sum tends to be the result of a few particularly largea-values, rather than an expression of the averagea-value.

The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties.^[16] Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case ofi.i.d. observations, the principle of complete information, and some others.

W. Edwards Deming andNassim Nicholas Taleb argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often are not mathematically nice and are not differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.^[17]

References

[edit]

^^a ^bHastie, Trevor;Tibshirani, Robert;Friedman, Jerome H. (2001).The Elements of Statistical Learning. Springer. p. 18.ISBN 0-387-95284-5.
^Wald, A. (1950).Statistical Decision Functions. Wiley.
^Cramér, H. (1930).On the mathematical theory of risk. Centraltryckeriet.
^Frisch, Ragnar (1969). "From utopian theory to practical applications: the case of econometrics".The Nobel Prize–Prize Lecture. Retrieved15 February 2021.
^Tangian, Andranik; Gruber, Josef (1997).Constructing Scalar-Valued Objective Functions. Proceedings of the Third International Conference on Econometric Decision Models: Constructing Scalar-Valued Objective Functions, University of Hagen, held in Katholische Akademie Schwerte September 5–8, 1995. Lecture Notes in Economics and Mathematical Systems. Vol. 453. Berlin: Springer.doi:10.1007/978-3-642-48773-6.ISBN 978-3-540-63061-6.
^Tangian, Andranik; Gruber, Josef (2002).Constructing and Applying Objective Functions. Proceedings of the Fourth International Conference on Econometric Decision Models Constructing and Applying Objective Functions, University of Hagen, held in Haus Nordhelle, August, 28 — 31, 2000. Lecture Notes in Economics and Mathematical Systems. Vol. 510. Berlin: Springer.doi:10.1007/978-3-642-56038-5.ISBN 978-3-540-42669-1.
^Tangian, Andranik (2002). "Constructing a quasi-concave quadratic objective function from interviewing a decision maker".European Journal of Operational Research.141 (3):608–640.doi:10.1016/S0377-2217(01)00185-0.S2CID 39623350.
^Tangian, Andranik (2004). "A model for ordinally constructing additive objective functions".European Journal of Operational Research.159 (2):476–512.doi:10.1016/S0377-2217(03)00413-2.S2CID 31019036.
^Tangian, Andranik (2004). "Redistribution of university budgets with respect to the status quo".European Journal of Operational Research.157 (2):409–428.doi:10.1016/S0377-2217(03)00271-6.
^Tangian, Andranik (2008)."Multi-criteria optimization of regional employment policy: A simulation analysis for Germany".Review of Urban and Regional Development.20 (2):103–122.doi:10.1111/j.1467-940X.2008.00144.x.
^Nikulin, M.S. (2001) [1994],"Risk of a statistical procedure",Encyclopedia of Mathematics,EMS Press
^Berger, James O. (1985).Statistical decision theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag.Bibcode:1985sdtb.book.....B.ISBN 978-0-387-96098-2.MR 0804611.
^DeGroot, Morris (2004) [1970].Optimal Statistical Decisions. Wiley Classics Library.ISBN 978-0-471-68029-1.MR 2288194.
^Robert, Christian P. (2007).The Bayesian Choice. Springer Texts in Statistics (2nd ed.). New York: Springer.doi:10.1007/0-387-71599-1.ISBN 978-0-387-95231-4.MR 1835885.
^Pfanzagl, J. (1994).Parametric Statistical Theory. Berlin: Walter de Gruyter.ISBN 978-3-11-013863-4.
^Detailed information on mathematical principles of the loss function choice is given in Chapter 2 of the bookKlebanov, B.; Rachev, Svetlozat T.; Fabozzi, Frank J. (2009).Robust and Non-Robust Models in Statistics. New York: Nova Scientific Publishers, Inc. (and references there).
^Deming, W. Edwards (2000).Out of the Crisis. The MIT Press.ISBN 9780262541152.

Aretz, Kevin; Bartram, Söhnke M.; Pope, Peter F. (April–June 2011)."Asymmetric Loss Functions and the Rationality of Expected Stock Returns"(PDF).International Journal of Forecasting.27 (2):413–437.doi:10.1016/j.ijforecast.2009.10.008.SSRN 889323.
Berger, James O. (1985).Statistical decision theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag.Bibcode:1985sdtb.book.....B.ISBN 978-0-387-96098-2.MR 0804611.

Cecchetti, S. (2000)."Making monetary policy: Objectives and rules".Oxford Review of Economic Policy.16 (4):43–59.doi:10.1093/oxrep/16.4.43.

Horowitz, Ann R. (1987). "Loss functions and public policy".Journal of Macroeconomics.9 (4):489–504.doi:10.1016/0164-0704(87)90016-4.

Waud, Roger N. (1976). "Asymmetric Policymaker Utility Functions and Optimal Policy under Uncertainty".Econometrica.44 (1):53–66.doi:10.2307/1911380.JSTOR 1911380.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

v t e Differentiable computing
General	Differentiable programming Information geometry Statistical manifold Automatic differentiation Neuromorphic computing Pattern recognition Ricci calculus Computational learning theory Inductive bias
Hardware	IPU TPU VPU Memristor SpiNNaker
Software libraries	TensorFlow PyTorch Keras scikit-learn Theano JAX Flux.jl MindSpore
Portals Computer programming Technology