Movatterモバイル変換

[0]ホーム

Jump to content

Errors and residuals

Edit links

From Wikipedia, the free encyclopedia

Statistics concept

This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(September 2016) (Learn how and when to remove this message)

Regression analysis
Part of a series on
Models
Linear regression Simple regression Polynomial regression General linear model
Generalized linear model Vector generalized linear model Discrete choice Binomial regression Binary regression Logistic regression Multinomial logistic regression Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Linear mixed-effects model Nonlinear mixed-effects model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Linear Non-linear
Ordinary Weighted Generalized Generalized estimating equation
Partial Total Non-negative Ridge regression Regularized
Least absolute deviations Iteratively reweighted Bayesian Bayesian multivariate Least-squares spectral analysis
Background
Regression validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Mathematics portal
v t e

Instatistics andoptimization,errors andresiduals are two closely related and easily confused measures of thedeviation of anobserved value of anelement of astatistical sample from its "true value" (not necessarily observable). Theerror of anobservation is the deviation of the observed value from the true value of a quantity of interest (for example, apopulation mean). Theresidual is the difference between the observed value and theestimated value of the quantity of interest (for example, asample mean). The distinction is most important inregression analysis, where the concepts are sometimes called theregression errors andregression residuals and where they lead to the concept ofstudentized residuals.Ineconometrics, "errors" are also calleddisturbances.^[1]^[2]^[3]

Introduction

[edit]

Suppose there is a series of observations from aunivariate distribution and we want to estimate themean of that distribution (the so-calledlocation model). In this case, the errors are the deviations of the observations from the population mean, while the residuals are the deviations of the observations from the sample mean.

Astatistical error (ordisturbance) is the amount by which an observation differs from itsexpected value, the latter being based on the wholepopulation from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The expected value, being themean of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.

Aresidual (or fitting deviation), on the other hand, is an observableestimate of the unobservable statistical error. Consider the previous example with men's heights and suppose we have a random sample ofn people. Thesample mean could serve as a good estimator of thepopulation mean. Then we have:

The difference between the height of each man in the sample and the unobservablepopulation mean is astatistical error, whereas
The difference between the height of each man in the sample and the observablesample mean is aresidual.

Note that, because of the definition of the sample mean, the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarilynotindependent. The statistical errors, on the other hand, are independent, and their sum within the random sample isalmost surely not zero.

One can standardize statistical errors (especially of anormal distribution) in az-score (or "standard score"), and standardize residuals in at-statistic, or more generallystudentized residuals.

In univariate distributions

[edit]

If we assume a normally distributed population with mean μ andstandard deviation σ, and choose individuals independently, then we have

X_{1},\dots ,X_{n}\sim N\left(\mu ,\sigma ^{2}\right)\,

and thesample mean

{\overline {X}}={X_{1}+\cdots +X_{n} \over n}

is a random variable distributed such that:

{\overline {X}}\sim N\left(\mu ,{\frac {\sigma ^{2}}{n}}\right).

Thestatistical errors are then

e_{i}=X_{i}-\mu ,\,

withexpected values of zero,^[4] whereas theresiduals are

r_{i}=X_{i}-{\overline {X}}.

The sum of squares of thestatistical errors, divided byσ², has achi-squared distribution withndegrees of freedom:

{\frac {1}{\sigma ^{2}}}\sum _{i=1}^{n}e_{i}^{2}\sim \chi _{n}^{2}.

However, this quantity is not observable as the population mean is unknown. The sum of squares of theresiduals, on the other hand, is observable. The quotient of that sum by σ² has a chi-squared distribution with onlyn − 1 degrees of freedom:

{\frac {1}{\sigma ^{2}}}\sum _{i=1}^{n}r_{i}^{2}\sim \chi _{n-1}^{2}.

This difference betweenn andn − 1 degrees of freedom results inBessel's correction for the estimation ofsample variance of a population with unknown mean and unknown variance. No correction is necessary if the population mean is known.

Remark

[edit]

It is remarkable that thesum of squares of the residuals and the sample mean can be shown to be independent of each other, using, e.g.Basu's theorem. That fact, and the normal and chi-squared distributions given above form the basis of calculations involving the t-statistic:

T={\frac {{\overline {X}}_{n}-\mu _{0}}{S_{n}/{\sqrt {n}}}},

where ${\overline {X}}_{n}-\mu _{0}$ represents the errors, $S_{n}$ represents the sample standard deviation for a sample of sizen, and unknownσ, and the denominator term $S_{n}/{\sqrt {n}}$ accounts for the standard deviation of the errors according to:^[5]

$\operatorname {Var} \left({\overline {X}}_{n}\right)={\frac {\sigma ^{2}}{n}}$

Theprobability distributions of the numerator and the denominator separately depend on the value of the unobservable population standard deviationσ, butσ appears in both the numerator and the denominator and cancels. That is fortunate because it means that even though we do not know σ, we know the probability distribution of this quotient: it has aStudent's t-distribution withn − 1 degrees of freedom. We can therefore use this quotient to find aconfidence interval for μ. This t-statistic can be interpreted as "the number of standard errors away from the regression line."^[6]

Regressions

[edit]

Inregression analysis, the distinction betweenerrors andresiduals is subtle and important, and leads to the concept ofstudentized residuals. Given an unobservable function that relates the independent variable to the dependent variable – say, a line – the deviations of the dependent variable observations from this function are the unobservable errors. If one runs a regression on some data, then the deviations of the dependent variable observations from thefitted function are the residuals. If the linear model is applicable, a scatterplot of residuals plotted against the independent variable should be random about zero with no trend to the residuals.^[5] If the data exhibit a trend, the regression model is likely incorrect; for example, the true function may be a quadratic or higher order polynomial. If they are random, or have no trend, but "fan out" - they exhibit a phenomenon calledheteroscedasticity. If all of the residuals are equal, or do not fan out, they exhibithomoscedasticity.

However, a terminological difference arises in the expressionmean squared error (MSE). The mean squared error of a regression is a number computed from the sum of squares of the computedresiduals, and not of the unobservableerrors. If that sum of squares is divided byn, the number of observations, the result is the mean of the squared residuals. Since this is abiased estimate of the variance of the unobserved errors, the bias is removed by dividing the sum of the squared residuals bydf =n − p − 1, instead ofn, wheredf is the number ofdegrees of freedom (n minus the number of parameters (excluding the intercept) p being estimated - 1). This forms an unbiased estimate of the variance of the unobserved errors, and is called the mean squared error.^[7]

Another method to calculate the mean square of error when analyzing the variance of linear regression using a technique like that used inANOVA (they are the same because ANOVA is a type of regression), the sum of squares of the residuals (aka sum of squares of the error) is divided by the degrees of freedom (where the degrees of freedom equaln − p − 1, wherep is the number of parameters estimated in the model (one for each variable in the regression equation, not including the intercept)). One can then also calculate the mean square of the model by dividing the sum of squares of the model minus the degrees of freedom, which is just the number of parameters. Then the F value can be calculated by dividing the mean square of the model by the mean square of the error, and we can then determine significance (which is why you want the mean squares to begin with.).^[8]

However, because of the behavior of the process of regression, thedistributions of residuals at different data points (of the input variable) may varyeven if the errors themselves are identically distributed. Concretely, in alinear regression where the errors are identically distributed, the variability of residuals of inputs in the middle of the domain will behigher than the variability of residuals at the ends of the domain:^[9] linear regressions fit endpoints better than the middle. This is also reflected in theinfluence functions of various data points on theregression coefficients: endpoints have more influence.

Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability ofresiduals, which is calledstudentizing. This is particularly important in the case of detectingoutliers, where the case in question is somehow different from the others in a dataset. For example, a large residual may be expected in the middle of the domain, but considered an outlier at the end of the domain.

Other uses of the word "error" in statistics

[edit]

References

[edit]

^Kennedy, P. (2008).A Guide to Econometrics. Wiley. p. 576.ISBN 978-1-4051-8257-7. Retrieved2022-05-13.
^Wooldridge, J.M. (2019).Introductory Econometrics: A Modern Approach. Cengage Learning. p. 57.ISBN 978-1-337-67133-0. Retrieved2022-05-13.
^Das, P. (2019).Econometrics in Theory and Practice: Analysis of Cross Section, Time Series and Panel Data with Stata 15.1. Springer Singapore. p. 7.ISBN 978-981-329-019-8. Retrieved2022-05-13.
^Wetherill, G. Barrie. (1981).Intermediate statistical methods. London: Chapman and Hall.ISBN 0-412-16440-X.OCLC 7779780.
^^a ^bFrederik Michel Dekking; Cornelis Kraaikamp; Hendrik Paul Lopuhaä; Ludolf Erwin Meester (2005-06-15).A modern introduction to probability and statistics : understanding why and how. London: Springer London.ISBN 978-1-85233-896-1.OCLC 262680588.
^Peter Bruce; Andrew Bruce (2017-05-10).Practical statistics for data scientists : 50 essential concepts (First ed.). Sebastopol, CA: O'Reilly Media Inc.ISBN 978-1-4919-5296-2.OCLC 987251007.
^Steel, Robert G. D.; Torrie, James H. (1960).Principles and Procedures of Statistics, with Special Reference to Biological Sciences. McGraw-Hill. p. 288.
^Zelterman, Daniel (2010).Applied linear models with SAS (Online-Ausg. ed.). Cambridge: Cambridge University Press.ISBN 9780521761598.
^"7.3: Types of Outliers in Linear Regression".Statistics LibreTexts. 2013-11-21. Retrieved2019-11-22.

External links

[edit]

Media related toErrors and residuals at Wikimedia Commons

Least squares andregression analysis

Computational statistics

Correlation and dependence

Regression analysis

Regression as a
statistical model

Linear regression	Simple linear regression Ordinary least squares Generalized least squares Weighted least squares General linear model
Predictor structure	Polynomial regression Growth curve (statistics) Segmented regression Local regression
Non-standard	Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic
Non-normal errors	Generalized linear model Binomial Poisson Logistic

Decomposition of variance

Model exploration

Background

Design of experiments

Numerical approximation

Applications

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis (see alsoTemplate:Least squares and regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / multivariate / time-series / survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR) (Autoregressive model (AR))
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging