Movatterモバイル変換

[0]ホーム

Jump to content

Studentized residual

Edit links

From Wikipedia, the free encyclopedia

Kind of ratio

For broader coverage of this topic, seeStudentization.

This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages)

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Studentized residual" – news ·newspapers ·books ·scholar ·JSTOR(May 2015) (Learn how and when to remove this message)

This article'sfactual accuracy isdisputed. Relevant discussion may be found on thetalk page. Please help to ensure that disputed statements arereliably sourced.(February 2014) (Learn how and when to remove this message)

(Learn how and when to remove this message)

Regression analysis
Part of a series on
Models
Linear regression Simple regression Polynomial regression General linear model
Generalized linear model Vector generalized linear model Discrete choice Binomial regression Binary regression Logistic regression Multinomial logistic regression Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Linear mixed-effects model Nonlinear mixed-effects model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Linear Non-linear
Ordinary Weighted Generalized Generalized estimating equation
Partial Total Non-negative Ridge regression Regularized
Least absolute deviations Iteratively reweighted Bayesian Bayesian multivariate Least-squares spectral analysis
Background
Regression validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Mathematics portal
v t e

Instatistics, astudentized residual is thedimensionless ratio resulting from the division of aresidual by anestimate of itsstandard deviation, both expressed in the sameunits. It is a form of aStudent'st-statistic, with the estimate of error varying between points.

This is an important technique in the detection ofoutliers. It is among several named in honor ofWilliam Sealey Gosset, who wrote under the pseudonym "Student" (e.g.,Student's distribution). Dividing a statistic by asample standard deviation is calledstudentizing, in analogy withstandardizing andnormalizing.

Motivation

[edit]

The key reason for studentizing is that, inregression analysis of amultivariate distribution, the variances of theresiduals at different input variable values may differ, even if the variances of theerrors at these different input variable values are equal. The issue is the difference betweenerrors and residuals in statistics, particularly the behavior of residuals in regressions.

Consider thesimple linear regression model

Y=\alpha _{0}+\alpha _{1}X+\varepsilon .\,

Given a random sample (X_i, Y_i),i = 1, ..., n, each pair (X_i, Y_i) satisfies

Y_{i}=\alpha _{0}+\alpha _{1}X_{i}+\varepsilon _{i},\,

where theerrors $\varepsilon _{i}$ , areindependent and all have the same variance $\sigma ^{2}$ . Theresiduals are not the true errors, butestimates, based on the observable data. When themethod of least squares is used to estimate $\alpha _{0}$ and $\alpha _{1}$ , then the residuals ${\widehat {\varepsilon \,}}$ , unlike the errors $\varepsilon$ , cannot be independent since they satisfy the two constraints

\sum _{i=1}^{n}{\widehat {\varepsilon \,}}_{i}=0

and

\sum _{i=1}^{n}{\widehat {\varepsilon \,}}_{i}x_{i}=0.

(Hereε_i is theith error, and ${\widehat {\varepsilon \,}}_{i}$ is theith residual.)

The residuals, unlike the errors,do not all have the same variance: the variance decreases as the correspondingx-value gets farther from the averagex-value. This is not a feature of the data itself, but of the regression better fitting values at the ends of the domain. It is also reflected in theinfluence functions of various data points on theregression coefficients: endpoints have more influence. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. The fact thatthe variances of the residuals differ, even thoughthe variances of the true errors are all equal to each other, is theprincipal reason for the need for studentization.

It is not simply a matter of the population parameters (mean and standard deviation) being unknown – it is thatregressions yielddifferent residual distributions atdifferent data points, unlikepointestimators ofunivariate distributions, which share acommon distribution for residuals.

Background

[edit]

For this simple model, thedesign matrix is

X=\left[{\begin{matrix}1&x_{1}\\\vdots &\vdots \\1&x_{n}\end{matrix}}\right]

and thehat matrixH is the matrix of theorthogonal projection onto the column space of the design matrix:

H=X(X^{T}X)^{-1}X^{T}.\,

Theleverageh_ii is theith diagonal entry in the hat matrix. The variance of theith residual is

\operatorname {var} ({\widehat {\varepsilon \,}}_{i})=\sigma ^{2}(1-h_{ii}).

In case the design matrixX has only two columns (as in the example above), this is equal to

\operatorname {var} ({\widehat {\varepsilon \,}}_{i})=\sigma ^{2}\left(1-{\frac {1}{n}}-{\frac {(x_{i}-{\bar {x}})^{2}}{\sum _{j=1}^{n}(x_{j}-{\bar {x}})^{2}}}\right).

In the case of anarithmetic mean, the design matrixX has only one column (avector of ones), and this is simply:

\operatorname {var} ({\widehat {\varepsilon \,}}_{i})=\sigma ^{2}\left(1-{\frac {1}{n}}\right).

Calculation

[edit]

Given the definitions above, theStudentized residual is then

t_{i}={{\widehat {\varepsilon \,}}_{i} \over {\widehat {\sigma }}{\sqrt {1-h_{ii}\ }}}

whereh_ii is theleverage, and ${\widehat {\sigma }}$ is an appropriate estimate ofσ (see below).

In the case of a mean, this is equal to:

t_{i}={{\widehat {\varepsilon \,}}_{i} \over {\widehat {\sigma }}{\sqrt {(n-1)/n}}}

Internal and external studentization

[edit]

The usual estimate ofσ² is theinternally studentized residual

{\widehat {\sigma }}^{2}={1 \over n-m}\sum _{j=1}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2}.

wherem is the number of parameters in the model (2 in our example).

But if thei th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude thei th observation from the process of estimating the variance when one is considering whether thei th case may be an outlier, and instead use theexternally studentized residual, which is

{\widehat {\sigma }}_{(i)}^{2}={1 \over n-m-1}\sum _{\begin{smallmatrix}j=1\\j\neq i\end{smallmatrix}}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2},

based on all the residualsexcept the suspecti th residual. Here is to emphasize that ${\widehat {\varepsilon \,}}_{j}^{\,2}(j\neq i)$ for suspecti are computed withi th case excluded.

If the estimateσ²includes thei th case, then it is called theinternally studentized residual, $t_{i}$ (also known as thestandardized residual^[1]).If the estimate ${\widehat {\sigma }}_{(i)}^{2}$ is used instead,excluding thei th case, then it is called theexternally studentized, $t_{i(i)}$ .

Distribution

[edit]

"Tau distribution" redirects here; not to be confused withTau coefficient.

If the errors are independent andnormally distributed withexpected value 0 and varianceσ², then theprobability distribution of theith externally studentized residual $t_{i(i)}$ is aStudent's t-distribution withn − m − 1degrees of freedom, and can range from $\scriptstyle -\infty$ to $\scriptstyle +\infty$ .

On the other hand, the internally studentized residuals are in the range $0\,\pm \,{\sqrt {\nu }}$ , whereν =n − m is the number of residual degrees of freedom. Ift_i represents the internally studentized residual, and again assuming that the errors are independent identically distributed Gaussian variables, then:^[2]

t_{i}\sim {\sqrt {\nu }}{t \over {\sqrt {t^{2}+\nu -1}}}

wheret is a random variable distributed asStudent's t-distribution withν − 1 degrees of freedom. In fact, this implies thatt_i² /ν follows thebeta distributionB(1/2,(ν − 1)/2).The distribution above is sometimes referred to as thetau distribution;^[2] it was first derived by Thompson in 1935.^[3]

Whenν = 3, the internally studentized residuals areuniformly distributed between $\scriptstyle -{\sqrt {3}}$ and $\scriptstyle +{\sqrt {3}}$ .If there is only one residual degree of freedom, the above formula for the distribution of internally studentized residuals doesn't apply. In this case, thet_i are all either +1 or −1, with 50% chance for each.

The standard deviation of the distribution of internally studentized residuals is always 1, but this does not imply that the standard deviation of all thet_i of a particular experiment is 1.For instance, the internally studentized residuals when fitting a straight line going through (0, 0) to the points (1, 4), (2, −1), (2, −1) are ${\sqrt {2}},\ -{\sqrt {5}}/5,\ -{\sqrt {5}}/5$ , and the standard deviation of these is not 1.

Note that any pair of studentized residualt_i andt_j (where $i\neq j$ ), are NOT i.i.d. They have the same distribution, but are not independent due to constraints on the residuals having to sum to 0 and to have them be orthogonal to the design matrix.

Software implementations

[edit]

Many programs and statistics packages, such asR,Python, etc., include implementations of Studentized residual.

Language/Program	Function	Notes
R	`rstandard(model, ...)`	internally studentized. See[2]
R	`rstudent(model, ...)`	externally studentized. See[3]

References

[edit]

^Regression Deletion Diagnostics R docs
^^a ^bAllen J. Pope (1976), "The statistics of residuals and the detection of outliers", U.S. Dept. of Commerce, National Oceanic and Atmospheric Administration, National Ocean Survey, Geodetic Research and Development Laboratory, 136 pages,[1], eq.(6)
^Thompson, William R. (1935)."On a Criterion for the Rejection of Observations and the Distribution of the Ratio of Deviation to Sample Standard Deviation".The Annals of Mathematical Statistics.6 (4):214–219.doi:10.1214/aoms/1177732567.

Movatterモバイル変換

Studentized residual

Motivation

Background

Calculation

Internal and external studentization

Distribution

Software implementations

See also

References

Further reading