This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages) (Learn how and when to remove this message)
|
| Part of a series on |
| Regression analysis |
|---|
| Models |
| Estimation |
| Background |
Instatistics, astudentized residual is thedimensionless ratio resulting from the division of aresidual by anestimate of itsstandard deviation, both expressed in the sameunits. It is a form of aStudent'st-statistic, with the estimate of error varying between points.
This is an important technique in the detection ofoutliers. It is among several named in honor ofWilliam Sealey Gosset, who wrote under the pseudonym "Student" (e.g.,Student's distribution). Dividing a statistic by asample standard deviation is calledstudentizing, in analogy withstandardizing andnormalizing.
The key reason for studentizing is that, inregression analysis of amultivariate distribution, the variances of theresiduals at different input variable values may differ, even if the variances of theerrors at these different input variable values are equal. The issue is the difference betweenerrors and residuals in statistics, particularly the behavior of residuals in regressions.
Consider thesimple linear regression model
Given a random sample (Xi, Yi),i = 1, ..., n, each pair (Xi, Yi) satisfies
where theerrors, areindependent and all have the same variance. Theresiduals are not the true errors, butestimates, based on the observable data. When themethod of least squares is used to estimate and, then the residuals, unlike the errors, cannot be independent since they satisfy the two constraints
and
(Hereεi is theith error, and is theith residual.)
The residuals, unlike the errors,do not all have the same variance: the variance decreases as the correspondingx-value gets farther from the averagex-value. This is not a feature of the data itself, but of the regression better fitting values at the ends of the domain. It is also reflected in theinfluence functions of various data points on theregression coefficients: endpoints have more influence. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. The fact thatthe variances of the residuals differ, even thoughthe variances of the true errors are all equal to each other, is theprincipal reason for the need for studentization.
It is not simply a matter of the population parameters (mean and standard deviation) being unknown – it is thatregressions yielddifferent residual distributions atdifferent data points, unlikepointestimators ofunivariate distributions, which share acommon distribution for residuals.
For this simple model, thedesign matrix is
and thehat matrixH is the matrix of theorthogonal projection onto the column space of the design matrix:
Theleveragehii is theith diagonal entry in the hat matrix. The variance of theith residual is
In case the design matrixX has only two columns (as in the example above), this is equal to
In the case of anarithmetic mean, the design matrixX has only one column (avector of ones), and this is simply:
Given the definitions above, theStudentized residual is then
wherehii is theleverage, and is an appropriate estimate ofσ (see below).
In the case of a mean, this is equal to:
The usual estimate ofσ2 is theinternally studentized residual
wherem is the number of parameters in the model (2 in our example).
But if thei th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude thei th observation from the process of estimating the variance when one is considering whether thei th case may be an outlier, and instead use theexternally studentized residual, which is
based on all the residualsexcept the suspecti th residual. Here is to emphasize that for suspecti are computed withi th case excluded.
If the estimateσ2includes thei th case, then it is called theinternally studentized residual, (also known as thestandardized residual[1]).If the estimate is used instead,excluding thei th case, then it is called theexternally studentized,.
If the errors are independent andnormally distributed withexpected value 0 and varianceσ2, then theprobability distribution of theith externally studentized residual is aStudent's t-distribution withn − m − 1degrees of freedom, and can range from to.
On the other hand, the internally studentized residuals are in the range, whereν =n − m is the number of residual degrees of freedom. Ifti represents the internally studentized residual, and again assuming that the errors are independent identically distributed Gaussian variables, then:[2]
wheret is a random variable distributed asStudent's t-distribution withν − 1 degrees of freedom. In fact, this implies thatti2 /ν follows thebeta distributionB(1/2,(ν − 1)/2).The distribution above is sometimes referred to as thetau distribution;[2] it was first derived by Thompson in 1935.[3]
Whenν = 3, the internally studentized residuals areuniformly distributed between and.If there is only one residual degree of freedom, the above formula for the distribution of internally studentized residuals doesn't apply. In this case, theti are all either +1 or −1, with 50% chance for each.
The standard deviation of the distribution of internally studentized residuals is always 1, but this does not imply that the standard deviation of all theti of a particular experiment is 1.For instance, the internally studentized residuals when fitting a straight line going through (0, 0) to the points (1, 4), (2, −1), (2, −1) are, and the standard deviation of these is not 1.
Note that any pair of studentized residualti andtj (where), are NOT i.i.d. They have the same distribution, but are not independent due to constraints on the residuals having to sum to 0 and to have them be orthogonal to the design matrix.
Many programs and statistics packages, such asR,Python, etc., include implementations of Studentized residual.
| Language/Program | Function | Notes |
|---|---|---|
| R | rstandard(model, ...) | internally studentized. See[2] |
| R | rstudent(model, ...) | externally studentized. See[3] |