Movatterモバイル変換

Regression validation

From Wikipedia, the free encyclopedia

Statistics concept

Regression analysis
Part of a series on
Models
Linear regression Simple regression Polynomial regression General linear model
Generalized linear model Vector generalized linear model Discrete choice Binomial regression Binary regression Logistic regression Multinomial logistic regression Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Linear mixed-effects model Nonlinear mixed-effects model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Linear Non-linear
Ordinary Weighted Generalized Generalized estimating equation
Partial Total Non-negative Ridge regression Regularized
Least absolute deviations Iteratively reweighted Bayesian Bayesian multivariate Least-squares spectral analysis
Background
Regression validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Mathematics portal
v t e

Instatistics,regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained fromregression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing thegoodness of fit of the regression, analyzing whether theregression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.

Goodness of fit

[edit]

Main article:Goodness of fit

One measure of goodness of fit is thecoefficient of determination, often denoted,R². Inordinary least squares with an intercept, it ranges between 0 and 1. However, anR² close to 1 does not guarantee that the model fits the data well. For example, if the functional form of the model does not match the data,R² can be high despite a poor model fit.Anscombe's quartet consists of four example data sets with similarly highR² values, but data that sometimes clearly does not fit the regression line. Instead, the data sets includeoutliers,high-leverage points, or non-linearities.

One problem with theR² as a measure of model validity is that it can always be increased by adding more variables into the model, except in the unlikely event that the additional variables are exactly uncorrelated with the dependent variable in the data sample being used. This problem can be avoided by doing anF-test of the statistical significance of the increase in theR², or by instead using theadjustedR2.

Analysis of residuals

[edit]

Main article:residual analysis

Theresiduals from a fitted model are the differences between the responses observed at each combination of values of theexplanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for thei^th observation in thedata set is written

e_{i}=y_{i}-f(x_{i};{\hat {\beta }}),

plot of a fit and residuals to illustrate how plotting residuals allows us to evaluate how good a fit is — An illustrative plot of a fit to data (green curve in top panel, data in red) plus a plot of residuals: red points in bottom plot. Dashed curve in bottom panel is a straight line fit to the residuals. If the functional form is correct then there should be little or no trend to the residuals - as seen here.

withy_i denoting thei^th response in the data set andx_i the vector of explanatory variables, each set at the corresponding values found in thei^th observation in the data set.

If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The next section details the types of plots to use to test different aspects of a model and gives the correct interpretations of different results that could be observed for each type of plot.

Graphical analysis of residuals

[edit]

Quantitative analysis of residuals

[edit]

Main article:Regression diagnostic

Numerical methods also play an important role in model validation. For example, thelack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. One common situation when numerical validation methods take precedence over graphical methods is when the number ofparameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications usingdesigned experiments.Logistic regression withbinary data is another area in which graphical residual analysis can be difficult.

Serial correlation of the residuals can indicate model misspecification, and can be checked for with theDurbin–Watson statistic. The problem ofheteroskedasticity can be checked for in any ofseveral ways.

Out-of-sample evaluation

[edit]

Main article:Cross-validation

Cross-validation is the process of assessing how the results of a statistical analysis will generalize to an independent data set. If the model has been estimated over some, but not all, of the available data, then the model using the estimated parameters can be used to predict the held-back data. If, for example, the out-of-samplemean squared error, also known as themean squared prediction error, is substantially higher than the in-sample mean square error, this is a sign of deficiency in the model.

A development in medical statistics is the use of out-of-sample cross validation techniques in meta-analysis. It forms the basis of thevalidation statistic, Vn, which is used to test the statistical validity of meta-analysis summary estimates. Essentially it measures a type of normalized prediction error and its distribution is a linear combination ofχ² variables of degree 1.^[1]

References

[edit]

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Regression validation" – news ·newspapers ·books ·scholar ·JSTOR(March 2010) (Learn how and when to remove this message)