Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Ordinary least squares

From Wikipedia, the free encyclopedia
(Redirected fromNormal equations)
Method for estimating the unknown parameters in a linear regression model
Okun's law inmacroeconomics states that in an economy theGDP growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.
Part of a series on
Regression analysis
Models
Estimation
Background

Instatistics,ordinary least squares (OLS) is a type oflinear least squares method for choosing the unknownparameters in alinear regression model (with fixed level-one[clarification needed] effects of alinear function of a set ofexplanatory variables) by the principle ofleast squares: minimizing the sum of the squares of the differences between the observeddependent variable (values of the variable being observed) in the inputdataset and the output of the (linear) function of theindependent variable. Some sources consider OLS to be linear regression.[1]

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resultingestimator can be expressed by a simple formula, especially in the case of asimple linear regression, in which there is a singleregressor on the right side of the regression equation.

The OLS estimator isconsistent for the level-one fixed effects when the regressors areexogenous and forms perfectcolinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[2] and—by theGauss–Markov theoremoptimal in the class of linear unbiased estimators when theerrors arehomoscedastic andserially uncorrelated. Under these conditions, the method of OLS providesminimum-variance mean-unbiased estimation when the errors have finitevariances. Under the additional assumption that the errors arenormally distributed with zero mean, OLS is themaximum likelihood estimator that outperforms any non-linear unbiased estimator.

Linear model

[edit]
Main article:Linear regression model

Suppose the data consists ofn{\displaystyle n}observations{xi,yi}i=1n{\displaystyle \left\{\mathbf {x} _{i},y_{i}\right\}_{i=1}^{n}}. Each observationi{\displaystyle i} includes a scalar responseyi{\displaystyle y_{i}} and a column vectorxi{\displaystyle \mathbf {x} _{i}} ofp{\displaystyle p} parameters (regressors), i.e.,xi=[xi1,xi2,,xip]T{\displaystyle \mathbf {x} _{i}=\left[x_{i1},x_{i2},\dots ,x_{ip}\right]^{\operatorname {T} }}. In alinear regression model, the response variable,yi{\displaystyle y_{i}}, is a linear function of the regressors:

yi=β1 xi1+β2 xi2++βp xip+εi,{\displaystyle y_{i}=\beta _{1}\ x_{i1}+\beta _{2}\ x_{i2}+\cdots +\beta _{p}\ x_{ip}+\varepsilon _{i},}

or invector form,

yi=xiTβ+εi,{\displaystyle y_{i}=\mathbf {x} _{i}^{\operatorname {T} }{\boldsymbol {\beta }}+\varepsilon _{i},\,}

wherexi{\displaystyle \mathbf {x} _{i}}, as introduced previously, is a column vector of thei{\displaystyle i}-th observation of all the explanatory variables;β{\displaystyle {\boldsymbol {\beta }}} is ap×1{\displaystyle p\times 1} vector of unknown parameters; and the scalarεi{\displaystyle \varepsilon _{i}} represents unobserved random variables (errors) of thei{\displaystyle i}-th observation.εi{\displaystyle \varepsilon _{i}} accounts for the influences upon the responsesyi{\displaystyle y_{i}} from sources other than the explanatory variablesxi{\displaystyle \mathbf {x} _{i}}. This model can also be written in matrix notation as

y=Xβ+ε,{\displaystyle \mathbf {y} =\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,}

wherey{\displaystyle \mathbf {y} } andε{\displaystyle {\boldsymbol {\varepsilon }}} aren×1{\displaystyle n\times 1} vectors of the response variables and the errors of then{\displaystyle n} observations, andX{\displaystyle \mathbf {X} } is ann×p{\displaystyle n\times p} matrix of regressors, also sometimes called thedesign matrix, whose rowi{\displaystyle i} isxiT{\displaystyle \mathbf {x} _{i}^{\operatorname {T} }} and contains thei{\displaystyle i}-th observations on all the explanatory variables.

Typically, a constant term is included in the set of regressorsX{\displaystyle \mathbf {X} }, say, by takingxi1=1{\displaystyle x_{i1}=1} for alli=1,,n{\displaystyle i=1,\dots ,n}. The coefficientβ1{\displaystyle \beta _{1}} corresponding to this regressor is called theintercept. Without the intercept, the fitted line is forced to cross the origin whenxi=0{\displaystyle x_{i}={\vec {0}}}.

Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent).

As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would bequadratic in the second regressor, but none-the-less is still considered alinear model because the modelis still linear in the parameters (β{\displaystyle {\boldsymbol {\beta }}}).

Matrix/vector formulation

[edit]

Consider anoverdetermined system

j=1pxijβj=yi, (i=1,2,,n),{\displaystyle \sum _{j=1}^{p}x_{ij}\beta _{j}=y_{i},\ (i=1,2,\dots ,n),}

ofn{\displaystyle n}linear equations inp{\displaystyle p} unknowncoefficients,β1,β2,,βp{\displaystyle \beta _{1},\beta _{2},\dots ,\beta _{p}}, withn>p{\displaystyle n>p}. This can be written inmatrix form as

Xβ=y,{\displaystyle \mathbf {X} {\boldsymbol {\beta }}=\mathbf {y} ,}

where

X=[X11X12X1pX21X22X2pXn1Xn2Xnp],β=[β1β2βp],y=[y1y2yn].{\displaystyle \mathbf {X} ={\begin{bmatrix}X_{11}&X_{12}&\cdots &X_{1p}\\X_{21}&X_{22}&\cdots &X_{2p}\\\vdots &\vdots &\ddots &\vdots \\X_{n1}&X_{n2}&\cdots &X_{np}\end{bmatrix}},\qquad {\boldsymbol {\beta }}={\begin{bmatrix}\beta _{1}\\\beta _{2}\\\vdots \\\beta _{p}\end{bmatrix}},\qquad \mathbf {y} ={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{bmatrix}}.}

(Note: for a linear model as above, not all elements inX{\displaystyle \mathbf {X} } contains information on the data points. The first column is populated with ones,Xi1=1{\displaystyle X_{i1}=1}. Only the other columns contain actual data. So herep{\displaystyle p} is equal to the number of regressors plus one).

Such a system usually has no exact solution, so the goal is instead to find the coefficientsβ{\displaystyle {\boldsymbol {\beta }}} which fit the equations "best", in the sense of solving thequadraticminimization problem

β^=argminβS(β),{\displaystyle {\hat {\boldsymbol {\beta }}}={\underset {\boldsymbol {\beta }}{\operatorname {arg\,min} }}\,S({\boldsymbol {\beta }}),}

where the objective functionS{\displaystyle S} is given by

S(β)=i=1n|yij=1pXijβj|2=yXβ2.{\displaystyle S({\boldsymbol {\beta }})=\sum _{i=1}^{n}\left|y_{i}-\sum _{j=1}^{p}X_{ij}\beta _{j}\right|^{2}=\left\|\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}\right\|^{2}.}

A justification for choosing this criterion is given inProperties below. This minimization problem has a unique solution, provided that thep{\displaystyle p} columns of the matrixX{\displaystyle \mathbf {X} } arelinearly independent, given by solving the so-callednormal equations:

(XTX)β^=XTy .{\displaystyle \left(\mathbf {X} ^{\operatorname {T} }\mathbf {X} \right){\hat {\boldsymbol {\beta }}}=\mathbf {X} ^{\operatorname {T} }\mathbf {y} \ .}

The matrixXTX{\displaystyle \mathbf {X} ^{\operatorname {T} }\mathbf {X} } is known as thenormal matrix orGram matrix and the matrixXTy{\displaystyle \mathbf {X} ^{\operatorname {T} }\mathbf {y} } is known as themoment matrix of regressand by regressors.[3] Finally,β^{\displaystyle {\hat {\boldsymbol {\beta }}}} is the coefficient vector of the least-squareshyperplane, expressed as

β^=(XX)1Xy.{\displaystyle {\hat {\boldsymbol {\beta }}}=\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }\mathbf {y} .}

or

β^=β+(XX)1Xε.{\displaystyle {\hat {\boldsymbol {\beta }}}={\boldsymbol {\beta }}+\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }{\boldsymbol {\varepsilon }}.}

Estimation

[edit]

Supposeb is a "candidate" value for the parameter vectorβ. The quantityyixiTb, called theresidual for thei-th observation, measures the vertical distance between the data point(xi,yi) and the hyperplaney =xTb, and thus assesses the degree of fit between the actual data and the model. Thesum of squared residuals (SSR) (also called theerror sum of squares (ESS) orresidual sum of squares (RSS))[4] is a measure of the overall model fit:

S(b)=i=1n(yixiTb)2=(yXb)T(yXb),{\displaystyle S(b)=\sum _{i=1}^{n}(y_{i}-x_{i}^{\operatorname {T} }b)^{2}=(y-Xb)^{\operatorname {T} }(y-Xb),}

whereT denotes the matrixtranspose, and the rows ofX, denoting the values of all the independent variables associated with a particular value of the dependent variable, areXi = xiT. The value ofb which minimizes this sum is called theOLS estimator forβ. The functionS(b) is quadratic inb with positive-definiteHessian, and therefore this function possesses a unique global minimum atb=β^{\displaystyle b={\hat {\beta }}}, which can be given by the explicit formula[5][proof]

β^=argminbRpS(b)=(XTX)1XTy .{\displaystyle {\hat {\beta }}=\operatorname {argmin} _{b\in \mathbb {R} ^{p}}S(b)=(X^{\operatorname {T} }X)^{-1}X^{\operatorname {T} }y\ .}

The productN =XTX is aGram matrix, and its inverse,Q =N−1, is thecofactor matrix ofβ,[6][7][8] closely related to itscovariance matrix,Cβ.The matrix (XT X)−1 XT =Q XT is called theMoore–Penrose pseudoinverse matrix ofX. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfectmulticollinearity between the explanatory variables (which would cause the Gram matrix to have no inverse).

Prediction

[edit]

After we have estimatedβ, thefitted values (orpredicted values) from the regression will be

y^=Xβ^=Py,{\displaystyle {\hat {y}}=X{\hat {\beta }}=Py,}

whereP =X(XTX)−1XT is theprojection matrix onto the spaceV spanned by the columns ofX. This matrixP is also sometimes called thehat matrix because it "puts a hat" onto the variabley. Another matrix, closely related toP is theannihilator matrixM =InP; this is a projection matrix onto the space orthogonal toV. Both matricesP andM aresymmetric andidempotent (meaning thatP2 =P andM2 =M), and relate to the data matrixX via identitiesPX =X andMX = 0.[9] MatrixM creates theresiduals from the regression:

ε^=yy^=yXβ^=My=M(Xβ+ε)=(MX)β+Mε=Mε.{\displaystyle {\hat {\varepsilon }}=y-{\hat {y}}=y-X{\hat {\beta }}=My=M(X\beta +\varepsilon )=(MX)\beta +M\varepsilon =M\varepsilon .}

The variances of the predicted valuessy^i2{\displaystyle s_{{\hat {y}}_{i}}^{2}} are found in the main diagonal of thevariance-covariance matrix of predicted values:

Cy^=s2P,{\displaystyle C_{\hat {y}}=s^{2}P,}

whereP is the projection matrix ands2 is the sample variance.[10]The full matrix is very large; its diagonal elements can be calculated individually as:

sy^i2=s2Xi(XTX)1XiT,{\displaystyle s_{{\hat {y}}_{i}}^{2}=s^{2}X_{i}(X^{T}X)^{-1}X_{i}^{T},}

whereXi is thei-th row of matrixX.

Sample statistics

[edit]

Using these residuals we can estimate the sample variances2 using thereduced chi-squared statistic:

s2=ε^Tε^np=(My)TMynp=yTMTMynp=yTMynp=S(β^)np,σ^2=npns2{\displaystyle s^{2}={\frac {{\hat {\varepsilon }}^{\mathrm {T} }{\hat {\varepsilon }}}{n-p}}={\frac {(My)^{\mathrm {T} }My}{n-p}}={\frac {y^{\mathrm {T} }M^{\mathrm {T} }My}{n-p}}={\frac {y^{\mathrm {T} }My}{n-p}}={\frac {S({\hat {\beta }})}{n-p}},\qquad {\hat {\sigma }}^{2}={\frac {n-p}{n}}\;s^{2}}

The denominator,np, is thestatistical degrees of freedom. The first quantity,s2, is the OLS estimate forσ2, whereas the second,σ^2{\displaystyle \scriptstyle {\hat {\sigma }}^{2}}, is the MLE estimate forσ2. The two estimators are quite similar in large samples; the first estimator is alwaysunbiased, while the second estimator is biased but has a smallermean squared error. In practices2 is used more often, since it is more convenient for the hypothesis testing. The square root ofs2 is called theregression standard error,[11]standard error of the regression,[12][13] orstandard error of the equation.[9]

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing ontoX. Thecoefficient of determinationR2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variabley, in the cases where the regression sum of squares equals the sum of squares of residuals:[14]

R2=(y^iy¯)2(yiy¯)2=yTPTLPyyTLy=1yTMyyTLy=1RSSTSS{\displaystyle R^{2}={\frac {\sum ({\hat {y}}_{i}-{\overline {y}})^{2}}{\sum (y_{i}-{\overline {y}})^{2}}}={\frac {y^{\mathrm {T} }P^{\mathrm {T} }LPy}{y^{\mathrm {T} }Ly}}=1-{\frac {y^{\mathrm {T} }My}{y^{\mathrm {T} }Ly}}=1-{\frac {\rm {RSS}}{\rm {TSS}}}}

where TSS is thetotal sum of squares for the dependent variable,L=In1nJn{\textstyle L=I_{n}-{\frac {1}{n}}J_{n}}, andJn{\textstyle J_{n}} is ann×n matrix of ones. (L{\displaystyle L} is acentering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order forR2 to be meaningful, the matrixX of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case,R2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.

Simple linear regression model

[edit]
Main article:Simple linear regression

If the data matrixX contains only two variables, a constant and a scalar regressorxi, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as(α,β):

yi=α+βxi+εi.{\displaystyle y_{i}=\alpha +\beta x_{i}+\varepsilon _{i}.}

The least squares estimates in this case are given by simple formulas

β^=i=1n(xix¯)(yiy¯)i=1n(xix¯)2α^=y¯β^x¯ ,{\displaystyle {\begin{aligned}{\widehat {\beta }}&={\frac {\sum _{i=1}^{n}{(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}}{\sum _{i=1}^{n}{(x_{i}-{\bar {x}})^{2}}}}\\[2pt]{\widehat {\alpha }}&={\bar {y}}-{\widehat {\beta }}\,{\bar {x}}\ ,\end{aligned}}}

Alternative derivations

[edit]

In the previous section the least squares estimatorβ^{\displaystyle {\hat {\beta }}} was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same:^β = (XTX)−1XTy; the only difference is in how we interpret this result.

Projection

[edit]
OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each ofX1{\displaystyle X_{1}} andX2{\displaystyle X_{2}} refers to a column of the data matrix.)
This sectionmay need to be cleaned up. It has beenmerged fromLinear least squares.

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equationsy, whereβ is the unknown. Assuming the system cannot be solved exactly (the number of equationsn is much larger than the number of unknownsp), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies

β^=argminβyXβ2,{\displaystyle {\hat {\beta }}={\rm {arg}}\min _{\beta }\,\lVert \mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}\rVert ^{2},}

where· is the standardL2 norm in then-dimensionalEuclidean spaceRn. The predicted quantity is just a certain linear combination of the vectors of regressors. Thus, the residual vectory will have the smallest length wheny isprojected orthogonally onto thelinear subspacespanned by the columns ofX. The OLS estimatorβ^{\displaystyle {\hat {\beta }}} in this case can be interpreted as the coefficients ofvector decomposition of^y =Py along the basis ofX.

In other words, the gradient equations at the minimum can be written as:

(yXβ^)X=0.{\displaystyle (\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})^{\top }\mathbf {X} =0.}

A geometrical interpretation of these equations is that the vector of residuals,yXβ^{\displaystyle \mathbf {y} -X{\hat {\boldsymbol {\beta }}}} is orthogonal to thecolumn space ofX, since thedot product(yXβ^)Xv{\displaystyle (\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})\cdot \mathbf {X} \mathbf {v} } is equal to zero forany conformal vector,v. This means thatyXβ^{\displaystyle \mathbf {y} -\mathbf {X} {\boldsymbol {\hat {\beta }}}} is the shortest of all possible vectorsyXβ{\displaystyle \mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}}, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

Introducingγ^{\displaystyle {\hat {\boldsymbol {\gamma }}}} and a matrixK with the assumption that a matrix[X K]{\displaystyle [\mathbf {X} \ \mathbf {K} ]} is non-singular andKTX = 0 (cf.Orthogonal projections), the residual vector should satisfy the following equation:

r^:=yXβ^=Kγ^.{\displaystyle {\hat {\mathbf {r} }}:=\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}}=\mathbf {K} {\hat {\boldsymbol {\gamma }}}.}

The equation and solution of linear least squares are thus described as follows:

y=[XK][β^γ^],[β^γ^]=[XK]1y=[(XX)1X(KK)1K]y.{\displaystyle {\begin{aligned}\mathbf {y} &={\begin{bmatrix}\mathbf {X} &\mathbf {K} \end{bmatrix}}{\begin{bmatrix}{\hat {\boldsymbol {\beta }}}\\{\hat {\boldsymbol {\gamma }}}\end{bmatrix}},\\{}\Rightarrow {\begin{bmatrix}{\hat {\boldsymbol {\beta }}}\\{\hat {\boldsymbol {\gamma }}}\end{bmatrix}}&={\begin{bmatrix}\mathbf {X} &\mathbf {K} \end{bmatrix}}^{-1}\mathbf {y} ={\begin{bmatrix}\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }\\\left(\mathbf {K} ^{\top }\mathbf {K} \right)^{-1}\mathbf {K} ^{\top }\end{bmatrix}}\mathbf {y} .\end{aligned}}}

Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[15] Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.

Maximum likelihood

[edit]

The OLS estimator is identical to themaximum likelihood estimator (MLE) under the normality assumption for the error terms.[16][proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis byYule andPearson.[citation needed] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining theCramér–Rao bound for variance) if the normality assumption is satisfied.[17]

Generalized method of moments

[edit]

Iniid case the OLS estimator can also be viewed as aGMM estimator arising from the moment conditions

E[xi(yixiTβ)]=0.{\displaystyle \mathrm {E} {\big [}\,x_{i}\left(y_{i}-x_{i}^{\operatorname {T} }\beta \right)\,{\big ]}=0.}

These moment conditions state that the regressors should be uncorrelated with the errors. Sincexi is ap-vector, the number of moment conditions is equal to the dimension of the parameter vectorβ, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.

Note that the original strict exogeneity assumptionE[εi | xi] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-functionƒ, the moment conditionE[ƒ(xiεi] = 0 will hold. However it can be shown using theGauss–Markov theorem that the optimal choice of functionƒ is to takeƒ(x) =x, which results in the moment equation posted above.

Assumptions

[edit]
See also:Linear regression § Assumptions

There are several different frameworks in which thelinear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.

One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressorsxi are random and sampled together with theyi's from somepopulation, as in anobservational study. This approach allows for more natural study of theasymptotic properties of the estimators. In the other interpretation (fixed design), the regressorsX are treated as known constants set by adesign, andy is sampled conditionally on the values ofX as in anexperiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning onX. All results stated in this article are within the random design framework.

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observationsn is fixed. This contrasts with the other approaches, which study theasymptotic behavior of OLS, and in which the behavior at a large number of samples is studied.To prove finite sample unbiasedness of the OLS estimator, we require the following assumptions.

Example of a cubic polynomial regression, which is a type of linear regression. Althoughpolynomial regression fits a curve model to the data, as astatistical estimation problem it is linear, in the sense that the conditional expectation functionE[y|x]{\displaystyle \mathbb {E} [y|x]} is linear in the unknownparameters that are estimated from thedata. For this reason, polynomial regression is considered to be a special case ofmultiple linear regression.
  • Exogeneity. The regressors do notcovary with the error term:E[εixi]=0.{\displaystyle \mathbb {E} [\varepsilon _{i}x_{i}]=0.} This requires, for example, that there are noomitted variables that covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression inmathematical statistics is that the predictor variablesx can be treated as fixed values, rather thanrandom variables. This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complexerrors-in-variables models,instrumental variable models and the like.
  • Linearity, orcorrect specification. This means that the mean of the response variable is alinear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, inpolynomial regression, which uses linear regression to fit the response variable as an arbitrarypolynomial function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend tooverfit the data. As a result, some kind ofregularization must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples areridge regression andlasso regression.Bayesian linear regression can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact,ridge regression andlasso regression can both be viewed as special cases of Bayesian linear regression, with particular types ofprior distributions placed on the regression coefficients.)
  • Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab
    Constant variance orhomoscedasticity. This means that the variance of the errors does not depend on the values of the predictor variables:E[εi2|xi]=σ2.{\displaystyle \mathbb {E} [\varepsilon _{i}^{2}|x_{i}]=\sigma ^{2}.} Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000—i.e., astandard deviation of around $20,000—while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, since that would imply their actual income could vary anywhere between −$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is calledheteroscedasticity. In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; seeHeteroscedasticity. The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case ofordinary least squares, not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. Themean squared error for the model will also be wrong. Various estimation techniques includingweighted least squares and the use ofheteroscedasticity-consistent standard errors can handle heteroscedasticity in a quite general way.Bayesian linear regression techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting thelogarithm of the response variable using a linear regression model, which implies that the response variable itself has alog-normal distribution rather than anormal distribution).
To check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such asautocorrelation in the errors or their correlation with one or more covariates.
  • Uncorrelatedness of errors. This assumes that the errors of the response variables are uncorrelated with each other:E[εiεj|xi,xj]=0.{\displaystyle \mathbb {E} [\varepsilon _{i}\varepsilon _{j}|x_{i},x_{j}]=0.} Some methods such asgeneralized least squares are capable of handling correlated errors, although they typically require significantly more data unless some sort ofregularization is used to bias the model towards assuming uncorrelated errors.Bayesian linear regression is a general way of handling this issue. Fullstatistical independence is a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence.
  • Lack of perfect multicollinearity in the predictors. For standardleast squares estimation methods, the design matrixX must have fullcolumn rankp:[18]Pr[rank(X)=p]=1.{\displaystyle \Pr \!{\big [}\,\operatorname {rank} (X)=p\,{\big ]}=1.} If this assumption is violated, perfectmulticollinearity exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (seeVariance inflation factor). In the case of perfect multicollinearity, the parameter vectorβ will benon-identifiable—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter spaceRp). Seepartial least squares regression. Methods for fitting linear models with multicollinearity have been developed,[19][20][21][22] some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used ingeneralized linear models, do not suffer from this problem.

Violations of these assumptions can result in biased estimations ofβ, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:

  • The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.
  • The arrangement, orprobability distribution of the predictor variablesx has a major influence on the precision of estimates ofβ.Sampling anddesign of experiments are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate ofβ.

Properties

[edit]

Finite sample properties

[edit]

First of all, under thestrict exogeneity assumption the OLS estimatorsβ^{\displaystyle \scriptstyle {\hat {\beta }}} ands2 areunbiased, meaning that their expected values coincide with the true values of the parameters:[23][proof]

E[β^X]=β,E[s2X]=σ2.{\displaystyle \operatorname {E} [\,{\hat {\beta }}\mid X\,]=\beta ,\quad \operatorname {E} [\,s^{2}\mid X\,]=\sigma ^{2}.}

If the strict exogeneity does not hold (as is the case with manytime series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.

Thevariance-covariance matrix (or simplycovariance matrix) ofβ^{\displaystyle \scriptstyle {\hat {\beta }}} is equal to[24]

Var[β^X]=σ2(XTX)1=σ2Q.{\displaystyle \operatorname {Var} [\,{\hat {\beta }}\mid X\,]=\sigma ^{2}\left(X^{\operatorname {T} }X\right)^{-1}=\sigma ^{2}Q.}

In particular, the standard error of each coefficientβ^j{\displaystyle \scriptstyle {\hat {\beta }}_{j}} is equal to square root of thej-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantityσ2 with its estimates2. Thus,

s.e.^(β^j)=s2(XTX)jj1{\displaystyle {\widehat {\operatorname {s.\!e.} }}({\hat {\beta }}_{j})={\sqrt {s^{2}\left(X^{\operatorname {T} }X\right)_{jj}^{-1}}}}

It can also be easily shown that the estimatorβ^{\displaystyle \scriptstyle {\hat {\beta }}} is uncorrelated with the residuals from the model:[24]

Cov[β^,ε^X]=0.{\displaystyle \operatorname {Cov} [\,{\hat {\beta }},{\hat {\varepsilon }}\mid X\,]=0.}

TheGauss–Markov theorem states that under thespherical errors assumption (that is, the errors should beuncorrelated andhomoscedastic) the estimatorβ^{\displaystyle \scriptstyle {\hat {\beta }}} is efficient in the class of linear unbiased estimators. This is called thebest linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimatorβ~{\displaystyle \scriptstyle {\tilde {\beta }}} which would be linear iny and unbiased, then[24]

Var[β~X]Var[β^X]0{\displaystyle \operatorname {Var} [\,{\tilde {\beta }}\mid X\,]-\operatorname {Var} [\,{\hat {\beta }}\mid X\,]\geq 0}

in the sense that this is anonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error termsε, other, non-linear estimators may provide better results than OLS.

Assuming normality

[edit]

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that thenormality assumption holds (that is, thatε ~N(0,σ2In)), then additional properties of the OLS estimators can be stated.

The estimatorβ^{\displaystyle \scriptstyle {\hat {\beta }}} is normally distributed, with mean and variance as given before:[25]

β^  N(β, σ2(XTX)1).{\displaystyle {\hat {\beta }}\ \sim \ {\mathcal {N}}{\big (}\beta ,\ \sigma ^{2}(X^{\mathrm {T} }X)^{-1}{\big )}.}

This estimator reaches theCramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators.[17] Note that unlike theGauss–Markov theorem, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.

The estimators2 will be proportional to thechi-squared distribution:[26]

s2  σ2npχnp2{\displaystyle s^{2}\ \sim \ {\frac {\sigma ^{2}}{n-p}}\cdot \chi _{n-p}^{2}}

The variance of this estimator is equal to2σ4/(n − p), which does not attain theCramér–Rao bound of2σ4/n. However it was shown that there are no unbiased estimators ofσ2 with variance smaller than that of the estimators2.[27] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of themean squared error) estimator in this class will be~σ2 = SSR / (n − p + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (p = 1).[28]

Moreover, the estimatorsβ^{\displaystyle \scriptstyle {\hat {\beta }}} ands2 areindependent,[29] the fact which comes in useful when constructing the t- and F-tests for the regression.

Influential observations

[edit]
Main article:Influential observation
See also:Leverage (statistics)

As was mentioned before, the estimatorβ^{\displaystyle {\hat {\beta }}} is linear iny, meaning that it represents a linear combination of the dependent variablesyi. The weights in this linear combination are functions of the regressorsX, and generally are unequal. The observations with high weights are calledinfluential because they have a more pronounced effect on the value of the estimator.

To analyze which observations are influential we remove a specificj-th observation and consider how much the estimated quantities are going to change (similarly to thejackknife method). It can be shown that the change in the OLS estimator forβ will be equal to[30]

β^(j)β^=11hj(XTX)1xjTε^j,{\displaystyle {\hat {\beta }}^{(j)}-{\hat {\beta }}=-{\frac {1}{1-h_{j}}}(X^{\mathrm {T} }X)^{-1}x_{j}^{\mathrm {T} }{\hat {\varepsilon }}_{j}\,,}

wherehj =xjT (XTX)−1xj is thej-th diagonal element of the hat matrixP, andxj is the vector of regressors corresponding to thej-th observation. Similarly, the change in the predicted value forj-th observation resulting from omitting that observation from the dataset will be equal to[30]

y^j(j)y^j=xjTβ^(j)xjTβ^=hj1hjε^j{\displaystyle {\hat {y}}_{j}^{(j)}-{\hat {y}}_{j}=x_{j}^{\mathrm {T} }{\hat {\beta }}^{(j)}-x_{j}^{\operatorname {T} }{\hat {\beta }}=-{\frac {h_{j}}{1-h_{j}}}\,{\hat {\varepsilon }}_{j}}

From the properties of the hat matrix,0 ≤hj ≤ 1, and they sum up top, so that on averagehjp/n. These quantitieshj are called theleverages, and observations with highhj are calledleverage points.[31] Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.

Partitioned regression

[edit]

Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form

y=X1β1+X2β2+ε,{\displaystyle y=X_{1}\beta _{1}+X_{2}\beta _{2}+\varepsilon ,}

whereX1 andX2 have dimensionsn×p1,n×p2, andβ1,β2 arep1×1 andp2×1 vectors, withp1 +p2 =p.

TheFrisch–Waugh–Lovell theorem states that in this regression the residualsε^{\displaystyle {\hat {\varepsilon }}} and the OLS estimateβ^2{\displaystyle \scriptstyle {\hat {\beta }}_{2}} will be numerically identical to the residuals and the OLS estimate forβ2 in the following regression:[32]

M1y=M1X2β2+η,{\displaystyle M_{1}y=M_{1}X_{2}\beta _{2}+\eta \,,}

whereM1 is theannihilator matrix for regressorsX1.

The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.

Large sample properties

[edit]

The least squares estimators arepoint estimates of the linear regression model parametersβ. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct theinterval estimates.

Since we have not made any assumption about the distribution of error termεi, it is impossible to infer the distribution of the estimatorsβ^{\displaystyle {\hat {\beta }}} andσ^2{\displaystyle {\hat {\sigma }}^{2}}. Nevertheless, we can apply thecentral limit theorem to derive theirasymptotic properties as sample sizen goes to infinity. While the sample size is necessarily finite, it is customary to assume thatn is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.

We can show that under the model assumptions, the least squares estimator forβ isconsistent (that isβ^{\displaystyle {\hat {\beta }}}converges in probability toβ) and asymptotically normal:[proof]

(β^β) d N(0,σ2Qxx1),{\displaystyle ({\hat {\beta }}-\beta )\ {\xrightarrow {d}}\ {\mathcal {N}}{\big (}0,\;\sigma ^{2}Q_{xx}^{-1}{\big )},}

whereQxx=XTX.{\displaystyle Q_{xx}=X^{\operatorname {T} }X.}

Inference

[edit]
Main articles:Confidence interval andPrediction interval

Using this asymptotic distribution, approximate two-sided confidence intervals for thej-th component of the vectorβ^{\displaystyle {\hat {\beta }}} can be constructed as

βj[ β^j±q1α2N(0,1)σ^2[Qxx1]jj ]{\displaystyle \beta _{j}\in {\bigg [}\ {\hat {\beta }}_{j}\pm q_{1-{\frac {\alpha }{2}}}^{{\mathcal {N}}(0,1)}\!{\sqrt {{\hat {\sigma }}^{2}\left[Q_{xx}^{-1}\right]_{jj}}}\ {\bigg ]}}   at the1 − α confidence level,

whereq denotes thequantile function of standard normal distribution, and [·]jj is thej-th diagonal element of a matrix.

Similarly, the least squares estimator forσ2 is also consistent and asymptotically normal (provided that the fourth moment ofεi exists) with limiting distribution

(σ^2σ2) d N(0,E[εi4]σ4).{\displaystyle ({\hat {\sigma }}^{2}-\sigma ^{2})\ {\xrightarrow {d}}\ {\mathcal {N}}\left(0,\;\operatorname {E} \left[\varepsilon _{i}^{4}\right]-\sigma ^{4}\right).}

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Supposex0{\displaystyle x_{0}} is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. Themean response is the quantityy0=x0Tβ{\displaystyle y_{0}=x_{0}^{\mathrm {T} }\beta }, whereas thepredicted response isy^0=x0Tβ^{\displaystyle {\hat {y}}_{0}=x_{0}^{\mathrm {T} }{\hat {\beta }}}. Clearly the predicted response is a random variable, its distribution can be derived from that ofβ^{\displaystyle {\hat {\beta }}}:

(y^0y0) d N(0,σ2x0TQxx1x0),{\displaystyle \left({\hat {y}}_{0}-y_{0}\right)\ {\xrightarrow {d}}\ {\mathcal {N}}\left(0,\;\sigma ^{2}x_{0}^{\mathrm {T} }Q_{xx}^{-1}x_{0}\right),}

which allows construct confidence intervals for mean responsey0{\displaystyle y_{0}} to be constructed:

y0[ x0Tβ^±q1α2N(0,1)σ^2x0TQxx1x0 ]{\displaystyle y_{0}\in \left[\ x_{0}^{\mathrm {T} }{\hat {\beta }}\pm q_{1-{\frac {\alpha }{2}}}^{{\mathcal {N}}(0,1)}\!{\sqrt {{\hat {\sigma }}^{2}x_{0}^{\mathrm {T} }Q_{xx}^{-1}x_{0}}}\ \right]}   at the1 − α confidence level.

Hypothesis testing

[edit]
Main article:Hypothesis testing
[icon]
This sectionneeds expansion. You can help byadding missing information.(February 2017)

Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). Thenull hypothesis of no explanatory value of the estimated regression is tested using anF-test. If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and thealternative hypothesis, that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.

Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient'st-statistic, as the ratio of the coefficient estimate to itsstandard error. If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.

In addition, theChow test is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.

Violations of assumptions

[edit]

Time series model

[edit]

In atime series model, we require thestochastic process {xi,yi} to bestationary andergodic; if {xi,yi} is nonstationary, OLS results are often biased unless {xi,yi} isco-integrating.[33]

We still require the regressors to bestrictly exogenous: E[xiεi] = 0 for alli = 1, ...,n. If they are onlypredetermined, OLS is biased in finite sample;

Finally, the assumptions on the variance take the form of requiring that {xiεi} is amartingale difference sequence, with a finite matrix of second momentsQxxε² = E[ εi2xi xiT ].

Constrained estimation

[edit]
Main article:Ridge regression

Suppose it is known that the coefficients in the regression satisfy a system of linear equations

A:QTβ=c,{\displaystyle A\colon \quad Q^{\operatorname {T} }\beta =c,\,}

whereQ is ap×q matrix of full rank, andc is aq×1 vector of known constants, whereq < p. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraintA. Theconstrained least squares (CLS) estimator can be given by an explicit formula:[34]

β^c=β^(XTX)1Q(QT(XTX)1Q)1(QTβ^c).{\displaystyle {\hat {\beta }}^{c}={\hat {\beta }}-(X^{\operatorname {T} }X)^{-1}Q{\Big (}Q^{\operatorname {T} }(X^{\operatorname {T} }X)^{-1}Q{\Big )}^{-1}(Q^{\operatorname {T} }{\hat {\beta }}-c).}

This expression for the constrained estimator is valid as long as the matrixXTX is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails,β will not be identifiable. However it may happen that adding the restrictionA makesβ identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to[35]

β^c=R(RTXTXR)1RTXTy+(IpR(RTXTXR)1RTXTX)Q(QTQ)1c,{\displaystyle {\hat {\beta }}^{c}=R(R^{\operatorname {T} }X^{\operatorname {T} }XR)^{-1}R^{\operatorname {T} }X^{\operatorname {T} }y+{\Big (}I_{p}-R(R^{\operatorname {T} }X^{\operatorname {T} }XR)^{-1}R^{\operatorname {T} }X^{\operatorname {T} }X{\Big )}Q(Q^{\operatorname {T} }Q)^{-1}c,}

whereR is ap×(p − q) matrix such that the matrix[Q R] is non-singular, andRTQ = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case whenXTX is invertible.[35]

Example with real data

[edit]
See also:Simple linear regression § Example, andLinear least squares § Example

The following data set gives average heights and weights for American women aged 30–39 (source:The World Almanac and Book of Facts, 1975).

Height (m)1.471.501.521.551.57
Scatterplot of the data, the relationship is slightly curved but close to linear
Weight (kg)52.2153.1254.4855.8457.20
Height (m)1.601.631.651.681.70
Weight (kg)58.5759.9361.2963.1164.47
Height (m)1.731.751.781.801.83
Weight (kg)66.2868.1069.9272.1974.46

When only one dependent variable is being modeled, ascatterplot will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressorHEIGHT2. The regression model then becomes a multiple linear model:

wi=β1+β2hi+β3hi2+εi.{\displaystyle w_{i}=\beta _{1}+\beta _{2}h_{i}+\beta _{3}h_{i}^{2}+\varepsilon _{i}.}
Fitted regression

The output from most popularstatistical packages will look similar to this:

MethodLeast squares
Dependent variableWEIGHT
Observations15

ParameterValueStd errort-statisticp-value

β1{\displaystyle \beta _{1}}128.812816.30837.89860.0000
β2{\displaystyle \beta _{2}}−143.162019.8332−7.21830.0000
β3{\displaystyle \beta _{3}}61.96036.008410.31220.0000

R20.9989S.E. of regression0.2516
Adjusted R20.9987Model sum-of-sq.692.61
Log-likelihood1.0890Residual sum-of-sq.0.7595
Durbin–Watson stat.2.1013Total sum-of-sq.693.37
Akaike criterion0.2548F-statistic5471.2
Schwarz criterion0.3964p-value (F-stat)0.0000

In this table:

  • TheValue column gives the least squares estimates of parametersβj
  • TheStd error column showsstandard errors of each coefficient estimate:σ^j=(σ^2[Qxx1]jj)12{\displaystyle {\hat {\sigma }}_{j}=\left({\hat {\sigma }}^{2}\left[Q_{xx}^{-1}\right]_{jj}\right)^{\frac {1}{2}}}
  • Thet-statistic andp-value columns are testing whether any of the coefficients might be equal to zero. Thet-statistic is calculated simply ast=β^j/σ^j{\displaystyle t={\hat {\beta }}_{j}/{\hat {\sigma }}_{j}}. If the errors ε follow a normal distribution,t follows a Student-t distribution. Under weaker conditions,t is asymptotically normal. Large values oft indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column,p-value, expresses the results of the hypothesis test as asignificance level. Conventionally,p-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.
  • R-squared is thecoefficient of determination indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressorsX have no explanatory power whatsoever. This is a biased estimate of the populationR-squared, and will never decrease if additional regressors are added, even if they are irrelevant.
  • Adjusted R-squared is a slightly modified version ofR2{\displaystyle R^{2}}, designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller thanR2{\displaystyle R^{2}}, can decrease as new regressors are added, and even be negative for poorly fitting models:
R¯2=1n1np(1R2){\displaystyle {\overline {R}}^{2}=1-{\frac {n-1}{n-p}}(1-R^{2})}
  • Log-likelihood is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
  • Durbin–Watson statistic tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.
  • Akaike information criterion andSchwarz criterion are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[36]
  • Standard error of regression is an estimate ofσ, standard error of the error term.
  • Total sum of squares,model sum of squared, andresidual sum of squares tell us how much of the initial variation in the sample were explained by the regression.
  • F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic hasF(p–1,n–p) distribution under the null hypothesis and normality assumption, and itsp-value indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such asWald test orLR test should be used.
Residuals plot

Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:

  • Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.
  • Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  • Residuals against the fitted values,y^{\displaystyle {\hat {y}}}.
  • Residuals against the preceding residual. This plot may identify serial correlations in the residuals.

An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.

Sensitivity to rounding

[edit]
Main article:Errors-in-variables models
See also:Quantization error model

This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this isnot an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become:

ConstHeightHeight2
Converted to metric with rounding.128.8128−143.16261.96033
Converted to metric without rounding.119.0205−131.507658.5046
Residuals to a quadratic fit for correctly and incorrectly converted data.

Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.

While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range (extrapolation).

This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of thex andy errors.

Another example with less real data

[edit]

Problem statement

[edit]

We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used isr(θ)=p1ecos(θ){\displaystyle r(\theta )={\frac {p}{1-e\cos(\theta )}}} wherer(θ){\displaystyle r(\theta )} is the radius of how far the object is from one of the bodies. In the equation the parametersp{\displaystyle p} ande{\displaystyle e} are used to determine the path of the orbit. We have measured the following data.

θ{\displaystyle \theta } (in degrees)43455293108116
r(θ){\displaystyle r(\theta )}4.71264.55424.04192.21871.89101.7599

We need to find the least-squares approximation ofe{\displaystyle e} andp{\displaystyle p} for the given data.

Solution

[edit]

First we need to represent e and p in a linear form. So we are going to rewrite the equationr(θ){\displaystyle r(\theta )} as1r(θ)=1pepcos(θ){\displaystyle {\frac {1}{r(\theta )}}={\frac {1}{p}}-{\frac {e}{p}}\cos(\theta )}.

Furthermore, one could fit forapsides by expandingcos(θ){\displaystyle \cos(\theta )} with an extra parameter ascos(θθ0)=cos(θ)cos(θ0)+sin(θ)sin(θ0){\displaystyle \cos(\theta -\theta _{0})=\cos(\theta )\cos(\theta _{0})+\sin(\theta )\sin(\theta _{0})}, which is linear in bothcos(θ){\displaystyle \cos(\theta )} and in the extra basis functionsin(θ){\displaystyle \sin(\theta )}.

We use the original two-parameter form to represent our observational data as:

ATA(xy)=ATb,{\displaystyle A^{T}A{\binom {x}{y}}=A^{T}b,}

where:

x=1/p{\displaystyle x=1/p\,};y=e/p{\displaystyle y=e/p\,};A{\displaystyle A} contains the coefficients of1/p{\displaystyle 1/p} in the first column, which are all 1, and the coefficients ofe/p{\displaystyle e/p} in the second column, given bycos(θ){\displaystyle \cos(\theta )\,}; andb=1/r(θ){\displaystyle b=1/r(\theta )}, such that:

A=[10.73135410.70710710.6156611 0.05233610.30901710.438371],b=[0.212200.219580.247410.450710.528830.56820].{\displaystyle A={\begin{bmatrix}1&-0.731354\\1&-0.707107\\1&-0.615661\\1&\ 0.052336\\1&0.309017\\1&0.438371\end{bmatrix}},\quad b={\begin{bmatrix}0.21220\\0.21958\\0.24741\\0.45071\\0.52883\\0.56820\end{bmatrix}}.}

On solving we get(xy)=(0.434780.30435){\displaystyle {\binom {x}{y}}={\binom {0.43478}{0.30435}}\,},

sop=1x=2.3000{\displaystyle p={\frac {1}{x}}=2.3000} ande=py=0.70001{\displaystyle e=p\cdot y=0.70001}

See also

[edit]

References

[edit]
  1. ^"The Origins of Ordinary Least Squares Assumptions".Feature Column. 2022-03-01. Retrieved2024-05-16.
  2. ^"What is a complete list of the usual assumptions for linear regression?".Cross Validated. Retrieved2022-09-28.
  3. ^Goldberger, Arthur S. (1964)."Classical Linear Regression".Econometric Theory. New York: John Wiley & Sons. pp. 158.ISBN 0-471-31101-4.{{cite book}}:ISBN / Date incompatibility (help)
  4. ^Hayashi, Fumio (2000).Econometrics. Princeton University Press. p. 15.ISBN 9780691010182.
  5. ^Hayashi (2000, page 18).
  6. ^Ghilani, Charles D.; Wolf, Paul R. (12 June 2006).Adjustment Computations: Spatial Data Analysis. John Wiley & Sons.ISBN 9780471697282.
  7. ^Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007).GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more. Springer.ISBN 9783211730171.
  8. ^Xu, Guochang (5 October 2007).GPS: Theory, Algorithms and Applications. Springer.ISBN 9783540727156.
  9. ^abHayashi (2000, page 19)
  10. ^Hoaglin, David C.; Welsch, Roy E. (1978)."The Hat Matrix in Regression and ANOVA".The American Statistician.32 (1):17–22.doi:10.1080/00031305.1978.10479237.hdl:1721.1/1920.ISSN 0003-1305.
  11. ^Julian Faraway (2000),Practical Regression and Anova using R
  12. ^Kenney, J.; Keeping, E. S. (1963).Mathematics of Statistics. van Nostrand. p. 187.
  13. ^Zwillinger, Daniel (1995).Standard Mathematical Tables and Formulae. Chapman&Hall/CRC. p. 626.ISBN 0-8493-2479-3.
  14. ^Hayashi (2000, page 20)
  15. ^Akbarzadeh, Vahab (7 May 2014)."Line Estimation".
  16. ^Hayashi (2000, page 49)
  17. ^abHayashi (2000, page 52)
  18. ^Hayashi (2000, page 10)
  19. ^Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso".Journal of the Royal Statistical Society, Series B.58 (1):267–288.doi:10.1111/j.2517-6161.1996.tb02080.x.JSTOR 2346178.
  20. ^Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). "Least Angle Regression".The Annals of Statistics.32 (2):407–451.arXiv:math/0406456.doi:10.1214/009053604000000067.JSTOR 3448465.S2CID 204004121.
  21. ^Hawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis".Journal of the Royal Statistical Society, Series C.22 (3):275–286.doi:10.2307/2346776.JSTOR 2346776.
  22. ^Jolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression".Journal of the Royal Statistical Society, Series C.31 (3):300–303.doi:10.2307/2348005.JSTOR 2348005.
  23. ^Hayashi (2000, pages 27, 30)
  24. ^abcHayashi (2000, page 27)
  25. ^Amemiya, Takeshi (1985).Advanced Econometrics. Harvard University Press. p. 13.ISBN 9780674005600.
  26. ^Amemiya (1985, page 14)
  27. ^Rao, C. R. (1973).Linear Statistical Inference and its Applications (Second ed.). New York: J. Wiley & Sons. p. 319.ISBN 0-471-70823-2.
  28. ^Amemiya (1985, page 20)
  29. ^Amemiya (1985, page 27)
  30. ^abDavidson, Russell;MacKinnon, James G. (1993).Estimation and Inference in Econometrics. New York: Oxford University Press. p. 33.ISBN 0-19-506011-3.
  31. ^Davidson & MacKinnon (1993, page 36)
  32. ^Davidson & MacKinnon (1993, page 20)
  33. ^"Memento on EViews Output"(PDF). Retrieved28 December 2020.
  34. ^Amemiya (1985, page 21)
  35. ^abAmemiya (1985, page 22)
  36. ^Burnham, Kenneth P.; Anderson, David R. (2002).Model Selection and Multi-Model Inference (2nd ed.). Springer.ISBN 0-387-95364-7.

Further reading

[edit]
Computational statistics
Correlation and dependence
Regression analysis
Regression as a
statistical model
Linear regression
Predictor structure
Non-standard
Non-normal errors
Decomposition of variance
Model exploration
Background
Design of experiments
Numericalapproximation
Applications
Retrieved from "https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&oldid=1327870412#Normal_equations"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp