Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Principal component regression

From Wikipedia, the free encyclopedia
Statistical technique
Part of a series on
Regression analysis
Models
Estimation
Background

Instatistics,principal component regression (PCR) is aregression analysis technique that is based onprincipal component analysis (PCA). PCR is a form ofreduced rank regression.[1] More specifically, PCR is used forestimating the unknownregression coefficients in astandard linear regression model.

In PCR, instead of regressing the dependent variable on the explanatory variables directly, theprincipal components of the explanatory variables are used asregressors. One typically uses only a subset of all the principal components for regression, making PCR a kind ofregularized procedure and also a type ofshrinkage estimator.

Often the principal components with highervariances (the ones based oneigenvectors corresponding to the highereigenvalues of thesamplevariance-covariance matrix of the explanatory variables) are selected as regressors. However, for the purpose ofpredicting the outcome, the principal components with low variances may also be important, in some cases even more important.[2]

One major use of PCR lies in overcoming themulticollinearity problem which arises when two or more of the explanatory variables are close to beingcollinear.[3] PCR can aptly deal with such situations by excluding some of the low-variance principal components in the regression step. In addition, by usually regressing on only a subset of all the principal components, PCR can result indimension reduction through substantially lowering the effective number of parameters characterizing the underlying model. This can be particularly useful in settings withhigh-dimensional covariates. Also, through appropriate selection of the principal components to be used for regression, PCR can lead to efficientprediction of the outcome based on the assumed model.

The principle

[edit]

The PCR method may be broadly divided into three major steps:

1.{\displaystyle \;\;} PerformPCA on the observeddata matrix for the explanatory variables to obtain the principal components, and then (usually) select a subset, based on some appropriate criteria, of the principal components so obtained for further use.
2.{\displaystyle \;\;} Now regress the observed vector of outcomes on the selected principal components as covariates, usingordinary least squares regression (linear regression) to get a vector of estimated regression coefficients (withdimension equal to the number of selected principal components).
3.{\displaystyle \;\;} Nowtransform this vector back to the scale of the actual covariates, using the selectedPCA loadings (the eigenvectors corresponding to the selected principal components) to get thefinal PCR estimator (with dimension equal to the total number of covariates) for estimating the regression coefficients characterizing the original model.

Details of the method

[edit]

Data representation: LetYn×1=(y1,,yn)T{\displaystyle \mathbf {Y} _{n\times 1}=\left(y_{1},\ldots ,y_{n}\right)^{T}} denote the vector of observed outcomes andXn×p=(x1,,xn)T{\displaystyle \mathbf {X} _{n\times p}=\left(\mathbf {x} _{1},\ldots ,\mathbf {x} _{n}\right)^{T}} denote the correspondingdata matrix of observed covariates where,n{\displaystyle n} andp{\displaystyle p} denote the size of the observedsample and the number of covariates respectively, withnp{\displaystyle n\geq p}. Each of then{\displaystyle n} rows ofX{\displaystyle \mathbf {X} } denotes one set of observations for thep{\displaystyle p}dimensional covariate and the respective entry ofY{\displaystyle \mathbf {Y} } denotes the corresponding observed outcome.

Data pre-processing: Assume thatY{\displaystyle \mathbf {Y} } and each of thep{\displaystyle p} columns ofX{\displaystyle \mathbf {X} } have already beencentered so that all of them have zeroempirical means. This centering step is crucial (at least for the columns ofX{\displaystyle \mathbf {X} }) since PCR involves the use of PCA onX{\displaystyle \mathbf {X} } andPCA is sensitive tocentering of the data.

Underlying model: Following centering, the standardGauss–Markovlinear regression model forY{\displaystyle \mathbf {Y} } onX{\displaystyle \mathbf {X} } can be represented as:Y=Xβ+ε,{\displaystyle \mathbf {Y} =\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\;} whereβRp{\displaystyle {\boldsymbol {\beta }}\in \mathbb {R} ^{p}} denotes the unknown parameter vector of regression coefficients andε{\displaystyle {\boldsymbol {\varepsilon }}} denotes the vector of random errors withE(ε)=0{\displaystyle \operatorname {E} \left({\boldsymbol {\varepsilon }}\right)=\mathbf {0} \;} andVar(ε)=σ2In×n{\displaystyle \;\operatorname {Var} \left({\boldsymbol {\varepsilon }}\right)=\sigma ^{2}I_{n\times n}} for some unknownvariance parameterσ2>0{\displaystyle \sigma ^{2}>0\;\;}

Objective: The primary goal is to obtain an efficientestimatorβ^{\displaystyle {\widehat {\boldsymbol {\beta }}}} for the parameterβ{\displaystyle {\boldsymbol {\beta }}}, based on the data. One frequently used approach for this isordinary least squares regression which, assumingX{\displaystyle \mathbf {X} } isfull column rank, gives theunbiased estimator:β^ols=(XTX)1XTY{\displaystyle {\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} }=(\mathbf {X} ^{T}\mathbf {X} )^{-1}\mathbf {X} ^{T}\mathbf {Y} } ofβ{\displaystyle {\boldsymbol {\beta }}}. PCR is another technique that may be used for the same purpose of estimatingβ{\displaystyle {\boldsymbol {\beta }}}.

PCA step: PCR starts by performing a PCA on the centered data matrixX{\displaystyle \mathbf {X} }. For this, letX=UΔVT{\displaystyle \mathbf {X} =U\Delta V^{T}} denote thesingular value decomposition ofX{\displaystyle \mathbf {X} } where,Δp×p=diag[δ1,,δp]{\displaystyle \Delta _{p\times p}=\operatorname {diag} \left[\delta _{1},\ldots ,\delta _{p}\right]} withδ1δp0{\displaystyle \delta _{1}\geq \cdots \geq \delta _{p}\geq 0} denoting the non-negativesingular values ofX{\displaystyle \mathbf {X} }, while thecolumns ofUn×p=[u1,,up]{\displaystyle U_{n\times p}=[\mathbf {u} _{1},\ldots ,\mathbf {u} _{p}]} andVp×p=[v1,,vp]{\displaystyle V_{p\times p}=[\mathbf {v} _{1},\ldots ,\mathbf {v} _{p}]} are bothorthonormal sets of vectors denoting theleft and right singular vectors ofX{\displaystyle \mathbf {X} } respectively.

The principal components:VΛVT{\displaystyle V\Lambda V^{T}} gives aspectral decomposition ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} } whereΛp×p=diag[λ1,,λp]=diag[δ12,,δp2]=Δ2{\displaystyle \Lambda _{p\times p}=\operatorname {diag} \left[\lambda _{1},\ldots ,\lambda _{p}\right]=\operatorname {diag} \left[\delta _{1}^{2},\ldots ,\delta _{p}^{2}\right]=\Delta ^{2}} withλ1λp0{\displaystyle \lambda _{1}\geq \cdots \geq \lambda _{p}\geq 0} denoting the non-negative eigenvalues (also known as theprincipal values) ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} }, while the columns ofV{\displaystyle V} denote the corresponding orthonormal set of eigenvectors. Then,Xvj{\displaystyle \mathbf {X} \mathbf {v} _{j}} andvj{\displaystyle \mathbf {v} _{j}} respectively denote thejth{\displaystyle j^{th}}principal component and thejth{\displaystyle j^{th}}principal component direction (orPCA loading) corresponding to thejth{\displaystyle j^{\text{th}}} largestprincipal valueλj{\displaystyle \lambda _{j}} for eachj{1,,p}{\displaystyle j\in \{1,\ldots ,p\}}.

Derived covariates: For anyk{1,,p}{\displaystyle k\in \{1,\ldots ,p\}}, letVk{\displaystyle V_{k}} denote thep×k{\displaystyle p\times k} matrix with orthonormal columns consisting of the firstk{\displaystyle k} columns ofV{\displaystyle V}. LetWk=XVk{\displaystyle W_{k}=\mathbf {X} V_{k}}=[Xv1,,Xvk]{\displaystyle =[\mathbf {X} \mathbf {v} _{1},\ldots ,\mathbf {X} \mathbf {v} _{k}]} denote then×k{\displaystyle n\times k} matrix having the firstk{\displaystyle k} principal components as its columns.W{\displaystyle W} may be viewed as the data matrix obtained by using thetransformed covariatesxik=VkTxiRk{\displaystyle \mathbf {x} _{i}^{k}=V_{k}^{T}\mathbf {x} _{i}\in \mathbb {R} ^{k}} instead of using the original covariatesxiRp1in{\displaystyle \mathbf {x} _{i}\in \mathbb {R} ^{p}\;\;\forall \;\;1\leq i\leq n}.

The PCR estimator: Letγ^k=(WkTWk)1WkTYRk{\displaystyle {\widehat {\gamma }}_{k}=(W_{k}^{T}W_{k})^{-1}W_{k}^{T}\mathbf {Y} \in \mathbb {R} ^{k}} denote the vector of estimated regression coefficients obtained byordinary least squares regression of the response vectorY{\displaystyle \mathbf {Y} } on the data matrixWk{\displaystyle W_{k}}. Then, for anyk{1,,p}{\displaystyle k\in \{1,\ldots ,p\}}, the final PCR estimator ofβ{\displaystyle {\boldsymbol {\beta }}} based on using the firstk{\displaystyle k} principal components is given by:β^k=Vkγ^kRp{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}=V_{k}{\widehat {\gamma }}_{k}\in \mathbb {R} ^{p}}.

Fundamental characteristics and applications of the PCR estimator

[edit]

Two basic properties

[edit]

The fitting process for obtaining the PCR estimator involves regressing the response vector on the derived data matrixWk{\displaystyle W_{k}} which hasorthogonal columns for anyk{1,,p}{\displaystyle k\in \{1,\ldots ,p\}} since the principal components aremutually orthogonal to each other. Thus in the regression step, performing amultiple linear regression jointly on thek{\displaystyle k} selected principal components as covariates is equivalent to carrying outk{\displaystyle k} independentsimple linear regressions (or univariate regressions) separately on each of thek{\displaystyle k} selected principal components as a covariate.

When all the principal components are selected for regression so thatk=p{\displaystyle k=p}, then the PCR estimator is equivalent to theordinary least squares estimator. Thus,β^p=β^ols{\displaystyle {\widehat {\boldsymbol {\beta }}}_{p}={\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} }}. This is easily seen from the fact thatWp=XVp=XV{\displaystyle W_{p}=\mathbf {X} V_{p}=\mathbf {X} V} and also observing thatV{\displaystyle V} is anorthogonal matrix.

Variance reduction

[edit]

For anyk{1,,p}{\displaystyle k\in \{1,\ldots ,p\}}, the variance ofβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} is given by

Var(β^k)=σ2Vk(WkTWk)1VkT=σ2Vkdiag(λ11,,λk1)VkT=σ2j=1kvjvjTλj.{\displaystyle \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{k})=\sigma ^{2}\;V_{k}(W_{k}^{T}W_{k})^{-1}V_{k}^{T}=\sigma ^{2}\;V_{k}\;\operatorname {diag} \left(\lambda _{1}^{-1},\ldots ,\lambda _{k}^{-1}\right)V_{k}^{T}=\sigma ^{2}\sideset {}{}\sum _{j=1}^{k}{\frac {\mathbf {v} _{j}\mathbf {v} _{j}^{T}}{\lambda _{j}}}.}

In particular:

Var(β^p)=Var(β^ols)=σ2j=1pvjvjTλj.{\displaystyle \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{p})=\operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })=\sigma ^{2}\sideset {}{}\sum _{j=1}^{p}{\frac {\mathbf {v} _{j}\mathbf {v} _{j}^{T}}{\lambda _{j}}}.}

Hence for allk{1,,p1}{\displaystyle k\in \{1,\ldots ,p-1\}} we have:

Var(β^ols)Var(β^k)=σ2j=k+1pvjvjTλj.{\displaystyle \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })-\operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{k})=\sigma ^{2}\sideset {}{}\sum _{j=k+1}^{p}{\frac {\mathbf {v} _{j}\mathbf {v} _{j}^{T}}{\lambda _{j}}}.}

Thus, for allk{1,,p}{\displaystyle k\in \{1,\ldots ,p\}} we have:

Var(β^ols)Var(β^k)0{\displaystyle \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })-\operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{k})\succeq 0}

whereA0{\displaystyle A\succeq 0} indicates that a square symmetric matrixA{\displaystyle A} isnon-negative definite. Consequently, any givenlinear form of the PCR estimator has a lower variance compared to that of the samelinear form of the ordinary least squares estimator.

Addressing multicollinearity

[edit]

Undermulticollinearity, two or more of the covariates are highlycorrelated, so that one can be linearly predicted from the others with a non-trivial degree of accuracy. Consequently, the columns of the data matrixX{\displaystyle \mathbf {X} } that correspond to the observations for these covariates tend to becomelinearly dependent and therefore,X{\displaystyle \mathbf {X} } tends to becomerank deficient losing its full column rank structure. More quantitatively, one or more of the smaller eigenvalues ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} } get(s) very close or become(s) exactly equal to0{\displaystyle 0} under such situations. The variance expressions above indicate that these small eigenvalues have the maximuminflation effect on the variance of the least squares estimator, therebydestabilizing the estimator significantly when they are close to0{\displaystyle 0}. This issue can be effectively addressed through using a PCR estimator obtained by excluding the principal components corresponding to these small eigenvalues.

Dimension reduction

[edit]

PCR may also be used for performingdimension reduction. To see this, letLk{\displaystyle L_{k}} denote anyp×k{\displaystyle p\times k} matrix having orthonormal columns, for anyk{1,,p}.{\displaystyle k\in \{1,\ldots ,p\}.} Suppose now that we want toapproximate each of the covariate observationsxi{\displaystyle \mathbf {x} _{i}} through therankk{\displaystyle k}linear transformationLkzi{\displaystyle L_{k}\mathbf {z} _{i}} for someziRk(1in){\displaystyle \mathbf {z} _{i}\in \mathbb {R} ^{k}(1\leq i\leq n)}.

Then, it can be shown that

i=1nxiLkzi2{\displaystyle \sum _{i=1}^{n}\left\|\mathbf {x} _{i}-L_{k}\mathbf {z} _{i}\right\|^{2}}

is minimized atLk=Vk,{\displaystyle L_{k}=V_{k},} the matrix with the firstk{\displaystyle k} principal component directions as columns, andzi=xik=VkTxi,{\displaystyle \mathbf {z} _{i}=\mathbf {x} _{i}^{k}=V_{k}^{T}\mathbf {x} _{i},} the correspondingk{\displaystyle k} dimensional derived covariates. Thus thek{\displaystyle k} dimensional principal components provide the bestlinear approximation of rankk{\displaystyle k} to the observed data matrixX{\displaystyle \mathbf {X} }.

The correspondingreconstruction error is given by:

i=1nxiVkxik2={j=k+1nλj1k<p0k=p{\displaystyle \sum _{i=1}^{n}\left\|\mathbf {x} _{i}-V_{k}\mathbf {x} _{i}^{k}\right\|^{2}={\begin{cases}\sum _{j=k+1}^{n}\lambda _{j}&1\leqslant k<p\\0&k=p\end{cases}}}

Thus any potentialdimension reduction may be achieved by choosingk{\displaystyle k}, the number of principal components to be used, through appropriate thresholding on the cumulative sum of theeigenvalues ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} }. Since the smaller eigenvalues do not contribute significantly to the cumulative sum, the corresponding principal components may be continued to be dropped as long as the desired threshold limit is not exceeded. The same criteria may also be used for addressing themulticollinearity issue whereby the principal components corresponding to the smaller eigenvalues may be ignored as long as the threshold limit is maintained.

Regularization effect

[edit]

Since the PCR estimator typically uses only a subset of all the principal components for regression, it can be viewed as some sort of aregularized procedure. More specifically, for any1k<p{\displaystyle 1\leqslant k<p}, the PCR estimatorβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} denotes the regularized solution to the followingconstrained minimization problem:

minβRpYXβ2 subject to β{vk+1,,vp}.{\displaystyle \min _{{\boldsymbol {\beta }}_{*}\in \mathbb {R} ^{p}}\left\|\mathbf {Y} -\mathbf {X} {\boldsymbol {\beta }}_{*}\right\|^{2}\quad {\text{ subject to }}\quad {\boldsymbol {\beta }}_{*}\perp \{\mathbf {v} _{k+1},\ldots ,\mathbf {v} _{p}\}.}

The constraint may be equivalently written as:

V(pk)Tβ=0,{\displaystyle V_{(p-k)}^{T}{\boldsymbol {\beta }}_{*}=\mathbf {0} ,}

where:

V(pk)=[vk+1,,vp]p×(pk).{\displaystyle V_{(p-k)}=\left[\mathbf {v} _{k+1},\ldots ,\mathbf {v} _{p}\right]_{p\times (p-k)}.}

Thus, when only a proper subset of all the principal components are selected for regression, the PCR estimator so obtained is based on a hard form ofregularization that constrains the resulting solution to thecolumn space of the selected principal component directions, and consequently restricts it to beorthogonal to the excluded directions.

Optimality of PCR among a class of regularized estimators

[edit]

Given the constrained minimization problem as defined above, consider the following generalized version of it:

minβRpYXβ2 subject to L(pk)Tβ=0{\displaystyle \min _{{\boldsymbol {\beta }}_{*}\in \mathbb {R} ^{p}}\|\mathbf {Y} -\mathbf {X} {\boldsymbol {\beta }}_{*}\|^{2}\quad {\text{ subject to }}\quad L_{(p-k)}^{T}{\boldsymbol {\beta }}_{*}=\mathbf {0} }

where,L(pk){\displaystyle L_{(p-k)}} denotes any full column rank matrix of orderp×(pk){\displaystyle p\times (p-k)} with1k<p{\displaystyle 1\leqslant k<p}.

Letβ^L{\displaystyle {\widehat {\boldsymbol {\beta }}}_{L}} denote the corresponding solution. Thus

β^L=argminβRpYXβ2 subject to L(pk)Tβ=0.{\displaystyle {\widehat {\boldsymbol {\beta }}}_{L}=\arg \min _{{\boldsymbol {\beta }}_{*}\in \mathbb {R} ^{p}}\|\mathbf {Y} -\mathbf {X} {\boldsymbol {\beta }}_{*}\|^{2}\quad {\text{ subject to }}\quad L_{(p-k)}^{T}{\boldsymbol {\beta }}_{*}=\mathbf {0} .}

Then the optimal choice of the restriction matrixL(pk){\displaystyle L_{(p-k)}} for which the corresponding estimatorβ^L{\displaystyle {\widehat {\boldsymbol {\beta }}}_{L}} achieves the minimum prediction error is given by:[4]

L(pk)=V(pk)Λ(pk)1/2,{\displaystyle L_{(p-k)}^{*}=V_{(p-k)}\Lambda _{(p-k)}^{1/2},}

where

Λ(pk)1/2=diag(λk+11/2,,λp1/2).{\displaystyle \Lambda _{(p-k)}^{1/2}=\operatorname {diag} \left(\lambda _{k+1}^{1/2},\ldots ,\lambda _{p}^{1/2}\right).}

Quite clearly, the resulting optimal estimatorβ^L{\displaystyle {\widehat {\boldsymbol {\beta }}}_{L^{*}}} is then simply given by the PCR estimatorβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} based on the firstk{\displaystyle k} principal components.

Efficiency

[edit]

Since the ordinary least squares estimator isunbiased forβ{\displaystyle {\boldsymbol {\beta }}}, we have

Var(β^ols)=MSE(β^ols),{\displaystyle \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })=\operatorname {MSE} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} }),}

where, MSE denotes themean squared error. Now, if for somek{1,,p}{\displaystyle k\in \{1,\ldots ,p\}}, we additionally have:V(pk)Tβ=0{\displaystyle V_{(p-k)}^{T}{\boldsymbol {\beta }}=\mathbf {0} }, then the correspondingβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} is alsounbiased forβ{\displaystyle {\boldsymbol {\beta }}} and therefore

Var(β^k)=MSE(β^k).{\displaystyle \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{k})=\operatorname {MSE} ({\widehat {\boldsymbol {\beta }}}_{k}).}

We have already seen that

j{1,,p}:Var(β^ols)Var(β^j)0,{\displaystyle \forall j\in \{1,\ldots ,p\}:\quad \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })-\operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{j})\succeq 0,}

which then implies:

MSE(β^ols)MSE(β^k)0{\displaystyle \operatorname {MSE} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })-\operatorname {MSE} ({\widehat {\boldsymbol {\beta }}}_{k})\succeq 0}

for that particulark{\displaystyle k}. Thus in that case, the correspondingβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} would be a moreefficient estimator ofβ{\displaystyle {\boldsymbol {\beta }}} compared toβ^ols{\displaystyle {\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} }}, based on using the mean squared error as the performance criteria. In addition, any givenlinear form of the correspondingβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} would also have a lowermean squared error compared to that of the samelinear form ofβ^ols{\displaystyle {\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} }}.

Now suppose that for a givenk{1,,p},V(pk)Tβ0{\displaystyle k\in \{1,\ldots ,p\},V_{(p-k)}^{T}{\boldsymbol {\beta }}\neq \mathbf {0} }. Then the correspondingβ^k{\displaystyle {\widehat {\boldsymbol {\beta }}}_{k}} isbiased forβ{\displaystyle {\boldsymbol {\beta }}}. However, since

k{1,,p}:Var(β^ols)Var(β^k)0,{\displaystyle \forall k\in \{1,\ldots ,p\}:\quad \operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })-\operatorname {Var} ({\widehat {\boldsymbol {\beta }}}_{k})\succeq 0,}

it is still possible thatMSE(β^ols)MSE(β^k)0{\displaystyle \operatorname {MSE} ({\widehat {\boldsymbol {\beta }}}_{\mathrm {ols} })-\operatorname {MSE} ({\widehat {\boldsymbol {\beta }}}_{k})\succeq 0}, especially ifk{\displaystyle k} is such that the excluded principal components correspond to the smaller eigenvalues, thereby resulting in lowerbias.

In order to ensure efficient estimation and prediction performance of PCR as an estimator ofβ{\displaystyle {\boldsymbol {\beta }}}, Park (1981)[4] proposes the following guideline for selecting the principal components to be used for regression: Drop thejth{\displaystyle j^{th}} principal component if and only ifλj<(pσ2)/βTβ.{\displaystyle \lambda _{j}<(p\sigma ^{2})/{\boldsymbol {\beta }}^{T}{\boldsymbol {\beta }}.} Practical implementation of this guideline of course requires estimates for the unknown model parametersσ2{\displaystyle \sigma ^{2}} andβ{\displaystyle {\boldsymbol {\beta }}}. In general, they may be estimated using the unrestricted least squares estimates obtained from the original full model. Park (1981) however provides a slightly modified set of estimates that may be better suited for this purpose.[4]

Unlike the criteria based on the cumulative sum of the eigenvalues ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} }, which is probably more suited for addressing the multicollinearity problem and for performing dimension reduction, the above criteria actually attempts to improve the prediction and estimation efficiency of the PCR estimator by involving both the outcome as well as the covariates in the process of selecting the principal components to be used in the regression step. Alternative approaches with similar goals include selection of the principal components based oncross-validation or theMallow's Cp criteria. Often, the principal components are also selected based on their degree ofassociation with the outcome.

Shrinkage effect of PCR

[edit]

In general, PCR is essentially ashrinkage estimator that usually retains the high variance principal components (corresponding to the higher eigenvalues ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} }) as covariates in the model and discards the remaining low variance components (corresponding to the lower eigenvalues ofXTX{\displaystyle \mathbf {X} ^{T}\mathbf {X} }). Thus it exerts a discreteshrinkage effect on the low variance components nullifying their contribution completely in the original model. In contrast, theridge regression estimator exerts a smooth shrinkage effect through theregularization parameter (or the tuning parameter) inherently involved in its construction. While it does not completely discard any of the components, it exerts a shrinkage effect over all of them in a continuous manner so that the extent of shrinkage is higher for the low variance components and lower for the high variance components. Frank and Friedman (1993)[5] conclude that for the purpose of prediction itself, the ridge estimator, owing to its smooth shrinkage effect, is perhaps a better choice compared to the PCR estimator having a discrete shrinkage effect.

In addition, the principal components are obtained from theeigen-decomposition ofX{\displaystyle \mathbf {X} } that involves the observations for the explanatory variables only. Therefore, the resulting PCR estimator obtained from using these principal components as covariates need not necessarily have satisfactory predictive performance for the outcome. A somewhat similar estimator that tries to address this issue through its very construction is thepartial least squares (PLS) estimator. Similar to PCR, PLS also uses derived covariates of lower dimensions. However unlike PCR, the derived covariates for PLS are obtained based on using both the outcome as well as the covariates. While PCR seeks the high variance directions in the space of the covariates, PLS seeks the directions in the covariate space that are most useful for the prediction of the outcome.

2006 a variant of the classical PCR known as thesupervised PCR was proposed.[6] In a spirit similar to that of PLS, it attempts at obtaining derived covariates of lower dimensions based on a criterion that involves both the outcome as well as the covariates. The method starts by performing a set ofp{\displaystyle p}simple linear regressions (or univariate regressions) wherein the outcome vector is regressed separately on each of thep{\displaystyle p} covariates taken one at a time. Then, for somem{1,,p}{\displaystyle m\in \{1,\ldots ,p\}}, the firstm{\displaystyle m} covariates that turn out to be the most correlated with the outcome (based on the degree of significance of the corresponding estimated regression coefficients) are selected for further use. A conventional PCR, as described earlier, is then performed, but now it is based on only then×m{\displaystyle n\times m} data matrix corresponding to the observations for the selected covariates. The number of covariates used:m{1,,p}{\displaystyle m\in \{1,\ldots ,p\}} and the subsequent number of principal components used:k{1,,m}{\displaystyle k\in \{1,\ldots ,m\}} are usually selected bycross-validation.

Generalization to kernel settings

[edit]

The classical PCR method as described above is based onclassical PCA and considers alinear regression model for predicting the outcome based on the covariates. However, it can be easily generalized to akernel machine setting whereby theregression function need not necessarily belinear in the covariates, but instead it can belong to theReproducing Kernel Hilbert Space associated with any arbitrary (possiblynon-linear),symmetricpositive-definite kernel. Thelinear regression model turns out to be a special case of this setting when thekernel function is chosen to be thelinear kernel.

In general, under thekernel machine setting, the vector of covariates is firstmapped into ahigh-dimensional (potentiallyinfinite-dimensional)feature space characterized by thekernel function chosen. Themapping so obtained is known as thefeature map and each of itscoordinates, also known as thefeature elements, corresponds to one feature (may belinear ornon-linear) of the covariates. Theregression function is then assumed to be alinear combination of thesefeature elements. Thus, theunderlying regression model in thekernel machine setting is essentially alinear regression model with the understanding that instead of the original set of covariates, the predictors are now given by the vector (potentiallyinfinite-dimensional) offeature elements obtained bytransforming the actual covariates using thefeature map.

However, thekernel trick actually enables us to operate in thefeature space without ever explicitly computing thefeature map. It turns out that it is only sufficient to compute the pairwiseinner products among the feature maps for the observed covariate vectors and theseinner products are simply given by the values of thekernel function evaluated at the corresponding pairs of covariate vectors. The pairwise inner products so obtained may therefore be represented in the form of an×n{\displaystyle n\times n} symmetric non-negative definite matrix also known as thekernel matrix.

PCR in thekernel machine setting can now be implemented by firstappropriately centering thiskernel matrix (K, say) with respect to thefeature space and then performing akernel PCA on thecentered kernel matrix (K', say) whereby aneigendecomposition of K' is obtained. Kernel PCR then proceeds by (usually) selecting a subset of all theeigenvectors so obtained and then performing astandard linear regression of the outcome vector on these selectedeigenvectors. Theeigenvectors to be used for regression are usually selected usingcross-validation. The estimated regression coefficients (having the same dimension as the number of selected eigenvectors) along with the corresponding selected eigenvectors are then used for predicting the outcome for a future observation. Inmachine learning, this technique is also known asspectral regression.

Clearly, kernel PCR has a discrete shrinkage effect on the eigenvectors of K', quite similar to the discrete shrinkage effect of classical PCR on the principal components, as discussed earlier. However, the feature map associated with the chosen kernel could potentially be infinite-dimensional, and hence the corresponding principal components and principal component directions could be infinite-dimensional as well. Therefore, these quantities are often practically intractable under the kernel machine setting. Kernel PCR essentially works around this problem by considering an equivalent dual formulation based on using thespectral decomposition of the associated kernel matrix. Under the linear regression model (which corresponds to choosing the kernel function as the linear kernel), this amounts to considering a spectral decomposition of the correspondingn×n{\displaystyle n\times n} kernel matrixXXT{\displaystyle \mathbf {X} \mathbf {X} ^{T}} and then regressing the outcome vector on a selected subset of the eigenvectors ofXXT{\displaystyle \mathbf {X} \mathbf {X} ^{T}} so obtained. It can be easily shown that this is the same as regressing the outcome vector on the corresponding principal components (which are finite-dimensional in this case), as defined in the context of the classical PCR. Thus, for the linear kernel, the kernel PCR based on a dual formulation is exactly equivalent to the classical PCR based on a primal formulation. However, for arbitrary (and possibly non-linear) kernels, this primal formulation may become intractable owing to the infinite dimensionality of the associated feature map. Thus classical PCR becomes practically infeasible in that case, but kernel PCR based on the dual formulation still remains valid and computationally scalable.

See also

[edit]

References

[edit]
  1. ^Schmidli, Heinz (13 March 2013).Reduced Rank Regression: With Applications to Quantitative Structure-Activity Relationships. Springer.ISBN 978-3-642-50015-2.
  2. ^Jolliffe, Ian T. (1982). "A note on the Use of Principal Components in Regression".Journal of the Royal Statistical Society, Series C.31 (3):300–303.doi:10.2307/2348005.JSTOR 2348005.
  3. ^Dodge, Y. (2003)The Oxford Dictionary of Statistical Terms, OUP.ISBN 0-19-920613-9
  4. ^abcSung H. Park (1981). "Collinearity and Optimal Restrictions on Regression Parameters for Estimating Responses".Technometrics.23 (3):289–295.doi:10.2307/1267793.JSTOR 1267793.
  5. ^Lldiko E. Frank & Jerome H. Friedman (1993). "A Statistical View of Some Chemometrics Regression Tools".Technometrics.35 (2):109–135.doi:10.1080/00401706.1993.10485033.
  6. ^Eric Bair; Trevor Hastie; Debashis Paul; Robert Tibshirani (2006). "Prediction by Supervised Principal Components".Journal of the American Statistical Association.101 (473):119–137.CiteSeerX 10.1.1.516.2313.doi:10.1198/016214505000000628.

Further reading

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Principal_component_regression&oldid=1256288436"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp