Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Bayesian multivariate linear regression

From Wikipedia, the free encyclopedia
Bayesian approach to multivariate linear regression
Part of a series on
Regression analysis
Models
Estimation
Background

Instatistics,Bayesian multivariate linear regression is aBayesian approach tomultivariate linear regression, i.e.linear regression where the predicted outcome is a vector of correlatedrandom variables rather than a single scalar random variable. A more general treatment of this approach can be found in the articleMMSE estimator.

Details

[edit]

Consider a regression problem where thedependent variable to be predicted is not a singlereal-valued scalar but anm-length vector of correlated real numbers. As in the standard regression setup, there aren observations, where each observationi consists ofk−1explanatory variables, grouped into a vectorxi{\displaystyle \mathbf {x} _{i}} of lengthk (where adummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set ofm related regression problems for each observationi:yi,1=xiTβ1+ϵi,1yi,m=xiTβm+ϵi,m{\displaystyle {\begin{aligned}y_{i,1}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{1}+\epsilon _{i,1}\\&\;\;\vdots \\y_{i,m}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{m}+\epsilon _{i,m}\end{aligned}}}where the set of errors{ϵi,1,,ϵi,m}{\displaystyle \{\epsilon _{i,1},\ldots ,\epsilon _{i,m}\}} are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is arow vectoryiT{\displaystyle \mathbf {y} _{i}^{\mathsf {T}}} and the regression coefficient vectors are stacked next to each other, as follows:yiT=xiTB+ϵiT.{\displaystyle \mathbf {y} _{i}^{\mathsf {T}}=\mathbf {x} _{i}^{\mathsf {T}}\mathbf {B} +{\boldsymbol {\epsilon }}_{i}^{\mathsf {T}}.}

Thecoefficient matrixB is ak×m{\displaystyle k\times m} matrix where the coefficient vectorsβ1,,βm{\displaystyle {\boldsymbol {\beta }}_{1},\ldots ,{\boldsymbol {\beta }}_{m}} for each regression problem are stacked horizontally:B=[(β1)(βm)]=[(β1,1βk,1)(β1,mβk,m)].{\displaystyle \mathbf {B} ={\begin{bmatrix}{\begin{pmatrix}\\{\boldsymbol {\beta }}_{1}\\\\\end{pmatrix}}\cdots {\begin{pmatrix}\\{\boldsymbol {\beta }}_{m}\\\\\end{pmatrix}}\end{bmatrix}}={\begin{bmatrix}{\begin{pmatrix}\beta _{1,1}\\\vdots \\\beta _{k,1}\end{pmatrix}}\cdots {\begin{pmatrix}\beta _{1,m}\\\vdots \\\beta _{k,m}\end{pmatrix}}\end{bmatrix}}.}

The noise vectorϵi{\displaystyle {\boldsymbol {\epsilon }}_{i}} for each observationi isjointly normal, so that the outcomes for a given observation are correlated:ϵiN(0,Σϵ).{\displaystyle {\boldsymbol {\epsilon }}_{i}\sim N(0,{\boldsymbol {\Sigma }}_{\epsilon }).}

We can write the entire regression problem in matrix form as:Y=XB+E,{\displaystyle \mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {E} ,}whereY andE aren×m{\displaystyle n\times m} matrices. Thedesign matrixX is ann×k{\displaystyle n\times k} matrix with the observations stacked vertically, as in the standardlinear regression setup:X=[x1Tx2TxnT]=[x1,1x1,kx2,1x2,kxn,1xn,k].{\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}x_{1,1}&\cdots &x_{1,k}\\x_{2,1}&\cdots &x_{2,k}\\\vdots &\ddots &\vdots \\x_{n,1}&\cdots &x_{n,k}\end{bmatrix}}.}

The classical, frequentistslinear least squares solution is to simply estimate the matrix of regression coefficientsB^{\displaystyle {\hat {\mathbf {B} }}} using theMoore-Penrosepseudoinverse:B^=(XTX)1XTY.{\displaystyle {\hat {\mathbf {B} }}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {Y} .}

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case oflinear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as[1]ρ(E|Σϵ)|Σϵ|n/2exp(12tr(ETΣϵ1E)),{\displaystyle \rho (\mathbf {E} |{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp \left(-{\tfrac {1}{2}}\operatorname {tr} \left(\mathbf {E} ^{\mathsf {T}}{\boldsymbol {\Sigma }}_{\epsilon }^{-1}\mathbf {E} \right)\right),}writing the errorE{\displaystyle \mathbf {E} } in terms ofY,X,{\displaystyle \mathbf {Y} ,\mathbf {X} ,} andB{\displaystyle \mathbf {B} } yieldsρ(Y|X,B,Σϵ)|Σϵ|n/2exp(12tr((YXB)TΣϵ1(YXB))),{\displaystyle \rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {X} \mathbf {B} )^{\mathsf {T}}{\boldsymbol {\Sigma }}_{\epsilon }^{-1}(\mathbf {Y} -\mathbf {X} \mathbf {B} ))),}

We seek a natural conjugate prior—a joint densityρ(B,Σϵ){\displaystyle \rho (\mathbf {B} ,\Sigma _{\epsilon })} which is of the same functional form as the likelihood. Since the likelihood is quadratic inB{\displaystyle \mathbf {B} }, we re-write the likelihood so it is normal in(BB^){\displaystyle (\mathbf {B} -{\hat {\mathbf {B} }})} (the deviation from classical sample estimate).

Using the same technique as withBayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product andvectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood:ρ(Y|X,B,Σϵ)|Σϵ|(nk)/2exp(tr(12STSΣϵ1))|Σϵ|k/2exp(12tr((BB^)TXTX(BB^)Σϵ1)),{\displaystyle \rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-(n-k)/2}\exp(-\operatorname {tr} ({\tfrac {1}{2}}\mathbf {S} ^{\mathsf {T}}\mathbf {S} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}))|{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),}S=YXB^{\displaystyle \mathbf {S} =\mathbf {Y} -\mathbf {X} {\hat {\mathbf {B} }}}

We would like to develop a conditional form for the priors:ρ(B,Σϵ)=ρ(Σϵ)ρ(B|Σϵ),{\displaystyle \rho (\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon }),}whereρ(Σϵ){\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon })} is aninverse-Wishart distributionandρ(B|Σϵ){\displaystyle \rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon })} is some form ofnormal distribution in the matrixB{\displaystyle \mathbf {B} }. This is accomplished using thevectorization transformation, which converts the likelihood from a function of the matricesB,B^{\displaystyle \mathbf {B} ,{\hat {\mathbf {B} }}} to a function of the vectorsβ=vec(B),β^=vec(B^){\displaystyle {\boldsymbol {\beta }}=\operatorname {vec} (\mathbf {B} ),{\hat {\boldsymbol {\beta }}}=\operatorname {vec} ({\hat {\mathbf {B} }})}.

Writetr((BB^)TXTX(BB^)Σϵ1)=vec(BB^)Tvec(XTX(BB^)Σϵ1){\displaystyle \operatorname {tr} ((\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\operatorname {vec} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})}

Letvec(XTX(BB^)Σϵ1)=(Σϵ1XTX)vec(BB^),{\displaystyle \operatorname {vec} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }}),}whereAB{\displaystyle \mathbf {A} \otimes \mathbf {B} } denotes theKronecker product of matricesA andB, a generalization of theouter product which multiplies anm×n{\displaystyle m\times n} matrix by ap×q{\displaystyle p\times q} matrix to generate anmp×nq{\displaystyle mp\times nq} matrix, consisting of every combination of products of elements from the two matrices.

Thenvec(BB^)T(Σϵ1XTX)vec(BB^)=(ββ^)T(Σϵ1XTX)(ββ^){\displaystyle {\begin{aligned}&\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})\\&=({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})^{\mathsf {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})\end{aligned}}}which will lead to a likelihood which is normal in(ββ^){\displaystyle ({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})}.

With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.

Conjugate prior distribution

[edit]

The natural conjugate prior using the vectorized variableβ{\displaystyle {\boldsymbol {\beta }}} is of the form:[1]ρ(β,Σϵ)=ρ(Σϵ)ρ(β|Σϵ),{\displaystyle \rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon }),}whereρ(Σϵ)W1(V0,ν0){\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {W}}^{-1}(\mathbf {V} _{0},{\boldsymbol {\nu }}_{0})}andρ(β|Σϵ)N(β0,ΣϵΛ01).{\displaystyle \rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon })\sim N({\boldsymbol {\beta }}_{0},{\boldsymbol {\Sigma }}_{\epsilon }\otimes {\boldsymbol {\Lambda }}_{0}^{-1}).}

Posterior distribution

[edit]

Using the above prior and likelihood, the posterior distribution can be expressed as:[1]ρ(β,Σϵ|Y,X)|Σϵ|(ν0+m+1)/2exp(12tr(V0Σϵ1))×|Σϵ|k/2exp(12tr((BB0)TΛ0(BB0)Σϵ1))×|Σϵ|n/2exp(12tr((YXB)T(YXB)Σϵ1)),{\displaystyle {\begin{aligned}\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto {}&|{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+1)/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} (\mathbf {V} _{0}{\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} -\mathbf {B} _{0}){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {XB} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))},\end{aligned}}}wherevec(B0)=β0{\displaystyle \operatorname {vec} (\mathbf {B} _{0})={\boldsymbol {\beta }}_{0}}.The terms involvingB{\displaystyle \mathbf {B} } can be grouped (withΛ0=UTU{\displaystyle {\boldsymbol {\Lambda }}_{0}=\mathbf {U} ^{\mathsf {T}}\mathbf {U} }) using:(BB0)TΛ0(BB0)+(YXB)T(YXB)=([YUB0][XU]B)T([YUB0][XU]B)=([YUB0][XU]Bn)T([YUB0][XU]Bn)+(BBn)T(XTX+Λ0)(BBn)=(YXBn)T(YXBn)+(B0Bn)TΛ0(B0Bn)+(BBn)T(XTX+Λ0)(BBn),{\displaystyle {\begin{aligned}&\left(\mathbf {B} -\mathbf {B} _{0}\right)^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}\left(\mathbf {B} -\mathbf {B} _{0}\right)+\left(\mathbf {Y} -\mathbf {XB} \right)^{\mathsf {T}}\left(\mathbf {Y} -\mathbf {XB} \right)\\={}&\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)^{\mathsf {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)\\={}&\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} _{n}\right)^{\mathsf {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} _{n}\right)+\left(\mathbf {B} -\mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)\left(\mathbf {B} -\mathbf {B} _{n}\right)\\={}&\left(\mathbf {Y} -\mathbf {X} \mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {Y} -\mathbf {X} \mathbf {B} _{n}\right)+\left(\mathbf {B} _{0}-\mathbf {B} _{n}\right)^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}\left(\mathbf {B} _{0}-\mathbf {B} _{n}\right)+\left(\mathbf {B} -\mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)\left(\mathbf {B} -\mathbf {B} _{n}\right),\end{aligned}}}withBn=(XTX+Λ0)1(XTXB^+Λ0B0)=(XTX+Λ0)1(XTY+Λ0B0).{\displaystyle \mathbf {B} _{n}=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)^{-1}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} {\hat {\mathbf {B} }}+{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0}\right)=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)^{-1}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0}\right).}

This now allows us to write the posterior in a more useful form:ρ(β,Σϵ|Y,X)|Σϵ|(ν0+m+n+1)/2exp(12tr((V0+(YXBn)T(YXBn)+(BnB0)TΛ0(BnB0))Σϵ1))×|Σϵ|k/2exp(12tr((BBn)T(XTX+Λ0)(BBn)Σϵ1)).{\displaystyle {\begin{aligned}\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto {}&|{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+n+1)/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {V} _{0}+(\mathbf {Y} -\mathbf {XB_{n}} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B} _{n}-\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} _{n}-\mathbf {B} _{0})){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -\mathbf {B} _{n})^{\mathsf {T}}(\mathbf {X} ^{T}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})(\mathbf {B} -\mathbf {B} _{n}){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}.\end{aligned}}}

This takes the form of aninverse-Wishart distribution times aMatrix normal distribution:ρ(Σϵ|Y,X)W1(Vn,νn){\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\sim {\mathcal {W}}^{-1}(\mathbf {V} _{n},{\boldsymbol {\nu }}_{n})}andρ(B|Y,X,Σϵ)MNk,m(Bn,Λn1,Σϵ).{\displaystyle \rho (\mathbf {B} |\mathbf {Y} ,\mathbf {X} ,{\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {MN}}_{k,m}(\mathbf {B} _{n},{\boldsymbol {\Lambda }}_{n}^{-1},{\boldsymbol {\Sigma }}_{\epsilon }).}

The parameters of this posterior are given by:Vn=V0+(YXBn)T(YXBn)+(BnB0)TΛ0(BnB0){\displaystyle \mathbf {V} _{n}=\mathbf {V} _{0}+(\mathbf {Y} -\mathbf {XB_{n}} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B} _{n}-\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} _{n}-\mathbf {B} _{0})}νn=ν0+n{\displaystyle {\boldsymbol {\nu }}_{n}={\boldsymbol {\nu }}_{0}+n}Bn=(XTX+Λ0)1(XTY+Λ0B0){\displaystyle \mathbf {B} _{n}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\mathsf {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0})}Λn=XTX+Λ0{\displaystyle {\boldsymbol {\Lambda }}_{n}=\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}}

See also

[edit]

References

[edit]
  1. ^abcPeter E. Rossi, Greg M. Allenby, Rob McCulloch.Bayesian Statistics and Marketing. John Wiley & Sons, 2012, p. 32.
This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(November 2010) (Learn how and when to remove this message)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Bayesian_multivariate_linear_regression&oldid=1272655378"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp