Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Probit model

From Wikipedia, the free encyclopedia
Statistical regression where the dependent variable can take only two values
Part of a series on
Regression analysis
Models
Estimation
Background

Instatistics, aprobit model is a type ofregression where thedependent variable can take only two values, for example married or not married. The word is aportmanteau, coming fromprobability +unit.[1] The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type ofbinary classification model.

Aprobit model is a popular specification for abinary response model. As such it treats the same set of problems as doeslogistic regression using similar techniques. When viewed in thegeneralized linear model framework, the probit model employs aprobitlink function.[2] It is most often estimated using themaximum likelihood procedure,[3] such an estimation being called aprobit regression.

Conceptual framework

[edit]

Suppose a response variableY isbinary, that is it can have onlytwo possible outcomes which we will denote as 1 and 0. For example,Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector ofregressorsX, which are assumed to influence the outcomeY. Specifically, we assume that the model takes the form

P(Y=1X)=Φ(XTβ),{\displaystyle P(Y=1\mid X)=\Phi (X^{\operatorname {T} }\beta ),}

whereP is theprobability andΦ{\displaystyle \Phi } is the cumulative distribution function (CDF) of thestandard normal distribution. The parametersβ are typically estimated bymaximum likelihood.

It is possible to motivate the probit model as alatent variable model. Suppose there exists an auxiliary random variable

Y=XTβ+ε,{\displaystyle Y^{\ast }=X^{T}\beta +\varepsilon ,}

whereε ~N(0, 1). ThenY can be viewed as an indicator for whether this latent variable is positive:

Y={1Y>00otherwise}={1XTβ+ε>00otherwise}{\displaystyle Y=\left.{\begin{cases}1&Y^{*}>0\\0&{\text{otherwise}}\end{cases}}\right\}=\left.{\begin{cases}1&X^{\operatorname {T} }\beta +\varepsilon >0\\0&{\text{otherwise}}\end{cases}}\right\}}

The use of the standard normal distribution causes noloss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount.

To see that the two models are equivalent, note that

P(Y=1X)=P(Y>0)=P(XTβ+ε>0)=P(ε>XTβ)=P(ε<XTβ)by symmetry of the normal distribution=Φ(XTβ){\displaystyle {\begin{aligned}P(Y=1\mid X)&=P(Y^{\ast }>0)\\&=P(X^{\operatorname {T} }\beta +\varepsilon >0)\\&=P(\varepsilon >-X^{\operatorname {T} }\beta )\\&=P(\varepsilon <X^{\operatorname {T} }\beta )&{\text{by symmetry of the normal distribution}}\\&=\Phi (X^{\operatorname {T} }\beta )\end{aligned}}}

Model estimation

[edit]

Maximum likelihood estimation

[edit]

Suppose data set{yi,xi}i=1n{\displaystyle \{y_{i},x_{i}\}_{i=1}^{n}} containsn independentstatistical units corresponding to the model above.

For the single observation, conditional on the vector of inputs of that observation, we have:

P(yi=1|xi)=Φ(xiTβ){\displaystyle P(y_{i}=1|x_{i})=\Phi (x_{i}^{\operatorname {T} }\beta )}
P(yi=0|xi)=1Φ(xiTβ){\displaystyle P(y_{i}=0|x_{i})=1-\Phi (x_{i}^{\operatorname {T} }\beta )}

wherexi{\displaystyle x_{i}} is a vector ofK×1{\displaystyle K\times 1} inputs, andβ{\displaystyle \beta } is aK×1{\displaystyle K\times 1} vector of coefficients.

The likelihood of a single observation(yi,xi){\displaystyle (y_{i},x_{i})} is then

L(β;yi,xi)=Φ(xiTβ)yi[1Φ(xiTβ)](1yi){\displaystyle {\mathcal {L}}(\beta ;y_{i},x_{i})=\Phi (x_{i}^{\operatorname {T} }\beta )^{y_{i}}[1-\Phi (x_{i}^{\operatorname {T} }\beta )]^{(1-y_{i})}}

In fact, ifyi=1{\displaystyle y_{i}=1}, thenL(β;yi,xi)=Φ(xiTβ){\displaystyle {\mathcal {L}}(\beta ;y_{i},x_{i})=\Phi (x_{i}^{\operatorname {T} }\beta )}, and ifyi=0{\displaystyle y_{i}=0}, thenL(β;yi,xi)=1Φ(xiTβ){\displaystyle {\mathcal {L}}(\beta ;y_{i},x_{i})=1-\Phi (x_{i}^{\operatorname {T} }\beta )}.

Since the observations are independent and identically distributed, then the likelihood of the entire sample, or thejoint likelihood, will be equal to the product of the likelihoods of the single observations:

L(β;Y,X)=i=1n(Φ(xiTβ)yi[1Φ(xiTβ)](1yi)){\displaystyle {\mathcal {L}}(\beta ;Y,X)=\prod _{i=1}^{n}\left(\Phi (x_{i}^{\operatorname {T} }\beta )^{y_{i}}[1-\Phi (x_{i}^{\operatorname {T} }\beta )]^{(1-y_{i})}\right)}

The joint log-likelihood function is thus

lnL(β;Y,X)=i=1n(yilnΦ(xiTβ)+(1yi)ln(1Φ(xiTβ))){\displaystyle \ln {\mathcal {L}}(\beta ;Y,X)=\sum _{i=1}^{n}{\bigg (}y_{i}\ln \Phi (x_{i}^{\operatorname {T} }\beta )+(1-y_{i})\ln \!{\big (}1-\Phi (x_{i}^{\operatorname {T} }\beta ){\big )}{\bigg )}}

The estimatorβ^{\displaystyle {\hat {\beta }}} which maximizes this function will beconsistent, asymptotically normal andefficient provided thatE[XXT]{\displaystyle \operatorname {E} [XX^{\operatorname {T} }]} exists and is not singular. It can be shown that this log-likelihood function is globallyconcave inβ{\displaystyle \beta }, and therefore standard numerical algorithms for optimization will converge rapidly to the unique maximum.

Asymptotic distribution forβ^{\displaystyle {\hat {\beta }}} is given by

n(β^β) d N(0,Ω1),{\displaystyle {\sqrt {n}}({\hat {\beta }}-\beta )\ {\xrightarrow {d}}\ {\mathcal {N}}(0,\,\Omega ^{-1}),}

where

Ω=E[φ2(XTβ)Φ(XTβ)(1Φ(XTβ))XXT],Ω^=1ni=1nφ2(xiTβ^)Φ(xiTβ^)(1Φ(xiTβ^))xixiT,{\displaystyle \Omega =\operatorname {E} {\bigg [}{\frac {\varphi ^{2}(X^{\operatorname {T} }\beta )}{\Phi (X^{\operatorname {T} }\beta )(1-\Phi (X^{\operatorname {T} }\beta ))}}XX^{\operatorname {T} }{\bigg ]},\qquad {\hat {\Omega }}={\frac {1}{n}}\sum _{i=1}^{n}{\frac {\varphi ^{2}(x_{i}^{\operatorname {T} }{\hat {\beta }})}{\Phi (x_{i}^{\operatorname {T} }{\hat {\beta }})(1-\Phi (x_{i}^{\operatorname {T} }{\hat {\beta }}))}}x_{i}x_{i}^{\operatorname {T} },}[citation needed]

andφ=Φ{\displaystyle \varphi =\Phi '} is the Probability Density Function (PDF) of standard normal distribution.

Semi-parametric and non-parametric maximum likelihood methods for probit-type and other related models are also available.[4]

Berkson's minimum chi-square method

[edit]
Main article:Minimum chi-square estimation

This method can be applied only when there are many observations of response variableyi{\displaystyle y_{i}} having the same value of the vector of regressorsxi{\displaystyle x_{i}} (such situation may be referred to as "many observations per cell"). More specifically, the model can be formulated as follows.

Suppose amongn observations{yi,xi}i=1n{\displaystyle \{y_{i},x_{i}\}_{i=1}^{n}} there are onlyT distinct values of the regressors, which can be denoted as{x(1),,x(T)}{\displaystyle \{x_{(1)},\ldots ,x_{(T)}\}}. Letnt{\displaystyle n_{t}} be the number of observations withxi=x(t),{\displaystyle x_{i}=x_{(t)},} andrt{\displaystyle r_{t}} the number of such observations withyi=1{\displaystyle y_{i}=1}. We assume that there are indeed "many" observations per each "cell": for eacht,limnnt/n=ct>0{\displaystyle t,\lim _{n\rightarrow \infty }n_{t}/n=c_{t}>0}.

Denote

p^t=rt/nt{\displaystyle {\hat {p}}_{t}=r_{t}/n_{t}}
σ^t2=1ntp^t(1p^t)φ2(Φ1(p^t)){\displaystyle {\hat {\sigma }}_{t}^{2}={\frac {1}{n_{t}}}{\frac {{\hat {p}}_{t}(1-{\hat {p}}_{t})}{\varphi ^{2}{\big (}\Phi ^{-1}({\hat {p}}_{t}){\big )}}}}

ThenBerkson's minimum chi-square estimator is ageneralized least squares estimator in a regression ofΦ1(p^t){\displaystyle \Phi ^{-1}({\hat {p}}_{t})} onx(t){\displaystyle x_{(t)}} with weightsσ^t2{\displaystyle {\hat {\sigma }}_{t}^{-2}}:

β^=(t=1Tσ^t2x(t)x(t)T)1t=1Tσ^t2x(t)Φ1(p^t){\displaystyle {\hat {\beta }}={\Bigg (}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}x_{(t)}^{\operatorname {T} }{\Bigg )}^{-1}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}\Phi ^{-1}({\hat {p}}_{t})}

It can be shown that this estimator is consistent (asn→∞ andT fixed), asymptotically normal and efficient.[citation needed] Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated countsrt{\displaystyle r_{t}},nt{\displaystyle n_{t}}, andx(t){\displaystyle x_{(t)}} (for example in the analysis of voting behavior).

Albert and Chib Gibbs sampling method

[edit]

Gibbs sampling of a probit model is possible with the introduction of normally distributed latent variablesz, which are observed as 1 if positive and 0 otherwise. This approach was introduced in Albert and Chib (1993),[5] which demonstrated how Gibbs sampling could be applied to binary and polychotomous response models within a Bayesian framework. Under a multivariate normalprior distribution over the weights, the model can be described as

βN(b0,B0)zixi,βN(xiTβ,1)yi={1if zi>00otherwise{\displaystyle {\begin{aligned}{\boldsymbol {\beta }}&\sim {\mathcal {N}}(\mathbf {b} _{0},\mathbf {B} _{0})\\[3pt]z_{i}\mid \mathbf {x} _{i},{\boldsymbol {\beta }}&\sim {\mathcal {N}}(\mathbf {x} _{i}^{\operatorname {T} }{\boldsymbol {\beta }},1)\\[3pt]y_{i}&={\begin{cases}1&{\text{if }}z_{i}>0\\0&{\text{otherwise}}\end{cases}}\end{aligned}}}

From this, Albert and Chib (1993)[5] derive the following full conditional distributions in the Gibbs sampling algorithm:

B=(B01+XTX)1βzN(B(B01b0+XTz),B)ziyi=0,xi,βN(xiTβ,1)[zi0]ziyi=1,xi,βN(xiTβ,1)[zi>0]{\displaystyle {\begin{aligned}\mathbf {B} &=(\mathbf {B} _{0}^{-1}+\mathbf {X} ^{\operatorname {T} }\mathbf {X} )^{-1}\\[3pt]{\boldsymbol {\beta }}\mid \mathbf {z} &\sim {\mathcal {N}}(\mathbf {B} (\mathbf {B} _{0}^{-1}\mathbf {b} _{0}+\mathbf {X} ^{\operatorname {T} }\mathbf {z} ),\mathbf {B} )\\[3pt]z_{i}\mid y_{i}=0,\mathbf {x} _{i},{\boldsymbol {\beta }}&\sim {\mathcal {N}}(\mathbf {x} _{i}^{\operatorname {T} }{\boldsymbol {\beta }},1)[z_{i}\leq 0]\\[3pt]z_{i}\mid y_{i}=1,\mathbf {x} _{i},{\boldsymbol {\beta }}&\sim {\mathcal {N}}(\mathbf {x} _{i}^{\operatorname {T} }{\boldsymbol {\beta }},1)[z_{i}>0]\end{aligned}}}

The result forβ{\displaystyle {\boldsymbol {\beta }}} is given in the article onBayesian linear regression, although specified with different notation, while the conditional posterior distributions of the latent variables follow atruncated normal distribution within the given ranges. The notation[zi<0]{\displaystyle [z_{i}<0]} is theIverson bracket, sometimes writtenI(zi<0){\displaystyle {\mathcal {I}}(z_{i}<0)} or similar. Thus, knowledge of the observed outcomes serves to restrict the support of the latent variables.

Sampling of the weightsβ{\displaystyle {\boldsymbol {\beta }}}given the latent vectorz{\displaystyle \mathbf {z} } from the multinormal distribution is standard. For sampling the latent variables from the truncated normal posterior distributions, one can take advantage of the inverse-cdf method, implemented in the followingR vectorized function, making it straightforward to implement the method.

zbinprobit<-function(y,X,beta,n){meanv<-X%*%betau<-runif(n)# uniform(0,1) random variatescd<-pnorm(-meanv)# cumulative normal CDFpu<-(u*cd)*(1-2*y)+(u+cd)*ycpui<-qnorm(pu)# inverse normal CDFz<-meanv+cpui# latent vectorreturn(z)}

Model evaluation

[edit]

The suitability of an estimated binary model can be evaluated by counting the number of true observations equaling 1, and the number equaling zero, for which the model assigns a correct predicted classification by treating any estimated probability above 1/2 (or, below 1/2), as an assignment of a prediction of 1 (or, of 0). SeeLogistic regression § Model for details.

Performance under misspecification

[edit]
This section mayrequirecleanup to meet Wikipedia'squality standards. The specific problem is:Need to adopt the notation of the rest of the article, fix grammar, and make prose clearer. Please helpimprove this section if you can.(June 2019) (Learn how and when to remove this message)

Consider the latent variable model formulation of the probit model. When thevariance ofε{\displaystyle \varepsilon } conditional onx{\displaystyle x} is not constant but dependent onx{\displaystyle x}, then theheteroscedasticity issue arises. For example, supposey=β0+B1x1+ε{\displaystyle y^{*}=\beta _{0}+B_{1}x_{1}+\varepsilon } andεxN(0,x12){\displaystyle \varepsilon \mid x\sim N(0,x_{1}^{2})} wherex1{\displaystyle x_{1}} is a continuous positive explanatory variable. Under heteroskedasticity, the probit estimator forβ{\displaystyle \beta } is usually inconsistent, and most of the tests about the coefficients are invalid. More importantly, the estimator forP(y=1x){\displaystyle P(y=1\mid x)} becomes inconsistent, too. To deal with this problem, the original model needs to be transformed to be homoskedastic. For instance, in the same example,1[β0+β1x1+ε>0]{\displaystyle 1[\beta _{0}+\beta _{1}x_{1}+\varepsilon >0]} can be rewritten as1[β0/x1+β1+ε/x1>0]{\displaystyle 1[\beta _{0}/x_{1}+\beta _{1}+\varepsilon /x_{1}>0]}, whereε/x1xN(0,1){\displaystyle \varepsilon /x_{1}\mid x\sim N(0,1)}. Therefore,P(y=1x)=Φ(β1+β0/x1){\displaystyle P(y=1\mid x)=\Phi (\beta _{1}+\beta _{0}/x_{1})} and running probit on(1,1/x1){\displaystyle (1,1/x_{1})} generates a consistent estimator for theconditional probabilityP(y=1x).{\displaystyle P(y=1\mid x).}

When the assumption thatε{\displaystyle \varepsilon } is normally distributed fails to hold, then a functional formmisspecification issue arises: if the model is still estimated as a probit model, the estimators of the coefficientsβ{\displaystyle \beta } are inconsistent. For instance, ifε{\displaystyle \varepsilon } follows alogistic distribution in the true model, but the model is estimated by probit, the estimates will be generally smaller than the true value. However, the inconsistency of the coefficient estimates is practically irrelevant because the estimates for thepartial effects,P(y=1x)/xi{\displaystyle \partial P(y=1\mid x)/\partial x_{i'}}, will be close to the estimates given by the true logit model.[6]

To avoid the issue of distribution misspecification, one may adopt a general distribution assumption for the error term, such that many different types of distribution can be included in the model. The cost is heavier computation and lower accuracy for the increase of the number of parameter.[7] In most of the cases in practice where the distribution form is misspecified, the estimators for the coefficients are inconsistent, but estimators for the conditional probability and the partial effects are still very good.[citation needed]

One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametricquasi-likelihood methods, which avoid assumptions on a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).[4]

History

[edit]

The probit model is usually credited toChester Bliss, who coined the term "probit" in 1934,[8] and toJohn Gaddum (1933), who systematized earlier work.[9] However, the basic model dates to theWeber–Fechner law byGustav Fechner, published inFechner (1860), and was repeatedly rediscovered until the 1930s; seeFinney (1971, Chapter 3.6) andAitchison & Brown (1957, Chapter 1.2).[9]

A fast method for computingmaximum likelihood estimates for the probit model was proposed byRonald Fisher as an appendix to Bliss' work in 1935.[10]

See also

[edit]

References

[edit]
  1. ^Oxford English Dictionary, 3rd ed. s.v.probit (article dated June 2007):Bliss, C. I. (1934). "The Method of Probits".Science.79 (2037):38–39.Bibcode:1934Sci....79...38B.doi:10.1126/science.79.2037.38.PMID 17813446.These arbitrary probability units have been called 'probits'.
  2. ^Agresti, Alan (2015).Foundations of Linear and Generalized Linear Models. New York: Wiley. pp. 183–186.ISBN 978-1-118-73003-4.
  3. ^Aldrich, John H.; Nelson, Forrest D.; Adler, E. Scott (1984).Linear Probability, Logit, and Probit Models. Sage. pp. 48–65.ISBN 0-8039-2133-0.
  4. ^abPark, Byeong U.; Simar, Léopold; Zelenyuk, Valentin (2017)."Nonparametric estimation of dynamic discrete choice models for time series data"(PDF).Computational Statistics & Data Analysis.108:97–120.doi:10.1016/j.csda.2016.10.024.
  5. ^abAlbert, J., & Chib, S. (1993). "Bayesian Analysis of Binary and Polychotomous Response Data." Journal of the American Statistical Association, 88(422), 669-679.
  6. ^Greene, W. H. (2003), Econometric Analysis, Prentice Hall, Upper Saddle River, NJ.
  7. ^For more details, refer to: Cappé, O., Moulines, E. and Ryden, T. (2005): "Inference in Hidden Markov Models", Springer-Verlag New York, Chapter 2.
  8. ^Bliss, C. I. (1934). "The Method of Probits".Science.79 (2037):38–39.Bibcode:1934Sci....79...38B.doi:10.1126/science.79.2037.38.PMID 17813446.
  9. ^abCramer 2002, p. 7.
  10. ^Fisher, R. A. (1935)."The Case of Zero Survivors in Probit Assays".Annals of Applied Biology.22:164–165.doi:10.1111/j.1744-7348.1935.tb07713.x. Archived fromthe original on 2014-04-30.

Further reading

[edit]

External links

[edit]
International
National
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=Probit_model&oldid=1334587052"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp