Movatterモバイル変換

Statistical model

From Wikipedia, the free encyclopedia

Type of mathematical model

Astatistical model is amathematical model that embodies a set ofstatistical assumptions concerning the generation ofsample data (and similar data from a largerpopulation). A statistical model represents, often in considerably idealized form, thedata-generating process.^[1] When referring specifically toprobabilities, the corresponding term isprobabilistic model. Allstatistical hypothesis tests and allstatistical estimators are derived via statistical models. More generally, statistical models are part of the foundation ofstatistical inference. A statistical model is usually specified as a mathematical relationship between one or morerandom variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quotingKenneth Bollen).^[2]

Introduction

[edit]

Informally, a statistical model can be thought of as astatistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of anyevent. As an example, consider a pair of ordinary six-sideddice. We will study two different statistical assumptions about the dice.

The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is⁠1/6⁠. From that assumption, we can calculate the probability of both dice coming up 5: ⁠1/6⁠ ×⁠1/6⁠ =⁠1/36⁠. More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is⁠1/8⁠ (because the dice areweighted). From that assumption, we can calculate the probability of both dice coming up 5: ⁠1/8⁠ ×⁠1/8⁠ =⁠1/64⁠. We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption doesnot constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.

Formal definition

[edit]

In mathematical terms, a statistical model is a pair ( $S,{\mathcal {P}}$ ), where $S {\displaystyle S}$ is the set of possible observations, i.e. thesample space, and ${\mathcal {P}}$ is a set ofprobability distributions on $S {\displaystyle S}$ .^[3] The set ${\mathcal {P}}$ represents all of the models that are considered possible. This set is typically parameterized: ${\mathcal {P}}=\{F_{\theta }:\theta \in \Theta \}$ . The set $\Theta$ defines theparameters of the model. If a parameterization is such that distinct parameter values give rise to distinct distributions, i.e. $F_{\theta _{1}}=F_{\theta _{2}}\Rightarrow \theta _{1}=\theta _{2}$ (in other words, the mapping isinjective), it is said to beidentifiable.^[3]

In some cases, the model can be more complex.

InBayesian statistics, the model is extended by adding a probability distribution over the parameter space $\Theta$ .
A statistical model can sometimes distinguish two sets of probability distributions. The first set ${\mathcal {Q}}=\{F_{\theta }:\theta \in \Theta \}$ is the set of models considered for inference. The second set ${\mathcal {P}}=\{F_{\lambda }:\lambda \in \Lambda \}$ is the set of models that could have generated the data which is much larger than ${\mathcal {Q}}$ . Such statistical models are key in checking that a given procedure isrobust, i.e. that it does not produce catastrophic errors when its assumptions about the data are incorrect.

An example

[edit]

Suppose that we have a population of children, with the ages of the children distributeduniformly, in the population. The height of a child will bestochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in alinear regression model, like this: height_i =b₀ +b₁age_i + ε_i, whereb₀ is the intercept,b₁ is a parameter that age is multiplied by to obtain a prediction of height, ε_i is the error term, andi identifies the child. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (height_i =b₀ +b₁age_i) cannot be admissible for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, ε_i, must be included in the equation, so that the model is consistent with all the data points. To dostatistical inference, we would first need to assume some probability distributions for the ε_i. For instance, we might assume that the ε_i distributions arei.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters:b₀,b₁, and the variance of the Gaussian distribution. We can formally specify the model in the form ( $S,{\mathcal {P}}$ ) as follows. The sample space, $S {\displaystyle S}$ , of our model comprises the set of all possible pairs (age, height). Each possible value of $\theta$ = (b₀,b₁,σ²) determines a distribution on $S {\displaystyle S}$ ; denote that distribution by $F_{\theta }$ . If $\Theta$ is the set of all possible values of $\theta$ , then ${\mathcal {P}}=\{F_{\theta }:\theta \in \Theta \}$ . (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying $S {\displaystyle S}$ and (2) making some assumptions relevant to ${\mathcal {P}}$ . There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify ${\mathcal {P}}$ —as they are required to do.

General remarks

[edit]

A statistical model is a special class ofmathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables arestochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance,coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via aBernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statisticianSir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".^[4]

There are three purposes for a statistical model, according to Konishi & Kitagawa:^[5]

Predictions
Extraction of information
Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description.^[6]

Dimension of a model

[edit]

Suppose that we have a statistical model ( $S,{\mathcal {P}}$ ) with ${\mathcal {P}}=\{F_{\theta }:\theta \in \Theta \}$ . In notation, we write that $\Theta \subseteq \mathbb {R} ^{k}$ wherek is a positive integer ( $\mathbb {R}$ denotes thereal numbers; other sets can be used, in principle). Here,k is called thedimension of the model. The model is said to beparametric if $\Theta$ has finite dimension.^{[citation needed]} As an example, if we assume that data arise from a univariateGaussian distribution, then we are assuming that

{\mathcal {P}}=\left\{F_{\mu ,\sigma }(x)\equiv {\frac {1}{{\sqrt {2\pi }}\sigma }}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right):\mu \in \mathbb {R} ,\sigma >0\right\}

In this example, the dimension,k, equals 2. As another example, suppose that the data consists of points (x,y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.)

Although formally $\theta \in \Theta$ is a single parameter that has dimensionk, it is sometimes regarded as comprisingk separate parameters. For example, with the univariate Gaussian distribution, $\theta$ is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model isnonparametric if the parameter set $\Theta$ is infinite dimensional. A statistical model issemiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, ifk is the dimension of $\Theta$ andn is the number of samples, both semiparametric and nonparametric models have $k\rightarrow \infty$ as $n\rightarrow \infty$ . If $k/n\rightarrow 0$ as $n\rightarrow \infty$ , then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models,Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".^[7]

Nested models

[edit]

This sectionneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources in this section. Unsourced material may be challenged and removed.
Find sources: "Statistical model" – news ·newspapers ·books ·scholar ·JSTOR(November 2023) (Learn how and when to remove this message)

Not to be confused withMultilevel models.

Two statistical models arenested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model

y =b₀ +b₁x +b₂x² + ε, ε ~ 𝒩(0,σ²)

has, nested within it, the linear model

y =b₀ +b₁x + ε, ε ~ 𝒩(0,σ²)

—we constrain the parameterb₂ to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.

Comparing models

[edit]

Notes

[edit]

^Cox 2006, p. 178
^Adèr 2008, p. 280
^^a ^bMcCullagh 2002
^Cox 2006, p. 197
^Konishi & Kitagawa 2008, §1.1
^Friendly & Meyer 2016, §11.6
^Cox 2006, p. 2
^Le Cam, Lucien (1964)."Sufficiency and Approximate Sufficiency".Annals of Mathematical Statistics.35 (4).Institute of Mathematical Statistics: 1429.doi:10.1214/aoms/1177700372.

This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(September 2010) (Learn how and when to remove this message)

References

[edit]

Adèr, H. J. (2008), "Modelling", in Adèr, H. J.;Mellenbergh, G. J. (eds.),Advising on Research Methods: A consultant's companion, Huizen, The Netherlands: Johannes van Kessel Publishing, pp. 271–304.
Burnham, K. P.; Anderson, D. R. (2002),Model Selection and Multimodel Inference (2nd ed.), Springer-Verlag.
Cox, D. R. (2006),Principles of Statistical Inference,Cambridge University Press,doi:10.1017/CBO9780511813559.
Friendly, M.; Meyer, D. (2016),Discrete Data Analysis with R,Chapman & Hall.
Konishi, S.; Kitagawa, G. (2008),Information Criteria and Statistical Modeling, Springer.
McCullagh, P. (2002),"What is a statistical model?"(PDF),Annals of Statistics,30 (5):1225–1310,doi:10.1214/aos/1035844977.

Davison, A. C. (2008),Statistical Models,Cambridge University Press
Drton, M.; Sullivant, S. (2007),"Algebraic statistical models"(PDF),Statistica Sinica,17:1273–1297
Freedman, D. A. (2009),Statistical Models,Cambridge University Press
Helland, I. S. (2010),Steps Towards a Unified Basis for Scientific Models and Methods,World Scientific
Kroese, D. P.; Chan, J. C. C. (2014),Statistical Modeling and Computation,Springer
Shmueli, G. (2010), "To explain or to predict?",Statistical Science,25 (3):289–310,arXiv:1101.0891,doi:10.1214/10-STS330,S2CID 15900983

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis (see alsoTemplate:Least squares and regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / multivariate / time-series / survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR) (Autoregressive model (AR))
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging