Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Sample mean and covariance

From Wikipedia, the free encyclopedia
Statistics computed from a sample of data
This articlemay be too technical for most readers to understand. Pleasehelp improve it tomake it understandable to non-experts, without removing the technical details.(June 2014) (Learn how and when to remove this message)

Thesample mean (sample average) orempirical mean (empirical average), and thesample covariance orempirical covariance arestatistics computed from asample of data on one or morerandom variables.

The sample mean is theaverage value (ormean value) of asample of numbers taken from a largerpopulation of numbers, where "population" indicates not number of people but the entirety of relevant data, whether collected or not. A sample of 40 companies' sales from theFortune 500 might be used for convenience instead of looking at the population, all 500 companies' sales. The sample mean is used as anestimator for the population mean, the average value in the entire population, where the estimate is more likely to be close to the population mean if the sample is large and representative. The reliability of the sample mean is estimated using thestandard error, which in turn is calculated using thevariance of the sample. If the sample is random, the standard error falls with the size of the sample and the sample mean's distribution approaches the normal distribution as the sample size increases.

The term "sample mean" can also be used to refer to avector of average values when the statistician is looking at the values of several variables in the sample, e.g. the sales, profits, and employees of a sample of Fortune 500 companies. In this case, there is not just a sample variance for each variable but a samplevariance-covariance matrix (or simplycovariance matrix) showing also the relationship between each pair of variables. This would be a 3×3 matrix when 3 variables are being considered. The sample covariance is useful in judging the reliability of the sample means as estimators and is also useful as an estimate of the population covariance matrix.

Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics to represent thelocation anddispersion of thedistribution of values in the sample, and to estimate the values for the population.

Definition of the sample mean

[edit]
Further information:Arithmetic mean

The sample mean is the average of the values of a variable in a sample, which is the sum of those values divided by the number of values. Usingmathematical notation, if a sample ofN observations on variableX is taken from the population, the sample mean is:

X¯=1Ni=1NXi.{\displaystyle {\bar {X}}={\frac {1}{N}}\sum _{i=1}^{N}X_{i}.}

Under this definition, if the sample (1, 4, 1) is taken from the population (1,1,3,4,0,2,1,0), then the sample mean isx¯=(1+4+1)/3=2{\displaystyle {\bar {x}}=(1+4+1)/3=2}, as compared to the population mean ofμ=(1+1+3+4+0+2+1+0)/8=12/8=1.5{\displaystyle \mu =(1+1+3+4+0+2+1+0)/8=12/8=1.5}. Even if a sample is random, it is rarely perfectly representative, and other samples would have other sample means even if the samples were all from the same population. The sample (2, 1, 0), for example, would have a sample mean of 1.

If the statistician is interested inK variables rather than one, each observation having a value for each of thoseK variables, the overall sample mean consists ofK sample means for individual variables. Letxij{\displaystyle x_{ij}} be theith independently drawn observation (i=1,...,N) on thejth random variable (j=1,...,K). These observations can be arranged intoNcolumn vectors, each withK entries, with theK×1 column vector giving thei-th observations of all variables being denotedxi{\displaystyle \mathbf {x} _{i}} (i=1,...,N).

Thesample mean vectorx¯{\displaystyle \mathbf {\bar {x}} } is a column vector whosej-th elementx¯j{\displaystyle {\bar {x}}_{j}} is the average value of theN observations of thejth variable:

x¯j=1Ni=1Nxij,j=1,,K.{\displaystyle {\bar {x}}_{j}={\frac {1}{N}}\sum _{i=1}^{N}x_{ij},\quad j=1,\ldots ,K.}

Thus, the sample mean vector contains the average of the observations for each variable, and is written

x¯=1Ni=1Nxi=[x¯1x¯jx¯K]{\displaystyle \mathbf {\bar {x}} ={\frac {1}{N}}\sum _{i=1}^{N}\mathbf {x} _{i}={\begin{bmatrix}{\bar {x}}_{1}\\\vdots \\{\bar {x}}_{j}\\\vdots \\{\bar {x}}_{K}\end{bmatrix}}}

Definition of sample covariance

[edit]
See also:Sample variance

Thesample covariance matrix is aK-by-KmatrixQ=[qjk]{\displaystyle \textstyle \mathbf {Q} =\left[q_{jk}\right]} with entries

qjk=1N1i=1N(xijx¯j)(xikx¯k),{\displaystyle q_{jk}={\frac {1}{N-1}}\sum _{i=1}^{N}\left(x_{ij}-{\bar {x}}_{j}\right)\left(x_{ik}-{\bar {x}}_{k}\right),}

whereqjk{\displaystyle q_{jk}} is an estimate of thecovariance between thejthvariable and thekth variable of the population underlying the data.In terms of the observation vectors, the sample covariance is

Q=1N1i=1N(xi.x¯)(xi.x¯)T,{\displaystyle \mathbf {Q} ={1 \over {N-1}}\sum _{i=1}^{N}(\mathbf {x} _{i}.-\mathbf {\bar {x}} )(\mathbf {x} _{i}.-\mathbf {\bar {x}} )^{\mathrm {T} },}

Alternatively, arranging the observation vectors as the columns of a matrix, so that

F=[x1x2xN]{\displaystyle \mathbf {F} ={\begin{bmatrix}\mathbf {x} _{1}&\mathbf {x} _{2}&\dots &\mathbf {x} _{N}\end{bmatrix}}},

which is a matrix ofK rows andN columns.Here, the sample covariance matrix can be computed as

Q=1N1(Fx¯1NT)(Fx¯1NT)T{\displaystyle \mathbf {Q} ={\frac {1}{N-1}}(\mathbf {F} -\mathbf {\bar {x}} \,\mathbf {1} _{N}^{\mathrm {T} })(\mathbf {F} -\mathbf {\bar {x}} \,\mathbf {1} _{N}^{\mathrm {T} })^{\mathrm {T} }},

where1N{\displaystyle \mathbf {1} _{N}} is anN by1 vector of ones. If the observations are arranged as rows instead of columns, sox¯{\displaystyle \mathbf {\bar {x}} } is now a 1×K row vector andM=FT{\displaystyle \mathbf {M} =\mathbf {F} ^{\mathrm {T} }} is anN×K matrix whose columnj is the vector ofN observations on variablej, then applying transposes in the appropriate places yields

Q=1N1(M1Nx¯)T(M1Nx¯).{\displaystyle \mathbf {Q} ={\frac {1}{N-1}}(\mathbf {M} -\mathbf {1} _{N}\mathbf {\bar {x}} )^{\mathrm {T} }(\mathbf {M} -\mathbf {1} _{N}\mathbf {\bar {x}} ).}

Like covariance matrices forrandom vector, sample covariance matrices arepositive semi-definite. To prove it, note that for any matrixA{\displaystyle \mathbf {A} } the matrixATA{\displaystyle \mathbf {A} ^{T}\mathbf {A} } is positive semi-definite. Furthermore, a covariance matrix is positive definiteif and only if the rank of thexi.x¯{\displaystyle \mathbf {x} _{i}.-\mathbf {\bar {x}} } vectors is K.

Unbiasedness

[edit]

The sample mean and the sample covariance matrix areunbiased estimates of themean and thecovariance matrix of therandom vectorX{\displaystyle \textstyle \mathbf {X} }, a row vector whosejth element (j = 1, ..., K) is one of the random variables.[1] The sample covariance matrix hasN1{\displaystyle \textstyle N-1} in the denominator rather thanN{\displaystyle \textstyle N} due to a variant ofBessel's correction: In short, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations. If the population meanE(X){\displaystyle \operatorname {E} (\mathbf {X} )} is known, the analogous unbiased estimate

qjk=1Ni=1N(xijE(Xj))(xikE(Xk)),{\displaystyle q_{jk}={\frac {1}{N}}\sum _{i=1}^{N}\left(x_{ij}-\operatorname {E} (X_{j})\right)\left(x_{ik}-\operatorname {E} (X_{k})\right),}

using the population mean, hasN{\displaystyle \textstyle N} in the denominator. This is an example of why in probability and statistics it is essential to distinguish betweenrandom variables (upper case letters) andrealizations of the random variables (lower case letters).

Themaximum likelihoodestimate of the covariance

qjk=1Ni=1N(xijx¯j)(xikx¯k){\displaystyle q_{jk}={\frac {1}{N}}\sum _{i=1}^{N}\left(x_{ij}-{\bar {x}}_{j}\right)\left(x_{ik}-{\bar {x}}_{k}\right)}

for theGaussian distribution case hasN in the denominator as well. The ratio of 1/N to 1/(N − 1) approaches 1 for large N, so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large.

Distribution of the sample mean

[edit]
Main article:Standard error of the mean

For each random variable, the sample mean is a goodestimator of the population mean, where a "good" estimator is defined as beingefficient and unbiased. Of course the estimator will likely not be the true value of thepopulation mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is arandom variable, not a constant, and consequently has its own distribution.

Denoting withμ the population mean and withσ2{\displaystyle \sigma ^{2}} the population variance, for a random sample ofnindependent observations drawn from the population, the expected value of the sample mean is

E(x¯)=μ{\displaystyle \operatorname {E} ({\bar {x}})=\mu }

and thevariance of the sample mean is

var(x¯)=σ2n.{\displaystyle \operatorname {var} ({\bar {x}})={\frac {\sigma ^{2}}{n}}.}

If the samples are not independent, butcorrelated, then special care has to be taken in order to avoid the problem ofpseudoreplication.

If the population isnormally distributed, then the sample mean is normally distributed as follows:

x¯N{μ,σ2n}.{\displaystyle {\bar {x}}\thicksim N\left\{\mu ,{\frac {\sigma ^{2}}{n}}\right\}.}

If the population is not normally distributed, the sample mean is nonetheless approximately normally distributed ifn is large and σ2/n < +∞. This is a consequence of thecentral limit theorem.

Weighted samples

[edit]
Main article:Weighted mean

In a weighted sample, each vectorxi{\displaystyle \textstyle {\textbf {x}}_{i}} (each set of single observations on each of theK random variables) is assigned a weightwi0{\displaystyle \textstyle w_{i}\geq 0}.Without loss of generality, assume that the weights arenormalized:

i=1Nwi=1.{\displaystyle \sum _{i=1}^{N}w_{i}=1.}

(If they are not, divide the weights by their sum).Then theweighted mean vectorx¯{\displaystyle \textstyle \mathbf {\bar {x}} } is given by

x¯=i=1Nwixi.{\displaystyle \mathbf {\bar {x}} =\sum _{i=1}^{N}w_{i}\mathbf {x} _{i}.}

and the elementsqjk{\displaystyle q_{jk}} of the weighted covariance matrixQ{\displaystyle \textstyle \mathbf {Q} } are[2]

qjk=11i=1Nwi2i=1Nwi(xijx¯j)(xikx¯k).{\displaystyle q_{jk}={\frac {1}{1-\sum _{i=1}^{N}w_{i}^{2}}}\sum _{i=1}^{N}w_{i}\left(x_{ij}-{\bar {x}}_{j}\right)\left(x_{ik}-{\bar {x}}_{k}\right).}

If all weights are the same,wi=1/N{\displaystyle \textstyle w_{i}=1/N}, the weighted mean and covariance reduce to the (biased) sample mean and covariance mentioned above.

Criticism

[edit]

The sample mean and sample covariance are notrobust statistics, meaning that they are sensitive tooutliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notablyquantile-based statistics such as thesample median for location,[3] andinterquartile range (IQR) for dispersion. Other alternatives includetrimming andWinsorising, as in thetrimmed mean and theWinsorized mean.

See also

[edit]

References

[edit]
  1. ^Richard Arnold Johnson; Dean W. Wichern (2007).Applied Multivariate Statistical Analysis. Pearson Prentice Hall.ISBN 978-0-13-187715-3. Retrieved10 August 2012.
  2. ^Mark Galassi, Jim Davies, James Theiler, Brian Gough, Gerard Jungman, Michael Booth, and Fabrice Rossi.GNU Scientific Library - Reference manual, Version 2.6, 2021.Section Statistics: Weighted Samples
  3. ^The World Question Center 2006: The Sample MeanArchived 2019-07-12 at theWayback Machine,Bart Kosko
Retrieved from "https://en.wikipedia.org/w/index.php?title=Sample_mean_and_covariance&oldid=1312676784"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp