Sample maximum and minimum

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Sample maximum and minimum" – news ·newspapers ·books ·scholar ·JSTOR(July 2021) (Learn how and when to remove this message)

Instatistics, thesample maximum andsample minimum, also called thelargest observation andsmallest observation, are the values of the greatest and least elements of asample.^[1] They are basicsummary statistics, used indescriptive statistics such as thefive-number summary andBowley's seven-figure summary and the associatedbox plot.

Box plots of theMichelson–Morley experiment, showing sample maxima and minima

The minimum and the maximum value are the first and lastorder statistics (often denotedX₍₁₎ andX_(n) respectively, for a sample size ofn).

If the sample hasoutliers, they necessarily include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low. However, the sample maximum and minimum need not be outliers, if they are not unusually far from other observations.

Robustness

edit

The sample maximum and minimum are theleastrobust statistics: they are maximally sensitive to outliers.

This can either be an advantage or a drawback: if extreme values are real (not measurement errors), and of real consequence, as in applications ofextreme value theory such as building dikes or financial loss, then outliers (as reflected in sample extrema) are important. On the other hand, if outliers have little or no impact on actual outcomes, then using non-robust statistics such as the sample extrema simply clouds the statistics, and robust alternatives should be used, such as otherquantiles: the 10th and 90thpercentiles (first and lastdecile) are more robust alternatives.

Derived statistics

edit

In addition to being a component of every statistic that uses all elements of the sample, the sample extrema are important parts of therange, a measure of dispersion, andmid-range, a measure of location. They also realize themaximum absolute deviation: one of them is thefurthest point from any given point, particularly a measure of center such as the median or mean.

Applications

edit

Smooth maximum

edit

For a sample set, the maximum function is non-smooth and thus non-differentiable. For optimization problems that occur in statistics it often needs to be approximated by a smooth function that is close to the maximum of the set.

Asmooth maximum, for example,

g(x₁,x₂, …,x_n) = log( exp(x₁) + exp(x₂) + … + exp(x_n) )

is a good approximation of the sample maximum.

Summary statistics

edit

The sample maximum and minimum are basicsummary statistics, showing the most extreme observations, and are used in thefive-number summary and a version of theseven-number summary and the associatedbox plot.

Prediction interval

edit

Further information:Prediction interval § Non-parametric

The sample maximum and minimum provide a non-parametricprediction interval:in a sample from a population, or more generally anexchangeable sequence of random variables, each observation is equally likely to be the maximum or minimum.

Thus if one has a sample $\{X_{1},\dots ,X_{n}\},$ and one picks another observation $X_{n+1},$ then this has $1/(n+1)$ probability of being the largest value seen so far, $1/(n+1)$ probability of being the smallest value seen so far, and thus the other $(n-1)/(n+1)$ of the time, $X_{n+1}$ falls between the sample maximum and sample minimum of $\{X_{1},\dots ,X_{n}\}.$ Thus, denoting the sample maximum and minimum byM andm, this yields an $(n-1)/(n+1)$ prediction interval of [m,M].

For example, ifn = 19, then [m,M] gives an 18/20 = 90% prediction interval – 90% of the time, the 20th observation falls between the smallest and largest observation seen heretofore. Likewise,n = 39 gives a 95% prediction interval, andn = 199 gives a 99% prediction interval.

Estimation

edit

Due to their sensitivity to outliers, the sample extrema cannot reliably be used asestimators unless data is clean – robust alternatives include the first and lastdeciles.

However, with clean data or in theoretical settings, they can sometimes prove very good estimators, particularly forplatykurtic distributions, where for small data sets themid-range is the mostefficient estimator.

They are inefficient estimators of location for mesokurtic distributions, such as thenormal distribution, and leptokurtic distributions, however.

Uniform distribution

edit

Further information:German tank problem

For sampling without replacement from auniform distribution with one or two unknown endpoints (so $1,2,\dots ,N$ withN unknown, or $M,M+1,\dots ,N$ with bothM andN unknown), the sample maximum, or respectively the sample maximum and sample minimum, aresufficient andcomplete statistics for the unknown endpoints; thus an unbiased estimator derived from these will beUMVU estimator.

If only the top endpoint is unknown, the sample maximum is a biased estimator for the population maximum, but the unbiased estimator ${\frac {k+1}{k}}m-1$ (wherem is the sample maximum andk is the sample size) is the UMVU estimator; seeGerman tank problem for details.

If both endpoints are unknown, then the sample range is a biased estimator for the population range, but correcting as for maximum above yields the UMVU estimator.

If both endpoints are unknown, then themid-range is an unbiased (and hence UMVU) estimator of the midpoint of the interval (here equivalently the population median, average, or mid-range).

The reason the sample extrema are sufficient statistics is that the conditional distribution of the non-extreme samples is just the distribution for the uniform interval between the sample maximum and minimum – once the endpoints are fixed, the values of the interior points add no additional information.