Movatterモバイル変換

Hodges–Lehmann estimator

From Wikipedia, the free encyclopedia

Robust and nonparametric estimator of a population's location parameter

Instatistics, theHodges–Lehmann estimator is arobust andnonparametric estimator of a population'slocation parameter. For populations that are symmetric about onemedian, such as the Gaussian ornormal distribution or theStudentt-distribution, the Hodges–Lehmann estimator is aconsistent and median-unbiased estimate of the population median. For non-symmetric populations, the Hodges–Lehmann estimator estimates the "pseudo–median", which is closely related to the population median.

The Hodges–Lehmann estimator was proposed originally for estimating the location parameter of one-dimensional populations, but it has been used for many more purposes. It has been used to estimate thedifferences between the members of two populations. It has been generalized from univariate populations tomultivariate populations, which produce samples ofvectors.

It is based on theWilcoxon signed-rank statistic. In statistical theory, it was an early example of arank-based estimator, an important class of estimators both in nonparametric statistics and in robust statistics. The Hodges–Lehmann estimator was proposed in 1963 independently byPranab Kumar Sen and byJoseph Hodges andErich Lehmann, and so it is also called the "Hodges–Lehmann–Sen estimator".^[1]

Definition

[edit]

In the simplest case, the "Hodges–Lehmann" statistic estimates the location parameter for a univariate population.^[2]^[3] Its computation can be described quickly. For a dataset withn measurements, the set of all possible two-element subsets of it $(z_{i},z_{j})$ such that $i {\displaystyle i}$ ≤ $j {\displaystyle j}$ (i.e. specifically including self-pairs; many secondary sources incorrectly omit this detail), which set hasn(n + 1)/2 elements. For each such subset, the mean is computed; finally, the median of thesen(n + 1)/2 averages is defined to be the Hodges–Lehmann estimator of location.

The two-sample Hodges–Lehmann statistic is an estimate of a location-shift typedifference between two populations. For two sets of data withm andn observations, the set of two-element sets made of them is their Cartesian product, which containsm × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is themedian of them × n differences.^[4]

Estimating the population median of a symmetric population

[edit]

In the general case the Hodges-Lehmann statistic estimates the population'spseudomedian,^[5] alocation parameter that is closely related to themedian. The difference between the median and pseudo-median is relatively small, and so this distinction is neglected in elementary discussions. Like thespatial median,^[6] the pseudo–median is well defined for all distributions of random variables having dimension two or greater; for one-dimensional distributions, there exists some pseudo–median, which need not be unique, however. Like the median, the pseudo–median is defined for even heavy–tailed distributions that lack any (finite)mean.^[7]

For a population that is symmetric, the Hodges–Lehmann statistic also estimates the population's median. It is a robust statistic that has abreakdown point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated. This robustness is an important advantage over the sample mean, which has a zero breakdown point, being proportional to any single observation and so liable to being misled by even oneoutlier. Thesample median is even more robust, having a breakdown point of 0.50.^[8] The Hodges–Lehmann estimator is much better than the sample mean when estimating mixtures of normal distributions, also.^[9]

For symmetric distributions, the Hodges–Lehmann statistic sometimes has greaterefficiency at estimating the center of symmetry (population median) than does the sample median. For the normal distribution, the Hodges-Lehmann statistic is nearly as efficient as the sample mean. For theCauchy distribution (Student t-distribution with one degree of freedom), the Hodges-Lehmann is infinitely more efficient than the sample mean, which is not a consistent estimator of the median,^[8] but it is not more efficient than the median in that instance.

The one-sample Hodges–Lehmann statistic need not estimate any population mean, which for many distributions does not exist. The two-sample Hodges–Lehmann estimator need not estimate the difference of two means or the difference of two (pseudo-)medians; rather, it estimates the median of the distribution of the difference between pairs of random–variables drawn respectively from the two populations.^[4]

In general statistics

[edit]

The Hodges–Lehmannunivariate statistics have several generalizations inmultivariate statistics:^[10]

Multivariate ranks and signs^[11]
Spatial sign tests and spatial medians^[6]
Spatial signed-rank tests^[12]
Comparisons of tests and estimates^[13]
Several-sample location problems^[14]

Notes

[edit]

^Lehmann (2006, pp. 176 and 200–201)
^Dodge, Y. (2003)The Oxford Dictionary of Statistical Terms, OUP.ISBN 0-19-850994-4 Entry for "Hodges-Lehmann one-sample estimator"
^Hodges & Lehmann (1963)
^^a ^bEveritt (2002) Entry for "Hodges-Lehmann estimator"
^Hettmansperger & McKean (1998, pp. 2–4)
^^a ^bOja (2010, p. 71)
^Hettmansperger & McKean (1998, pp. 2–4 and 355–356)
^^a ^bMyles Hollander. Douglas A. Wolfe.Nonparametric statistical methods. 2nd ed. John Wiley.
^Jureckova Sen. Robust Statistical Procedures.
^Oja (2010, pp. 2–3)
^Oja (2010, p. 34)
^Oja (2010, pp. 83–94)
^Oja (2010, pp. 98–102)
^Oja (2010, pp. 160, 162, and 167–169)

References

[edit]

Everitt, B.S. (2002)The Cambridge Dictionary of Statistics, CUP.ISBN 0-521-81099-X
Hettmansperger, T. P.; McKean, J. W. (1998).Robust nonparametric statistical methods. Kendall's Library of Statistics. Vol. 5 (First ed., rather than Taylor and Francis (2010) second ed.). London; New York: Edward Arnold; John Wiley and Sons, Inc. pp. xiv+467.ISBN 0-340-54937-8.MR 1604954.
Hodges, J. L.;Lehmann, E. L. (1963)."Estimation of location based on ranks".Annals of Mathematical Statistics.34 (2):598–611.doi:10.1214/aoms/1177704172.JSTOR 2238406.MR 0152070.Zbl 0203.21105.PE euclid.aoms/1177704172.
Lehmann, Erich L. (2006).Nonparametrics: Statistical methods based on ranks. With the special assistance of H. J. M. D'Abrera (Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463.ISBN 978-0-387-35212-1.MR 0395032.
Oja, Hannu (2010).Multivariate nonparametric methods with R: An approach based on spatial signs and ranks. Lecture Notes in Statistics. Vol. 199. New York: Springer. pp. xiv+232.doi:10.1007/978-1-4419-0468-3.ISBN 978-1-4419-0467-6.MR 2598854.
Sen, Pranab Kumar (December 1963). "On the estimation of relative potency in dilution(-direct) assays by distribution-free methods".Biometrics.19 (4):532–552.doi:10.2307/2527532.JSTOR 2527532.Zbl 0119.15604.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis (see alsoTemplate:Least squares and regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / multivariate / time-series / survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR) (Autoregressive model (AR))
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging