Movatterモバイル変換

[0]ホーム

Jump to content

Empirical distribution function

Edit links

From Wikipedia, the free encyclopedia

Distribution function associated with the empirical measure of a sample

Definition

[edit]

Let(X₁, …,X_n) beindependent, identically distributed real random variables with the commoncumulative distribution functionF(t). Then theempirical distribution function is defined as^[2]

{\widehat {F}}_{n}(t)={\frac {{\mbox{number of elements in the sample}}\leq t}{n}}={\frac {1}{n}}\sum _{i=1}^{n}\mathbf {1} _{X_{i}\leq t},

where $\mathbf {1} _{A}$ is theindicator ofeventA. For a fixedt, the indicator $\mathbf {1} _{X_{i}\leq t}$ is aBernoulli random variable with parameterp =F(t); hence $n{\widehat {F}}_{n}(t)$ is abinomial random variable withmeannF(t) andvariancenF(t)(1 −F(t)). This implies that ${\widehat {F}}_{n}(t)$ is anunbiased estimator forF(t).

However, in some textbooks, the definition is given as

{\widehat {F}}_{n}(t)={\frac {1}{n+1}}\sum _{i=1}^{n}\mathbf {1} _{X_{i}\leq t}

^[3]^[4]

Asymptotic properties

[edit]

Since the ratio(n + 1)/n approaches 1 asn goes to infinity, the asymptotic properties of the two definitions that are given above are the same.

By thestrong law of large numbers, the estimator $\scriptstyle {\widehat {F}}_{n}(t)$ converges toF(t) asn → ∞almost surely, for every value oft:^[2]

{\widehat {F}}_{n}(t)\ {\xrightarrow {\text{a.s.}}}\ F(t);

thus the estimator $\scriptstyle {\widehat {F}}_{n}(t)$ isconsistent. This expression asserts the pointwise convergence of the empirical distribution function to the true cumulative distribution function. There is a stronger result, called theGlivenko–Cantelli theorem, which states that the convergence in fact happens uniformly overt:^[5]

\|{\widehat {F}}_{n}-F\|_{\infty }\equiv \sup _{t\in \mathbb {R} }{\big |}{\widehat {F}}_{n}(t)-F(t){\big |}\ \xrightarrow {} \ 0.

The sup-norm in this expression is called theKolmogorov–Smirnov statistic for testing the goodness-of-fit between the empirical distribution $\scriptstyle {\widehat {F}}_{n}(t)$ and the assumed true cumulative distribution functionF. Othernorm functions may be reasonably used here instead of the sup-norm. For example, theL²-norm gives rise to theCramér–von Mises statistic.

The asymptotic distribution can be further characterized in several different ways. First, thecentral limit theorem states thatpointwise, $\scriptstyle {\widehat {F}}_{n}(t)$ has asymptotically normal distribution with the standard ${\sqrt {n}}$ rate of convergence:^[2]

{\sqrt {n}}{\big (}{\widehat {F}}_{n}(t)-F(t){\big )}\ \ {\xrightarrow {d}}\ \ {\mathcal {N}}{\Big (}0,F(t){\big (}1-F(t){\big )}{\Big )}.

This result is extended by theDonsker’s theorem, which asserts that theempirical process $\scriptstyle {\sqrt {n}}({\widehat {F}}_{n}-F)$ , viewed as a function indexed by $\scriptstyle t\in \mathbb {R}$ ,converges in distribution in theSkorokhod space $\scriptstyle D[-\infty ,+\infty ]$ to the mean-zeroGaussian process $\scriptstyle G_{F}=B\circ F$ , whereB is the standardBrownian bridge.^[5] The covariance structure of this Gaussian process is

\operatorname {E} [\,G_{F}(t_{1})G_{F}(t_{2})\,]=F(t_{1}\wedge t_{2})-F(t_{1})F(t_{2}).

The uniform rate of convergence in Donsker’s theorem can be quantified by the result known as theHungarian embedding:^[6]

\limsup _{n\to \infty }{\frac {\sqrt {n}}{\ln ^{2}n}}{\big \|}{\sqrt {n}}({\widehat {F}}_{n}-F)-G_{F,n}{\big \|}_{\infty }<\infty ,\quad {\text{a.s.}}

Alternatively, the rate of convergence of $\scriptstyle {\sqrt {n}}({\widehat {F}}_{n}-F)$ can also be quantified in terms of the asymptotic behavior of the sup-norm of this expression. Number of results exist in this venue, for example theDvoretzky–Kiefer–Wolfowitz inequality provides bound on the tail probabilities of $\scriptstyle {\sqrt {n}}\|{\widehat {F}}_{n}-F\|_{\infty }$ :^[6]

\Pr \!{\Big (}{\sqrt {n}}\|{\widehat {F}}_{n}-F\|_{\infty }>z{\Big )}\leq 2e^{-2z^{2}}.

In fact, Kolmogorov has shown that if the cumulative distribution functionF is continuous, then the expression $\scriptstyle {\sqrt {n}}\|{\widehat {F}}_{n}-F\|_{\infty }$ converges in distribution to $\scriptstyle \|B\|_{\infty }$ , which has theKolmogorov distribution that does not depend on the form ofF.

Another result, which follows from thelaw of the iterated logarithm, is that^[6]

\limsup _{n\to \infty }{\frac {{\sqrt {n}}\|{\widehat {F}}_{n}-F\|_{\infty }}{\sqrt {2\ln \ln n}}}\leq {\frac {1}{2}},\quad {\text{a.s.}}

and

\liminf _{n\to \infty }{\sqrt {2n\ln \ln n}}\|{\widehat {F}}_{n}-F\|_{\infty }={\frac {\pi }{2}},\quad {\text{a.s.}}

Confidence intervals

[edit]

Empirical CDF, CDF and confidence interval plots for various sample sizes ofnormal distribution

Empirical CDF, CDF and confidence interval plots for various sample sizes ofCauchy distribution

Empirical CDF, CDF and confidence interval plots for various sample sizes oftriangle distribution

As perDvoretzky–Kiefer–Wolfowitz inequality the interval that contains the true CDF, $F(x)$ , with probability $1-\alpha$ is specified as

F_{n}(x)-\varepsilon \leq F(x)\leq F_{n}(x)+\varepsilon \;{\text{ where }}\varepsilon ={\sqrt {\frac {\ln {\frac {2}{\alpha }}}{2n}}}.

As per the above bounds, we can plot the Empirical CDF, CDF and confidence intervals for different distributions by using any one of the statistical implementations.

Statistical implementation

[edit]

A non-exhaustive list of software implementations of Empirical Distribution function includes:

InR software, we compute an empirical cumulative distribution function, with several methods for plotting, printing and computing with such an “ecdf” object.
InMATLAB we can use Empirical cumulative distribution function (cdf) plot
jmp from SAS, the CDF plot creates a plot of the empirical cumulative distribution function.
Minitab, create an Empirical CDF
Mathwave, we can fit probability distribution to our data
Dataplot, we can plot Empirical CDF plot
Scipy, we can use scipy.stats.ecdf
Statsmodels, we can use statsmodels.distributions.empirical_distribution.ECDF
Matplotlib, using the matplotlib.pyplot.ecdf function (new in version 3.8.0)^[7]
Seaborn, using the seaborn.ecdfplot function
Plotly, using the plotly.express.ecdf function
Excel, we can plot Empirical CDF plot
ArviZ, using theaz.plot_ecdf function

References

[edit]

^A modern introduction to probability and statistics: Understanding why and how. Michel Dekking. London: Springer. 2005. p. 219.ISBN 978-1-85233-896-1.OCLC 262680588.{{cite book}}: CS1 maint: others (link)
^^a ^b ^cvan der Vaart, A.W. (1998).Asymptotic statistics. Cambridge University Press. p. 265.ISBN 0-521-78450-6.
^Coles, S. (2001)An Introduction to Statistical Modeling of Extreme Values. Springer, p. 36, Definition 2.4.ISBN 978-1-4471-3675-0.
^Madsen, H.O., Krenk, S., Lind, S.C. (2006)Methods of Structural Safety. Dover Publications. p. 148-149.ISBN 0486445976
^^a ^bvan der Vaart, A.W. (1998).Asymptotic statistics. Cambridge University Press. p. 266.ISBN 0-521-78450-6.
^^a ^b ^cvan der Vaart, A.W. (1998).Asymptotic statistics. Cambridge University Press. p. 268.ISBN 0-521-78450-6.
^"What's new in Matplotlib 3.8.0 (Sept 13, 2023) — Matplotlib 3.8.3 documentation".

External links

[edit]

Media related toEmpirical distribution functions at Wikimedia Commons

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) /Binomial /Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical /multivariate /time-series /survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials /studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process /quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging