Movatterモバイル変換

[0]ホーム

Jump to content

Q–Q plot

Edit links

From Wikipedia, the free encyclopedia

Comparison of two distributions

Not to be confused withP–P plot.

A normal Q–Q plot of randomly generated, independent standardexponential data, (X ~ Exp(1)). This Q–Q plot compares asample ofdata on the vertical axis to astatistical population on the horizontal axis. The points follow a strongly nonlinear pattern, suggesting that the data are not distributed as a standard normal (X ~ N(0,1)). The offset between the line and the points suggests that the mean of the data is not 0. The median of the points can be determined to be near 0.7.

A normal Q–Q plot comparing randomly generated, independent standard normal data on the vertical axis to a standard normal population on the horizontal axis. The linearity of the points suggests that the data are normally distributed.

A Q–Q plot of a sample of data versus aWeibull distribution. The deciles of the distributions are shown in red. Three outliers are evident at the high end of the range. Otherwise, the data fit the Weibull(1,2) model well.

A Q–Q plot comparing the distributions ofstandardized daily maximum temperatures at 25 stations in the US state of Ohio in March and in July. The curved pattern suggests that the centralquantiles are more closely spaced in July than in March, and that the July distribution isskewed to the left compared to the March distribution. The data cover the period 1893–2001.

In statistics, aQ–Q plot (quantile–quantile plot) is a probability plot, agraphical method for comparing twoprobability distributions by plotting theirquantiles against each other.^[1] A point(x,y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). This defines aparametric curve where the parameter is the index of the quantile interval.

If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on theidentity liney =x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the liney =x. Q–Q plots can also be used as a graphical means of estimating parameters in alocation-scale family of distributions.

A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such aslocation,scale, andskewness are similar or different in the two distributions. Q–Q plots can be used to compare collections of data, ortheoretical distributions. The use of Q–Q plots to compare two samples of data can be viewed as anon-parametric approach to comparing their underlying distributions. A Q–Q plot is generally more diagnostic than comparing the samples'histograms, but is less widely known. Q–Q plots are commonly used to compare a data set to a theoretical model.^[2]^[3] This can provide an assessment ofgoodness of fit that is graphical, rather than reducing to a numericalsummary statistic. Since Q–Q plots compare distributions, there is no need for the values to be observed as pairs, as in ascatter plot, or even for the numbers of values in the two groups being compared to be equal.

The term "probability plot" sometimes refers specifically to a Q–Q plot, sometimes to a more general class of plots, and sometimes to the less commonly usedP–P plot. Theprobability plot correlation coefficient plot (PPCC plot) is a quantity derived from the idea of Q–Q plots, which measures the agreement of a fitted distribution with observed data and which is sometimes used as a means of fitting a distribution to data.

Definition and construction

[edit]

Q–Q plot for first opening/final closing dates ofWashington State Route 20, versus a normal distribution.^[4] Outliers are visible in the upper right corner.

A Q–Q plot is a plot of the quantiles of two distributions against each other, or a plot based on estimates of the quantiles. The pattern of points in the plot is used to compare the two distributions.

The main step in constructing a Q–Q plot is calculating or estimating the quantiles to be plotted. If one or both of the axes in a Q–Q plot is based on a theoretical distribution with a continuouscumulative distribution function (CDF), all quantiles are uniquely defined and can be obtained by inverting the CDF. If a theoretical probability distribution with a discontinuous CDF is one of the two distributions being compared, some of the quantiles may not be defined, so an interpolated quantile may be plotted. If the Q–Q plot is based on data, there are multiple quantile estimators in use. Rules for forming Q–Q plots when quantiles must be estimated or interpolated are calledplotting positions.

A simple case is where one has two data sets of the same size. In that case, to make the Q–Q plot, one orders each set in increasing order, then pairs off and plots the corresponding values. A more complicated construction is the case where two data sets of different sizes are being compared. To construct the Q–Q plot in this case, it is necessary to use aninterpolated quantile estimate so that quantiles corresponding to the same underlying probability can be constructed.

More abstractly,^[5] given two cumulative probability distribution functionsF andG, with associatedquantile functionsF⁻¹ andG⁻¹ (the inverse function of the CDF is the quantile function), the Q–Q plot draws theq-th quantile ofF against theq-th quantile ofG for a range of values ofq. Thus, the Q–Q plot is aparametric curve indexed over [0,1] with values in the real planeR².

Typically for an analysis of normality, the vertical axis shows the values of the variable of interest, sayx with CDFF(x), and the horizontal axis representsN⁻¹(F(x)), whereN⁻¹(.) represents the inverse cumulative normal distribution function.

Interpretation

[edit]

The points plotted in a Q–Q plot always have a positive slope, that is they increase when viewed from left to right. If the two distributions being compared are identical, the Q–Q plot follows the 45° liney =x. If the two distributions agree after linearly transforming the values in one of the distributions, then the Q–Q plot follows some line, but not necessarily the liney =x. If the general trend of the Q–Q plot is flatter than the liney =x, the distribution plotted on the horizontal axis is moredispersed than the distribution plotted on the vertical axis. Conversely, if the general trend of the Q–Q plot is steeper than the liney =x, the distribution plotted on the vertical axis is moredispersed than the distribution plotted on the horizontal axis. Q–Q plots are often arced, or S-shaped, indicating that one of the distributions is more skewed than the other, or that one of the distributions has heavier tails than the other.

Although a Q–Q plot is based on quantiles, in a standard Q–Q plot it is not possible to determine which point in the Q–Q plot determines a given quantile. For example, it is not possible to determine the median of either of the two distributions being compared by inspecting the Q–Q plot. Some Q–Q plots indicate the deciles to make determinations such as this possible.

The intercept and slope of a linear regression between the quantiles gives a measure of the relative location and relative scale of the samples. If the median of the distribution plotted on the horizontal axis is 0, the intercept of a regression line is a measure of location, and the slope is a measure of scale. The distance between medians is another measure of relative location reflected in a Q–Q plot. The "probability plot correlation coefficient" (PPCC plot) is thecorrelation coefficient between the paired sample quantiles. he closer the correlation coefficient is to one, the closer the distributions are to being shifted, scaled versions of each other. For distributions with a single shape parameter, the probability plot correlation coefficient plot provides a method for estimating the shape parameter – one simply computes the correlation coefficient for different values of the shape parameter, and uses the one with the best fit, just as if one were comparing distributions of different types.

Another common use of Q–Q plots is to compare the distribution of a sample to a theoretical distribution, such as the standardnormal distributionN(0,1), as in anormal probability plot. As in the case when comparing two samples of data, one orders the data (formally, computes the order statistics), then plots them against certain quantiles of the theoretical distribution.^[3]

Plotting positions

[edit]

The choice of quantiles from a theoretical distribution can depend upon context and purpose. One choice, given a sample of sizen, isk /n fork = 1, …,n, as these are the quantiles that thesampling distribution realizes. The last of these,n /n, corresponds to the 100th percentile – the maximum value of the theoretical distribution, which is sometimes infinite. Other choices are the use of(k − 0.5) /n, or instead to space then points such that there is an equal distance between all of them and also between the two outermost points and the edges of the $[0,1]$ interval, usingk / (n + 1).^[6]

Many other choices have been suggested, both formal and heuristic, based on theory or simulations relevant in context. The following subsections discuss some of these. A narrower question is choosing a maximum (estimation of a population maximum), known as theGerman tank problem, for which similar "sample maximum, plus a gap" solutions exist, most simplym +m/n − 1. A more formal application of this uniformization of spacing occurs inmaximum spacing estimation of parameters.

Expected value of the order statistic for a uniform distribution

[edit]

Thek / (n + 1) approach equals that of plotting the points according to the probability that the last of (n + 1) randomly drawn values will not exceed thek-th smallest of the firstn randomly drawn values.^[7]^[8]

Expected value of the order statistic for a standard normal distribution

[edit]

In using anormal probability plot, the quantiles one uses are therankits, the quantile of the expected value of the order statistic of a standard normal distribution.

More generally,Shapiro–Wilk test uses the expected values of the order statistics of the given distribution; the resulting plot and line yields thegeneralized least squares estimate for location and scale (from theintercept andslope of the fitted line).^[9]Although this is not too important for the normal distribution (the location and scale are estimated by the mean and standard deviation, respectively), it can be useful for many other distributions.

However, this requires calculating the expected values of the order statistic, which may be difficult if the distribution is not normal.

Median of the order statistics

[edit]

Alternatively, one may use estimates of themedian of the order statistics, which one can compute based on estimates of the median of the order statistics of a uniform distribution and the quantile function of the distribution; this was suggested byFilliben (1975).^[9]

This can be easily generated for any distribution for which the quantile function can be computed, but conversely the resulting estimates of location and scale are no longer precisely the least squares estimates, though these only differ significantly forn small.

Heuristics

[edit]

Several different formulas have been used or proposed asaffine symmetrical plotting positions. Such formulas have the form(k −a) / (n + 1 − 2a) for some value ofa in the range from 0 to 1, which gives a range betweenk / (n + 1) and(k − 1) / (n − 1).

Expressions include:

k / (n + 1)
(k − 0.3) / (n + 0.4).^[10]
(k − 0.3175) / (n + 0.365).^[11]^{[note 1]}
(k − 0.326) / (n + 0.348).^[12]
(k − ⅓) / (n + ⅓).^{[note 2]}
(k − 0.375) / (n + 0.25).^{[note 3]}
(k − 0.4) / (n + 0.2).^{[citation needed]}
(k − 0.44) / (n + 0.12).^{[note 4]}
(k − 0.5) / n.^[14]
(k − 0.567) / (n − 0.134).^{[citation needed]}
(k − 1) / (n − 1).^{[note 5]}

For large sample size,n, there is little difference between these various expressions.

Filliben's estimate

[edit]

The order statistic medians are the medians of theorder statistics of the distribution. These can be expressed in terms of the quantile function and theorder statistic medians for the continuous uniform distribution by: $N(i)=G(U(i))$ whereU(i) are the uniform order statistic medians andG is the quantile function for the desired distribution. The quantile function is the inverse of thecumulative distribution function (probability thatX is less than or equal to some value). That is, given a probability, we want the corresponding quantile of the cumulative distribution function.

James J. Filliben uses the following estimates for the uniform order statistic medians:^[15] $m(i)={\begin{cases}1-0.5^{1/n}&i=1\\[2ex]{\dfrac {i-0.3175}{n+0.365}}&i=2,3,\ldots ,n-1\\[2ex]0.5^{1/n}&i=n.\end{cases}}$ The reason for this estimate is that the order statistic medians do not have a simple form.

Software

[edit]

TheR programming language comes with functions to make Q–Q plots, namely qqnorm and qqplot from thestats package. Thefastqq package implements faster plotting for large number of data points.

Notes

[edit]

^ Note that this also uses a different expression for the first & last points.[1] cites the original work byFilliben (1975). This expression is an estimate of themedians ofU_(k).
^A simple (and easy to remember) formula for plotting positions; used inBMDP statistical package.
^This isBlom (1958)'s earlier approximation and is the expression used inMINITAB.
^This plotting position was used by Irving I. Gringorten^[13] to plot points in tests for theGumbel distribution.
^Used byFilliben (1975), these plotting points are equal to themodes ofU_(k).

References

[edit]

Citations

[edit]

^Wilk, M.B.; Gnanadesikan, R. (1968), "Probability plotting methods for the analysis of data",Biometrika,55 (1), Biometrika Trust:1–17,doi:10.1093/biomet/55.1.1,JSTOR 2334448,PMID 5661047.
^Gnanadesikan (1977), p. 199.
^^a ^bThode (2002), Section 2.2.2, Quantile-Quantile Plots,p. 21
^"SR 20 – North Cascades Highway – Opening and Closing History".North Cascades Passes. Washington State Department of Transportation. October 2009. Retrieved8 February 2009.
^Gibbons & Chakraborti (2003),p. 144
^Weibull, Waloddi (1939), "The Statistical Theory of the Strength of Materials",IVA Handlingar, Royal Swedish Academy of Engineering Sciences (151)
^Madsen, H.O.; et al. (1986),Methods of Structural Safety
^Makkonen, L. (2008), "Bringing closure to the plotting position controversy",Communications in Statistics – Theory and Methods,37 (3):460–467,doi:10.1080/03610920701653094,S2CID 122822135
^^a ^bTesting for Normality, by Henry C. Thode, CRC Press, 2002,ISBN 978-0-8247-9613-6,p. 31
^Benard, A.; Bos-Levenbach, E. C. (September 1953)."The plotting of observations on probability paper".Statistica Neerlandica (in Dutch).7:163–173.doi:10.1111/j.1467-9574.1953.tb00821.x.
^"1.3.3.21. Normal Probability Plot".itl.nist.gov. Retrieved16 February 2022.
^Distribution free plotting position, Yu & Huang
^Gringorten, Irving I. (1963)."A plotting rule for extreme probability paper".Journal of Geophysical Research.68 (3):813–814.Bibcode:1963JGR....68..813G.doi:10.1029/JZ068i003p00813.ISSN 2156-2202.
^Hazen, Allen (1914), "Storage to be provided in the impounding reservoirs for municipal water supply",Transactions of the American Society of Civil Engineers (77):1547–1550
^Filliben (1975).

Sources

[edit]

This article incorporatespublic domain material from the National Institute of Standards and Technology
Blom, G. (1958),Statistical estimates and transformed beta variables, New York: John Wiley and Sons
Chambers, John;Cleveland, William; Kleiner, Beat; Tukey, Paul (1983),Graphical methods for data analysis, Wadsworth
Cleveland, W.S. (1994)The Elements of Graphing Data, Hobart PressISBN 0-9634884-1-4
Filliben, J. J. (February 1975), "The Probability Plot Correlation Coefficient Test for Normality",Technometrics,17 (1), American Society for Quality:111–117,doi:10.2307/1268008,JSTOR 1268008.
Gibbons, Jean Dickinson; Chakraborti, Subhabrata (2003),Nonparametric statistical inference (4th ed.), CRC Press,ISBN 978-0-8247-4052-8
Gnanadesikan, R. (1977).Methods for Statistical Analysis of Multivariate Observations. Wiley.ISBN 0-471-30845-5.
Thode, Henry C. (2002),Testing for normality, New York: Marcel Dekker,ISBN 0-8247-9613-6

External links

[edit]

Wikimedia Commons has media related toQ-Q plot.

Probability plot
Manuel Gimond,The empirical QQ plot (and derived Tukey mean-difference plot)

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS) Template:Least squares and regression analysis
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / multivariate / time-series / survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR) (Autoregressive model (AR))
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging