Movatterモバイル変換

[0]ホーム

Jump to content

Contingency table

Edit links

From Wikipedia, the free encyclopedia

Table that displays the frequency of variables

For cross-tabulation that aggregates by summing, averaging, etc. (rather than only by counting), seePivot table.

Instatistics, acontingency table (also known as across tabulation orcrosstab) is a type oftable in amatrix format that displays the multivariatefrequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The termcontingency table was first used byKarl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation",^[1] part of theDrapers' Company Research Memoirs Biometric Series I published in 1904.

A crucial problem ofmultivariate statistics is finding the (direct-)dependence structure underlying the variables contained in high-dimensional contingency tables. If some of theconditional independences are revealed, then even the storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can useinformation theory concepts, which gain the information only from the distribution of probability, which can be expressed easily from the contingency table by the relative frequencies.

Apivot table is a way to create contingency tables using spreadsheet software.

Example

[edit]

Suppose there are two variables, sex (male or female) andhandedness (right- or left-handed). Further suppose that 100 individuals are randomly sampled from a very large population as part of a study of sex differences in handedness. A contingency table can be created to display the numbers of individuals who are male right-handed and left-handed, female right-handed and left-handed. Such a contingency table is shown below.

Handed- ness Sex	Right-handed	Left-handed	Total
Male	43	9	52
Female	44	4	48
Total	87	13	100

The numbers of the males, females, and right- and left-handed individuals are calledmarginal totals. The grand total (the total number of individuals represented in the contingency table) is the number in the bottom right corner.

The table allows users to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are right-handed although the proportions are not identical. The strength of the association can be measured by theodds ratio, and the population odds ratio estimated by thesample odds ratio. Thesignificance of the difference between the two proportions can be assessed with a variety of statistical tests includingPearson's chi-squared test, theG-test,Fisher's exact test,Boschloo's test, andBarnard's test, provided the entries in the table represent individuals randomly sampled from the population about which conclusions are to be drawn. If the proportions of individuals in the different columns vary significantly between rows (or vice versa), it is said that there is acontingency between the two variables. In other words, the two variables arenot independent. If there is no contingency, it is said that the two variables areindependent.

The example above is the simplest kind of contingency table, a table in which each variable has only two levels; this is called a 2 × 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are difficult to represent visually. The relation betweenordinal variables, or between ordinal and categorical variables, may also be represented in contingency tables, although such a practice is rare. For more on the use of a contingency table for the relation between two ordinal variables, seeGoodman and Kruskal's gamma.

Standard contents of a contingency table

[edit]

Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to asbanner points orcuts (and the rows are sometimes referred to asstubs).
Significance tests. Typically, eithercolumn comparisons, which test for differences between columns and display these results using letters, or,cell comparisons, which use color or arrows to identify a cell in a table that stands out in some way.
Nets ornetts which are sub-totals.
One or more of: percentages, row percentages, column percentages, indexes or averages.
Unweighted sample sizes (counts).

Measures of association

[edit]

The degree of association between the two variables can be assessed by a number of coefficients. The following subsections describe a few of them. For a more complete discussion of their uses, see the main articles linked under each subsection heading.

Odds ratio

[edit]

Main article:Odds ratio

The simplest measure of association for a 2 × 2 contingency table is theodds ratio. Given two events, A and B, the odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the odds ratio is 1; if the odds ratio is greater than 1, the events are positively associated; if the odds ratio is less than 1, the events are negatively associated.

The odds ratio has a simple expression in terms of probabilities; given the joint probability distribution:

{\begin{array}{c|cc}&B=1&B=0\\\hline A=1&p_{11}&p_{10}\\A=0&p_{01}&p_{00}\end{array}}

the odds ratio is:

OR={\frac {p_{11}p_{00}}{p_{10}p_{01}}}.

Phi coefficient

[edit]

Main article:Phi coefficient

A simple measure, applicable only to the case of 2 × 2 contingency tables, is thephi coefficient (φ) defined by

\phi =\pm {\sqrt {\frac {\chi ^{2}}{N}}},

whereχ² is computed as inPearson's chi-squared test, andN is the grand total of observations. φ varies from 0 (corresponding to no association between the variables) to 1 or −1 (complete association or complete inverse association), provided it is based on frequency data represented in 2 × 2 tables. Then its sign equals the sign of the product of themain diagonal elements of the table minus the product of the off–diagonal elements. φ takes on the minimum value −1.0 or the maximum value of +1.0if and only if every marginal proportion is equal to 0.5 (and two diagonal cells are empty).^[2]

Cramér'sV and the contingency coefficientC

[edit]

Main article:Cramér's V

Two alternatives are thecontingency coefficientC, andCramér's V.

The formulae for theC andV coefficients are:

C={\sqrt {\frac {\chi ^{2}}{N+\chi ^{2}}}}

and

V={\sqrt {\frac {\chi ^{2}}{N(k-1)}}},

k being the number of rows or the number of columns, whichever is less.

C suffers from the disadvantage that it does not reach a maximum of 1.0, notably the highest it can reach in a 2 × 2 table is 0.707 . It can reach values closer to 1.0 in contingency tables with more categories; for example, it can reach a maximum of 0.870 in a 4 × 4 table. It should, therefore, not be used to compare associations in different tables if they have different numbers of categories.^[3]

C can be adjusted so it reaches a maximum of 1.0 when there is complete association in a table of any number of rows and columns by dividingC by ${\sqrt {\frac {k-1}{k}}}$ wherek is the number of rows or columns, when the table is square^{[citation needed]}, or by ${\sqrt[{\scriptstyle 4}]{{r-1 \over r}\times {c-1 \over c}}}$ wherer is the number of rows andc is the number of columns.^[4]

Tetrachoric correlation coefficient

[edit]

Main article:Polychoric correlation

Another choice is thetetrachoric correlation coefficient but it is only applicable to 2 × 2 tables.Polychoric correlation is an extension of the tetrachoric correlation to tables involving variables with more than two levels.

Tetrachoric correlation assumes that the variable underlying eachdichotomous measure is normally distributed.^[5] The coefficient provides "a convenient measure of [the Pearson product-moment] correlation when graduated measurements have been reduced to two categories."^[6]

The tetrachoric correlation coefficient should not be confused with thePearson correlation coefficient computed by assigning, say, values 0.0 and 1.0 to represent the two levels of each variable (which is mathematically equivalent to the φ coefficient).

Lambda coefficient

[edit]

Main article:Goodman and Kruskal's lambda

Thelambda coefficient is a measure of the strength of association of the cross tabulations when the variables are measured at thenominal level. Values range from 0.0 (no association) to 1.0 (the maximum possible association).

Asymmetric lambda measures the percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions.

Uncertainty coefficient

[edit]

Main article:Uncertainty coefficient

Theuncertainty coefficient, or Theil's U, is another measure for variables at the nominal level. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates the absence of association.

Also, the uncertainty coefficient is conditional and an asymmetrical measure of association, which can be expressed as

U(X|Y)\neq U(Y|X)

This asymmetrical property can lead to insights not as evident in symmetrical measures of association.^[7]

Others

[edit]

Main articles:Goodman and Kruskal's gamma andKendall rank correlation coefficient

Gamma, Tau-b and Tau-c are used when the categories or levels of both variables have a natural order.

Gamma test: No adjustment for either table size or ties.
Kendall's tau: Adjustment for ties.
- Tau-b: Used for square tables.
- Tau-c: Used for rectangular tables.

References

[edit]

^Karl Pearson, F.R.S. (1904).Mathematical contributions to the theory of evolution. Dulau and Co.
^Ferguson, G. A. (1966).Statistical analysis in psychology and education. New York: McGraw–Hill.
^Smith, S. C., & Albaum, G. S. (2004)Fundamentals of marketing research. Sage: Thousand Oaks, CA. p. 631
^Blaikie, N. (2003)Analyzing Quantitative Data. Sage: Thousand Oaks, CA. p. 100
^Ferguson.^{[full citation needed]}
^Ferguson, 1966, p. 244
^"The Search for Categorical Correlation". 26 December 2019.

External links

[edit]

Wikimedia Commons has media related toContingency tables.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis (see alsoTemplate:Least squares and regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / multivariate / time-series / survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR) (Autoregressive model (AR))
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging