Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Rank correlation

From Wikipedia, the free encyclopedia
Statistic comparing ordinal rankings

Instatistics, arank correlation is any of several statistics that measure anordinal association — the relationship betweenrankings of differentordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. Arank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess thesignificance of the relation between them. For example, two commonnonparametric methods of significance that use rank correlation are theMann–Whitney U test and theWilcoxon signed-rank test.

Context

[edit]

If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.

If there is only one variable, the identity of a college football program, but it is subject to two different poll rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls' rankings can be measured with a rank correlation coefficient.

As another example, in acontingency table withlow income,medium income, andhigh income in the row variable and educational level—no high school,high school,university—in the column variable),[1] a rank correlation measures the relationship between income and educational level.

Correlation coefficients

[edit]

Some of the more popular rankcorrelation statistics include

  1. Spearman'sρ
  2. Kendall'sτ
  3. Goodman and Kruskal'sγ
  4. Somers'D

An increasing rank correlationcoefficient implies increasing agreement between rankings. The coefficient is inside the interval [−1, 1] and assumes the value:

  • 1 if the agreement between the two rankings is perfect; the two rankings are the same.
  • 0 if the rankings are completely independent.
  • −1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the other.

FollowingDiaconis (1988), a ranking can be seen as apermutation of aset of objects. Thus we can look at observed rankings as data obtained when the sample space is (identified with) asymmetric group. We can then introduce ametric, making the symmetric group into ametric space. Different metrics will correspond to different rank correlations.

General correlation coefficient

[edit]

Kendall 1970[2] showed that hisτ{\displaystyle \tau } (tau) and Spearman'sρ{\displaystyle \rho } (rho) are particular cases of a general correlation coefficient.

Suppose we have a set ofn{\displaystyle n} objects, which are being considered in relation to two properties, represented byx{\displaystyle x} andy{\displaystyle y}, forming the sets of values{xi}in{\displaystyle \{x_{i}\}_{i\leq n}} and{yi}in{\displaystyle \{y_{i}\}_{i\leq n}}. To any pair of individuals, say thei{\displaystyle i}-th and thej{\displaystyle j}-th we assign ax{\displaystyle x}-score, denoted byaij{\displaystyle a_{ij}}, and ay{\displaystyle y}-score, denoted bybij{\displaystyle b_{ij}}. The only requirement for these functions is that they be anti-symmetric, soaij=aji{\displaystyle a_{ij}=-a_{ji}} andbij=bji{\displaystyle b_{ij}=-b_{ji}}. (Note that in particularaij=bij=0{\displaystyle a_{ij}=b_{ij}=0} ifi=j{\displaystyle i=j}.) Then the generalized correlation coefficientΓ{\displaystyle \Gamma } is defined as

Γ=i,j=1naijbiji,j=1naij2i,j=1nbij2{\displaystyle \Gamma ={\frac {\sum _{i,j=1}^{n}a_{ij}b_{ij}}{\sqrt {\sum _{i,j=1}^{n}a_{ij}^{2}\sum _{i,j=1}^{n}b_{ij}^{2}}}}}

Equivalently, if all coefficients are collected into matricesA=(aij){\displaystyle A=(a_{ij})} andB=(bij){\displaystyle B=(b_{ij})}, withAT=A{\displaystyle A^{\textsf {T}}=-A} andBT=B{\displaystyle B^{\textsf {T}}=-B}, then

Γ=A,BFAFBF{\displaystyle \Gamma ={\frac {\langle A,B\rangle _{\rm {F}}}{\|A\|_{\rm {F}}\|B\|_{\rm {F}}}}}

whereA,BF{\displaystyle \langle A,B\rangle _{\rm {F}}} is theFrobenius inner product andAF=A,AF{\displaystyle \|A\|_{\rm {F}}={\sqrt {\langle A,A\rangle _{\rm {F}}}}} theFrobenius norm. In particular, the general correlation coefficient is the cosine of the angle between the matricesA{\displaystyle A} andB{\displaystyle B}.

See also:Inner product space § Norms on inner product spaces

Kendall's τ as a particular case

[edit]

Ifri{\displaystyle r_{i}},si{\displaystyle s_{i}} are the ranks of thei{\displaystyle i}-member according to thex{\displaystyle x}-quality andy{\displaystyle y}-quality respectively, then we can define

aij=sgn(rjri),bij=sgn(sjsi).{\displaystyle a_{ij}=\operatorname {sgn}(r_{j}-r_{i}),\quad b_{ij}=\operatorname {sgn}(s_{j}-s_{i}).}

The sumaijbij{\displaystyle \sum a_{ij}b_{ij}} is the number of concordant pairs minus the number of discordant pairs (seeKendall tau rank correlation coefficient). The sumaij2{\displaystyle \sum a_{ij}^{2}} is justn(n1)/2{\displaystyle n(n-1)/2}, the number of termsaij{\displaystyle a_{ij}}, as isbij2{\displaystyle \sum b_{ij}^{2}}. Thus in this case,

Γ=2((number of concordant pairs)(number of discordant pairs))n(n1)=Kendall's τ{\displaystyle \Gamma ={\frac {2\,(({\text{number of concordant pairs}})-({\text{number of discordant pairs}}))}{n(n-1)}}={\text{Kendall's }}\tau }

Spearman's ρ as a particular case

[edit]

Ifri{\displaystyle r_{i}},si{\displaystyle s_{i}} are the ranks ofthei{\displaystyle i}-member according to thex{\displaystyle x} and they{\displaystyle y}-quality respectively,we may consider the matricesa,bM(n×n;R){\displaystyle a,b\in M(n\times n;\mathbb {R} )} defined by

aij:=rjri{\displaystyle a_{ij}:=r_{j}-r_{i}}
bij:=sjsi{\displaystyle b_{ij}:=s_{j}-s_{i}}

The sumsaij2{\displaystyle \sum a_{ij}^{2}} andbij2{\displaystyle \sum b_{ij}^{2}} are equal,since bothri{\displaystyle r_{i}} andsi{\displaystyle s_{i}} range from1{\displaystyle 1} ton{\displaystyle n}.Hence

Γ=(rjri)(sjsi)(rjri)2{\displaystyle \Gamma ={\frac {\sum (r_{j}-r_{i})(s_{j}-s_{i})}{\sum (r_{j}-r_{i})^{2}}}}

To simplify this expression,letdi:=risi{\displaystyle d_{i}:=r_{i}-s_{i}} denote the difference in the ranks for eachi{\displaystyle i}.Further, letU{\displaystyle U} be a uniformly distributed discrete random variables on{1,2,,n}{\displaystyle \{1,2,\ldots ,n\}}.Since the ranksr,s{\displaystyle r,s} are just permutations of1,2,,n{\displaystyle 1,2,\ldots ,n},we can view both as being random variables distributed likeU{\displaystyle U}.Using basicsummation results from discrete mathematics,it is easy to see that for the uniformly distributed random variable,U{\displaystyle U},we haveE[U]=n+12{\displaystyle \mathbb {E} [U]=\textstyle {\frac {n+1}{2}}}andE[U2]=(n+1)(2n+1)6{\displaystyle \mathbb {E} [U^{2}]=\textstyle {\frac {(n+1)(2n+1)}{6}}}and thusVar(U)=(n+1)(2n+1)6(n+1)(n+1)4=n2112{\displaystyle \mathrm {Var} (U)=\textstyle {\frac {(n+1)(2n+1)}{6}}-\textstyle {\frac {(n+1)(n+1)}{4}}=\textstyle {\frac {n^{2}-1}{12}}}.Now, observing symmetries allows us to compute the parts ofΓ{\displaystyle \Gamma }as follows:

1n2i,j=1n(rjri)(sjsi)=2(1n2ni=1nrisi(1ni=1nri)(1nj=1nsj))=1ni=1n(ri2+si2di2)2(E[U])2=1ni=1nri2+1ni=1nsi21ni=1ndi22(E[U])2=2(E[U2](E[U])2)1ni=1ndi2{\displaystyle {\begin{aligned}{\frac {1}{n^{2}}}\sum _{i,j=1}^{n}(r_{j}-r_{i})(s_{j}-s_{i})&=2\left({\frac {1}{n^{2}}}\cdot n\sum _{i=1}^{n}r_{i}s_{i}-\left({\frac {1}{n}}\sum _{i=1}^{n}r_{i}\right)\left({\frac {1}{n}}\sum _{j=1}^{n}s_{j}\right)\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}(r_{i}^{2}+s_{i}^{2}-d_{i}^{2})-2\left(\mathbb {E} [U]\right)^{2}\\&={\frac {1}{n}}\sum _{i=1}^{n}r_{i}^{2}+{\frac {1}{n}}\sum _{i=1}^{n}s_{i}^{2}-{\frac {1}{n}}\sum _{i=1}^{n}d_{i}^{2}-2\left(\mathbb {E} [U]\right)^{2}\\&=2\left(\mathbb {E} [U^{2}]-(\mathbb {E} [U])^{2}\right)-{\frac {1}{n}}\sum _{i=1}^{n}d_{i}^{2}\\\end{aligned}}}

and

1n2i,j=1n(rjri)2=1n2ni,j=1n(ri2+rj22rirj)=21ni=1nri22(1ni=1nri)(1nj=1nrj)=2(E[U2](E[U])2){\displaystyle {\begin{aligned}{\frac {1}{n^{2}}}\sum _{i,j=1}^{n}(r_{j}-r_{i})^{2}&={\frac {1}{n^{2}}}\cdot n\sum _{i,j=1}^{n}(r_{i}^{2}+r_{j}^{2}-2r_{i}r_{j})\\&=2{\frac {1}{n}}\sum _{i=1}^{n}r_{i}^{2}-2\left({\frac {1}{n}}\sum _{i=1}^{n}r_{i}\right)\left({\frac {1}{n}}\sum _{j=1}^{n}r_{j}\right)\\&=2(\mathbb {E} [U^{2}]-(\mathbb {E} [U])^{2})\\\end{aligned}}}

Hence

Γ=1i=1ndi22nVar(U)=16i=1ndi2n(n21){\displaystyle \Gamma =1-{\frac {\sum _{i=1}^{n}d_{i}^{2}}{2n\mathrm {Var} (U)}}=1-{\frac {6\sum _{i=1}^{n}d_{i}^{2}}{n(n^{2}-1)}}}

wheredi=risi{\displaystyle d_{i}=r_{i}-s_{i}} is the difference between ranks,which is exactlySpearman's rank correlation coefficientρ{\displaystyle \rho }.

Rank-biserial correlation

[edit]
Main article:Mann–Whitney_U_test § Rank-biserial_correlation

Gene Glass (1965) noted that the rank-biserial can be derived from Spearman'sρ{\displaystyle \rho }. "One can derive a coefficient defined onX, the dichotomous variable, andY, the ranking variable, which estimates Spearman's rho between X and Y in the same way that biserial r estimates Pearson'sr between two normal variables" (p. 91). The rank-biserial correlation had been introduced nine years before by Edward Cureton (1956) as a measure of rank correlation when the ranks are in two groups.[3]

Kerby simple difference formula

[edit]

Dave Kerby (2014) recommended the rank-biserial as the measure to introduce students to rank correlation, because the general logic can be explained at an introductory level. The rank-biserial is the correlation used with theMann–Whitney U test, a method commonly covered in introductory college courses on statistics. The data for this test consists of two groups; and for each member of the groups, the outcome is ranked for the study as a whole.

Kerby showed that this rank correlation can be expressed in terms of two concepts: the percent of data that support a stated hypothesis, and the percent of data that do not support it. The Kerby simple difference formula states that the rank correlation can be expressed as the difference between the proportion of favorable evidence (f) minus the proportion of unfavorable evidence (u).

r=fu{\displaystyle r=f-u}

Example and interpretation

[edit]

To illustrate the computation, suppose a coach trains long-distance runners for one month using two methods. Group A has 5 runners, and Group B has 4 runners. The stated hypothesis is that method A produces faster runners. The race to assess the results finds that the runners from Group A do indeed run faster, with the following ranks: 1, 2, 3, 4, and 6. The slower runners from Group B thus have ranks of 5, 7, 8, and 9.

The analysis is conducted on pairs, defined as a member of one group compared to a member of the other group. For example, the fastest runner in the study is a member of four pairs: (1,5), (1,7), (1,8), and (1,9). All four of these pairs support the hypothesis, because in each pair the runner from Group A is faster than the runner from Group B. There are a total of 20 pairs, and 19 pairs support the hypothesis. The only pair that does not support the hypothesis are the two runners with ranks 5 and 6, because in this pair, the runner from Group B had the faster time. By the Kerby simple difference formula, 95% of the data support the hypothesis (19 of 20 pairs), and 5% do not support (1 of 20 pairs), so the rank correlation isr = .95 − .05 = .90.

The maximum value for the correlation isr = 1, which means that 100% of the pairs favor the hypothesis. A correlation ofr = 0 indicates that half the pairs favor the hypothesis and half do not; in other words, the sample groups do not differ in ranks, so there is no evidence that they come from two different populations. An effect size ofr = 0 can be said to describe no relationship between group membership and the members' ranks.

In their article “Improving statistical reporting in psychology,” Anna-Lena Schubert and her colleagues (2025) set forth guidelines for statistical reporting, drawing from established recommendations and emerging best practices. In Table 5 they offered an overview of effect sizes for commonly used statistical models. In this table they cited the Kerby (2014) paper as a key resource for the application and interpretation of the rank-biserial correlation.

References

[edit]
  1. ^Kruskal, William H. (1958). "Ordinal Measures of Association".Journal of the American Statistical Association.53 (284):814–861.doi:10.2307/2281954.JSTOR 2281954.
  2. ^Kendall, Maurice G (1970).Rank Correlation Methods (4 ed.). Griffin.ISBN 978-0-85264-199-6.
  3. ^Zar, Jerrold H. (2005)."Spearman Rank Correlation".Encyclopedia of Biostatistics. John Wiley & Sons, Ltd.doi:10.1002/0470011815.b2a15150.ISBN 978-0-470-84907-1.

Further reading

[edit]

External links

[edit]
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis (see alsoTemplate:Least squares and regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Rank_correlation&oldid=1323857387"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp