Loading metrics
Open Access
Peer-reviewed
Research Article
Why Cohen’sKappa should be avoided as performance measure in classification
- Rosario Delgado,
Contributed equally to this work with: Rosario Delgado, Xavier-Andoni Tibau
Roles Writing – original draft, Writing – review & editing
* E-mail:delgado@mat.uab.cat
Affiliation Department of Mathematics, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès, Spain
⨯ - Xavier-Andoni Tibau
Contributed equally to this work with: Rosario Delgado, Xavier-Andoni Tibau
Roles Writing – original draft, Writing – review & editing
Affiliation Advanced Stochastic Modelling research group, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès, Spain
⨯
Why Cohen’sKappa should be avoided as performance measure in classification
- Rosario Delgado,
- Xavier-Andoni Tibau
- Published: September 26, 2019
- https://doi.org/10.1371/journal.pone.0222916
Figures
Abstract
We show that Cohen’sKappa and Matthews Correlation Coefficient (MCC), both extended and contrasted measures of performance in multi-class classification, are correlated in most situations, albeit can differ in others. Indeed, although in the symmetric case both match, we consider different unbalanced situations in whichKappa exhibits an undesired behaviour, i.e. a worse classifier gets higherKappa score, differing qualitatively from that of MCC. The debate about the incoherence in the behaviour ofKappa revolves around the convenience, or not, of using a relative metric, which makes the interpretation of its values difficult. We extend these concerns by showing that its pitfalls can go even further. Through experimentation, we present a novel approach to this topic. We carry on a comprehensive study that identifies an scenario in which the contradictory behaviour among MCC andKappa emerges. Specifically, we find out that when there is a decrease to zero of the entropy of the elements out of the diagonal of the confusion matrix associated to a classifier, the discrepancy betweenKappa and MCC rise, pointing to an anomalous performance of the former. We believe that this finding disablesKappa to be used in general as a performance measure to compare classifiers.
Citation:Delgado R, Tibau X-A (2019) Why Cohen’sKappa should be avoided as performance measure in classification. PLoS ONE 14(9): e0222916. https://doi.org/10.1371/journal.pone.0222916
Editor:Quanquan Gu, UCLA, UNITED STATES
Received:February 12, 2019;Accepted:September 10, 2019;Published: September 26, 2019
Copyright: © 2019 Delgado, Tibau. This is an open access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability:All relevant data are within the paper.
Funding:The authors are supported by Ministerio de Ciencia, Innovación y Universidades del Gobierno de España, project ref. PGC2018 - 097848 - B - I0.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Classification is one of the cornerstones of Supervised Machine Learning. In parallel to the development of different methodologies that allow the construction of classifiers, the evaluation process of the classifiers to compare them, and the choice of the best among those available, has caught the attention of researchers.
Introduction of an adequate performance measure for classifiers is a subject no yet closed up to date (see [1]-[3]), and different metrics have been introduced. Some measures are naturally introduced in the binary case, such as Accuracy, Sensitivity, Specificity and Area Under the ROC Curve (AUC), among others, but not all of them can be well extended to the multi-class setting.
One of the ones that does is Accuracy (i.e. the fraction of well-predicted cases over the total), which seems the most natural measure and has been used for decades. Notwithstanding, Accuracy is not an effective measure since, among other things, it does not take into account the distribution of the misclassification among classes nor the marginal distributions. Other more subtle measures have been introduced in the multi-class setting to address this issue, improving efficiency and class discrimination power.
We will focus our attention in Matthews Correlation Coefficient (MCC) and Cohen’sKappa. The former was introduced in the binary setting by Matthews ([4]), and generalized to the multi-class case in [5], being commonly used as a reference performance measure, especially for unbalanced data sets, in different fields as, for example, bioinformatics (see [5]-[7]). On the other hand,Kappa is a traditional measure originally designed as a measure of agreement between two judges, based on the Accuracy but corrected for chance agreement. At present, its use is not simply limited to medicine or psychology (see for instance, [8] and [9]), but is a measure widely used in other fields as ecology ([10] and [11]), neuroscience ([12]) or machine learning, where it is used to evaluate the agreement between the actual and the assigned classes by a classifier. In the classification literature, the discussion onKappa is most focused on its suitability compared to other classifiers; for example, in [1]Kappa has been considered jointly with 17 other performance metrics in several scenarios.
It is not an overstatement to say thatKappa is one of the most widespread measures and of use in several fields and disciplines. Nevertheless, some authors, including the introducer ofKappa statistic himself, Jakob Cohen, alerted thatKappa could be inadequate in different circumstances, specifically when an imbalance distribution of classes is involved, i.e. the marginal probability of one class is much more (or less) greater than the others (leaving aside the literature below, on which we will deal more closely, see also [13]-[17]). According to them, some problems arise in such situations because it is not clear how the hypothetical probability of chance agreement should be defined. In [18] and [19], the so-calledKappa paradox is described. Roughly speaking,Kappa paradox arises since for a fixed agreement between judges, theKappa statistic penalizes judges with similar marginals compared with judges with different ones. The authors show several examples where this happens.
This same obstacle is extensively studied in [20]-[22]. In the later, two separate causes of theparadox are considered; (1) theprevalence paradox arises from the fact that when the hypothetical probability of chance agreement among raters is high, even high values of the relative observed agreement (which is identical to Accuracy) produce low values ofKappa, and (2) thebias paradox, which is the consequence of the fact that imbalanced marginal distributions produce higher scores ofKappa. The authors claim that reporting a single agreement coefficient makes interpretation and comparison difficult. Hence, they suggest a corrected version ofKappa forbias andprevalence (PABAK), which should be used together withKappa.
Similar conclusions emerge from [23], where the authors claim thatKappa is a relative measure of agreement, which is an inadequate characteristic for assessing in a clinical setting, specifically if a high agreement among experts leads to lower values ofKappa. Instead, they suggest usingthe proportion of specific agreement ([24]), which divides the agreement into a positive and a negative rate, allowing professionals to have an absolute measure and at the same time, information about the marginal distributions. Regarding the effect on estimation of the chance agreement, Albatine et al. ([25]) analysed 28 different similarity measures for clustering purposes; they suggest adding a correction for chance, in a specific family of coefficients, which makes some of them equivalent, regardless of how expectations are calculated. This work is extended by Warrens in [26], where more in-depth analysis is presented and several indices are generalized: Cohen’s kappa ([27]), Scott’s pi ([28]), Mak’s rho([29]), Goodman and Kruskal’s lambda ([30]), and Hamann’s eta ([31]).
On the other hand, there are several authors that defend thatKappa is a useful measure of agreement, when its limitations are taken into account. For example, in [32] the authors defend the use ofKappa in a previous study, and warn that it is a useful measure if marginal distributions are considered. A similar conclusion was reached in [33], where it is said that althoughKappa is not suitable in certain circumstances, it is better than the raw proportion. In [34] the work of [22] expands and theKappa pitfalls are explained for the agreement between judgments, concluding that if it is used and interpreted properly, theKappa coefficient provides a valuable information. As in previous works, they propose to use corrected versions of the coefficient as well. In [16] the author argues that in the case of dichotomous variables,Kappa is satisfactory (although it is not for other cases); as we show in the present work, even in the binary case,Kappa can exhibit unexpected behaviour. Finally, there are some authors ([34]) who do not agree with the use of weighted versions of the statistics as PABAK, and suggest select the marginal distributions to be similar.
In general, the use ofKappa is not only extended but accepted, and its pitfalls are overcome by considering the marginal distributions and using weighted alternatives, as, for example the one suggested by Cohen ([15]), PABAK or other alternatives ([35] and [36]).
Despite the vast amount of existing literature, in the field of medicine and psychology, pointing out the threats ofKappa, when Classification Machine Learning methods experimented their boom Cohen’sKappa was introduced as a reliable performance metric. Actually it is incorporated in the most extended software packages, such asSciKit Learn [37] for Python, andCaret [38] for R. What is more, in recent studies such as [39]-[42] and [12],Kappa is still used as if it were a reliable performance metric. In fact, the literature reviewed recognizes the difficulty of clinical professionals in interpretingKappa because it is a relative measure, that is,Kappa itself is not enough to know if two professionals agree or disagree. This does not seem to be a problem in machine learning classification because the ground-truth is always compared with different methods in the same condition of marginal distributions. Therefore, it can be argued that we are not interested in the value ofKappa itself (as are the clinicians), but in the difference of the classifying pairs ground-truth, soKappa is a reliable metric for this task. However, the reality is that this is not always the case. As we show, there are scenarios in which, given the same ground-truth, a better classifier can obtain a lower value ofKappa. It is important to mention that some authors also highlight the problems associated withKappa when it is used as a performance metric in classification (see for instance [43]-[45]), although they do not perform an exhaustive analysis like the one presented here.
Clearly, marginal distributions seem to play a key role in the problems surroundingKappa. However, there is a lack of a consistent and satisfactory description of the cases in which the unwanted behaviour ofKappa appears, and how this affects its use as a performance metric for classification.
In our paper, we deepen the study of the pitfalls discussed above by analysing in detail the unwanted behaviour ofKappa from a novel perspective. Our point of view is the identification of situations in which discrepancies in its behaviour, with respect to that of MCC, become evident, going in the opposite direction. Indeed, we study varied scenarios of misclassification in settings with different marginal probabilities of the categories, and how this scenarios affect the statisticsKappa and MCC, by analysing both the asymmetry and the entropy of the confusion matrix. ConsideringKappa as a relative measure of agreement, we provide a mathematical framework to understand the associated problems with it when dealing with extreme unbalanced marginal distributions, which is frequent in machine learning problems.
Our goal is to present a systematic study, both analytical and by means of empirical experimentation, to compare the two performance measures. For that, we investigate the similarities and differences in the behaviour of MCC andKappa in different scenarios. In some of them, they are strongly correlated, and we show some mathematical relations and study some limit cases. But in others, they exhibit very different behaviour, being that ofKappa contrary to common sense, to the point that we join the detractors of its use for the assessment of classifiers. This paper is an attempt to shed some light on the identification of the latter.
The paper is organized as follows: first, we introduce some definitions and state some notations. Next, we prove that if the confusion matrix, which allows visualization of the performance of a classifier, is symmetric, thenKappa and MCC coincide. Each column in the confusion matrix represents the cases in any predicted class, while each row represents the cases in any actual class. In the sequel, we study in some detail the binary case, in which classes are named “positive” and “negative” and the confusion matrix has a general form, wherea =true positive,b =false negative,c =false positive andd =true negative, splitting the study according to whetherc = 0, the scenario in whichKappa has a behaviour consistent with that of MCC, andc > 0, in which the opposite happens. For each of these cases, we consider particular sub-cases and we deepen in their study. We also consider a pathological multi-class unbalanced situation, in which one of the classes is much more common than the others, and it is mainly misclassified (family of confusion matricesZA introduced in [2]). We also perform empirical experimentation in dimension 3, considering some families of confusion matrices, and finish with a few concluding words.
Definitions and notations
Given a generic matrixM, letMT denote its transpose, that is, the matrix obtained fromM by interchanging columns and rows. The same notation applies to vectors, which by default are column vectors. We say that matrixQ is equivalent toM, and denote it byQ ≡M, ifQ can be obtained fromM by multiplying it by a positive constant.
Classification
Classification consists of assigning a case to a class (category or label) on the basis of a known set of features or characteristics. This is usually done by a classifier learned from a training dataset. From the validation process of the classifier with a testing dataset, we obtain a confusion matrixC, which takes into account actual and predicted classes of the cases in the testing dataset. To fix ideas, assume that there areN different classes labeled {1, …,N}. Then,C = (Cij)i,j=1,…,N is aN ×N matrix defined by:Cij is the number of cases in the testing dataset that belong to classi and have been assigned to classj by the classifier. Note thatCij ≥ 0. LetS denote the sum of all the elements ofC (the number of cases in the testing dataset), that is,. In the binary caseN = 2, to abbreviate notation we preferably denote
by
, as previously mentioned in the Introduction.
In the context of classification,Accuracy (Acc for brief) is the fraction of correctly classified cases in the testing dataset, that is,. This performance measure is one of the most intuitive, and it is naturally extended to multi-class from binary classification. Acc mainly considers the diagonal of the confusion matrix, and does not take into account how the off-diagonal elements, corresponding to misclassification, are distributed.
Other more subtle performance measures based on the confusion matrix have been introduced to compare classifiers. We here compare two of the most commonly used. Note that these measures are invariant for equivalent confusion matrices.
Matthews correlation coefficient
The binary case.
Matthews Correlation Coefficient MCC was first introduced in the binary case by B.W. Matthews [4] to assess the performance of protein secondary structure prediction, as theϕ-coefficient, which is the measure of association obtained by discretization of the Pearson’s correlation coefficient for two binary vectors. That is, in the binary case, MCC =ϕ =ρ(x,y), wherex = (x1, …,xS)T andy = (y1, …,yS)T are theS-dimensional binary vectors defined in this way:andρ is Pearson’s correlation coefficient defined by
(1)where, as usual,
and
, andCov(x,y) denotes the statistical covariance ofx andy, that is,
, and whenx =y,Cov(x,x) =Var(x) is the statistical (uncorrected) variance ofx. Note that the square of theϕ-coefficient is related to the chi-squared statistic for the 2 × 2 contingency table,χ2, by means of
. Then, using some algebra and taking into account that, by definition of vectorsx andy, the elements of the confusion matrix are
we obtain that
and then using
and
for anyi = 1, …,S, we can rewrite (1) as
(2)
The multi-class case.
In [5] the problem of evaluation of prediction of RNA secondary structure in cases where some predicted pairs go into the category of “unknown” due to lack of reliability, is considered. By introducing an extended correlation coefficient that applies to any number of categories, the author facilitates addressing the problem of predicting base pairs of RNA secondary structure as a three-category problem instead of artificially force it to fall into the binary case by fixing one of the categories, and then considering which cases belong and which do not belong to that category, leading to a loss of information and a suboptimal procedure. Indeed, MCC is generalized in [5] to classification withN > 2 classes based on considering the expected covariance of all categories and constructing the following extension of Pearson’s correlation coefficientρ from a pair of binary vectors to a pair of binary matrices:(3)where ifX andY are two matricesS ×N,
is defined as the average of theN covariances between the different pairs ofS-dimensional binary vectors given by the same column in matricesX andY, that is,
, wherexk = (X1k, …,XSk)T andyk = (Y1k, …,YSk)T are the columnsk of matricesX andY, respectively. Therefore, by definingS ×N matricesX = (Xij)i,j andY = (Yij)i,j in the following way:
fori = 1, …,S andj = 1, …,N, we finally introduce the multi-class extension by
, and by using some algebra and that by definition of matricesX andY,
, we obtain the known expression
(4)
We give below a sketch of the proof of the equivalence between (3) and (4). Indeed, the numerator of (3) can be developed as follows:using that
, which is a consequence of the fact that by definition,
since
(note that by definition ofY,
), and analogously with
.
We also used that, and that
. Now we develop the term in the denominator of (3) corresponding toX (analogous development would be obtained forY):
Note that in the binary case, expression (4) matches (2). Indeed, whenN = 2, numerator of (4) can be written as 2(C11C22 −C21C12) = 2(ad −bc), while the first term in the denominator is, and the second one coincides with
.
Software provided by the author of [5] allowing to perform the calculations easily is available athttp://rk.kvl.dk/.
Cohen’sKappa
Cohen’sKappa statistic, or simplyKappa (henceforth, also denoted by), was originally introduced by J. A. Cohen [27] in the field of psychology as a measure of agreement between two judge, and later it has been used in the literature as a performance measure in classification, as for example in [46]. More concretely,Kappa is used in classification as a measure of agreement between observed and predicted or inferred classes for cases in a testing dataset. Its definition is:
(5)wherePe is the hypothetical probability of chance agreement, using the values of the confusion matrix to estimate the probabilities of randomly choose each class, that is,
, where as usual, we use the notations
(the sum of rowi), and
(the sum of columnj).
Both MCC andKappa assume their theoretical maximum value of +1 when classification is perfect, the larger the metric value, the better the classifier performance. MCC ranges between −1 and +1 whileKappa does not in general, although it does in the cases considered in this work. Moreover, it is straightforward to see that they are symmetric, that is, and MCC(CT) = MCC(C).
The symmetric case
In the case of a symmetric confusion matrix, it is known thatKappa statistic is equivalent toScott’s pi ([28], [47]), which is a special case ofKrippendorff’s alpha ([48]).Scott’s pi is a statistic with the same structure asKappa but that differs from it in the definition ofPe. Hereunder, we will show that ifC is a symmetric matrix,Kappa and MCC not only are consistent with each other but they coincide exactly. Although this result seems to be known, we could not find a reference for it and therefore, we provide its proof here.
Proposition 1Let C = (Cij)i,j=1,…,Nbe a symmetric confusion matrix in the general multi-class setting. That is, C =CT.Then,.
Proof. By (4) and taking into account thatCij =Cji by symmetry, we can write(6)On the other hand, by symmetry we can write
, and therefore,
which coincides with MCC(C) by (6).
The binary case
LetC be a generic confusion matrix in dimension 2,. By (2) and (5), we have that
and it turns out that
is the harmonic mean ofα andβ, while MCC(C) is their geometric mean, being
That is,
and
. As a direct consequence of the known relationship between these two means, we have that in the binary case:
(7)
Now we delve a little deeper into the relationship between the two performance measures. By the property of invariance for equivalent confusion matrices, we can split the study of the binary case into two different scenarios:c = 0 andc = 1 (the latter corresponding toc > 0). These two cases cover all the possibilities, determining a partition of the set of binary confusion matrices into two subsets with clearly differentiated behaviour. As we will see next, whenc = 0 there is an agreement between MCC andKappa. What is more, MCC andKappa are linked by means of a functional relationship (see Proposition 2 below) that easily shows the relationship of monotony between them, which implies that when one of them grows or decreases, the other also does the same, that is, they have a consistent behaviour. On the contrary, whenc = 1 an important disagreement between the two measures highlights in different particular scenarios (see Corollaries 4, 5 and 6). Indeed, in all of them it is shown that while MCC monotonically decreases as the task done by the classifier is getting worse,Kappa does not.
Moreover, as the row sums are the actual number of cases in the testing dataset belonging to each class, we assume that they are both strictly positive, that is,a +b > 0 andc +d > 0. We also must ensure that MCC can be calculated, i.e, that we do not divide by zero. For that, the sum of the columns must also be strictly positive, that is, we additionally assume thata +c > 0 andb +d > 0.
The c = 0case: Agreement between MCCand Kappa
This case corresponds to perfect classification of the negative class, since there are no cases of the negative class in the testing dataset that have been classified as belonging to the positive class. Then, we assumea > 0 andd > 0. Moreover, we assumeb > 0 sinceb = 0 corresponds to the symmetric case already studied in the previous section, in which. We use notation
We have, then,
We will show that in this case there is agreement between the behaviour of the two measures. Indeed, they are linked by means of a functional relationship, as can be seen in the next proposition.
Proposition 2and the following properties hold:
- Since MCC(C0) > 0,
is a monotonically increasing function of MCC(C0),so they are consistent performance measures.
.
- The maximum distance between them is achieved when MCC(C0) ≈ 0.3,and is ≈ 0.13.
Moreover,
- Fixeda,d,
which corresponds to an scenario in which the negative class is underrepresented and cases actually in the positive class are mainly misclassified. On the other hand,
corresponding to perfect classification (seeFig 1(a)).
- Fixedb,d,
which corresponds to an scenario in which the negative class is underrepresented but cases actually in the positive class are mainly well classified. Note that asb → 0, both
and lima→+∞ MCC(C0), tend to be 1.
On the other hand,corresponding to complete misclassification of the positive class (seeFig 1(b)).
- The case witha,b fixed, considering MCC(C0) and
as function ofd, is symmetric to the previous one, and then omitted.
Unbalanced case with underrepresentation of the negative class, which is perfectly classified. (a) Witha =d = 1, as function ofb: positive class mainly misclassified. (b) Withb =d = 1 as function ofa: positive class mainly well classified.
The c = 1case: Disagreement between MCCand Kappa
This case corresponds to not-completely perfect classification of the negative class, since there is at least one case in the testing dataset belonging to this class that has been classified as being in the positive class. We assumeb > 0 since ifb = 0 we are in the previous situation, by symmetry of MCC andKappa. Althoughb = 1 corresponds to a symmetric confusion matrix already studied, we include it in this section for the sake of completeness. We use the notation Then,
Proposition 3If.
If.
Next we consider some particular scenarios of this case that should be explored.
- a =d > 0.
We use notation. Fixeda > 0, ifb > 1, the negative class is underrepresented, and the positive class is mainly misclassified, while ifb < 1, sayb = 1/h withh > 1,
, which is a confusion matrix that corresponds to underrepresentation of the positive class while it is mainly well classified (ifb → 0, which is equivalent toh → +∞). Then,
From these expressions and Proposition 3, we obtain:
Corollary 4If.
If.
Otherwise,where
Fixed a > 0,,while
and
,as a function of b, is monotonically decreasing when b increases, which agrees with the intuition, since when b monotonically increases, the task done by the classifier is clearly getting worse, while
is not.Indeed,fixed a > 0,
has a global minimum at b =b0with
SeeFig 2 to observe the behaviour of MCC andKappa fixeda = 0.2, as function ofb.
Remark 1Corollary 4 explains the behaviour of MCCand Kappa for a confusion matrix equivalent to,according to the values of a = “true positive” = “true negative”, and b = “false negative”/“false positive”. In particular, fixed “true positive” = “true negative” and “false positive”, we observe a contradictory behaviour between these two performance measures as b increases. Indeed, as “false negative”/“false positive” is increasing (implying that the negative class is underrepresented, and the positive class is mainly misclassified), MCCmonotonically decreases, what is reasonable, but Kappa does not. In fact, Kappa decreases for low values of b (b <b0)but increases otherwise. This unreasonable behaviour of Kappa goes in the direction of the thesis defended in this work.Fig 2graphically shows this fact for the particular case a = 0.2,corresponding to a confusion matrix equivalent to
.
Caseb > 1, witha = 1, corresponds to matrixZA withA =b and dimensionN = 2, which is a pathological situation that will be studied in the next section. - a > 0,d = 0.
We use notation. In this case,
and application of Proposition 3 allows obtaining the following result:
Corollary 5
Although fixed a > 0,is a monotonically decreasing function of b, coinciding with intuition,
is not, achieving its global minimum when
.Moreover, fixed a > 0,
SeeFig 3 to observe the behaviour of MCC andKappa, fixeda = 1, as function ofb.
Remark 2In Corollary 5 we can observe the behaviour of MCCand Kappa for a confusion matrix equivalent to,corresponding to a scenario in which the negative class is underrepresented and the classifier systematically misclassifies this class, and generally also misclassifies the positive class if b = “false negative”/“false positive” is big. In particular, fixed “true positive” and “false positive”, we observe a contradictory behaviour between MCCand Kappa as b increases: while MCCmonotonically decreases, what is expected, Kappa decreases for
but increases otherwise. Again, we observe here an unreasonable behaviour of Kappa, which is graphically showed inFig 3for the particular case a = 1,corresponding to a confusion matrix equivalent to
.
- d = 1,a ≥ 0.
We use notationClassification of negative class is entirely done by random, that is, with the same probability a case actually in the negative class is classified as belonging to any of the two classes. Ifa,b > 1, negative class is underrepresented. We have that
and application of Proposition 3 gives:
Corollary 6
As in the previous cases with c = 1,although if we fix a > 0,thenis a monotonically decreasing function of b, coinciding with intuition, we can see that
is not, achieving its global minimum when
.Moreover, fixed a > 0,
InFig 4 we can observe the behaviour of MCC andKappa, fixeda = 0.2, as function ofb.
Remark 3Finally, Corollary 6 is dedicated to confusion matrices equivalent to,which correspond to an unbalanced database set if a, b > 1,with minority class the negative one, which is randomly classified, that is, each class is imputed with the same probability to a case actually in the negative class. In addition, if fixed a = “true positive”/“true negative”, when b = “false negative”/“false positive” increases the positive class is mainly misclassified. While MCCin this situation behaves as expected and monotonically decreases, Kappa does not, increasing for
.As in the previous corollaries, an unreasonable behaviour of Kappa is observed, which is shown inFig 4for the particular case a = 0.2,that is, for a confusion matrix equivalent to
.
Ifb > 1, the negative class is underrepresented and quite misclassified, and the positive class is mainly misclassified. (a) A zoom of the detail forb ≤ 2. (b) Forb ≤ 30.
Ifb > 1, the negative class is underrepresented and systematically misclassified, and the positive class is also mainly misclassified. (a) A zoom of the detail forb ≤ 2. (b) Forb ≤ 30.
The negative class is classified at random. Ifb > 1 the positive class is mainly misclassified, and the negative class is underrepresented. (a) A zoom of the detail forb ≤ 2. (b) Forb ≤ 30.
TheZA family
Finally, we consider another situation that highlights the incoherent behaviour ofKappa. {ZA,A ≥ 0} has been introduced in [2] as a family of confusion matrices useful to analyse performance measures in unbalanced situations. The definition ofZA is as follows:. We denote by MCC(A) and
, respectively, the MCC andKappa values of matrixZA. Note that whenN = 2, this family is a particular case of iii) witha = 1 andb =A. Then, we obtain from Corollary 6 the following result:
Corollary 7If N = 2,We have that
Although MCC(A)is a monotonically decreasing function of A, coinciding with intuition,
is not, achieving its global minimum when
.Moreover,
We generalize the previous result to anyN ≥ 2 in the following proposition:
Proposition 8and the following properties hold:
,
and then,
,
,
- MCC(A)is monotonically decreasing, while
is not. Indeed,
is a convex function of A, achieving the global minimum, which is a negative value, when
.
- The divergence between MCC(A)and
increases monotonically as A → ∞.
Fig 5 shows the behaviour of MCC andKappa as functions ofA, in casesN = 2 (both forA ≤ 5 and forA ≤ 100), and forN = 5 andN = 10. A desirable property of any measure of performance is its internal coherence, which implies that if the classifier moves gradually towards a worsening of the classification process, as is the case whenA increases for the familyZA, the measure must reflect this fact with the consequent monotonous decrease (or increase, depending on the interpretation of the measure).Fig 5 highlights the incoherent behaviour ofKappa, since as we monotonically increaseA, it does not exhibits a monotonic decreasing (as MCC does), and this anomaly not only happens in the binary case (N = 2), but continues to occur when we increaseN above 2, although at a different scale. Therefore, we have seen that MCC shows internal coherence, unlikeKappa, which after decreasing in accordance with the worsening of the classification by increasing A, shows a monotonic growth that goes just in the opposite direction by continuing to increaseA, which is clearly inconsistent.
(a)N = 2, a zoom of the detail forA ≤ 5. (b)N = 2,A ≤ 100. (c)N = 5,A ≤ 500. (d)N = 10,A ≤ 1000.
Experimental results
If we recapitulate, we have seen that both in the binary case withc = 1, and with the multidimensionalZA family, as the asymmetry of the confusion matrix increased (b → +∞ andA → +∞, respectively), while its diagonal stays constant, the behaviour ofKappa and MCC differed more and more. This would be in line with the proven fact that if there is perfect symmetry, therefore these measures match (Proposition 1). It seems natural to ask if it is only the asymmetry that plays a determining role in the discrepancy observed in their linked behaviour (it seems that it should not be like that, since asymmetry of matrixC0 also increases asb → +∞, and yet the behaviour ofKappa and MCC agree). Or, on the contrary, there is any other characteristic of the matrix that drives in this circumstance. To try to shed some light on this issue, we have carried out some empirical experimentation in dimensionN = 3.
We start by introducing a measure of the asymmetry of a matrix, sayAsy(M), by means of the Frobenius norm of the difference between the matrix and its transpose. That is to say, we define
Example (a) Let us consider matrix, withA ≥ 1. Obviously,M1(A) is not symmetric, withAsy(M1(A)) = 2A, which increases withA, achieving the minimum = 2 whenA = 1. We can make a graph showing the evolution ofKappa and MCC when increasingA, as showsFig 6, where it can be observed that the behaviour ofKappa is very similar to that of MCC. Then, asymmetry has not been enough to generate a different behaviour of them. What, then?
Increasing asymmetry but constant entropy.
Think about the entropy generated by the values of the matrix that are outside the main diagonal. In general, given a set of non-negative numbers, say {n1, …,nr}, the Shannon’s entropy generated by the set can be defined by, with
if
, where log usually denotes logarithm in base 2. With this definition,Ent(M1(A)) =Ent({2A,A,A, 2A,A,A}) = 2.5, which is independent ofA, so for the family of matricesM1(A), entropy can not play any role since it remains constant whenA varies. The same happens with matrixC0, for which asymmetry increases asb → +∞ but entropy remains constant. In other words: increasing asymmetry but constant entropy does not produce the phenomenon of inappropriate behaviour ofKappa in which we are interested.
Example (b) Consider now matrix withA > 1. Then
, which increases withA, and
decreases, converging to 0 asA → +∞. The corresponding plots ofKappa, MCC and the difference, with respect toA are shown inFig 7.
Decreasing to zero entropy, which implies increasing asymmetry.
MCC(M2(A)) is a decreasing function ofA but is increasing forA ≥ 4. Then, we can observe a contradictory behaviour of the two measures. Let us see this with numerical examples inTable 1: asA increases (and then, asymmetry increases while entropy decreases to zero), MCC decreases butKappa increases.
A = 10, 25, 50, 75, 100.
Remark 4Note that for matrix M2(A), MCCand Kappa diverge as A increases, as it happens with the family of matrices ZAand with the confusion matrixconsidered in Proposition 3 (binary case with c = 1in which the behaviour of Kappa appears as contrary to common sense when b increases). In the three scenarios, entropy decreases to zero and the asymmetry of the confusion matrix grows to +∞.Indeed, for matrices ZA (as A → +∞)and C1 (as b → +∞)we have that
In general, entropy of the elements outside the main diagonal and asymmetry are related in the sense given by the following lemma.
Lemma 9Let C(A) = (Cij(A))i,j=1,…,Nbe a matrix of non-negative integers depending on a parameter A ∈ ℕ,and such that Ent(C(A)) > 0for any A. Therefore, if the entropy of C(A)decreases to zero, asymmetry must grow to infinity, that is,Proof: By definition of Shannon’s entropy, ifEnt(C(A)) converges to zero, then in the limit there is no uncertainty outside the main diagonal, that is, there must exist a pair (i,j), withi ≠j, such that
Then, with (r,s) = (j,i), we can write
since
and
.
Finally, from the fact thatAsy(C(A)) ≥ |Cij(A) −Cji(A)| → +∞ we finish the proof.
Lemma 9 confirms that what we have observed in different examples (confusion matricesC1 as function ofb,ZA andM2(A)), in which entropy tended to zero and asymmetry grew towards infinity, is not a coincidence but the rule.
It is still necessary to ask whether the role of asymmetry in observing the phenomenon of the discrepancy between the behaviours ofKappa and MCC is canceled out by entropy. That is, if the phenomenon still can be observed if the asymmetry remains constant while the entropy does not decrease to zero. The negative answer is given by the following example, in which asymmetry is constant and entropy decreases to a positive limit but the phenomenon of discrepancy between MCC andKappa is no longer observed.
Example (c) Be matrix withB = 1000 −A,A = 0,…, 999. The corresponding plot of MCC,Kappa and the difference in absolute value is shown inFig 8. In this setting, as with Example (a), there is an agreement in the behaviour of MCC andKappa. However, in this case there is no decrease of entropy to zero as in Example (b). Indeed,
withB = 1000 −A, is a monotonically decreasing function ofA that converges to log(300) − log(100) > 0 asA → 1000, while
remains constant.
Decreasing entropy to a positive limit and constant asymmetry.
Previous examples, in which the diagonal stays constant, show that it is not enough that the asymmetry grows to infinity, or that the entropy is constant or simply decreasing, for the phenomenon of discrepancy betweenKappa and MCC to occur, but heuristically it seems that entropy must decrease to zero, which implies that at the same time asymmetry grows to infinity by Lemma 9. At least it is what experimentation has shown in the cases already commented. To finish, two more examples in the same vein, the first corresponding to the situation of discrepancy, and the latter to the similarity, in the behaviours of MCC and Kappa.
Example (d) Let be the confusion matrix, withB = 100 −A andA = 50,…, 100. In this case, as function ofA ∈ [50, 100],
monotonically increases withA, and
withg(A) =A(A + 1) + (100 −A)(101 −A) + 2, monotonically decreases (to zero if we increase the parameter 100). We can observe inFig 9 that in this case the appearance of the described phenomenon of behaviour against the common sense ofKappa is confirmed: forA > 50, MCC decreases andKappa increases asA increases. By symmetry, forA < 50 we observe just the same whenA decreases.
Entropy decreases to zero, which implies that asymmetry increases, forA increasing from 50 to 100, and fromA decreasing from 50 to 0, by symmetry.
Table 2 illustrates this example numerically through a particular case in which we compare different values ofA. We observe that when entropy decreases and asymmetry increases (A > 50) MCC decreases andKappa increases, while a completely symmetrical behaviour is observed forA < 50, according toFig 9.
A = 50, 60, 70, 80, 90, 100.
Example (e) Let be the confusion matrix. As function ofA ≥ 1,
and is increasing, while
decreases to log(7) − 2/7 > 0 whenA → +∞. In this case, MCC andKappa agree in behaviour asA increases.
Conclusion
Accuracy is one of the most intuitive and widely used performance metrics for classification although it is not appropriate when considering unbalanced cases. MCC andKappa seem to correct this bias: the former was initially designed to deal with very unbalanced data, while the latter, which was not created to be a classification performance metric but that, however, is widely used for this, takes into account the probability of getting the classification by pure chance. These two measures have a similar behaviour in some situations. In fact, we show that they coincide precisely when the confusion matrix is perfectly symmetric. In other situations, however, their behaviour can diverge to the point thatKappa should be avoided as a measure of behaviour to compare classifiers in favor of more robust measures as MCC.
In the present work, similarities and differences among MCC andKappa have been discussed and illustrated with synthetic confusion matrices, both in the binary and in the multi-class setting. Our mathematical analysis and heuristic study show that in situations in which the diagonal of the confusion matrix stays constant and at the same time there is a decrease to zero of the entropy of the elements outside the diagonal, which implies an increase in the asymmetry of the confusion matrix, the phenomenon of qualitative differentiation in the behaviour ofKappa and MCC appears clearly. Notwithstanding, neither increasing nor constant asymmetry when entropy is not decreasing to zero, does not seem to be enough to produce this phenomenon. As far as we know, this kind of conclusions have not been reached before, so they represent a novelty in the study ofKappa.
From a clinical perspective, the fact thatKappa is a relative measure of agreement is problematic since it is hard to set a threshold for a good agreement. This does not seem to be a problem when it is used as a performance metric, becauseKappa values are compared for each classifier given a unique ground-truth, being the relative difference and not the value itself, which determines the best classifier. Notwithstanding, we have shown that if marginal probabilities are really small, the distribution of the misclassification also affects the value ofKappa, to the extent that worse classification results can obtain, however, higher values of the statistic. This is especially dramatic when the entropy of the elements outside the main diagonal of the confusion matrix decreases to zero.
A summary of the examples that have been considered in this work according to the agreement/disagreement between the behaviour of MCC andKappa, can be found in theTable 3.
Disagreement scenario corresponds to entropy decreasing to zero, which implies by Lemma 9 that asymmetry must grow to infinity.
The standard problems associated withKappa are mainly related to unbalanced datasets (see for instance [36] and [17]). We show that an unbalanced situation can makeKappa not comparable between different situations, but to achieve counter-intuitive results, it is also necessary that the entropy of the elements outside the main diagonal to decrease to zero.
Nowadays, in the field of machine learning such situations, in which the number of observations of one of the classes far exceed the quantity of the others, or when the marginal distributions are small, are very common. Machine learning algorithms automatically scrutinize huge amount of data, classifying it into hundreds of categories or look for an unlikely relevant event. In that framework, the finding of a dependable performance measure to be robust and reliable becomes of the utmost importance. Hence, we believe that it has been sufficiently justified that, unfortunately, Cohen’sKappa can no longer play this role, especially considering the existence of solid alternatives.
Acknowledgments
The authors wish to thank the anonymous referees for careful reading and helpful comments that resulted in an overall improvement of the paper.
References
- 1.Ferri C., Hernández-Orallo J., Modroiu R.: An experimental comparison of performance measures for classification. Pattern Recognition Letters 30(1), 27–38 (2009)
- 2.Jurman G., Riccadonna S., Furlanello C.: A comparison of mcc and cen error measures in multi-class prediction. PloS one 7(8), e41882 (2012)
- 3.Sokolova M., Lapalme G.: A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4), 427–437 (2009)
- 4.Matthews B.W.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2), 442–451 (1975)
- 5.Gorodkin J.: Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry 28(5-6), 367–374 (2004) pmid:15556477
- 6.Stokić D., Hanel R., Thurner S.: A fast and efficient gene-network reconstruction method from multiple over-expression experiments. BMC bioinformatics 10(1), 253 (2009) pmid:19686586
- 7.Supper, J., Spieth, C., Zell, A.: Reconstructing linear gene regulatory networks. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 270–279. Springer (2007)
- 8.Blair E., Stanley F.: Interobserver agreement in the classification of cerebral palsy. Developmental Medicine & Child Neurology 27(5), 615–622 (1985)
- 9.Cameron M.L., Briggs K.K., Steadman J.R.: Reproducibility and reliability of the outerbridge classification for grading chondral lesions of the knee arthroscopically. The American journal of sports medicine 31(1), 83–86 (2003) pmid:12531763
- 10.Monserud R.A., Leemans R.: Comparing global vegetation maps with the Kappa statistic. Ecological modelling 62(4), 275–293 (1992)
- 11.Allouche O., Tsoar A., & Kadmon R.: Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of applied ecology 43(6), 1223–1232 (2006)
- 12.Tian Y., Zhang H., Pang Y., Lin J.: Classification for single-trial N170 during responding to facial picture with emotion. Front. Comput. Neurosci. 12:68. pmid:30271337
- 13.Donker D., Hasman A., Van Geijn H.: Interpretation of low Kappa values. International journal of bio-medical computing 33(1), 55–64 (1993) pmid:8349359
- 14.Forbes A.D.: Classification-algorithm evaluation: Five performance measures based onconfusion matrices. Journal of Clinical Monitoring 11(3), 189–206 (1995) pmid:7623060
- 15.Brennan R.L., Prediger D.J.: Coefficient Kappa: Some uses, misuses, and alternatives. Educational and psychological measurement 41(3), 687–699 (1981)
- 16.Maclure M., Willett W.C.: Misinterpretation and misuse of the Kappa statistic. American journal of epidemiology 126(2), 161–169 (1987) pmid:3300279
- 17.Uebersax J.S.: Diversity of decision-making models and the measurement of interrater agreement. Psychological bulletin 101(1), 140–146 (1987)
- 18.Feinstein A.R., Cicchetti D.V.: High agreement but low Kappa: I. the problems of two paradoxes. Journal of clinical epidemiology 43(6), 543–549 (1990) pmid:2348207
- 19.Cicchetti D.V., Feinstein A.R.: High agreement but low Kappa: Ii. resolving the paradoxes. Journal of clinical epidemiology 43(6), 551–558 (1990) pmid:2189948
- 20.Krippendorff K.: Reliability in content analysis: Some common misconceptions and recommendations. Human communication research 30(3), 411–433 (2004)
- 21.Warrens M.J.: A formal proof of a paradox associated with Cohen’s Kappa. Journal of Classification 27(3), 322–332 (2010)
- 22.Byrt T., Bishop J., & Carlin J. B.: Bias, prevalence and kappa. Journal of clinical epidemiology 46(5), 423–429 (1993) pmid:8501467
- 23.de Vet H.C., Mokkink L.B., Terwee C.B., Hoekstra O.S., Knol D.L.: Clinicians are right not to like Cohen’s Kappa. BMJ 346, f2125 (2013) pmid:23585065
- 24.Dice L. R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
- 25.Albatineh A. N., Niewiadomska-Bugaj M., & Mihalko D.: On similarity indices and correction for chance agreement. Journal of Classification 23(2), 301–313 (2006)
- 26.Warrens M. J.: On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika 73(3), 487 (2008) pmid:20037641
- 27.Cohen J.: A coefficient of agreement for nominal scales. Educational and psychological measurement 20(1), 37–46 (1960)
- 28.Scott W.A.: Reliability of content analysis: The case of nominal scale coding. Public opinion quarterly pp. 321–325 (1955)
- 29.Mak T. K.: Analysing intraclass correlation for dichotomous variables. Journal of the Royal Statistical Society: Series C (Applied Statistics) 37(3), 344–352 (1988)
- 30.Goodman L. A., & Kruskal W. H.: Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58(302), 310–364 (1963)
- 31.Brennan R. L., & Light R. J.: Measuring agreement when two observers classify people into categories not defined in advance. British Journal of Mathematical and Statistical Psychology 27(2), 154–163 (1974)
- 32.Bexkens R., Claessen F. M., Kodde I. F., Oh L. S., Eygendaal D., & van den Bekerom M. P.: The kappa paradox. Shoulder & Elbow, 10(4), 308–308 (2018)
- 33.Viera A. J., & Garrett J. M.: Understanding interobserver agreement: the kappa statistic. Fam med 37(5), 360–363 (2005) pmid:15883903
- 34.Sim J., & Wright C. C.: The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy 85(3), 257–268 (2005) pmid:15733050
- 35.Warrens M.J.: On association coefficients, correction for chance, and correction for maximum value. Journal of Modern Mathematics Frontier 2(4), 111–119 (2013)
- 36.Andrés A.M., Marzo P.F.: Delta: A new measure of agreement between two raters. British journal of mathematical and statistical psychology 57(1), 1–19 (2004) pmid:15171798
- 37.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011)
- 38.Kuhn M., et al.: Caret package. Journal of statistical software 28(5), 1–26 (2008)
- 39.Huang C., Davis L., Townshend J.: An assessment of support vector machines for land cover classification. International Journal of remote sensing 23(4), 725–749 (2002)
- 40.Duro D.C., Franklin S.E., Dubé M.G.: A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using spot-5 HRG imagery. Remote Sensing of Environment 118, 259–272 (2012)
- 41.Passos A.N., Kohara V.S., Freitas R.S.d., Vicentini A.P.: Immunological assays employed for the elucidation of an histoplasmosis outbreak in São Paulo, SP. Brazilian Journal of Microbiology 45(4), 1357–1361 (2014) pmid:25763041
- 42.Claessen F. M., van den Ende K. I., Doornberg J. N., Guitton T. G., Eygendaal D., van den Bekerom M. P., … & Wagener M.: Osteochondritis dissecans of the humeral capitellum: reliability of four classification systems using radiographs and computed tomography. Journal of shoulder and elbow surgery 24(10), 1613–1618 (2015) pmid:25953486
- 43.Powers, D.M.W.: The problem with Kappa. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 345–355. Association for Computational Linguistics (2012)
- 44.Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data–recommendations for the use of performance metrics. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp. 245–251. IEEE (2013)
- 45.Zhao X., Liu J.S., Deng K.: Assumptions behind intercoder reliability indices. In Salmon Charles T. (ed.) Communication Yearbook 36, 419–480. New York: Routledge (2013)
- 46.Witten I.H., Frank E., Hall M.A., Pal C.J.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2016)
- 47.Krippendorff K.: Association, agreement, and equity. Quality and Quantity 21(2), 109–123 (1987)
- 48.Krippendorff K.: Content analysis: An introduction to its methodology (1980)
Subject Areas?For more information about PLOS Subject Areas, clickhere.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!
For more information about PLOS Subject Areas, clickhere.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!- Entropy
Is the Subject Area"Entropy" applicable to this article?
Thanks for your feedback.
- Machine learning
Is the Subject Area"Machine learning" applicable to this article?
Thanks for your feedback.
- Probability distribution
Is the Subject Area"Probability distribution" applicable to this article?
Thanks for your feedback.
- Statistical distributions
Is the Subject Area"Statistical distributions" applicable to this article?
Thanks for your feedback.
- Covariance
Is the Subject Area"Covariance" applicable to this article?
Thanks for your feedback.
- Medicine and health sciences
Is the Subject Area"Medicine and health sciences" applicable to this article?
Thanks for your feedback.
- Protein structure prediction
Is the Subject Area"Protein structure prediction" applicable to this article?
Thanks for your feedback.
- RNA structure
Is the Subject Area"RNA structure" applicable to this article?
Thanks for your feedback.