An intuitive way to calculate the similarity between two trees is tocalculate the resolution of their strict consensus, which corresponds tothe number of splits that occur in both trees(Schuh & Polhemus, 1980). The correspondingdistance measure, the Robinson–Foulds distance(Robinson & Foulds, 1981), counts the numberof splits that are unique to one of the two trees.
It is important to remember that counting splits is not the same ascounting clades, edges or nodes. If a tree is drawn as rooted (evenwithout a root edge), then the number of splits is one less than thenumber of edges or clades, and two less than the number of nodes. Forexample, the trees below have one split, two edges and three nodes incommon.
The simplicity of counting splits is appealing, but limited by theunderlying assumption that all splits are equivalent.
As an example, a split that separates eight leaves into two sets offour (as in the right-hand tree above) has a\(\frac{1}{35}\) chance of being compatiblewith the reference tree. In contrast, a split that separates two leavesfrom the other six has a\(\frac{1}{7}\) chance of matching thereference tree: the similarity observed is five times more likely tohave arisen by chance. In other words, failure to match an even split isless noteworthy than failure to match an uneven one.
As a consequence, trees whose splits are less even will, on average,exhibit higher Robinson–Foulds distances with comparison trees. Comparea balanced and an unbalanced eight-taxon tree:
Each tree divides the eight taxa into five splits. Thephylogenetic information content of a splitis a function of the probability that the split will match a uniformlychosen random tree, i.e. the proportion of eight-leaf binary trees thatcontain the split in question. (Information content, in bits, is definedas\(-\log_2(\textrm{probability})\).)This, in turn, is a function of the evenness of the split:
| Matching trees | P(Match in random tree) | Phylogenetic information content | |
|---|---|---|---|
| Split size: 2:6 | 945 / 10 395 | 0.0909 | 3.46 bits |
| Split size: 3:5 | 315 / 10 395 | 0.0303 | 5.04 bits |
| Split size: 4:4 | 225 / 10 395 | 0.0216 | 5.53 bits |
In the first tree, split 1 is even, dividing four taxa from fourothers (4|4); splits 2–5 are maximally uneven(2|6). If each split is treated as independent, then thetotal information content of the five splits is 19.37 bits, whereas thatof the five splits in the second tree, of sizes2|6,3|5,4|4,3|5 and2|6, is 22.54 bits. Put another way, a random tree will onaverage share more splits with the balanced tree (whose splits arepredominantly uneven and thus likely to be matched) than the asymmetrictree (which contains more even splits that are less likely to occur in arandom tree).
Indeed, of the 10 395 eight-leaf trees, many more bear at least onesplit in common with a balanced tree than with an asymmetric tree:
This differing information content can be accommodated by weightingeach split according to the amount of phylogenetic information itcontains(Smith, 2020). The two tree pairsbelow both have a Robinson–Foulds distance of two, but the first pairdiffer with regard to an uneven split (ABCDEF|GH), soobtain a total difference of 22.54 − (3.46 + 5.04 + 5.53 + 5.04) =3.46 bits:
tree1<- ape::read.tree(text='(1, (2, (3, (4, (5, (6, (7, 8)))))));')tree2<- ape::read.tree(text='(1, (2, (3, (4, (5, (7, (6, 8)))))));')tree3<- ape::read.tree(text='(1, (2, (3, (5, (4, (6, (7, 8)))))));')VisualizeMatching(InfoRobinsonFoulds, tree1, tree2,Plot = TreeDistPlot,prune =12)whereas the second pair differ in the resolution of a more even, andthus more information-rich, split (ABCD|EFGH), and soreceive a distance score of 5.53 bits:
Even when accounting for the information content of splits in thisway, the Robinson–Foulds distance is readily saturated: the maximumvalue can be obtained by moving a single leaf.
tree1<- ape::read.tree(text='(1, (2, (3, (4, (5, (6, (7, 8)))))));')tree2<- ape::read.tree(text='(8, (1, (2, (3, (4, (5, (6, 7)))))));')VisualizeMatching(RobinsonFouldsMatching, tree1, tree2,Plot = TreeDistPlot)Generalized Robinson–Foulds distances(Böcker,Canzar, & Klau, 2013; Nye, Liò, & Gilks, 2006) seek toaddress this issue. This category of metrics aim to acknowledgesemblances between similar-but-not-quite-identical pairs of splits,which would contribute zero to tree similarity under the standardRobinson–Foulds measure.
Generalized RF distances work by finding amatching thatpairs splits from one tree with splits in the other. Each pairing isscored according to the similarity of the paired splits; the sum ofthese scores is the score of the matching. The tree distance is given bythe score of the optimal matching.
Let’s consider two trees that differ in the position of one wildcardleaf, and in the resolution of a clade:
tree1<- ape::read.tree(text='((A, B), ((C, (D, E)), (F, (G, (H, I)))));')tree2<- ape::read.tree(text='((A, B), ((C, D, (E, I)), (F, (G, H))));')Plot(tree1, tree2,highlight ='I',prune =list(8,integer(0)))These trees obtain a Robinson–Foulds distance of nine: a largedistances, as the maximum possible for trees of this resolution iseleven.AB|CDEFGHI is the only split in common between thetwo trees:
This distance score is higher than might be expected, given how muchthe trees have in common; removing the single leaf ‘I’ results in twotrees that differ only in the resolution of a single node:
This hidden similarity can be better reflected if similar, butnon-identical, splits are assigned non-zero similarity scores.
There are various ways to score the similarity between two splits.One is to build on the idea introduced above, where identical splits arescored according to their phylogenetic information content. Non-matchingsplits can be scored according to the amount of phylogenetic informationthat they hold in common, which is a function of the proportion of treesthat are consistent with both splits. (A full explanation is provided inthe discussion ofGeneralizedRobinson–Foulds distances.)
Here, the splitAB|CDEFGHI occurs in both trees, and, asit happens, makes the largest contribution to the tree similarity score(3.70) for this particular pair of trees. This is the same contributionit would have made to the information-corrected Robinson–Fouldssimilarity.
The splitABCDEF|GHI in the left-hand tree is pairedwith the splitABCDEFI|GH in the right-hand tree. HadABCDEF|GHI been available in the right-hand tree, then thisperfect match would have been assigned a similarity ofSplitInformation(3, 6) = 5.57 bits. The partial match isinstead allocated a lower score of 2.12 bits. Pairings of incompatiblesplits, i.e. those that cannot co-exist on a tree, such asABCDEFG|HI -ABCDFGH|EI, have no phylogeneticinformation in common. (clusteringinformation is an alternative way to think about split similaritythat recognizes similarity even between incompatible splits.)
The matching depicted above is one of many. It happens to be optimal:an optimal matching can be found by considering the similarity score ofeach possible pairing, and solving a linear assignment problem to findthe optimal set of pairings.
We can view the splits in each tree, named according to the number oftheir associated node:
## 6 bipartition splits dividing 9 tips, A .. I## 123456789## 11 **.......## 13 ..***....## 14 ...**....## 15 .....****## 16 ......***## 17 .......**## ## Tip 1: A Tip 2: B Tip 3: C Tip 4: D Tip 5: E ## Tip 6: F Tip 7: G Tip 8: H Tip 9: I## 5 bipartition splits dividing 9 tips, A .. I## 123456789## 11 **.......## 13 ..***...*## 14 ....*...*## 15 .....***.## 16 ......**.## ## Tip 1: A Tip 2: B Tip 3: C Tip 4: D Tip 5: E ## Tip 6: F Tip 7: G Tip 8: H Tip 9: IWe can then see the similarity scores for each pair of splits, alongwith the optimal matching:
## $matching## [1] 1 2 NA 4 5 3## ## $matchedSplits## [1] "A B | C D E F G H I => A B | C D E F G H I"## [2] "C D E | A B F G H I => C D E I | A B F G H"## [3] "F G H I | A B C D E => F G H | A B C D E I"## [4] "G H I | A B C D E F => G H | A B C D E F I"## [5] "H I | A B C D E F G .. E I | A B C D F G H"## ## $matchedScores## [1] 3.700440 3.252981 NA 3.252981 2.115477 0.000000## ## $pairScores## [,1] [,2] [,3] [,4] [,5]## [1,] 3.7004397 0.8930848 0.2410081 0.5305147 0.2410081## [2,] 0.5305147 3.2529807 0.0000000 1.1825914 0.5305147## [3,] 0.2410081 1.3785116 0.0000000 0.5305147 0.2410081## [4,] 0.8930848 0.0000000 0.0000000 3.2529807 1.3785116## [5,] 0.5305147 0.0000000 0.0000000 0.0000000 2.1154772## [6,] 0.2410081 0.0000000 0.0000000 0.0000000 0.0000000.. denotes that the fifth matching contributes zero tosimilarity score; an alternative optimal matching would leave thesesplits unpaired.