The Robinson–Foulds distance

An intuitive way to calculate the similarity between two trees is tocalculate the resolution of their strict consensus, which corresponds tothe number of splits that occur in both trees(Schuh & Polhemus, 1980). The correspondingdistance measure, the Robinson–Foulds distance(Robinson & Foulds, 1981), counts the numberof splits that are unique to one of the two trees.

It is important to remember that counting splits is not the same ascounting clades, edges or nodes. If a tree is drawn as rooted (evenwithout a root edge), then the number of splits is one less than thenumber of edges or clades, and two less than the number of nodes. Forexample, the trees below have one split, two edges and three nodes incommon.

The simplicity of counting splits is appealing, but limited by theunderlying assumption that all splits are equivalent.

As an example, a split that separates eight leaves into two sets offour (as in the right-hand tree above) has a\(\frac{1}{35}\) chance of being compatiblewith the reference tree. In contrast, a split that separates two leavesfrom the other six has a\(\frac{1}{7}\) chance of matching thereference tree: the similarity observed is five times more likely tohave arisen by chance. In other words, failure to match an even split isless noteworthy than failure to match an uneven one.

As a consequence, trees whose splits are less even will, on average,exhibit higher Robinson–Foulds distances with comparison trees. Comparea balanced and an unbalanced eight-taxon tree:

Each tree divides the eight taxa into five splits. Thephylogenetic information content of a splitis a function of the probability that the split will match a uniformlychosen random tree, i.e. the proportion of eight-leaf binary trees thatcontain the split in question. (Information content, in bits, is definedas\(-\log_2(\textrm{probability})\).)This, in turn, is a function of the evenness of the split:

	Matching trees	P(Match in random tree)	Phylogenetic information content
Split size: 2:6	945 / 10 395	0.0909	3.46 bits
Split size: 3:5	315 / 10 395	0.0303	5.04 bits
Split size: 4:4	225 / 10 395	0.0216	5.53 bits

In the first tree, split 1 is even, dividing four taxa from fourothers (4|4); splits 2–5 are maximally uneven(2|6). If each split is treated as independent, then thetotal information content of the five splits is 19.37 bits, whereas thatof the five splits in the second tree, of sizes2|6,3|5,4|4,3|5 and2|6, is 22.54 bits. Put another way, a random tree will onaverage share more splits with the balanced tree (whose splits arepredominantly uneven and thus likely to be matched) than the asymmetrictree (which contains more even splits that are less likely to occur in arandom tree).

Indeed, of the 10 395 eight-leaf trees, many more bear at least onesplit in common with a balanced tree than with an asymmetric tree:

Generalized Robinson–Foulds distances

Even when accounting for the information content of splits in thisway, the Robinson–Foulds distance is readily saturated: the maximumvalue can be obtained by moving a single leaf.

tree1<- ape::read.tree(text='(1, (2, (3, (4, (5, (6, (7, 8)))))));')tree2<- ape::read.tree(text='(8, (1, (2, (3, (4, (5, (6, 7)))))));')VisualizeMatching(RobinsonFouldsMatching, tree1, tree2,Plot = TreeDistPlot)

Generalized Robinson–Foulds distances(Böcker,Canzar, & Klau, 2013; Nye, Liò, & Gilks, 2006) seek toaddress this issue. This category of metrics aim to acknowledgesemblances between similar-but-not-quite-identical pairs of splits,which would contribute zero to tree similarity under the standardRobinson–Foulds measure.

Generalized RF distances work by finding amatching thatpairs splits from one tree with splits in the other. Each pairing isscored according to the similarity of the paired splits; the sum ofthese scores is the score of the matching. The tree distance is given bythe score of the optimal matching.

Constructing a matching

Let’s consider two trees that differ in the position of one wildcardleaf, and in the resolution of a clade:

tree1<- ape::read.tree(text='((A, B), ((C, (D, E)), (F, (G, (H, I)))));')tree2<- ape::read.tree(text='((A, B), ((C, D, (E, I)), (F, (G, H))));')Plot(tree1, tree2,highlight ='I',prune =list(8,integer(0)))

These trees obtain a Robinson–Foulds distance of nine: a largedistances, as the maximum possible for trees of this resolution iseleven.AB|CDEFGHI is the only split in common between thetwo trees:

TwoTreePlot()VisualizeMatching(RobinsonFouldsMatching, tree1, tree2)

This distance score is higher than might be expected, given how muchthe trees have in common; removing the single leaf ‘I’ results in twotrees that differ only in the resolution of a single node:

TwoTreePlot()VisualizeMatching(RobinsonFouldsMatching,drop.tip(tree1,'I'),drop.tip(tree2,'I'))

This hidden similarity can be better reflected if similar, butnon-identical, splits are assigned non-zero similarity scores.

There are various ways to score the similarity between two splits.One is to build on the idea introduced above, where identical splits arescored according to their phylogenetic information content. Non-matchingsplits can be scored according to the amount of phylogenetic informationthat they hold in common, which is a function of the proportion of treesthat are consistent with both splits. (A full explanation is provided inthe discussion ofGeneralizedRobinson–Foulds distances.)

TwoTreePlot()VisualizeMatching(SharedPhylogeneticInfo, tree1, tree2)

Here, the splitAB|CDEFGHI occurs in both trees, and, asit happens, makes the largest contribution to the tree similarity score(3.70) for this particular pair of trees. This is the same contributionit would have made to the information-corrected Robinson–Fouldssimilarity.

The splitABCDEF|GHI in the left-hand tree is pairedwith the splitABCDEFI|GH in the right-hand tree. HadABCDEF|GHI been available in the right-hand tree, then thisperfect match would have been assigned a similarity ofSplitInformation(3, 6) = 5.57 bits. The partial match isinstead allocated a lower score of 2.12 bits. Pairings of incompatiblesplits, i.e. those that cannot co-exist on a tree, such asABCDEFG|HI -ABCDFGH|EI, have no phylogeneticinformation in common. (clusteringinformation is an alternative way to think about split similaritythat recognizes similarity even between incompatible splits.)

The matching depicted above is one of many. It happens to be optimal:an optimal matching can be found by considering the similarity score ofeach possible pairing, and solving a linear assignment problem to findthe optimal set of pairings.

We can view the splits in each tree, named according to the number oftheir associated node:

summary(TreeTools::as.Splits(tree1, LETTERS[1:9]))

## 6 bipartition splits dividing 9 tips, A .. I##      123456789##  11  **.......##  13  ..***....##  14  ...**....##  15  .....****##  16  ......***##  17  .......**## ##  Tip 1: A     Tip 2: B    Tip 3: C    Tip 4: D    Tip 5: E   ##  Tip 6: F     Tip 7: G    Tip 8: H    Tip 9: I

summary(TreeTools::as.Splits(tree2, LETTERS[1:9]))

## 5 bipartition splits dividing 9 tips, A .. I##      123456789##  11  **.......##  13  ..***...*##  14  ....*...*##  15  .....***.##  16  ......**.## ##  Tip 1: A     Tip 2: B    Tip 3: C    Tip 4: D    Tip 5: E   ##  Tip 6: F     Tip 7: G    Tip 8: H    Tip 9: I

We can then see the similarity scores for each pair of splits, alongwith the optimal matching:

attributes(SharedPhylogeneticInfo(tree1, tree2,reportMatching =TRUE))

## $matching## [1]  1  2 NA  4  5  3## ## $matchedSplits## [1] "A B | C D E F G H I => A B | C D E F G H I"## [2] "C D E | A B F G H I => C D E I | A B F G H"## [3] "F G H I | A B C D E => F G H | A B C D E I"## [4] "G H I | A B C D E F => G H | A B C D E F I"## [5] "H I | A B C D E F G .. E I | A B C D F G H"## ## $matchedScores## [1] 3.700440 3.252981       NA 3.252981 2.115477 0.000000## ## $pairScores##           [,1]      [,2]      [,3]      [,4]      [,5]## [1,] 3.7004397 0.8930848 0.2410081 0.5305147 0.2410081## [2,] 0.5305147 3.2529807 0.0000000 1.1825914 0.5305147## [3,] 0.2410081 1.3785116 0.0000000 0.5305147 0.2410081## [4,] 0.8930848 0.0000000 0.0000000 3.2529807 1.3785116## [5,] 0.5305147 0.0000000 0.0000000 0.0000000 2.1154772## [6,] 0.2410081 0.0000000 0.0000000 0.0000000 0.0000000

.. denotes that the fifth matching contributes zero tosimilarity score; an alternative optimal matching would leave thesesplits unpaired.

What next?

Alternatives measures of split similarity, such asmutual clustering information, give rise tootherGeneralized Robinson–Fouldsdistances, and can be used to generate meaningfultree spaces.

References

Böcker, S., Canzar, S., & Klau, G. W. (2013). The generalizedRobinson-Foulds metric. In A. Darling & J.Stoye (Eds.),Algorithms inBioinformatics.WABI 2013.LectureNotes inComputerScience, vol 8126 (pp. 156–169).Berlin, Heidelberg: Springer. doi:10.1007/978-3-642-40453-5_13

Nye, T. M. W., Liò, P., & Gilks, W. R. (2006).A novel algorithm and web-based tool for comparing twoalternative phylogenetic trees.Bioinformatics,22(1), 117–119. doi:10.1093/bioinformatics/bti720

Robinson, D. F., & Foulds, L. R. (1981). Comparison of phylogenetictrees.Mathematical Biosciences,53(1-2), 131–147.doi:10.1016/0025-5564(81)90043-2

Schuh, R. T., & Polhemus, J. T. (1980). Analysis of taxonomiccongruence among morphological, ecological and biogeographic data setsfor theLeptopodomorpha (Hemiptera).Systematic Biology,29(1), 1–26. doi:10.1093/sysbio/29.1.1

Smith, M. R. (2020). Information theoreticGeneralizedRobinson-Foulds metrics for comparing phylogenetic trees.Bioinformatics,36(20), 5007–5013. doi:10.1093/bioinformatics/btaa614

Movatterモバイル変換

Extending the Robinson-Foulds metric

Martin R.Smith

The Robinson–Foulds distance

Information-corrected Robinson–Foulds distance

Generalized Robinson–Foulds distances

Constructing a matching

What next?

References