Movatterモバイル変換


[0]ホーム

URL:


Inferring a Tree of Blobs with TINNIK

library(MSCquartets)

Why use TINNIK?

TINNIK infers a “tree of blobs” under the networkmultispecies coalescent model. The resulting unrooted topological treepartially depicts the species relationships that led to a collection ofgene trees, showing only the cut edges of the network (ones which joinotherwise unconnected pieces of the network). Any cycles or morecomplicated blobs formed by reticulations in the network are contractedto multifurcations (polytomies) in this tree.

The tree of blobs thus isolates those parts of the network wherereticulations have made relationships non-tree-like. A researcher mightthen apply other methods to investigate the structure of each blob,perhaps by reducing to consideration of a smaller number of taxa.However, with current methods inferring complicated blob structureaccurately may be difficult or impossible. Since theTINNIKalgorithm is statistically consistent under the NMSC model regardless ofthe unknown blob structure, it may provide the strongest networkinference possible without making assumptions on the unknown networkstructure.

Many current methods of network inference are statisticallyconsistent only if the unknown network is level-1 (each blob is assimple as possible, containing only a single reticulation). They canonly return a level-1 network as output, and may give no indication asto whether a level-1 network is an adequate model for the data. When alevel-1 method is used for network inference, checking thatTINNIK’s results are consistent with the level-1 output canprovide a researcher some justification for that assumption.

Preparing the input data

TINNIK requires input in the form of a collection ofgene trees on a common collection of taxa. Thus from multilocus sequencedata it is necessary to first align each gene’s sequences and usestandard phylogenetic methods and software (e.g., IQtree, RAxML) toobtain an unrooted topological tree for each gene. While these trees arenot strictly data, as they are themselves inferred,TINNIKtreats them as such. If the gene trees are rooted, or have edge lengths,that information is simply ignored.

We work with an example data set of 1730 gene trees of 7 Papioninispecies, extracted fromVanderpool et al.(2020) and analyzed byTINNIK inAllman et al. (2024). Gene trees in Newickformat are read from a text file to a multiphylo object using anape function:

# read text file of gene trees supplied with MSCquartets packagegts=read.tree(file =system.file("extdata","dataPapioniniVanderpool",package="MSCquartets"))

Notes:

While some missing taxa on gene trees can be handled byTINNIK, it is necessary that each subset of 4 taxa appearson at least 1 gene tree and desirable that it appears on many.Statistical tests are performed for each such set, and the amount ofdata for these is determined by the count of trees displaying theset.

All understanding ofTINNIK’s statistical behavior isestablished by assuming that the input gene trees are a true sampleunder the NMSC model. In practice, inferred gene trees are used, whichlikely contain some error. In particular, as with other networkinference methods using inferred gene trees as input, widespread lack ofresolution in the gene trees is unlikely to lead to a good analysis,

An initial analysis

The first steps ofTINNIK are to count occurrences ofeach quartet tree topology across all gene trees, for each set of 4taxa, and then to apply two hypothesis tests to these counts to judgetheir fit to 4-taxon star and resolved species trees, producing twop-values, “p_star” and “p_T3”

The easiest way to do this is to run the commands below, where wesave the table of quartet counts and associated p-values as “pT”. TheTINNIK command here does a full analysis, which wegenerally should not take as our final one, as we are using defaultnominal levels for the hypothesis tests.

# perform initial TINNIK analysis for gene trees, using defaultsoutput=TINNIK(gts)#> Analyzing 7 taxa: Cercocebus_atys, Macaca_fascicularis, Macaca_mulatta, Macaca_nemestrina, Mandrillus_leucophaeus, Papio_anubis, Theropithecus_gelada#> Counting occurrences of displayed quartets for 35 four-taxon subsets of 7 taxa across 1730 gene trees.#> Warning in quartetTable(genetrees, taxanames, epsilon = epsilon): Some taxa#> missing from some trees.#> Applying hypothesis test for model T3 to 35 quartets.#> Applying hypothesis test for star tree model to 35 quartets.# save table of quartet information and p-valuespT=output$pTable

TheTINNIK command produced three plots. The first is asimplex plot of the hypothesis test results, at the specified levels.Each set of 4 taxa is represented by one plotted symbol, with those nearthe 3 lines indicating a tree-like relationship, and those farther fromthe lines a putative non-tree-like one. Those near the centroid of thesimplex indicate a star-like relationship. (SeeAllman, Mitchell, and Rhodes (2021) for a morecomplete explanation of these plots.)

The second simplex plot shows the outcome of theTINNIKinference rule, where some of the quartets initially viewed as tree-like(T-quartets) from the hypothesis test are subsequently flagged as comingoff a blob (B-quartets). (SeeAllman et al.(2022) orAllman et al. (2024) fora more complete explanation.)

The final plot showsTINNIK’s inferred tree of blobs,with 5- and 4- multifurcations joined by an edge, indicating 4- and5-blobs. Note this result is dependent on the levels of the hypothesistests shown in the subtitle of the plot. The default values are\(\alpha=.05\) for the test with nullhypothesis “the quartet has a tree-like relationship”, and\(\beta=.95\) for the test with nullhypothesis “the quartet has a star-like relationship”.

Notes:

Since this data set is relatively small, it does not take long toproduce the table of quartet counts and p-values, and we could simplyrecalculate it for additional runs. For a large data set (many taxa,many gene trees) producing this table is the most significant factor inrun-time, so saving it for reuse is wise.

The table “pT” can be printed directly in R, or printed in a nicerformat withquartetTablePrint. However, its rows should notbe reordered, as theTINNIK algorithm uses an indexingfunction to access them quickly.

Instead of the T3 hypothesis test for tree-likeness, the analysis canbe run using the cut test (see theTINNIK documentation).Doing so may produce fewer initial B-quartets, and thus a more resolvedtree of blobs, but is more appropriate only for what might be consideredextreme network structures (seeAllman et al.(2024)).

Varying the test levels

An initial run ofTINNIK with default test levels israrely a sufficient analysis. One should always vary the test levels\(\alpha\) and\(\beta\) to judge robustness of the inferredtree of blobs to their choice.

We generally have little understanding of the error that might be inthe gene trees, and making different choices of the levels can be usedto address this somewhat. Setting\(\alpha=0\) will lead to all quartets beingjudged as tree-like, and increasing\(\alpha\) will potentially increase thenumber considered to have putative 4-blobs. In a noisy data set, thedefault\(\alpha=.05\) may result inmany putative 4-blobs, and a smaller value of\(\alpha\) may be needed for a usefulinference. Since hypothesis testing is done on each set of 4 taxaindependently, even with no gene tree error merely having more taxapresent makes the erroneous judgment of some putative 4-blobs morelikely.

Similarly, a value of\(\beta=1\)will result in no quartet relationship being judged as star-like, butdecreasing\(\beta\) may result in morestar-like quartets.

Smaller\(\alpha\)s and/or larger\(\beta\)s tend to produce moreresolution in the inferred tree of blobs.

For our example data set, we see no symbols plotted near the centroidin the first simplex plot, so only values of\(\beta\) very close to 0 are likely toresult in any judgment of star trees. An examination of “pT” confirmsthis, as the maximum in the “p_star” column is\(\approx.112e-32\). To see the effect of atiny\(\beta\), we reuse the tabulatedquartet information in “pT” and enter:

TINNIK(pT,alpha=.05,beta=1e-40)#> Some points (green) not rejected as star, but rejected as tree.

Here the first simplex plot shows a few symbols where the star-treeis now not rejected. However, as these are treated by theTINNIK algorithm as B-quartets, we see no change in thenext simplex plot or the inferred tree of blobs from our initialTINNIK run.

Since the original simplex plot shows rejection of tree-likeness forsome quartets plotted near the line segments, varying\(\alpha\) is likely to be more interesting.Using a smaller\(\alpha\), topotentially increase the number of quartets which are judged astree-like, we enter:

TINNIK(pT,alpha=.01,beta=.95)

The first simplex plot indeed shows fewer red triangles (initialB-quartets) and more blue circles, while the second plot has fewer goldsquares (all B-quartets) and more green “x”s, producing an inferred treeof blobs with more resolution — in fact, a fully-resolved tree.

This leads us to try a choice of\(\alpha\) intermediate to the previous ones,so we enter:

TINNIK(pT,alpha=.02,beta=.95)

Here the simplex plots and the tree of blobs vary from either of theearlier ones, showing one of the multifurcations in our initialanalysis, but resolving the other. Further varying\(\alpha\) can give intervals over whichthese analyses are stable.

Note:

In reporting anyTINNIK analysis it is essential toreport the test levels used, and we strongly recommend exploring withvalues beyond the defaults.

References

Allman, E. S., H. Baños, J. D. Mitchell, and J. A. Rhodes. 2022.“The Tree of Blobs of a Species Network: Identifiability Under theCoalescent.”Journal of Mathematical Biology 86 (1): 10.https://doi.org/10.1007/s00285-022-01838-9.
———. 2024.“TINNiK: Inference of the Tree of Blobs of a SpeciesNetwork Under the Coalescent.”bioRxiv.https://doi.org/10.1101/2024.04.20.590418.
Allman, E. S., J. D. Mitchell, and J. A Rhodes. 2021.Gene Tree Discord, Simplex Plots, and Statistical Testsunder the Coalescent.”Systematic Biology 71 (4):929–42.https://doi.org/10.1093/sysbio/syab008.
Vanderpool, D., B. Q. Minh, R. Lanfear, D. Hughes, S. Murali, R. A.Harris, M. Raveendran, et al. 2020.“Primate PhylogenomicsUncovers Multiple Rapid Radiations and Ancient InterspecificIntrogression.”PLOS Biology 18 (12): 1–27.https://doi.org/10.1371/journal.pbio.3000954.

[8]ページ先頭

©2009-2025 Movatter.jp