Movatterモバイル変換


[0]ホーム

URL:


Inferring a Level-1 network with the NANUQ+routines

library(MSCquartets)#> Loading required package: ape#> Loading required package: phangorn

Why use NANUQ\(^+\) routines?

NANUQ\(^+\) is a collection ofroutines for inference of level-1 network features under the networkmultispecies coalescent (NMSC) model. Notably, they do not assume thatthe entire network is level-1. Taking as input a collection of unrootedtopological gene trees, output is an unrooted (semidirected) topologicallevel-1 network that captures species relationships, displaying cyclesand cut edges (tree-like network features). Complex blobs (those thatare not cycles) may be left as multifurcations of degree greater than 3(polytomies) in this structure.

The initial step infers the unknown network’s tree of blobs, usingthe functionTINNIK (described in another vignette). Thistree partially represents species relationships, showing only cut edgesand contracting all reticulation cycles to multifurcations.Multifurcations suspected to arise from cycles can then be resolved intooptimal cycle structures usingresolveCycle. Thisresolution relies on a least-squares approach, comparing an empiricalNANUQ distance among taxon groups around a multifurcation to expecteddistances for possible cycle configurations. For cycles under a defaultsize of 10, a full search is fast, while larger cycles are processedusing a quartet-based heuristic.

The user may combine cycle resolutions for some or all of theindividual multifurcations withcombineCycle. Blobs thatare not suspected to arise from cycles can be left unresolved. When allmultifurcations are believed to be cycle-derived,resolveLevel1 both resolves all multifurcations andcombines the resolutions into fully-resolved level-1 networks.

We recommend using theNANUQ function before resolvingany multifurcations, to assess which ones may have resulted from networkcycles. For a level-1 network,NANUQ generates a splitsgraph with distinctive structural features called ‘darts’. If thelevel-1 assumption is violated, this structural pattern is disrupted,providing insight into whether resolving blobs as cycles iswarranted.

For blobs that are cycles, all procedures here provide statisticallyconsistent estimates of network structure under the NMSC model.

Preparing the input data

The NANUQ\(^+\) routines take asinput a collection of gene trees for a set of taxa. From multilocussequence data, standard phylogenetic tools must first be used to inferunrooted topological trees for each gene. Although these trees are notraw data, they are treated as such. If the gene trees are rooted orcontain edge lengths, that information is simply ignored in theseroutines.

We work with an example data set of 16,338 gene trees of 16 Leopardusspecies, supplied byLescroart et al.(2023) and analyzed with NANUQ\(^+\) byAllman,Baños, Rhodes, et al. (2024). Gene trees in Newick format can beread from a text file and their displayed quartet trees counted withcommands such as:

# read text file of gene trees and count quartets on themgts<-read.tree(file ='genetreefile')tableLeopardusLescroart=quartetTable(gts)

We will use a table of this data supplied withinMSCquartets, pre-computed by the above commands. BothTINNIK andNANUQ can take as input either thegene tree file or the quartet table.

# load data file containing quartet counts for Leopardus data set supplied with MSCquartets package# These counts are will be accessed as `tableLeopardusLescroart`.data(tableLeopardusLescroart)

Notes:

All functions used here allow gene trees with missing taxa, providedeach subset of four taxa appears on at least one tree (ideally, onmany). Statistical tests are conducted for each of these subsets, withthe sample size being the number of trees displaying the subset.

The good statistical behavior of the analysis below is establishedunder the assumption that the input gene trees are a true sample fromthe NMSC model. In practice, inferred gene trees, containing some error,are used instead. Poorly inferred gene trees or those lacking sufficientresolution can compromise results.

Step 1: Inferring the tree of blobs

The functionTINNIK can be used to infer the tree ofblobs for an unknown network. This first applies two hypothesis testsfor each choice of 4 taxa, testing whether the gene tree counts allowfor the rejection of a resolved quartet tree or a star tree for thosespecies. (SeeTINNIK’s vignette or(Allman, Baños, Mitchell, et al. 2024) for moreinformation.)

# perform TINNIK analysis for gene trees, using defaultsoutput<-TINNIK(tableLeopardusLescroart)#> Applying hypothesis test for model T3 to 1820 quartets.#> Applying hypothesis test for star tree model to 1820 quartets.# save table of quartet information with p-valuespT<-output$pTable

The defaultTINNIK test levels are\(\alpha=.05\) for the null hypothesis “thequartet has a tree-like relationship”, and\(\beta=.95\) for the null hypothesis “thequartet has a star-like relationship”. An initial run ofTINNIK with default test levels is rarely a sufficientanalysis. One should always vary the\(\alpha\) and\(\beta\) to judge robustness of the inferredtree of blobs to their choice.

As detailed byAllman, Baños, Rhodes, et al.(2024), for this data we choose\(\alpha=5e-29\), imposing a stringent testfor judging non-tree-likeness. Note in the first plot below this gives aclear separation between red (quartets interpreted as 4-blobs) and blue(quartets interpreted as tree-like) symbols, allowing for noise in thegene trees without rejecting tree-like-ness too strongly.

For this dataset and test levels we obtained a tree of blobs with twomultifurcations, both of degree five.

# perform improved TINNIK analysis to infer the tree of blobsoutput<-TINNIK(pT,alpha=5e-29,beta =0.95)

Output produced byTINNIK includes the table of quartetcounts and associated test p-values and the tree of blobs. We renamethese to use as input in further functions.

# run TINNIK to infer the tree of blobsoutput<-TINNIK(pT,alpha=5e-29,beta =0.95)

# rename outputpT<-output$pTable#quartet count data with p-values for testsToB<-output$ToB#the TINNIK tree of blobs

UsingNANUQ to explore whether multifurcations shouldbe resolved into cycles

TheNANUQ function can help detect whether a blob mightbe a cycle. On it’s own,NANUQ is a statisticallyconsistent level-1 network inference algorithm under the NMSC model. Ittakes the same input as TINNIK (a set of gene trees or a quartet table,along with two test levels) and produces, among other outputs, adistance measure that can be used as input to theneighborNet algorithm (as implemented in thephangorn package) to generate a splits network withspecific structural properties. When the model is violated, thisstructure is disrupted, providing insight into potential violations ofthe level-1 assumption.

Further details on NANUQ and its applications are given byAllman, Baños, and Rhodes (2019) andRhodes et al. (2020). We recommend using thesame test levels but encourage testing alternative levels as well.

# perform NANUQ analysis for table of quartet information and p-valuesD<-NANUQ(pT,alpha =5e-29,beta =0.95)# Run the NANUQ routine#> Distance table written to file: NANUQdist_alpha5e-29_beta0.95.nexNN<-neighborNet(D$dist)# Run the NeighborNet algorithm on the NANUQ distanceplot(NN)# plot the splits-graph with neighborNet

As in(Allman, Baños, Rhodes, et al.2024), we identified one multifurcation that aligns well with themodel hypothesis, displaying a characteristic dart-like structure, andanother that does not. We cannot determine whether the lattermultifurcation results from a model violation (presumably level-1-ness)or simply noise. Nevertheless, this indicates caution is needed ininterpreting findings associated with this multifurcation.

Note:

We used the NeighborNet implementationneighborNet inthephangorn package here, to do the full analysis withinR. However, we recommend using SplitsTree(Husonand Bryant 2006) for the splits graph, as it has been morethoroughly tested. For more details, see(Rhodeset al. 2020).

Step 2: Resolving multifurcations

When resolving the multifurcations, the first step is to numericallylabel all internal nodes in the tree of blobs, so that we and the codecan refer to individual multifurcations.

#Label internal nodes of the tree of blobs, and plot  ToB<-labelIntNodes(ToB)

Here the multifurcations are nodes 18 and 20. We will resolve eachone independently using the functionresolveCycle. Whilethe same test levels as in TINNIK can be used, we suggest experimentingwith different levels.

# resolve node 18  resC18<-resolveCycle(ToB,18,pT,alpha=5e-29,beta=0.95)

The outputresC18 is a list that includes a networkdisplaying the best resolution of node 18, with the other multifurcationremaining unresolved. Other information in this list is needed forcombining resolutions of different multifurcations, or producing thehistogram of residuals for all possible resolutions.

# resolve node 2O  resC20<-resolveCycle(ToB,20,pT,alpha=5e-29,beta=0.95)#> For node 20, 5 resolutions found.

Note multifurcation 20 produced five optimal resolutions (that is,those 5 resolutions have residual sums of squares that are within asmall tolerance). This suggests that this multifurcation may not arisefrom a cycle, or that the data has too much noise to be sure.

Alternatively, if one suspects that all multifurcations arise from acycle, one could attempt to resolve all multifurcations to obtain alevel-1 network.

#Fully resolve the tree of blobs to a level-1 networkresN<-resolveLevel1(ToB=output$ToB,pTable=output$pTable,alpha=5e-29,beta=0.95,distance="NANUQ")#> 2 multifurcations on tree, at nodes 18, 20.#> For node 20, 5 resolutions found.

The outputresN includes, as its first element, a listof the 5 level-1 networks that accommodate all the optimal resolutionsthat can co-occur.

Note:

In reporting any NANUQ\(^+\)analysis, it is essential to report the test levels used, and, asmentioned throughout, we strongly recommend exploring with values beyondthe defaults.

Estimating numerical parameters and further comparing multipleoptimal resolutions

NANUQ\(^+\) yielded 5 fully-resolvedtopological networks for this data analysis.

One can use the Julia packagePhyloNetworks(Solı́s-Lemus, Bastide, and Ané 2017) to optimizenumerical parameters (edge lengths and hybridization parameters) onthese under a quartet pseudo-likelihood criterion, and also use thepseudo-likelihood score to further compare these networks. Note,however, that comparing networks by psuedo-likelihood is a compromisedue to the difficulty of computing full likelihoods.

The commands below (based on 2024PhyloNetworks) provide anexample of this. Note that this is Julia code and not R code, and itrequires additional installation of software.

using PhyloNetworks #Load packagegts = readMultiTopology("genetreefile") #load gene trees to Juliaq, t = countquartetsintrees(gts; showprogressbar = true) #compute the quartet countsglobal df = writeTableCF(q, t) # to get a DataFrame that can be saved to a file laterglobal empCFs = readTableCF(df)ntwk = readMultiTopology("Network.nwk", false) # read the topology of the newtwok to be optimizednet = topologyMaxQPseudolik!(ntwk,empCFs) # optimize network parameters to obtain the pseudo-likelihood score

Note:

Inferring numerical parameters is only suggested for networks thatare fully resolved, such as those from the functionresolveLevel1. Attempted optimization on networks that arepartially resolved, where some multifurcations represent unknown blobstructures, may lead to erroneous inference.

References

Allman, E. S., H. Baños, J. D. Mitchell, and J. A. Rhodes. 2024.“TINNiK: Inference of the Tree of Blobs of a Species Network Underthe Coalescent.”bioRxiv.https://doi.org/10.1101/2024.04.20.590418.
Allman, E. S., H. Baños, and J. A. Rhodes. 2019.NANUQ: A Method for Inferring Species Networks fromGene Trees Under the Coalescent Model.”Algorithms Mol.Biol. 14 (24): 1–25.https://doi.org/10.1186/s13015-019-0159-2.
Allman, E. S., H. Baños, J. A. Rhodes, and K. Wicke. 2024.NANUQ\(^+\): ADivide-and-Conquer Approach to Network Estimation.”
Huson, D. H., and D. Bryant. 2006.“Application of PhylogeneticNetworks in Evolutionary Studies.”Molecular Biology andEvolution 23 (2): 254–67.
Lescroart, Jonas, Alejandra Bonilla-Sánchez, Constanza Napolitano, DianaL Buitrago-Torres, Héctor E Ramı́rez-Chaves, Paola Pulido-Santacruz,William J Murphy, Hannes Svardal, and Eduardo Eizirik. 2023.Extensive Phylogenomic Discordance and theComplex Evolutionary History of the Neotropical Cat GenusLeopardus.”Molecular Biology and Evolution 40(12): msad255.https://doi.org/10.1093/molbev/msad255.
Rhodes, J. A., H. Baños, J. D. Mitchell, and E. S. Allman. 2020.MSCquartets 1.0: Quartet methods for speciestrees and networks under the multispecies coalescent model inR.”Bioinformatics, October.https://doi.org/10.1093/bioinformatics/btaa868.
Solı́s-Lemus, Claudia, Paul Bastide, and Cécile Ané. 2017.PhyloNetworks: A Package for PhylogeneticNetworks.”Molecular Biology and Evolution 34(12): 3292–98.https://doi.org/10.1093/molbev/msx235.

[8]ページ先頭

©2009-2025 Movatter.jp