Movatterモバイル変換

RWTY: R We There Yet?

This package implements various tests, visualizations, and metricsfor diagnosing convergence and mixing of MCMC chains in phylogenetics.It implements and automates many of the functions of the AWTY package inthe R environment. It also adds a number of new functions not availablein AWTY.

Citation

Warren D.L., A.J. Geneva, and R. Lanfear. 2017. RWTY (R We ThereYet): An R package for examining convergence of Bayesian phylogeneticanalyses. Molecular Biology and Evolution 34:1016-1020. doi:10.1093/molbev/msw279

Support

For help with RWTY problems, please check out theGoogle Groupfor RWTY.

Installation

At present, RWTY is downloadable fromhttps://github.com/danlwarren/RWTY. There are multipleways to download it. The easiest is to use devtools and install fromGitHub.

Installing from GitHubusing devtools

Run the following code from your R console:

install.packages("devtools")library(devtools)install_github("danlwarren/RWTY")library(rwty)

Install from zip file

A zipped version of the package is available athttps://github.com/danlwarren/RWTY/archive/master.zip.To install from the zip file, download a copy of it to your system. Onceit’s finished downloading, type the following (where PATH is the path tothe zip file):

install.packages("devtools")library(devtools)install_local("PATH")library(rwty)

Setting the number of cores

Some of the calculations RWTY performs benefit from the availabilityof multiple processor cores. To set the number of cores used by RWTY,all you need to do is assign a global variable called“rwty.processors”.

rwty.processors <<- 8

If no rwty.processors object exists in the global environment, RWTYwill default to using N-1 cores, where N is the total number availableon your system. Due to the limits of R’s internal parallelization, thisfunctionality is not available on Windows computers.

Interacting with RWTY

RWTY works with lists of rwty.chain objects. Each rwty.chain objectcontains outputs from a single Bayesian phylogenetic MCMC analysis, e.g.from BEAST or MrBayes. A rwty.chain object must contain the treetopologies from the chain (e.g. from the .t file from MrBayes), and canoptionally contain the continuous parameters as well (e.g. from the .pfile from MrBayes). A list of rwty.chain objects should all come fromanalyses of the same taxa. Most often, a list of rwty.chain objects willcontain multiple independent replicates of a single MCMC analysis, butit could also contain independent MCMC analyses of different loci fromthe same set of taxa, or a combination of both (e.g. 4 replicateanalyses of 3 different loci).

A lot of analyses in RWTY can be done using just three RWTYfunctions:load.trees(),load.multi(), andanalyze.rwty(). Other functions you might use are any ofthe functions that start withmakeplot., and thetopological.approx.ess() function.

`load.trees()`

This function is used to load in a single MCMC chain, where PATHrepresenting the path to the tree file. It will load the tree file andattempt find an associated log file, with the type determined by the“format” command. The default format is for MrBayes, where tree fileshave a “.t” extension and log files have a “.p” extension. For example,if your tree file is named “mytrees.t”, RWTY will try to find“mytrees.p” in the same directory, and will insert the file into thetrees object automatically if it’s found. RWTY can also automaticallyfind files for BEAST and *BEAST runs, so long as you provide theappropriate format command.

If you have a large number of trees and RWTY is taking too long toproduce plots, you can subsample your trees with the “trim” command. Forinstance, “trim = 10” means RWTY will only keep every tenth tree foranalysis

my.trees <- load.trees("PATH")my.beast.trees <- load.trees("PATH", format = "beast", trim = 5)my.starbeast.trees <- load.trees("PATH", format = "*beast")

If your trees or log files are named or formatted in a nonstandardway, you can still load your data into an RWTY object. The “logfile”argument takes a path to your logfile, the “type” argument tells RWTYwhether your trees are Nexus or Newick format, the “skip” argument tellsRWTY how many lines to skip in the log file before it reaches the headerline, and the “gens.per.tree” argument tells RWTY how many generationsthere were in the MCMC inbetween each tree that was sampled.

my.trees <- load.trees("path_to_tree_file", type = "nexus", logfile = "path_to_log_file", skip = 1, gens.per.tree = 1)

For example, if you have a collection of trees in Newick format likethis:

((A, B),(C, D));(((A, B), C), D);((A, C),(B, D));

and no logfile, you could load them into RWTY like this (note you’dneed to change the file path to suit your machine):

my.trees <- load.trees("~/Desktop/trees.phy", type = "newick", gens.per.tree = 1)

`load.multi()`

This function is used to load more than one chain into a list ofrwty.chains objects. It allows you to specify the format of your MCMCanalysis, so that it can automatically find tree and log files. Youprovide a path to the folder containing the tree files, as well as thefile format (default is MrBayes). As with load.trees, load.multi willattempt to automatically find log files. For example, to search adirectory named PATH for output of a MrBayes run, you would input thefollowing:

my.trees <- load.multi("PATH", format = "mb")

`analyze.rwty()`

This function conducts a suite of analyses that are applicable to thedata that are passed to it. The output from this function will depend onwhether you have loaded data from one or more than one chain, and onwhether your data comprises just phylogenetic trees, or whether it alsoincludes continuous parameters from log files. Since RWTY runs can takea while, and since a single RWTY run produces a number of plots andmetrics, it’s a good idea to always store them in an object for lateruse.

analyze.rwty() will set sensible defaults for mostparameters, but you should set theburnin and thefill.color to values that are appropriate for your data.Theburnin should be set to a value that removes trees andparameters not sampled from the stationary distribution, and thefill.color should be set to the name of the parameter fromyour log file with which you would like to colour the points in treespace.

analyze.rwty() returns a long list of plots.

data(salamanders)salamanders.rwty <- analyze.rwty(salamanders, burnin=50, fill.color = 'LnL')# to see which plots you havenames(salamanders.rwty)

`makeplot.x()`

Each of the functions that starts withmakeplot. has asimilar set of inputs. Each plot takes a list of one or morerwty.chain objects and the burnin as input, with otherparameters that are specific to certain plots. These functions areuseful to tune the parameters of particular plots (e.g. the burnin, orthe number of points to plot in tree space plots), which can then bepassed to a more thorough analysis through the additional arguments inanalyze.rwty().

For example, you might choose your burnin based on looking at tracesof parameters and trees.

makeplot.all.params(salamanders, burnin=0) # the LnL trace suggests burnin should be >0makeplot.all.params(salamanders, burnin=50) # this looks OK

topological.approx.ess()

This function calculates the approximate ESS of tree topologies fromall of the chains. It returns a data frame with approximate ESS values,which can be used to decide if the chain has been run for long enoughfor sufficient samples to be collected. The approximate ESS ismathematically similar to the ESS used for continuous parameters, but incertain cases an upper bound will be provided rather than a pointestimate. These are represented by ‘<’, and ‘=’ in the ‘operator’column of the data frame respectively.

approx.ess <- topological.approx.ess(salamanders, burnin = 50)

Which gives you a data frame that looks like this:

  operator approx.ess       chain1        =   243.0683 AMOTL2.run12        =   134.2827 AMOTL2.run23        =   222.0624   LHX2.run14        =   251.0000   LHX2.run25        =   251.0000  TRMT5.run16        =   251.0000  TRMT5.run2

RWTY outputs

RWTY outputs a number of useful plots for exploring the performanceof MCMC chains.

Parameter trace plots

If a logfile is provided, RWTY will produce a trace plot of the valueof each column in the log file as a function of position in the chain.If you have multiple chains, the plots will be faceted by chain (you canturn this off usingfacet = FALSE when callinganalyze.rwty(). The title of each plot shows the EffectiveSample Size (ESS) of the parameter. In general, you should aim to haveESS > 200 for all parameters.

salamanders.rwty$pi.C

parameter plot

Parameter correlation plots

If a logfile is provided, RWTY will produce a scatter plot matrixfrom the columns of the log file, along with the topological distancefrom the starting tree. Because the log file output of many MCMCprograms can contain a large number of columns, the default behavior ofmakeplot.pairs is to only plot the topological distance along with thefirst two data columns. In the density plots on the diagonal, red valuesindicate values outside the 95% CI. In the scatter plots in the lowerleft, brighter and more yellow colors indicate samples from later in thechain.

salamanders.rwty$AMOTL2.run1.correlations

If you would like to view correlations between likelihood, topology,and model parameters, you can do so by passing a vector of parameternames or “all” to the params argument of makeplot.pairs oranalyze.rwty.

data(fungus)plotparams <- c("LnL", "pi.A.", "pi.T.")makeplot.pairs(fungus$Fungus.Run1, burnin = 20, params = plotparams)

Tree topology trace plot

RWTY will produce a plot of tree topologies that is similar to thosefor parameters. In these plots, the y-axis represents the distance ofeach tree in the chain from the last tree of the burnin of the firstchain. These plots can reveal whether the MCMC was sampling from astationary distribution of chains (in which case it should be relativestable) or whether it is moving between modes (in which case you mightsee large jumps). They can also show how well the chain is mixing withrespect to tree topologies. Well mixed chains will show the topologytrace jumping rapidly between values. Poorly mixed chains will showchains remaining at similar values in successive samples. Topologytraces can also reveal whether different chains are sampling differentbits of tree space, in which case the stationary distrubtions will showtraces at different y values on the plot. Finally, the topology tracecontains an estimate of the ESS (called the approximate ESS) of treetopologies. Note that this estimate is only useful if the trees aresampled from a stationary distribution.

salamanders.rwty$topology.trace

topology plot

Autocorrelation plots

These plots demonstrate the change in average tree distance betweentrees at different sampling intervals. A well-mixed chain (in whichsequential samples of tree topologies are effectively independent) willshow flat lines on these plots. In a chain that is mixing poorly withrespect to tree topologies, you will see the plot rising to anasymptote. This reflects the fact that are close to each other in thechain are more similar than trees that are further away, otherwise knownas autocorrelation.

salamanders.rwty$autocorr.plot

Treespace plots

These plots demonstrate the movement of the chain in two dimensionalrepresentations of treespace. The tree space is based on an NMDS scalingof the distances between trees, much like TreeSetViz. The space iscreated using all chains together, so the treespaces represented fordifferent chains are comparable. If all chains are sampling similar treetopolgoies, then the treespace plots will look similar for both chains.In this case, we can see that independent runs of the same gene aresampling from similar regions of tree space. But that chains fromdifferent genes are sampling from different regions of treespace.

salamanders.rwty$treespace.heatmapsalamanders.rwty$treespace.points.plot

heatmap points plot

Since the salamanders data comes from different genes, and thosegenes are obviously sampling different bits of tree space, it might bemore informative to compare the replicate runs of each geneindividually.

salamanders.treespace = makeplot.treespace(salamanders[1:2], burnin = 50, n.points = 200, fill.color = "LnL")salamanders.treespace$treespace.heatmapsalamanders.treespace$treespace.points.plot

heatmap 2 points plot 2

This shows that the replicate analyses of the same gene are samplingvery similar parts of treespace.

Slidingwindow and cumulative split frequency plots

These plots show the split frequencies of clades in the chain, eitheras the frequency within a sliding window or as the frequency calculatedfrom cumulative samples in the chain at various points in analysis. Bydefault they show the 20 clades with the highest standard deviation ofsplit frequencies within each sample (e.g. the standard deviation of thesplit frequencies of a given clade in a given chain, across all slidingwindows). In an analysis that is mixing well, we expect the frequenciesin the sliding window to move around a lot, but the frequencies in thecumulative plot should level off. If the frequencies in the cumulativeplot have not levelled off, it might be necessary to run the chain forlonger. Plots can be produced for single chains, or for multiplechains.

salamanders.rwty$splitfreqs.sliding.plotsalamanders.rwty$splitfreqs.cumulative.plot

sliding split frequencies cumulative split frequencies

For a single chain:

makeplot.splitfreqs.cumulative(salamanders[[1]])makeplot.splitfreqs.sliding(salamanders[[1]])

cumulative split frequencies sliding split frequencies

Slidingwindow and cumulative average change in split frequency (ACSF)plots

These plots show the change in split frequencies of clades in thechain, either as the frequency within a sliding window or as thefrequency calculated from cumulative samples in the chain at variouspoints in the analysis. They are analagous to the previous two plots,but instead of showing split frequencies of clades on the y axis, theyshow the change in split frequencies of all clades on the y axis. Thesolid line is the average change in split frequencies (ACSF), and theribbons show the 100%, 95%, and 75% quantiles of the Change in SplitFrequencies (CSF). Ideally, we’d see the sliding window change in splitfrequencies remaining above zero but without big jumps (which wouldindicate the chain being stuck in one region of tree space, then movingto another), and we’d see the cumulative change in split frequenciesapproaching zero (indicating that the chain has settled on sampling froma single reqion of tree space, and that adding additional samples is notmaking any appreciable difference to the estimates of clade posteriorprobabilities one might make from the analysis).

salamanders.rwty$acsf.sliding.plotsalamanders.rwty$acsf.cumulative.plot

sliding change in split frequencies cumulative change in split frequencies

Multiple chains only

The following plots are only produced with analyze.rwty() is passedlist of rwty.chain object that contains more than one chain.

AverageStandard Deviation of Split Frequency (ASDSF) plot

This plot shows the ASDSF of all clades that appear in any of thechains, calculated from cumulative samples in the chain at variouspoints in the analysis. The solid line shows the ASDSF, and the ribbonsshow the 75%, 95%, and 100% quantiles of the Standard Deviation of SplitFrequencies (SDSF) across clades. I.e. the upper line of the grey ribbonshows the clade with the 87.5% highest SDSF between chains, and thelower line of the grey ribbon shows the clade with the 12.5% highestSDSF. Each generation in the plot represents the distribution of SDSFsif the chains were all stopped at that point. The lower the ASDSF, themore the chains agree on split frequencies. In general we expect theSDSF to get lower as the chains progress, and to eventually reach anasymptote at some low value once all the chains have been sampling froma stationary distribution for a long time. Because we are usuallyintersted in the right-hand end of this graph, where the ASDSF valuescan be much lower than at the start of the chain, the y axis is plottedon a log10 scale.

Note that the points and line represent the mean, not the median,SDSF. For this reason it is possible for the line to fall outside of theinner ribbons with the distribution of SDSFs is highly skewed.

makeplot.asdsf(salamanders[c(1,2)])

Split frequency scatterplotmatrix

These plots are another way of visualizing the agreement ordisagreement between the regions of tree space that chains have sampled.For each pair of chains, the cumulative final frequency of each clade(which is also our best estimate of that clade’s posterior probability)is plotted against ts frequency in another chain in the lower triangle,and the the correlation (Pearson’s R) and ASDSF of each pair of chainsis shown in the upper triangle.

salamanders.rwty$splitfreq.matrix

ASDSF tree

This plot shows the similarity of split frequencies between chains.Chains with the most similar split frequencies be clustered together inthe tree. This plot can quickly reveal which chains are sampling similarand different regions of tree space.

salamanders.rwty$asdsf.tree

[8]ページ先頭