Here, we introduce the R-packagePOUMM - animplementation of the Phylogenetic Ornstein-Uhlenbeck Mixed Model(POUMM) for univariate continuous traits(Mitovand Stadler 2018, 2019). In the sections below, we demonstratehow the package works. To that end, we run a simulation of a traitaccording to the POUMM model. Then, we execute a maximum likelihood (ML)and a Bayesian (MCMC) POUMM fit to the simulated data. We show how touse plots and some diagnostics to assess the quality of the fit,i.e. the mixing and the convergence of the MCMC, as well as theconsistency of the POUMM fit with the true POUMM parameters from thesimulation. Once you are familiar with these topics, we recommendreading the vignetteInterpreting thePOUMM.

Simulating trait evolution under the POUMM

Parameters of the simulation

First, we specify the parameters of the POUMM simulation:

N<-500g0<-0alpha<- .5theta<-2sigma<-0.2sigmae<-0.2

We briefly explain the above parameters. The first four of themdefine an OU-process with initial state$g_0$, a selection strength parameter,$\alpha$, a long-term mean,$\theta$, and a stochastic time-unitstandard deviation,$\sigma$. To getan intuition about the OU-parameters, one can consider randomOU-trajectories using the functionrTrajectoryOU(). On thefigure below, notice that doubling$\alpha$ speeds up the convergence of thetrajectory towards$\theta$ (magentaline) while doubling$\sigma$ resultsin bigger stochastic oscilations (blue line):

$Dashed black and magenta lines denote the deterministic trend towards the long-term mean \theta, fixing the stochastic parameter \sigma=0.$

Dashed black and magenta lines denote thedeterministic trend towards the long-term mean$\theta$, fixing the stochastic parameter$\sigma=0$.

The POUMM models the evolution of a continuous trait,$z$, along a phylogenetic tree, assumingthat$z$ is the sum of a genetic(heritable) component,$g$, and anindependent non-heritable (environmental) component,$e\sim N(0,\sigma_e^2)$. At every branchingin the tree, the daughter lineages inherit the$g$-value of their parent, adding their ownenvironmental component$e$. ThePOUMM assumes the genetic component,$g$, evolves along each lineage accordingto an OU-process with initial state the$g$ value inherited from the parent-lineageand global parameters$\alpha$,$\theta$ and$\sigma$.

Simulating the phylogeny

Once the POUMM parameters are specified, we use theape R-package to generate a random tree with 500tips:

# Number of tipstree<-rtree(N)plot(tree,show.tip.label =FALSE)

Simulating trait evolution on the phylogeny

Starting from the root value$g_0$, we simulate the genotypic values,$g$, and the environmentalcontributions,$e$, at all internalnodes down to the tips of the phylogeny:

# genotypic (heritable) valuesg<-rVNodesGivenTreePOUMM(tree, g0, alpha, theta, sigma)# environmental contributionse<-rnorm(length(g),0, sigmae)# phenotypic valuesz<- g+ e

Visualizing the data

In most real situations, only the phenotypic value, at the tips,i.e. will be observable. One useful way to visualize the observedtrait-values is to cluster the tips in the tree according to theirroot-tip distance, and to use box-whisker or violin plots to visualizethe trait distribution in each group. This allows to visually assess thetrend towards uni-modality and normality of the values - an importantprerequisite for the POUMM.

# This is easily done using the nodeTimes utility function in combination with# the cut-function from the base package.data<-data.table(z = z[1:N],t =nodeTimes(tree,tipsOnly =TRUE))data<- data[, group:=cut(t,breaks =5,include.lowest =TRUE)]ggplot(data = data,aes(x = t,y = z,group = group))+geom_violin(aes(col = group))+geom_point(aes(col = group),size=.5)

Distributions of the trait-values groupedaccording to their root-tip distances.

Fitting the POUMM

Once all simulated data is available, it is time proceed with a firstPOUMM fit. This is done easily by calling the functionPOUMM():

fitPOUMM<-POUMM(z[1:N], tree)

The above code runs for about 5 minutes on a MacBook Pro Retina (late2013) with a 2.3 GHz Intel Core i7 processor. Using default settings, itperforms a maximum likelihood (ML) and a Bayesian (MCMC) fit to thedata. First the ML-fit is done. Then, three MCMC chains are run asfollows: the first MCMC chain samples from the default priordistribution, i.e. assuming a constant POUMM likelihood; the second andthe third chains perform adaptive Metropolis sampling from the posteriorparameter distribution conditioned on the default prior and the data. Bydefault each chain is run for$10^5$iterations. This and other default POUMM settings are described in thehelp-page for the functionspecifyPOUMM().

The strategy of executing three MCMC chains instead of one allows toassess:

the quality of the MCMC fit: a mismatch between the samplingdistributions of the second and third chains suggests that at least oneof the chains has not converged to a region of high posterior density(HPD).
the presence of signal for the POUMM parameters in the data: a closematch between prior and posterior distributions suggests lack of signalin the data.

We plot traces and posterior sample densities from the MCMC fit:

# get a list of plotsplotList<-plot(fitPOUMM,showUnivarDensityOnDiag =TRUE,doPlot =FALSE)plotList$traceplot

MCMC traces from a POUMM MCMC-fit.

plotList$densplot

MCMC univariate density plots. Black dots on thex-axis indicate the ML-fit.

A mismatch of the posterior sample density plots from chains 2 and 3,in particular for the phylogenetic heritability,$H_{\bar{t}}^2$, indicates that the chainshave not converged. This can be confirmed quantitatively by theGelman-Rubin statistic (column called G.R.) in the summary of thefit:

summary(fitPOUMM)

##             stat   N       MLE PostMean               HPD    ESS  G.R.##  1:        alpha 500   0.41430  0.40221     0.1898,0.6705  77.34 1.126##  2:        theta 500   2.06446  2.08657       1.958,2.248 128.33 1.030##  3:        sigma 500   0.12321  0.12200   0.07823,0.17237  78.51 1.119##  4:       sigmae 500   0.19983  0.20199     0.1779,0.2252  99.72 1.083##  5:          H2e 500   0.43886  0.42389     0.2871,0.5550  99.58 1.100##  6:       H2tInf 500   0.31451  0.31965     0.1278,0.4446 131.32 1.102##  7:       H2tMax 500   0.31450  0.31935     0.1278,0.4445 131.24 1.100##  8:      H2tMean 500   0.31343  0.31641     0.1267,0.4429 130.67 1.096##  9:           g0 500        NA       NA             NA,NA   0.00    NA## 10: sigmaG2tMean 500   0.01823  0.01961 0.007174,0.029110 148.95 1.205## 11:  sigmaG2tMax 500   0.01832  0.02005 0.007235,0.029455 151.11 1.225## 12:  sigmaG2tInf 500   0.01832  0.02013 0.007236,0.029464 154.32 1.232## 13:      logpost 500        NA 14.28151      1.609,17.014  36.65 1.315## 14:       loglik 500  16.32733       NA             NA,NA   0.00    NA## 15:          AIC 500 -22.65466       NA             NA,NA   0.00    NA## 16:         AICc 500 -22.53320       NA             NA,NA   0.00    NA

The G.R. diagnostic is used to check whether two random samplesoriginate from the same distribution. Values that are substantiallydifferent from 1.00 (in this case greater than 1.01) indicatesignificant difference between the two samples and possible need toincrease the number of MCMC iterations. Therefore, we rerun the fitspecifying that each chain should be run for$4 \times 10^5$ iterations:

fitPOUMM2<-POUMM(z[1:N], tree,spec=list(nSamplesMCMC =4e5))

Now, both the density plots and the G.R. values indicate nearlyperfect convergence of the second and third chains. The agreementbetween the ML-estimates (black dots on the density plots) and theposterior density modes (approximate location of the peak in the densitycurves) shows that the prior does not inflict a bias on the MCMC sample.The mismatch between chain 1 and chains 2 and 3 suggests that theinformation about the POUMM parameters contained in the data disagreeswith or significantly improves our prior knowledge about theseparameters. This is the desired outcome of a Bayesian fit, inparticular, in the case of a weak (non-informed) prior, such as thedefault one.

plotList<-plot(fitPOUMM2,doPlot =FALSE)plotList$densplot

summary(fitPOUMM2)

##             stat   N       MLE PostMean               HPD    ESS   G.R.##  1:        alpha 500   0.41430  0.38942     0.2104,0.5747  720.0 1.0015##  2:        theta 500   2.06446  2.08602       1.987,2.195  720.0 1.0009##  3:        sigma 500   0.12321  0.11770   0.07336,0.16939  720.0 0.9986##  4:       sigmae 500   0.19983  0.20191     0.1802,0.2286  799.1 0.9994##  5:          H2e 500   0.43886  0.42494     0.2658,0.5435  799.5 0.9999##  6:       H2tInf 500   0.31451  0.30849     0.1470,0.4694 1274.0 1.0015##  7:       H2tMax 500   0.31450  0.30831     0.1468,0.4694 1273.0 1.0015##  8:      H2tMean 500   0.31343  0.30572     0.1437,0.4679 1263.7 1.0013##  9:           g0 500        NA       NA             NA,NA    0.0     NA## 10: sigmaG2tMean 500   0.01823  0.01815 0.009029,0.030213 1266.7 1.0053## 11:  sigmaG2tMax 500   0.01832  0.01838 0.008773,0.030115 1264.1 1.0058## 12:  sigmaG2tInf 500   0.01832  0.01840 0.008774,0.030178 1264.3 1.0057## 13:      logpost 500        NA 15.02485       12.17,17.07  720.0 1.0013## 14:       loglik 500  16.32733       NA             NA,NA    0.0     NA## 15:          AIC 500 -22.65466       NA             NA,NA    0.0     NA## 16:         AICc 500 -22.53320       NA             NA,NA    0.0     NA

Consistency of the fit with the “true” simulation parameters

The 95% high posterior density (HPD) intervals contain the truevalues for all five POUMM parameters ($\alpha$,$\theta$,$\sigma$,$\sigma_e$ and$g_0$). This is also true for the derivedstatistics. To check this, we calculate the true derived statistics fromthe true parameter values and check that these are well within thecorresponding HPD intervals:

tMean<-mean(nodeTimes(tree,tipsOnly =TRUE))tMax<-max(nodeTimes(tree,tipsOnly =TRUE))c(# phylogenetic heritability at mean root-tip distance:H2tMean =H2(alpha, sigma, sigmae,t = tMean),# phylogenetic heritability at long term equilibirium:H2tInf =H2(alpha, sigma, sigmae,t =Inf),# empirical (time-independent) phylogenetic heritability,H2e =H2e(z[1:N], sigmae),# genotypic variance at mean root-tip distance:sigmaG2tMean =varOU(t = tMean, alpha, sigma),# genotypic variance at max root-tip distance:sigmaG2tMean =varOU(t = tMax, alpha, sigma),# genotypic variance at long-term equilibrium:sigmaG2tInf =varOU(t =Inf, alpha, sigma)  )

##      H2tMean       H2tInf          H2e sigmaG2tMean sigmaG2tMean  sigmaG2tInf ##      0.49958      0.50000      0.43792      0.03993      0.04000      0.04000

Finally, we compare the ratio of empirical genotypic to totalphenotypic variance with the HPD-interval for the phylogeneticheritability.

c(H2empirical =var(g[1:N])/var(z[1:N]))

## H2empirical ##      0.6541

summary(fitPOUMM2)["H2e"==stat,unlist(HPD)]

##  lower  upper ## 0.2658 0.5435

Parallel execution

On computers with multiple core processors, it is possible tospeed-up the POUMM-fit by parallelization. The POUMM package supportsparallelization on two levels:

parallelizing the POUMM likelihood calculation. The POUMM packageparallelizes the likelihood calculation through the SPLITT library forparallel lineage traversal(Mitov and Stadler2019). This is a fine grain parallelization, which can benefitfrom modern single instruction multiple data (SIMD) processors as wellas multiple physical cores. Parallelization on multiple cores becomesbeneficial on trees exceeding several hundreds tips. For thisparallelization to work, the POUMM package must be compiled fromsource-code using an OpenMP 4.0-enabled C++ compiler. Open MP 4.0 issupported by several modern C++ compilers including Gnu-g++, Intel-icpcand clang version 6.0 or above (see also theSPLITTGet started guide for details).

To control the maximum number of threads (defaults to all physical orvirtual cores on the system) by specifying the environment variableOMP_NUM_THREADS, before starting R e.g.:

export OMP_NUM_THREADS=4

parallelizing the MCMC-chains - this can be done by creating acluster using the R-packageparallel. With thedefault settings of the MCMC-fit (executing two MCMC chains samplingfrom the posterior distribution and one MCMC chain sampling from theprior), this parallelization can result in about two times speed-up ofthe POUMM fit on a computer with at least two available physicalcores.

# set up a parallel cluster on the local computer for parallel MCMC:cluster<- parallel::makeCluster(parallel::detectCores(logical =FALSE))doParallel::registerDoParallel(cluster)fitPOUMM<-POUMM(z[1:N], tree,spec=list(parallelMCMC =TRUE))# Don't forget to destroy the parallel cluster to avoid leaving zombie worker-processes.parallel::stopCluster(cluster)

It is possible to use this parallelization in combination withparallelization of the likelihood calculation. This, however, has notbeen tested and, presumably, would be slower than a parallelization onthe likelihood level only. It may be appropriate to parallelize the MCMCchains on small trees, since for small trees, the parallel likelihoodcalculation is not likely to be much faster than a serial modecalculation. In this case the SPLITT library would switch automaticallyto serial mode, so there will be no parallel CPU at the likelihoodlevel.

References

Bates, Douglas, and Martin Maechler. 2017.Matrix: Sparse and DenseMatrix Classes and Methods.https://CRAN.R-project.org/package=Matrix.

Dowle, Matt, and Arun Srinivasan. 2016.Data.table: Extension of‘Data.frame‘.https://CRAN.R-project.org/package=data.table.

Eddelbuettel, Dirk, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou,Nathan Russell, Douglas Bates, and John Chambers. 2017.Rcpp:Seamless r and c++ Integration.https://CRAN.R-project.org/package=Rcpp.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, and Torsten Hothorn.2016.Mvtnorm: Multivariate Normal and t Distributions.https://CRAN.R-project.org/package=mvtnorm.

Hothorn, Torsten, Achim Zeileis, Richard W. Farebrother, and ClintCummins. 2015.Lmtest: Testing Linear Regression Models.https://CRAN.R-project.org/package=lmtest.

Maechler, Martin. 2016.Rmpfr: R MPFR - Multiple PrecisionFloating-Point Reliable.https://CRAN.R-project.org/package=Rmpfr.

Mitov, Venelin, and Tanja Stadler. 2018.“A Practical Guide toEstimating the Heritability of Pathogen Traits.”MolecularBiology and Evolution 35 (3): 756–72.https://doi.org/10.1093/molbev/msx328.

———. 2019.“Parallel Likelihood Calculation for PhylogeneticComparative Models: The SPLITT c++ Library.”Methods inEcology and Evolution.https://doi.org/10.1111/2041-210X.13136.

Paradis, Emmanuel, Simon Blomberg, Ben Bolker, Julien Claude, Hoa SienCuong, Richard Desper, Gilles Didier, et al. 2016.Ape: Analyses ofPhylogenetics and Evolution.https://CRAN.R-project.org/package=ape.

Plummer, Martyn, Nicky Best, Kate Cowles, Karen Vines, Deepayan Sarkar,Douglas Bates, Russell Almond, and Arni Magnusson. 2016.Coda:Output Analysis and Diagnostics for MCMC.https://CRAN.R-project.org/package=coda.

Revolution Analytics, and Steve Weston. n.d.Foreach: ProvidesForeach Looping Construct for r.

Scheidegger, Andreas. 2012.adaptMCMC: Implementation of a GenericAdaptive Monte Carlo Markov Chain Sampler.https://CRAN.R-project.org/package=adaptMCMC.

Team, R Core. n.d.Support for Parallel Computation in r.

Wickham, Hadley. 2016.Testthat: Unit Testing for r.https://CRAN.R-project.org/package=testthat.

Wickham, Hadley, and Winston Chang. 2016.Ggplot2: Create ElegantData Visualisations Using the Grammar of Graphics.https://CRAN.R-project.org/package=ggplot2.

Movatterモバイル変換

A User Guide to the POUMM R-package

Venelin Mitov

Installing the POUMM R-package