Here, we introduce the R-packagePOUMM - animplementation of the Phylogenetic Ornstein-Uhlenbeck Mixed Model(POUMM) for univariate continuous traits(Mitovand Stadler 2018, 2019). In the sections below, we demonstratehow the package works. To that end, we run a simulation of a traitaccording to the POUMM model. Then, we execute a maximum likelihood (ML)and a Bayesian (MCMC) POUMM fit to the simulated data. We show how touse plots and some diagnostics to assess the quality of the fit,i.e. the mixing and the convergence of the MCMC, as well as theconsistency of the POUMM fit with the true POUMM parameters from thesimulation. Once you are familiar with these topics, we recommendreading the vignetteInterpreting thePOUMM.
But before we start, we need to install and load a few R-packagesfrom CRAN:
install.packages("data.table")install.packages("ggplot2")install.packages("lmtest")install.packages("ape")You can install the most recent version of the package fromgithub:
The above command will install the HEAD version from the mastergit-branch. This is the package branch, which evolves the fastest andgets the quickest bug-fixes.
A more stable version of the package can be installed from CRAN:
The above commands will install all 3rd party dependencies with oneexception: the package for high precision floating point arithmetics(Rmpfr). While the POUMM can be run without Rmpfr installed, installingthis package is recommended to prevent numerical instability in someextreme cases, i.e. very small positive values for the POUMM parametersalpha, sigma, and sigmae. Currently, Rmpfr is listed as “Suggests” inPOUMM’s DESCRIPTION, since it’s installation has been problematic onsome systems (i.e. Linux). If you have not done this before, you mayneed to enable Rcpp in your R environment. You can read how to do thisin the Rcpp-FAQ vignette available fromthis page.
Bear in mind that, for CRAN compliance, some of the functionality ofthe package might not be enabled on some systems, in particular,parallel likelihood calculation on Mac OS X El Capitan using clangcompiler prior to version 6 (see the sectionParallel execution below).
First, we specify the parameters of the POUMM simulation:
We briefly explain the above parameters. The first four of themdefine an OU-process with initial state\(g_0\), a selection strength parameter,\(\alpha\), a long-term mean,\(\theta\), and a stochastic time-unitstandard deviation,\(\sigma\). To getan intuition about the OU-parameters, one can consider randomOU-trajectories using the functionrTrajectoryOU(). On thefigure below, notice that doubling\(\alpha\) speeds up the convergence of thetrajectory towards\(\theta\) (magentaline) while doubling\(\sigma\) resultsin bigger stochastic oscilations (blue line):
The POUMM models the evolution of a continuous trait,\(z\), along a phylogenetic tree, assumingthat\(z\) is the sum of a genetic(heritable) component,\(g\), and anindependent non-heritable (environmental) component,\(e\sim N(0,\sigma_e^2)\). At every branchingin the tree, the daughter lineages inherit the\(g\)-value of their parent, adding their ownenvironmental component\(e\). ThePOUMM assumes the genetic component,\(g\), evolves along each lineage accordingto an OU-process with initial state the\(g\) value inherited from the parent-lineageand global parameters\(\alpha\),\(\theta\) and\(\sigma\).
Once the POUMM parameters are specified, we use theape R-package to generate a random tree with 500tips:
Starting from the root value\(g_0\), we simulate the genotypic values,\(g\), and the environmentalcontributions,\(e\), at all internalnodes down to the tips of the phylogeny:
In most real situations, only the phenotypic value, at the tips,i.e. will be observable. One useful way to visualize the observedtrait-values is to cluster the tips in the tree according to theirroot-tip distance, and to use box-whisker or violin plots to visualizethe trait distribution in each group. This allows to visually assess thetrend towards uni-modality and normality of the values - an importantprerequisite for the POUMM.
# This is easily done using the nodeTimes utility function in combination with# the cut-function from the base package.data<-data.table(z = z[1:N],t =nodeTimes(tree,tipsOnly =TRUE))data<- data[, group:=cut(t,breaks =5,include.lowest =TRUE)]ggplot(data = data,aes(x = t,y = z,group = group))+geom_violin(aes(col = group))+geom_point(aes(col = group),size=.5)Once all simulated data is available, it is time proceed with a firstPOUMM fit. This is done easily by calling the functionPOUMM():
The above code runs for about 5 minutes on a MacBook Pro Retina (late2013) with a 2.3 GHz Intel Core i7 processor. Using default settings, itperforms a maximum likelihood (ML) and a Bayesian (MCMC) fit to thedata. First the ML-fit is done. Then, three MCMC chains are run asfollows: the first MCMC chain samples from the default priordistribution, i.e. assuming a constant POUMM likelihood; the second andthe third chains perform adaptive Metropolis sampling from the posteriorparameter distribution conditioned on the default prior and the data. Bydefault each chain is run for\(10^5\)iterations. This and other default POUMM settings are described in thehelp-page for the functionspecifyPOUMM().
The strategy of executing three MCMC chains instead of one allows toassess:
We plot traces and posterior sample densities from the MCMC fit:
# get a list of plotsplotList<-plot(fitPOUMM,showUnivarDensityOnDiag =TRUE,doPlot =FALSE)plotList$traceplotA mismatch of the posterior sample density plots from chains 2 and 3,in particular for the phylogenetic heritability,\(H_{\bar{t}}^2\), indicates that the chainshave not converged. This can be confirmed quantitatively by theGelman-Rubin statistic (column called G.R.) in the summary of thefit:
## stat N MLE PostMean HPD ESS G.R.## 1: alpha 500 0.41430 0.40221 0.1898,0.6705 77.34 1.126## 2: theta 500 2.06446 2.08657 1.958,2.248 128.33 1.030## 3: sigma 500 0.12321 0.12200 0.07823,0.17237 78.51 1.119## 4: sigmae 500 0.19983 0.20199 0.1779,0.2252 99.72 1.083## 5: H2e 500 0.43886 0.42389 0.2871,0.5550 99.58 1.100## 6: H2tInf 500 0.31451 0.31965 0.1278,0.4446 131.32 1.102## 7: H2tMax 500 0.31450 0.31935 0.1278,0.4445 131.24 1.100## 8: H2tMean 500 0.31343 0.31641 0.1267,0.4429 130.67 1.096## 9: g0 500 NA NA NA,NA 0.00 NA## 10: sigmaG2tMean 500 0.01823 0.01961 0.007174,0.029110 148.95 1.205## 11: sigmaG2tMax 500 0.01832 0.02005 0.007235,0.029455 151.11 1.225## 12: sigmaG2tInf 500 0.01832 0.02013 0.007236,0.029464 154.32 1.232## 13: logpost 500 NA 14.28151 1.609,17.014 36.65 1.315## 14: loglik 500 16.32733 NA NA,NA 0.00 NA## 15: AIC 500 -22.65466 NA NA,NA 0.00 NA## 16: AICc 500 -22.53320 NA NA,NA 0.00 NAThe G.R. diagnostic is used to check whether two random samplesoriginate from the same distribution. Values that are substantiallydifferent from 1.00 (in this case greater than 1.01) indicatesignificant difference between the two samples and possible need toincrease the number of MCMC iterations. Therefore, we rerun the fitspecifying that each chain should be run for\(4 \times 10^5\) iterations:
Now, both the density plots and the G.R. values indicate nearlyperfect convergence of the second and third chains. The agreementbetween the ML-estimates (black dots on the density plots) and theposterior density modes (approximate location of the peak in the densitycurves) shows that the prior does not inflict a bias on the MCMC sample.The mismatch between chain 1 and chains 2 and 3 suggests that theinformation about the POUMM parameters contained in the data disagreeswith or significantly improves our prior knowledge about theseparameters. This is the desired outcome of a Bayesian fit, inparticular, in the case of a weak (non-informed) prior, such as thedefault one.
## stat N MLE PostMean HPD ESS G.R.## 1: alpha 500 0.41430 0.38942 0.2104,0.5747 720.0 1.0015## 2: theta 500 2.06446 2.08602 1.987,2.195 720.0 1.0009## 3: sigma 500 0.12321 0.11770 0.07336,0.16939 720.0 0.9986## 4: sigmae 500 0.19983 0.20191 0.1802,0.2286 799.1 0.9994## 5: H2e 500 0.43886 0.42494 0.2658,0.5435 799.5 0.9999## 6: H2tInf 500 0.31451 0.30849 0.1470,0.4694 1274.0 1.0015## 7: H2tMax 500 0.31450 0.30831 0.1468,0.4694 1273.0 1.0015## 8: H2tMean 500 0.31343 0.30572 0.1437,0.4679 1263.7 1.0013## 9: g0 500 NA NA NA,NA 0.0 NA## 10: sigmaG2tMean 500 0.01823 0.01815 0.009029,0.030213 1266.7 1.0053## 11: sigmaG2tMax 500 0.01832 0.01838 0.008773,0.030115 1264.1 1.0058## 12: sigmaG2tInf 500 0.01832 0.01840 0.008774,0.030178 1264.3 1.0057## 13: logpost 500 NA 15.02485 12.17,17.07 720.0 1.0013## 14: loglik 500 16.32733 NA NA,NA 0.0 NA## 15: AIC 500 -22.65466 NA NA,NA 0.0 NA## 16: AICc 500 -22.53320 NA NA,NA 0.0 NAThe 95% high posterior density (HPD) intervals contain the truevalues for all five POUMM parameters (\(\alpha\),\(\theta\),\(\sigma\),\(\sigma_e\) and\(g_0\)). This is also true for the derivedstatistics. To check this, we calculate the true derived statistics fromthe true parameter values and check that these are well within thecorresponding HPD intervals:
tMean<-mean(nodeTimes(tree,tipsOnly =TRUE))tMax<-max(nodeTimes(tree,tipsOnly =TRUE))c(# phylogenetic heritability at mean root-tip distance:H2tMean =H2(alpha, sigma, sigmae,t = tMean),# phylogenetic heritability at long term equilibirium:H2tInf =H2(alpha, sigma, sigmae,t =Inf),# empirical (time-independent) phylogenetic heritability,H2e =H2e(z[1:N], sigmae),# genotypic variance at mean root-tip distance:sigmaG2tMean =varOU(t = tMean, alpha, sigma),# genotypic variance at max root-tip distance:sigmaG2tMean =varOU(t = tMax, alpha, sigma),# genotypic variance at long-term equilibrium:sigmaG2tInf =varOU(t =Inf, alpha, sigma) )## H2tMean H2tInf H2e sigmaG2tMean sigmaG2tMean sigmaG2tInf ## 0.49958 0.50000 0.43792 0.03993 0.04000 0.04000Finally, we compare the ratio of empirical genotypic to totalphenotypic variance with the HPD-interval for the phylogeneticheritability.
## H2empirical ## 0.6541## lower upper ## 0.2658 0.5435On computers with multiple core processors, it is possible tospeed-up the POUMM-fit by parallelization. The POUMM package supportsparallelization on two levels:
To control the maximum number of threads (defaults to all physical orvirtual cores on the system) by specifying the environment variableOMP_NUM_THREADS, before starting R e.g.:
export OMP_NUM_THREADS=4parallel. With thedefault settings of the MCMC-fit (executing two MCMC chains samplingfrom the posterior distribution and one MCMC chain sampling from theprior), this parallelization can result in about two times speed-up ofthe POUMM fit on a computer with at least two available physicalcores.# set up a parallel cluster on the local computer for parallel MCMC:cluster<- parallel::makeCluster(parallel::detectCores(logical =FALSE))doParallel::registerDoParallel(cluster)fitPOUMM<-POUMM(z[1:N], tree,spec=list(parallelMCMC =TRUE))# Don't forget to destroy the parallel cluster to avoid leaving zombie worker-processes.parallel::stopCluster(cluster)It is possible to use this parallelization in combination withparallelization of the likelihood calculation. This, however, has notbeen tested and, presumably, would be slower than a parallelization onthe likelihood level only. It may be appropriate to parallelize the MCMCchains on small trees, since for small trees, the parallel likelihoodcalculation is not likely to be much faster than a serial modecalculation. In this case the SPLITT library would switch automaticallyto serial mode, so there will be no parallel CPU at the likelihoodlevel.
Apart from base R functionality, the POUMM package uses a number of3rd party R-packages: