Movatterモバイル変換


[0]ホーム

URL:


Analyzing multivariate count data with thePoisson log-normal model

PLN team

2025-03-21

Preliminaries

This vignette illustrates the use of thePLN functionand the methods accompanying the R6 classPLNfit.

From the statistical point of view, the functionPLNadjusts a multivariate Poisson lognormal model to a table of counts,possibly after correcting for effects of offsets and covariates.PLN is the building block for all the multivariate modelsfound in thePLNmodels package: having a basicunderstanding of both the mathematical background and the associated setofR functions is a good place to start.

Requirements

The packages required for the analysis arePLNmodelsplus some others for data manipulation and representation:

library(PLNmodels)library(ggplot2)library(corrplot)

Data set

We illustrate our point with the trichoptera data set, a fulldescription of which can be found inthecorresponding vignette. Data preparation is also detailed inthe specific vignette.

data(trichoptera)trichoptera<-prepare_data(trichoptera$Abundance, trichoptera$Covariate)

Thetrichoptera data frame stores a matrix of counts(trichoptera$Abundance), a matrix of offsets(trichoptera$Offset) and some vectors of covariates(trichoptera$Wind,trichoptera$Temperature,etc.)

Mathematical background

The multivariate Poisson lognormal model (in short PLN, seeAitchison and Ho (1989)) relates some\(p\)-dimensional observation vectors\(\mathbf{Y}_i\) to some\(p\)-dimensional vectors of Gaussian latentvariables\(\mathbf{Z}_i\) asfollows

\[\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim\mathcal{N}({\boldsymbol\mu},\boldsymbol\Sigma), \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.}& \mathbf{Y}_i |\mathbf{Z}_i\sim\mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right). \end{array}\end{equation}\]

The parameter\({\boldsymbol\mu}\)corresponds to the main effects and the latent covariance matrix\(\boldsymbol\Sigma\) describes theunderlying residual structure of dependence between the\(p\) variables. The following figureprovides insights about the role played by the different layers

PLN: geometrical view

PLN: geometrical view

Covariates and offsets

This model generalizes naturally to a formulation closer to amultivariate generalized linear model, where the main effect is due to alinear combination of\(d\) covariates\(\mathbf{x}_i\) (including a vector ofintercepts). We also let the possibility to add some offsets for the\(p\) variables in in each sample, thatis\(\mathbf{o}_i\). Hence, theprevious model generalizes to

\[\begin{equation} \mathbf{Y}_i | \mathbf{Z}_i \sim\mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim\mathcal{N}({\mathbf{o}_i +\mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma), \\\end{equation}\] where\(\mathbf{B}\) is a\(d\times p\) matrix of regressionparameters. When all individuals\(i=1,\dots,n\) are stacked together, thedata matrices available to feed the model are

  • the\(n\times p\) matrix of counts\(\mathbf{Y}\)
  • the\(n\times d\) matrix of design\(\mathbf{X}\)
  • the\(n\times p\) matrix of offsets\(\mathbf{O}\)

Inference in PLN then focuses on the regression parameters\(\mathbf{B}\) and on the covariance matrix\(\boldsymbol\Sigma\).

Optimization by Variational inference

Technically speaking, we adopt inPLNmodels avariational strategy to approximate the log-likelihood function andoptimize the consecutive variational surrogate of the log-likelihoodwith a gradient-ascent-based approach. To this end, we rely on the CCSAalgorithm ofSvanberg (2002) implemented in the C++ library(Johnson 2011), whichwe link to the package.

Analysis of trichoptera data with a PLN model

The standard PLN model described above is adjusted with the functionPLN. We now review its usage on a the trichoptera dataset.

A PLN model with latent main effects

Adjusting a fit

In order to become familiar with the functionPLN andits outputs, let us first fit a simple PLN model with just an interceptfor each species:

myPLN<-PLN(Abundance~1, trichoptera)
## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

Note the use of theformula object to specify the model:the vector\(\boldsymbol\mu\) of maineffects in the mathematical formulation (one per column species) isspecified in the call with the term~ 1 in theright-hand-side of the formula.Abundance is a variable inthe data frametrichoptera corresponding to a matrix of 17columns and theresponse in the model, occurring on theleft-hand-side of the formula.

ThePLNfit object

myPLN is anR6 object with classPLNfit, which comes with a couple of methods, as recalledwhen printing/showing such an object in theR console:

myPLN
## A multivariate Poisson Lognormal fit with full covariance model.## ==================================================================##  nb_param    loglik       BIC       ICL##       170 -1130.047 -1460.852 -2229.083## ==================================================================## * Useful fields##     $model_par, $latent, $latent_pos, $var_par, $optim_par##     $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria## * Useful S3 methods##     print(), coef(), sigma(), vcov(), fitted()##     predict(), predict_cond(), standard_error()

See also?PLNfit for more comprehensive information.

Field access

Accessing public fields of aPLNfit object can be donejust like with a traditional list,e.g.,

c(myPLN$loglik, myPLN$BIC, myPLN$ICL)
## [1] -1130.047 -1460.852 -2229.083
myPLN$criteria
##   nb_param    loglik       BIC       ICL## 1      170 -1130.047 -1460.852 -2229.083

GLM-like interface

We provide a set of S3-methods forPLNfit that mimic thestandard (G)LM-like interface ofR::stats, which we presentnow.

One can access the fitted value of the counts (Abundance\(\hat{\mathbf{Y}}\)) and check thatthe algorithm basically learned correctly from the data1:

data.frame(fitted   =as.vector(fitted(myPLN)),observed =as.vector(trichoptera$Abundance))%>%ggplot(aes(x = observed,y = fitted))+geom_point(size = .5,alpha =.25 )+scale_x_log10()+scale_y_log10()+theme_bw()+annotation_logticks()
fitted value vs. observation
fitted value vs. observation

The residual correlation matrix better displays as an imagematrix:

myPLN%>%sigma()%>%cov2cor()%>%corrplot()

Observation weights

It is also possible to use observation weights like in standard(G)LMs:

myPLN_weighted<-PLN(    Abundance~1,data    = trichoptera,weights =runif(nrow(trichoptera)),control =PLN_param(trace =0)  )data.frame(unweighted =as.vector(fitted(myPLN)),weighted   =as.vector(fitted(myPLN_weighted)))%>%ggplot(aes(x = unweighted,y = weighted))+geom_point(size = .5,alpha =.25 )+scale_x_log10()+scale_y_log10()+theme_bw()+annotation_logticks()

Accounting for covariates and offsets

For ecological count data, it is generally a good advice to includethe sampling effort via an offset term whenever available, otherwisesamples are not necessarily comparable:

myPLN_offsets<-PLN(Abundance~1+offset(log(Offset)),data = trichoptera,control =PLN_param(trace =0))

Note that we use the functionoffset with alog-transform of the total counts2 since it acts in the latent layer of themodel. Obviously the model with offsets is better since thelog-likelihood is higher with the same number of parameters3:

rbind(  myPLN$criteria,  myPLN_offsets$criteria)%>% knitr::kable()
nb_paramloglikBICICL
170-1130.047-1460.852-2229.083
170-1051.798-1382.603-2214.572

Let us try to correct for the wind effect in our model:

myPLN_wind<-PLN(Abundance~1+ Wind+offset(log(Offset)),data = trichoptera)
## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

When we compare the models, the gain is clear in terms oflog-likelihood. However, the BIC chooses not to include thisvariable:

rbind(  myPLN_offsets$criteria,  myPLN_wind$criteria)%>% knitr::kable()
nb_paramloglikBICICL
170-1051.798-1382.603-2214.572
187-1028.174-1392.060-2100.579

Covariance models (full, diagonal, spherical)

It is possible to change a bit the parametrization used for modelingthe residual covariance matrix\(\boldsymbol\Sigma\), and thus reduce thetotal number of parameters used in the model. By default, the residualcovariance is fully parameterized (hence\(p\times (p+1)/2\) parameters). However, we can chose to only modelthe variances of the species and not the covariances, by means of adiagonal matrix\(\boldsymbol\Sigma_D\)with only\(p\) parameters. In anextreme situation, we may also chose a single variance parameter for thewhole matrix\(\boldsymbol\Sigma = \sigma\mathbf{I}_p\). This can be tuned inPLN with thecontrol argument, a list controlling various aspects of theunderlying optimization process:

myPLN_spherical<-PLN(    Abundance~1+offset(log(Offset)),data = trichoptera,control =PLN_param(covariance ="spherical",trace =0)  )
myPLN_diagonal<-PLN(    Abundance~1+offset(log(Offset)),data = trichoptera,control =PLN_param(covariance ="diagonal",trace =0)  )

Note that, by default, the model chosen iscovariance = "spherical", so that the two following callsare equivalents:

myPLN_default<-PLN(Abundance~1,data = trichoptera, )
## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!
myPLN_full<-PLN(Abundance~1,data = trichoptera,control =PLN_param(covariance ="full"))
## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

Different covariance models can then be compared with the usualcriteria: it seems that the gain brought by passing from a diagonalmatrix to a fully parameterized covariance is not worth having so manyadditional parameters:

rbind(  myPLN_offsets$criteria,  myPLN_diagonal$criteria,  myPLN_spherical$criteria)%>%as.data.frame(row.names =c("full","diagonal","spherical"))%>%  knitr::kable()
nb_paramloglikBICICL
full170-1051.798-1382.603-2214.572
diagonal34-1109.725-1175.886-2065.220
spherical18-1158.526-1193.552-2152.309

A final model that we can try is the diagonal one with the wind as acovariate, which gives a slight improvement.

myPLN_final<-PLN(    Abundance~1+ Wind+offset(log(Offset)),data    = trichoptera,control =PLN_param(covariance ="diagonal",trace =0)  )rbind(  myPLN_wind$criteria,  myPLN_diagonal$criteria,  myPLN_final$criteria)%>% knitr::kable()
nb_paramloglikBICICL
187-1028.174-1392.060-2100.579
34-1109.725-1175.886-2065.220
51-1073.012-1172.253-1733.586

References

Aitchison, J., and C. H. Ho. 1989.“The Multivariate Poisson-LogNormal Distribution.”Biometrika 76 (4): 643–53.
Johnson, Steven G. 2011.The NLopt Nonlinear-OptimizationPackage.https://nlopt.readthedocs.io/en/latest/.
Svanberg, Krister. 2002.“A Class of Globally ConvergentOptimization Methods Based on Conservative Convex SeparableApproximations.”SIAM Journal on Optimization 12 (2):555–73.

  1. We use a log-log scale in our plot in order not to givean excessive importance to the higher counts in the fit↩︎

  2. Note that if the offset is not computed on the samescale as the count, you might need a different transformation than thelog. To ensure that the offset are on the count-scale, you can use thescale = "count" argument inprepare_data(),see also thecorresponding vignette.↩︎

  3. InPLNmodels the R-squared is apseudo-R-squared that can only be trusted between model where the sameoffsets term was used↩︎


[8]ページ先頭

©2009-2025 Movatter.jp