Movatterモバイル変換

Preliminaries

This vignette illustrates the use of thePLN functionand the methods accompanying the R6 classPLNfit.

From the statistical point of view, the functionPLNadjusts a multivariate Poisson lognormal model to a table of counts,possibly after correcting for effects of offsets and covariates.PLN is the building block for all the multivariate modelsfound in thePLNmodels package: having a basicunderstanding of both the mathematical background and the associated setofR functions is a good place to start.

Requirements

The packages required for the analysis arePLNmodelsplus some others for data manipulation and representation:

library(PLNmodels)library(ggplot2)library(corrplot)

Data set

We illustrate our point with the trichoptera data set, a fulldescription of which can be found inthecorresponding vignette. Data preparation is also detailed inthe specific vignette.

data(trichoptera)trichoptera<-prepare_data(trichoptera$Abundance, trichoptera$Covariate)

Thetrichoptera data frame stores a matrix of counts(trichoptera$Abundance), a matrix of offsets(trichoptera$Offset) and some vectors of covariates(trichoptera$Wind,trichoptera$Temperature,etc.)

Mathematical background

The multivariate Poisson lognormal model (in short PLN, seeAitchison and Ho (1989)) relates some$p$-dimensional observation vectors$\mathbf{Y}_i$ to some$p$-dimensional vectors of Gaussian latentvariables$\mathbf{Z}_i$ asfollows

\[\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim\mathcal{N}({\boldsymbol\mu},\boldsymbol\Sigma), \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.}& \mathbf{Y}_i |\mathbf{Z}_i\sim\mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right). \end{array}\end{equation}\]

The parameter${\boldsymbol\mu}$corresponds to the main effects and the latent covariance matrix$\boldsymbol\Sigma$ describes theunderlying residual structure of dependence between the$p$ variables. The following figureprovides insights about the role played by the different layers

PLN: geometrical view

Covariates and offsets

This model generalizes naturally to a formulation closer to amultivariate generalized linear model, where the main effect is due to alinear combination of$d$ covariates$\mathbf{x}_i$ (including a vector ofintercepts). We also let the possibility to add some offsets for the$p$ variables in in each sample, thatis$\mathbf{o}_i$. Hence, theprevious model generalizes to

\[\begin{equation} \mathbf{Y}_i | \mathbf{Z}_i \sim\mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim\mathcal{N}({\mathbf{o}_i +\mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma), \\\end{equation}\] where$\mathbf{B}$ is a$d\times p$ matrix of regressionparameters. When all individuals$i=1,\dots,n$ are stacked together, thedata matrices available to feed the model are

the$n\times p$ matrix of counts$\mathbf{Y}$
the$n\times d$ matrix of design$\mathbf{X}$
the$n\times p$ matrix of offsets$\mathbf{O}$

Inference in PLN then focuses on the regression parameters$\mathbf{B}$ and on the covariance matrix$\boldsymbol\Sigma$.

Optimization by Variational inference

Technically speaking, we adopt inPLNmodels avariational strategy to approximate the log-likelihood function andoptimize the consecutive variational surrogate of the log-likelihoodwith a gradient-ascent-based approach. To this end, we rely on the CCSAalgorithm ofSvanberg (2002) implemented in the C++ library(Johnson 2011), whichwe link to the package.

Analysis of trichoptera data with a PLN model

The standard PLN model described above is adjusted with the functionPLN. We now review its usage on a the trichoptera dataset.

A PLN model with latent main effects

Adjusting a fit

In order to become familiar with the functionPLN andits outputs, let us first fit a simple PLN model with just an interceptfor each species:

myPLN<-PLN(Abundance~1, trichoptera)

## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

Note the use of theformula object to specify the model:the vector$\boldsymbol\mu$ of maineffects in the mathematical formulation (one per column species) isspecified in the call with the term~ 1 in theright-hand-side of the formula.Abundance is a variable inthe data frametrichoptera corresponding to a matrix of 17columns and theresponse in the model, occurring on theleft-hand-side of the formula.

The`PLNfit` object

myPLN is anR6 object with classPLNfit, which comes with a couple of methods, as recalledwhen printing/showing such an object in theR console:

myPLN

## A multivariate Poisson Lognormal fit with full covariance model.## ==================================================================##  nb_param    loglik       BIC       ICL##       170 -1130.047 -1460.852 -2229.083## ==================================================================## * Useful fields##     $model_par, $latent, $latent_pos, $var_par, $optim_par##     $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria## * Useful S3 methods##     print(), coef(), sigma(), vcov(), fitted()##     predict(), predict_cond(), standard_error()

See also?PLNfit for more comprehensive information.

Field access

Accessing public fields of aPLNfit object can be donejust like with a traditional list,e.g.,

c(myPLN$loglik, myPLN$BIC, myPLN$ICL)

## [1] -1130.047 -1460.852 -2229.083

myPLN$criteria

##   nb_param    loglik       BIC       ICL## 1      170 -1130.047 -1460.852 -2229.083

GLM-like interface

We provide a set of S3-methods forPLNfit that mimic thestandard (G)LM-like interface ofR::stats, which we presentnow.

One can access the fitted value of the counts (Abundance–$\hat{\mathbf{Y}}$) and check thatthe algorithm basically learned correctly from the data¹:

data.frame(fitted   =as.vector(fitted(myPLN)),observed =as.vector(trichoptera$Abundance))%>%ggplot(aes(x = observed,y = fitted))+geom_point(size = .5,alpha =.25 )+scale_x_log10()+scale_y_log10()+theme_bw()+annotation_logticks()

fitted value vs. observation

The residual correlation matrix better displays as an imagematrix:

myPLN%>%sigma()%>%cov2cor()%>%corrplot()

Observation weights

It is also possible to use observation weights like in standard(G)LMs:

myPLN_weighted<-PLN(    Abundance~1,data    = trichoptera,weights =runif(nrow(trichoptera)),control =PLN_param(trace =0)  )data.frame(unweighted =as.vector(fitted(myPLN)),weighted   =as.vector(fitted(myPLN_weighted)))%>%ggplot(aes(x = unweighted,y = weighted))+geom_point(size = .5,alpha =.25 )+scale_x_log10()+scale_y_log10()+theme_bw()+annotation_logticks()

Accounting for covariates and offsets

For ecological count data, it is generally a good advice to includethe sampling effort via an offset term whenever available, otherwisesamples are not necessarily comparable:

myPLN_offsets<-PLN(Abundance~1+offset(log(Offset)),data = trichoptera,control =PLN_param(trace =0))

Note that we use the functionoffset with alog-transform of the total counts² since it acts in the latent layer of themodel. Obviously the model with offsets is better since thelog-likelihood is higher with the same number of parameters³:

rbind(  myPLN$criteria,  myPLN_offsets$criteria)%>% knitr::kable()

nb_param	loglik	BIC	ICL
170	-1130.047	-1460.852	-2229.083
170	-1051.798	-1382.603	-2214.572

Let us try to correct for the wind effect in our model:

myPLN_wind<-PLN(Abundance~1+ Wind+offset(log(Offset)),data = trichoptera)

## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

When we compare the models, the gain is clear in terms oflog-likelihood. However, the BIC chooses not to include thisvariable:

rbind(  myPLN_offsets$criteria,  myPLN_wind$criteria)%>% knitr::kable()

nb_param	loglik	BIC	ICL
170	-1051.798	-1382.603	-2214.572
187	-1028.174	-1392.060	-2100.579

Covariance models (full, diagonal, spherical)

It is possible to change a bit the parametrization used for modelingthe residual covariance matrix$\boldsymbol\Sigma$, and thus reduce thetotal number of parameters used in the model. By default, the residualcovariance is fully parameterized (hence$p\times (p+1)/2$ parameters). However, we can chose to only modelthe variances of the species and not the covariances, by means of adiagonal matrix$\boldsymbol\Sigma_D$with only$p$ parameters. In anextreme situation, we may also chose a single variance parameter for thewhole matrix$\boldsymbol\Sigma = \sigma\mathbf{I}_p$. This can be tuned inPLN with thecontrol argument, a list controlling various aspects of theunderlying optimization process:

myPLN_spherical<-PLN(    Abundance~1+offset(log(Offset)),data = trichoptera,control =PLN_param(covariance ="spherical",trace =0)  )

myPLN_diagonal<-PLN(    Abundance~1+offset(log(Offset)),data = trichoptera,control =PLN_param(covariance ="diagonal",trace =0)  )

Note that, by default, the model chosen iscovariance = "spherical", so that the two following callsare equivalents:

myPLN_default<-PLN(Abundance~1,data = trichoptera, )

## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

myPLN_full<-PLN(Abundance~1,data = trichoptera,control =PLN_param(covariance ="full"))

## ##  Initialization...##  Adjusting a full covariance PLN model with nlopt optimizer##  Post-treatments...##  DONE!

Different covariance models can then be compared with the usualcriteria: it seems that the gain brought by passing from a diagonalmatrix to a fully parameterized covariance is not worth having so manyadditional parameters:

rbind(  myPLN_offsets$criteria,  myPLN_diagonal$criteria,  myPLN_spherical$criteria)%>%as.data.frame(row.names =c("full","diagonal","spherical"))%>%  knitr::kable()

	nb_param	loglik	BIC	ICL
full	170	-1051.798	-1382.603	-2214.572
diagonal	34	-1109.725	-1175.886	-2065.220
spherical	18	-1158.526	-1193.552	-2152.309

A final model that we can try is the diagonal one with the wind as acovariate, which gives a slight improvement.

myPLN_final<-PLN(    Abundance~1+ Wind+offset(log(Offset)),data    = trichoptera,control =PLN_param(covariance ="diagonal",trace =0)  )rbind(  myPLN_wind$criteria,  myPLN_diagonal$criteria,  myPLN_final$criteria)%>% knitr::kable()

nb_param	loglik	BIC	ICL
187	-1028.174	-1392.060	-2100.579
34	-1109.725	-1175.886	-2065.220
51	-1073.012	-1172.253	-1733.586

Movatterモバイル変換

Analyzing multivariate count data with thePoisson log-normal model

PLN team

2025-03-21

Preliminaries

Requirements

Data set

Mathematical background

Covariates and offsets

Optimization by Variational inference

Analysis of trichoptera data with a PLN model

A PLN model with latent main effects

Adjusting a fit

The`PLNfit` object

Field access

GLM-like interface

Observation weights

Accounting for covariates and offsets

Covariance models (full, diagonal, spherical)

References

Movatterモバイル変換

Analyzing multivariate count data with thePoisson log-normal model

PLN team

2025-03-21

Preliminaries

Requirements

Data set

Mathematical background

Covariates and offsets

Optimization by Variational inference

Analysis of trichoptera data with a PLN model

A PLN model with latent main effects

Adjusting a fit

ThePLNfit object

Field access

GLM-like interface

Observation weights

Accounting for covariates and offsets

Covariance models (full, diagonal, spherical)

References

The`PLNfit` object