Movatterモバイル変換


[0]ホーム

URL:


Chapter 15 – Predicting multivariateabundances – Exercise solutions and Code Boxes

David Warton

2024-07-04

Exercise 15.1: Predicting fish communities at wind farms?

Consider again Lena’s wind farm study (Exericise 14.3). She wouldlike to predict what fish occur where. What type of model should sheuse, to get the best predictions?

The focus here is prediction so she should be using some modeldesigned to have good predictive performance. One thing that might helpis putting random effects on terms, that take different values fordifferent species, to borrow strength across species.

Exercise 15.2: Which environmental variables predict hunting spidercommunities?

Petrus set up 28 pitfall traps on sand dunes, in differentenvironmental conditions (in the open, under shrubs, etc) and classifiedall hunting spiders that fell into the trap to species. He measured sixenvironmental variables characterising each site. He would like to know:Which environmental variables best predict hunting spidercommunities?

What analysis approach should he use?

This is a variable selection problem, he should be using multivariatemodel selection techniques. One approach is to construct models designedto be pretty good for prediction, and choose the one with bestpredictive performance. For example, he could fit mixed models, withrandom effects for different species, and use cross-validation to seewhich set of predictors gives the model has best predictiveperformance.

Code Box 15.1: Predictive likelihood for the wind farm data

library(ecostats)data(windFarms)set.seed(5)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]library(mvabund)windMV=mvabund(windFarms$abund)windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -931.5643 -928.3885

What do these results tell us about which model ispreferred?

These results suggest the main effect model is better.

Exercise 15.3: Cross-validation for wind farm data and rarespecies

Repeat the analyses of Code Box 15.1 but after removing rarespecies (observed less than 10 times), using the followingcode:

notRare=colSums(windMV>0)>10windMVnotRare=mvabund(windFarms$abund[,notRare])
windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -781.0415 -788.9616

Did you get a similar answer?

Nope – now the model with an interaction term looks better!

Note that so far we have only considered one test sample, andthere is randomness in the choice of training/test split. Repeat theanalyses of Code Box 15.1 as well as those you have done here, with andwithout rare species, multiple times (which is a form ofcross-validation).

set.seed(1)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -915.8434 -970.8314windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -734.4901 -780.1963
set.seed(2)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]library(mvabund)windMV=mvabund(windFarms$abund)windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -886.4941 -992.3476windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -756.8341 -865.5058
set.seed(3)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]library(mvabund)windMV=mvabund(windFarms$abund)windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -830.5662 -832.3659windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -734.7738 -738.2911

Which set of results tends to be more reliable (less variable) –the ones with or without the rare species? Why do you think thishappened?

Without the rare species, the results are more reliable, withpredictive likelihood usually in the mid -700’s, and the interactionmodel consistently performing better (although sometimes narrowly so).With rare species, it bounces around a lot more, with the smaller model(main effects only) sometimes having better predictive performance. Thisis likely because rare species with all absences in one treatmentcombination can blow up the predictive likelihood (if they are predictedto have a mean of zero, but have some non-zero cases in test data).

Code Box 15.2: Fitting a mixed model to the wind farm data

windComp=manyglm(windMV~Zone*Year,data=windFarms$X,composition=TRUE)library(glmmTMB)wind_glmm=glmmTMB(windMV~Year*Zone+diag(Year*Zone|cols),family=poisson(),data=windComp$data)summary(wind_glmm)#>  Family: poisson  ( log )#> Formula:          windMV ~ Year * Zone + diag(Year * Zone | cols)#> Data: windComp$data#>#>      AIC      BIC   logLik deviance df.resid#>   3259.9   3331.4  -1617.9   3235.9     2852#>#> Random effects:#>#> Conditional model:#>  Groups Name           Variance Std.Dev. Corr#>  cols   (Intercept)    5.3421   2.3113#>         Year2010       1.6302   1.2768   0.00#>         ZoneN          0.6807   0.8251   0.00 0.00#>         ZoneS          3.9012   1.9751   0.00 0.00 0.00#>         Year2010:ZoneN 0.8519   0.9230   0.00 0.00 0.00 0.00#>         Year2010:ZoneS 0.3777   0.6146   0.00 0.00 0.00 0.00 0.00#> Number of obs: 2864, groups:  cols, 16#>#> Conditional model:#>                Estimate Std. Error z value Pr(>|z|)#> (Intercept)    -3.18913    0.65414  -4.875 1.09e-06 ***#> Year2010       -0.26989    0.45008  -0.600    0.549#> ZoneN          -0.15541    0.35209  -0.441    0.659#> ZoneS          -0.16619    0.61670  -0.269    0.788#> Year2010:ZoneN  0.08998    0.41305   0.218    0.828#> Year2010:ZoneS  0.56211    0.37247   1.509    0.131#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 15.4: Predictive likelihood for wind farm mixed model

Calculate the predictive likelihood for the mixed model fitted tothe wind farm data (Code Box 15.2), following the method in Code Box15.1.

isTest= windComp$data$Station%in%levels(windComp$data$Station)[isTestStn]wind_glmm=glmmTMB(windMV~Year*Zone+diag(Year*Zone|cols),family=poisson(),data=windComp$data[isTest==FALSE,])prGLMM_Test=predict(wind_glmm,newdata=windComp$data[isTest==TRUE,],type="response")predGLMM=sum(dpois(windComp$data$windMV[isTest==TRUE],lambda=prGLMM_Test,log=TRUE))c(predGLMM,sum(predLogL))#> [1] -744.1239 -830.5662

Did this have better predictive performance? Why do you thinkthis happened?

This did waaaay better thanmanyglm. Probably because itwas borrowing strength across species to better predict rare ones, notoverfitting quite as badly.

Code Box 15.3: Fitting a LASSO to the wind farm data via glmnet

library(glmnet)X=model.matrix(windMV~Year*Zone*cols,data=windComp$data)y= windComp$data$windMVwindLasso=glmnet(X,y,family="poisson")isTest= windComp$data$Station%in%levels(windComp$data$Station)[isTestStn]windLassoTrain=glmnet(X[isTest==FALSE,], y[isTest==FALSE],family="poisson")prLassoTest=predict(windLassoTrain,X[isTest,],type="response")predLLlasso=colSums(dpois(windComp$data$windMV[isTest],prLassoTest,log=TRUE))plot(windLassoTrain$lambda,predLLlasso,type="l",log="x")
plot of chunk code15.3
plot of chunk code15.3
isBestLambda=which(predLLlasso==max(predLLlasso))
matplot(windLasso$lambda,t(windLasso$beta),type="n",log="x")matlines(windLasso$lambda,t(windLasso$beta))
plot of chunk code15.3beta
plot of chunk code15.3beta

Code Box 15.4: Fitting a group-LASSO to the wind farm data

library(grplasso)windLambdaTrain=lambdamax(windMV~Year*Zone*cols,data=windComp$data,subset=isTest==FALSE,model =PoissReg())*0.7^(0:19)windGrplasso=grplasso(windMV~Year*Zone*cols,data=windComp$data,lambda=windLambdaTrain,subset=isTest==FALSE,model =PoissReg())#> Lambda: 556.505  nr.var: 16#> Lambda: 389.5535  nr.var: 31#> Lambda: 272.6875  nr.var: 61#> Lambda: 190.8812  nr.var: 61#> Lambda: 133.6169  nr.var: 61#> Lambda: 93.5318  nr.var: 61#> Lambda: 65.47226  nr.var: 91#> Lambda: 45.83058  nr.var: 91#> Lambda: 32.08141  nr.var: 91#> Lambda: 22.45699  nr.var: 91#> Lambda: 15.71989  nr.var: 91#> Lambda: 11.00392  nr.var: 91#> Lambda: 7.702746  nr.var: 91#> Lambda: 5.391922  nr.var: 91#> Lambda: 3.774346  nr.var: 91#> Lambda: 2.642042  nr.var: 91#> Lambda: 1.849429  nr.var: 91#> Lambda: 1.294601  nr.var: 91#> Lambda: 0.9062204  nr.var: 91#> Lambda: 0.6343543  nr.var: 91prGrpTest=predict(windGrplasso,newdata=windComp$data[isTest,],type="response")predLLgrplasso=colSums(dpois(windComp$data$windMV[isTest], prGrpTest,log=TRUE))plot(windGrplasso$lambda,predLLgrplasso,log="x",type="l")
plot of chunk code15.4
plot of chunk code15.4
isBestLambdaGrplasso=which(predLLgrplasso==max(predLLgrplasso))
matplot(windGrplasso$lambda,t(windGrplasso$coefficients),type="n",log="x")matlines(windGrplasso$lambda,t(windGrplasso$coefficients))
plot of chunk code15.4beta
plot of chunk code15.4beta

Exercise 15.5: Comparing predictive likelihoods for the wind farmdata

Compare the predictive log-likelihoods from Code Box 15.1 andFigures 15.1c-d. These models were all applied using the same testdataset. Which model seems to fit the data better? Why do you think thisis?

c(sum(predLogL),sum(predLogL_Int),max(predLLlasso),max(predLLgrplasso))#> [1] -830.5662 -832.3659 -744.2159 -739.6487

The LASSO models fit way better, as expected. These models shrinkparameters together, borrowing strength across taxa. The two LASSO fitsdid about as well as each other.

Code Box 15.5: Reduced rank regression for the wind farm data

library(VGAM)wind_RR2=rrvglm(as.matrix(windFarms$abund)~Year*Zone,family=poissonff,data=windFarms$X,Rank=2)wind_manyglm=manyglm(windMV~Year*Zone,data=windFarms$X,family=poisson())c(BIC(wind_RR2),sum(BIC(wind_manyglm)))#> [1] 3317.985 3471.901zoneyear=interaction(windFarms$X$Zone,windFarms$X$Year)matplot(as.numeric(zoneyear),latvar(wind_RR2),pch=c(1,19))
plot of chunk code15.5
plot of chunk code15.5

Which model fits better, according to BIC?

The reduced rank approach fits better.

Code Box 15.6: Using the LASSO for Petrus’s spider data

library(mvabund)data(spider)spid.trait=traitglm(spider$abund,spider$x,method="cv.glm1path")#> No traits matrix entered, so will fit SDMs with different env response for each spp
plot of chunk code15.6
plot of chunk code15.6
library(lattice)a=max(abs(spid.trait$fourth.corner) )colort=colorRampPalette(c("blue","white","red"))plot.4th=levelplot(t(as.matrix(spid.trait$fourth.corner)),xlab="Environmental Variables",ylab="Species",col.regions=colort(100),at=seq(-a, a,length=100),scales =list(x=list(rot =45)) )print(plot.4th)
plot of chunk code15.6lattice
plot of chunk code15.6lattice

Which environmental variables seem to have the strongest effecton spider abundances?

soil.dry has some strongly negative coefficients,fallen.leaves andherb.layer also have somestrong (but positive) interactions.

Code Box 15.7: Mixed model prediction of spider abundances

spidXY=data.frame(scale(spider$x),spider$abund)# scale standardises data!library(reshape2)spiderLong=melt(id=1:6,spidXY,variable.name="cols")Xformula=paste(colnames(spider$x),collapse="+")fullFormula=formula(paste0("value~cols+",Xformula,"+(",Xformula,"|cols)"))library(glmmTMB)spid_glmm=glmmTMB(fullFormula,family=nbinom2(),data=spiderLong)#> Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; non-positive-definite#> Hessian matrix. See vignette('troubleshooting')#> Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; false convergence (8).#> See vignette('troubleshooting'), help('diagnose')summary(spid_glmm)#>  Family: nbinom2  ( log )#> Formula:          value ~ cols + soil.dry + bare.sand + fallen.leaves + moss +#>     herb.layer + reflection + (soil.dry + bare.sand + fallen.leaves +#>     moss + herb.layer + reflection | cols)#> Data: spiderLong#>#>      AIC      BIC   logLik deviance df.resid#>       NA       NA       NA       NA      289#>#> Random effects:#>#> Conditional model:#>  Groups Name          Variance Std.Dev. Corr#>  cols   (Intercept)   0.6278   0.7923#>         soil.dry      1.8414   1.3570   -0.87#>         bare.sand     0.1583   0.3979   -0.13 -0.06#>         fallen.leaves 0.4065   0.6376    0.66 -0.66 -0.71#>         moss          0.2394   0.4893   -0.68  0.38  0.53 -0.60#>         herb.layer    0.6218   0.7885   -0.91  0.80 -0.18 -0.37  0.44#>         reflection    0.6865   0.8286    0.18 -0.55  0.32  0.21  0.37 -0.36#> Number of obs: 336, groups:  cols, 12#>#> Dispersion parameter for nbinom2 family (): 1.51#>#> Conditional model:#>               Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.14046    0.38463   0.365 0.714972#> colsAlopcune   0.57040    0.35671   1.599 0.109811#> colsAlopfabr  -1.63338    0.58814  -2.777 0.005483 **#> colsArctlute  -1.95026    0.73851  -2.641 0.008270 **#> colsArctperi  -4.59785    1.36464  -3.369 0.000754 ***#> colsAuloalbi   0.08365    0.45974   0.182 0.855622#> colsPardlugu  -1.35449    0.69980  -1.936 0.052924 .#> colsPardmont   1.72206    0.42390   4.062 4.86e-05 ***#> colsPardnigr   0.96816    0.47566   2.035 0.041808 *#> colsPardpull   1.18339    0.67418   1.755 0.079205 .#> colsTrocterr   2.28614    0.37644   6.073 1.26e-09 ***#> colsZoraspin   0.37109    0.43828   0.847 0.397160#> soil.dry       0.63734    0.42406   1.503 0.132855#> bare.sand      0.06513    0.15932   0.409 0.682691#> fallen.leaves -0.35355    0.26504  -1.334 0.182218#> moss           0.13821    0.18615   0.743 0.457782#> herb.layer     1.11253    0.27857   3.994 6.50e-05 ***#> reflection    -0.01724    0.29874  -0.058 0.953983#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are some convergence warnings here, which is fair enoughbecause there are only 12 species but heaps of correlated terms in thismodel. Results seem to be reasonably stable on multiple runs, tweakingsome settings, so we’ll go with this fit.

Which predictors seem to be most important to spider communitycomposition?

The largest variance component, by some margin, is forsoil.dry (about 1.8), followed byherb.layerandreflection.


[8]ページ先頭

©2009-2025 Movatter.jp