Consider again Lena’s wind farm study (Exericise 14.3). She wouldlike to predict what fish occur where. What type of model should sheuse, to get the best predictions?
The focus here is prediction so she should be using some modeldesigned to have good predictive performance. One thing that might helpis putting random effects on terms, that take different values fordifferent species, to borrow strength across species.
Petrus set up 28 pitfall traps on sand dunes, in differentenvironmental conditions (in the open, under shrubs, etc) and classifiedall hunting spiders that fell into the trap to species. He measured sixenvironmental variables characterising each site. He would like to know:Which environmental variables best predict hunting spidercommunities?
What analysis approach should he use?
This is a variable selection problem, he should be using multivariatemodel selection techniques. One approach is to construct models designedto be pretty good for prediction, and choose the one with bestpredictive performance. For example, he could fit mixed models, withrandom effects for different species, and use cross-validation to seewhich set of predictors gives the model has best predictiveperformance.
library(ecostats)data(windFarms)set.seed(5)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]library(mvabund)windMV=mvabund(windFarms$abund)windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -931.5643 -928.3885What do these results tell us about which model ispreferred?
These results suggest the main effect model is better.
Repeat the analyses of Code Box 15.1 but after removing rarespecies (observed less than 10 times), using the followingcode:
windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -781.0415 -788.9616Did you get a similar answer?
Nope – now the model with an interaction term looks better!
Note that so far we have only considered one test sample, andthere is randomness in the choice of training/test split. Repeat theanalyses of Code Box 15.1 as well as those you have done here, with andwithout rare species, multiple times (which is a form ofcross-validation).
set.seed(1)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -915.8434 -970.8314windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -734.4901 -780.1963set.seed(2)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]library(mvabund)windMV=mvabund(windFarms$abund)windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -886.4941 -992.3476windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -756.8341 -865.5058set.seed(3)# use this seed to get the same results as below:nStations=length(levels(as.factor(windFarms$X$Station)))isTestStn=sample(nStations,nStations/2)isTest= windFarms$X$Station%in%levels(windFarms$X$Station)[isTestStn]library(mvabund)windMV=mvabund(windFarms$abund)windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_Test=predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,],type="response")prWind_Int_Test=predict.manyglm(windFt_Int_Train,newdata=windFarms$X[isTest,],type="response")predLogL=dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)predLogL_Int=dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)c(sum(predLogL),sum(predLogL_Int))#> [1] -830.5662 -832.3659windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,data=windFarms$X[isTest==FALSE,],family="poisson")windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,data=windFarms$X[isTest==FALSE,],family="poisson")prWind_TestRare=predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,],type="response")prWind_Int_TestRare=predict.manyglm(windFt_Int_TrainRare,newdata=windFarms$X[isTest,],type="response")predLogLRare=dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)predLogL_IntRare=dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)c(sum(predLogLRare),sum(predLogL_IntRare))#> [1] -734.7738 -738.2911Which set of results tends to be more reliable (less variable) –the ones with or without the rare species? Why do you think thishappened?
Without the rare species, the results are more reliable, withpredictive likelihood usually in the mid -700’s, and the interactionmodel consistently performing better (although sometimes narrowly so).With rare species, it bounces around a lot more, with the smaller model(main effects only) sometimes having better predictive performance. Thisis likely because rare species with all absences in one treatmentcombination can blow up the predictive likelihood (if they are predictedto have a mean of zero, but have some non-zero cases in test data).
windComp=manyglm(windMV~Zone*Year,data=windFarms$X,composition=TRUE)library(glmmTMB)wind_glmm=glmmTMB(windMV~Year*Zone+diag(Year*Zone|cols),family=poisson(),data=windComp$data)summary(wind_glmm)#> Family: poisson ( log )#> Formula: windMV ~ Year * Zone + diag(Year * Zone | cols)#> Data: windComp$data#>#> AIC BIC logLik deviance df.resid#> 3259.9 3331.4 -1617.9 3235.9 2852#>#> Random effects:#>#> Conditional model:#> Groups Name Variance Std.Dev. Corr#> cols (Intercept) 5.3421 2.3113#> Year2010 1.6302 1.2768 0.00#> ZoneN 0.6807 0.8251 0.00 0.00#> ZoneS 3.9012 1.9751 0.00 0.00 0.00#> Year2010:ZoneN 0.8519 0.9230 0.00 0.00 0.00 0.00#> Year2010:ZoneS 0.3777 0.6146 0.00 0.00 0.00 0.00 0.00#> Number of obs: 2864, groups: cols, 16#>#> Conditional model:#> Estimate Std. Error z value Pr(>|z|)#> (Intercept) -3.18913 0.65414 -4.875 1.09e-06 ***#> Year2010 -0.26989 0.45008 -0.600 0.549#> ZoneN -0.15541 0.35209 -0.441 0.659#> ZoneS -0.16619 0.61670 -0.269 0.788#> Year2010:ZoneN 0.08998 0.41305 0.218 0.828#> Year2010:ZoneS 0.56211 0.37247 1.509 0.131#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Calculate the predictive likelihood for the mixed model fitted tothe wind farm data (Code Box 15.2), following the method in Code Box15.1.
isTest= windComp$data$Station%in%levels(windComp$data$Station)[isTestStn]wind_glmm=glmmTMB(windMV~Year*Zone+diag(Year*Zone|cols),family=poisson(),data=windComp$data[isTest==FALSE,])prGLMM_Test=predict(wind_glmm,newdata=windComp$data[isTest==TRUE,],type="response")predGLMM=sum(dpois(windComp$data$windMV[isTest==TRUE],lambda=prGLMM_Test,log=TRUE))c(predGLMM,sum(predLogL))#> [1] -744.1239 -830.5662Did this have better predictive performance? Why do you thinkthis happened?
This did waaaay better thanmanyglm. Probably because itwas borrowing strength across species to better predict rare ones, notoverfitting quite as badly.
library(glmnet)X=model.matrix(windMV~Year*Zone*cols,data=windComp$data)y= windComp$data$windMVwindLasso=glmnet(X,y,family="poisson")isTest= windComp$data$Station%in%levels(windComp$data$Station)[isTestStn]windLassoTrain=glmnet(X[isTest==FALSE,], y[isTest==FALSE],family="poisson")prLassoTest=predict(windLassoTrain,X[isTest,],type="response")predLLlasso=colSums(dpois(windComp$data$windMV[isTest],prLassoTest,log=TRUE))plot(windLassoTrain$lambda,predLLlasso,type="l",log="x")matplot(windLasso$lambda,t(windLasso$beta),type="n",log="x")matlines(windLasso$lambda,t(windLasso$beta))library(grplasso)windLambdaTrain=lambdamax(windMV~Year*Zone*cols,data=windComp$data,subset=isTest==FALSE,model =PoissReg())*0.7^(0:19)windGrplasso=grplasso(windMV~Year*Zone*cols,data=windComp$data,lambda=windLambdaTrain,subset=isTest==FALSE,model =PoissReg())#> Lambda: 556.505 nr.var: 16#> Lambda: 389.5535 nr.var: 31#> Lambda: 272.6875 nr.var: 61#> Lambda: 190.8812 nr.var: 61#> Lambda: 133.6169 nr.var: 61#> Lambda: 93.5318 nr.var: 61#> Lambda: 65.47226 nr.var: 91#> Lambda: 45.83058 nr.var: 91#> Lambda: 32.08141 nr.var: 91#> Lambda: 22.45699 nr.var: 91#> Lambda: 15.71989 nr.var: 91#> Lambda: 11.00392 nr.var: 91#> Lambda: 7.702746 nr.var: 91#> Lambda: 5.391922 nr.var: 91#> Lambda: 3.774346 nr.var: 91#> Lambda: 2.642042 nr.var: 91#> Lambda: 1.849429 nr.var: 91#> Lambda: 1.294601 nr.var: 91#> Lambda: 0.9062204 nr.var: 91#> Lambda: 0.6343543 nr.var: 91prGrpTest=predict(windGrplasso,newdata=windComp$data[isTest,],type="response")predLLgrplasso=colSums(dpois(windComp$data$windMV[isTest], prGrpTest,log=TRUE))plot(windGrplasso$lambda,predLLgrplasso,log="x",type="l")matplot(windGrplasso$lambda,t(windGrplasso$coefficients),type="n",log="x")matlines(windGrplasso$lambda,t(windGrplasso$coefficients))Compare the predictive log-likelihoods from Code Box 15.1 andFigures 15.1c-d. These models were all applied using the same testdataset. Which model seems to fit the data better? Why do you think thisis?
c(sum(predLogL),sum(predLogL_Int),max(predLLlasso),max(predLLgrplasso))#> [1] -830.5662 -832.3659 -744.2159 -739.6487The LASSO models fit way better, as expected. These models shrinkparameters together, borrowing strength across taxa. The two LASSO fitsdid about as well as each other.
library(VGAM)wind_RR2=rrvglm(as.matrix(windFarms$abund)~Year*Zone,family=poissonff,data=windFarms$X,Rank=2)wind_manyglm=manyglm(windMV~Year*Zone,data=windFarms$X,family=poisson())c(BIC(wind_RR2),sum(BIC(wind_manyglm)))#> [1] 3317.985 3471.901zoneyear=interaction(windFarms$X$Zone,windFarms$X$Year)matplot(as.numeric(zoneyear),latvar(wind_RR2),pch=c(1,19))Which model fits better, according to BIC?
The reduced rank approach fits better.
library(mvabund)data(spider)spid.trait=traitglm(spider$abund,spider$x,method="cv.glm1path")#> No traits matrix entered, so will fit SDMs with different env response for each spplibrary(lattice)a=max(abs(spid.trait$fourth.corner) )colort=colorRampPalette(c("blue","white","red"))plot.4th=levelplot(t(as.matrix(spid.trait$fourth.corner)),xlab="Environmental Variables",ylab="Species",col.regions=colort(100),at=seq(-a, a,length=100),scales =list(x=list(rot =45)) )print(plot.4th)Which environmental variables seem to have the strongest effecton spider abundances?
soil.dry has some strongly negative coefficients,fallen.leaves andherb.layer also have somestrong (but positive) interactions.
spidXY=data.frame(scale(spider$x),spider$abund)# scale standardises data!library(reshape2)spiderLong=melt(id=1:6,spidXY,variable.name="cols")Xformula=paste(colnames(spider$x),collapse="+")fullFormula=formula(paste0("value~cols+",Xformula,"+(",Xformula,"|cols)"))library(glmmTMB)spid_glmm=glmmTMB(fullFormula,family=nbinom2(),data=spiderLong)#> Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; non-positive-definite#> Hessian matrix. See vignette('troubleshooting')#> Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; false convergence (8).#> See vignette('troubleshooting'), help('diagnose')summary(spid_glmm)#> Family: nbinom2 ( log )#> Formula: value ~ cols + soil.dry + bare.sand + fallen.leaves + moss +#> herb.layer + reflection + (soil.dry + bare.sand + fallen.leaves +#> moss + herb.layer + reflection | cols)#> Data: spiderLong#>#> AIC BIC logLik deviance df.resid#> NA NA NA NA 289#>#> Random effects:#>#> Conditional model:#> Groups Name Variance Std.Dev. Corr#> cols (Intercept) 0.6278 0.7923#> soil.dry 1.8414 1.3570 -0.87#> bare.sand 0.1583 0.3979 -0.13 -0.06#> fallen.leaves 0.4065 0.6376 0.66 -0.66 -0.71#> moss 0.2394 0.4893 -0.68 0.38 0.53 -0.60#> herb.layer 0.6218 0.7885 -0.91 0.80 -0.18 -0.37 0.44#> reflection 0.6865 0.8286 0.18 -0.55 0.32 0.21 0.37 -0.36#> Number of obs: 336, groups: cols, 12#>#> Dispersion parameter for nbinom2 family (): 1.51#>#> Conditional model:#> Estimate Std. Error z value Pr(>|z|)#> (Intercept) 0.14046 0.38463 0.365 0.714972#> colsAlopcune 0.57040 0.35671 1.599 0.109811#> colsAlopfabr -1.63338 0.58814 -2.777 0.005483 **#> colsArctlute -1.95026 0.73851 -2.641 0.008270 **#> colsArctperi -4.59785 1.36464 -3.369 0.000754 ***#> colsAuloalbi 0.08365 0.45974 0.182 0.855622#> colsPardlugu -1.35449 0.69980 -1.936 0.052924 .#> colsPardmont 1.72206 0.42390 4.062 4.86e-05 ***#> colsPardnigr 0.96816 0.47566 2.035 0.041808 *#> colsPardpull 1.18339 0.67418 1.755 0.079205 .#> colsTrocterr 2.28614 0.37644 6.073 1.26e-09 ***#> colsZoraspin 0.37109 0.43828 0.847 0.397160#> soil.dry 0.63734 0.42406 1.503 0.132855#> bare.sand 0.06513 0.15932 0.409 0.682691#> fallen.leaves -0.35355 0.26504 -1.334 0.182218#> moss 0.13821 0.18615 0.743 0.457782#> herb.layer 1.11253 0.27857 3.994 6.50e-05 ***#> reflection -0.01724 0.29874 -0.058 0.953983#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1There are some convergence warnings here, which is fair enoughbecause there are only 12 species but heaps of correlated terms in thismodel. Results seem to be reasonably stable on multiple runs, tweakingsome settings, so we’ll go with this fit.
Which predictors seem to be most important to spider communitycomposition?
The largest variance component, by some margin, is forsoil.dry (about 1.8), followed byherb.layerandreflection.