ubms is an R package for fitting models of wildlifeoccurrence and abundance to with datasets where animals are notindividually marked. It provides a nearly identical interface to thepopularunmarked package(Fiske, Chandler, and others 2011).Instead of using maximum likelihood to fit models (as withunmarked), models are fit in a Bayesian framework usingStan(Carpenter et al. 2017). It isgenerally expected that you are already familiar withunmarked when usingubms. You can downloadubms, report issues, or help with development onGithub.
There are several advantages to usingubms overunmarked. First, it is possible to include random effectsinubms models, which is not currently possible inunmarked. These are specified using the familiar syntax oflme4(Bates et al.2015). Second,ubms generates posteriordistributions for model parameters, including latent occupancy andabundance parameters. These can be useful for post-hoc analyses anddiagnostics. Finally, fitting models with Stan gives you access to thelarge ecosystem of Stan-related tools, such asLOO(leave-one-out-cross validation;(Vehtari, Gelman, and Gabry2017)).
Another alternative toubms would be to fit models in anexisting modeling language such as BUGS,JAGS, or directly in Stan.ubms abstracts away the complex process of defining andwriting custom models in these languages which need to be updated eachtime you make changes to your model. It also provides many useful helperfunctions (e.g. predict) which would otherwise requirecustom code. Finally, because of Stan’s efficient sampler(Hoffman and Gelman2014) and because the underlying likelihoods inubms are marginalized,ubms will often fitmodels much faster than equivalent models in BUGS or JAGS(Yackulic et al.2020).
Relative tounmarked,ubms has fewer typesof models available. For example, models that incorporate temporaryemigration (likegdistsamp)(Chandler, Royle, and King 2011) arecurrently not available inubms. Models should run muchfaster inunmarked as well. If you do not need one of thespecific benefits ofubms described above, it makes senseto stick withunmarked. Even if you do plan to useubms, it makes sense to test the models inunmarked first. The similar interface between the twopackages makes this very easy, as you will see in the next section.
Relative to BUGS/JAGS/Stan,ubms is less flexiblebecause you cannot customize your model structure. You are limited tothe provided model types. Furthermore, you cannot currently customizeprior distributions (although I plan to add this in the future in someform). Finally, writing your own BUGS/JAGS model can be valuable forgaining a deeper understanding of how a model works;ubms,likeunmarked, is essentially a black box.
To summarize the advantages and disadvantages: I seeubms as an intermediate step along the continuum fromunmarked to custom models in BUGS/JAGS. It is not meant toreplace either approach but rather to supplement them, for situationswhen a Bayesian framework is needed and “off-the-shelf” model structuresare adequate.
Occupancy models estimate the probability\(\psi\) that a species occupies a site,while accounting for detection probability\(p< 1\)(MacKenzie et al. 2002). In orderto estimate both\(p\) and\(\psi\), repeated observations(detection/non-detection data) at each site are required.
First, load the dataset we’ll be using, which comes withunmarked:
Thecrossbill dataset is adata.frame withmany columns. It contains detection/non-detection data for the Europeancrossbill (Loxia curvirostra) in Switzerland(Schmid, Zbinden, and Keller2004).
## [1] 267 58## [1] "id" "ele" "forest" "surveys" "det991" "det992" "det993" ## [8] "det001" "det002" "det003" "det011" "det012" "det013" "det021" ## [15] "det022" "det023" "det031" "det032" "det033" "det041" "det042" ## [22] "det043" "det051" "det052" "det053" "det061" "det062" "det063" ## [29] "det071" "det072" "det073" "date991" "date992" "date993" "date001"## [36] "date002" "date003" "date011" "date012" "date013" "date021" "date022"## [43] "date023" "date031" "date032" "date033" "date041" "date042" "date043"## [50] "date051" "date052" "date053" "date061" "date062" "date063" "date071"## [57] "date072" "date073"Check?crossbill for details about each column. Thefirst three columnsid,ele, andforest are site covariates.
The following 27 columns beginning withdet are thebinary detection/non-detection data; 9 years with 3 observations peryear. For this example we want to fit a single-season occupancy model;thus we will use only the first three columns (year 1) ofdet as our response variabley.
## det991 det992 det993## 1 0 0 0## 2 0 0 0## 3 NA NA NA## 4 0 0 0## 5 0 0 0## 6 NA NA NANote that missing values are possible. The final 27 columns beginningwithdate are the Julian dates for each observation. Aswithy we want only the first three columns correspondingto year 1.
Finally, we build ourunmarkedFrame object holding ourdetection/non-detection data, site covariates, and observationcovariates. Since we will conduct a single-season occupancy analysis, weneed to useunmarkedFrameOccu specifically. The resultingunmarkedFrame can be used by bothunmarked andubms.
## Data frame representation of unmarkedFrame object.## y.1 y.2 y.3 id ele forest date.1 date.2 date.3## 1 0 0 0 1 450 3 34 59 65## 2 0 0 0 2 450 21 17 33 65## 3 NA NA NA 3 1050 32 NA NA NA## 4 0 0 0 4 950 9 29 59 65## 5 0 0 0 5 1150 35 24 45 65## 6 NA NA NA 6 550 2 NA NA NAFirst, we fit a null model (no covariates) inunmarkedusing theoccu function. Theoccu functionrequires as input a double formula (for detection and occupancy,respectively) along with ourunmarkedFrame.
## ## Call:## occu(formula = ~1 ~ 1, data = umf)## ## Occupancy (logit-scale):## Estimate SE z P(>|z|)## -0.546 0.218 -2.51 0.0121## ## Detection (logit-scale):## Estimate SE z P(>|z|)## -0.594 0.208 -2.86 0.00426## ## AIC: 511.2538 ## Number of sites: 245## ID of sites removed due to NA: 3 6 38 63 71 106 116 118 125 126 152 163 168 178 181 197 199 212 218 220 238 257Next, we fit the same model inubms. The equivalent tooccu inubms isstan_occu.Functions inubms generally use thisstan_prefix, based on the approach used in packagerstanarm forGLMs. We need to provide the same arguments tostan_occu.In addition, we will specify that we want 3 MCMC chains(chains=3), with 500 iterations per chain(iter=500) of which the first half will be warmupiterations. It is beyond the scope of this vignette to discuss theappropriate number or length of chains; seetheStan user’s guide for more details. Generally 4 chains of 2000iterations each is recommended (of which 1000 per chain are warmups).Thus, 500 iterations per chain is probably not enough, but to keepthings running quickly it is sufficient for this vignette. Note that ifyou are more familiar with BUGS or JAGS, Stan generally requires asmaller number of iterations to reach convergence thanks to its defaultNUTS sampler(Hoffmanand Gelman 2014). If you have a good multi-core CPU, you canrun chains in parallel. Tell Stan how many parallel cores you want touse by assigning a value to thecores argument.
## ## Call:## stan_occu(formula = ~1 ~ 1, data = umf, chains = 3, iter = 500, ## refresh = 0, seed = 123)## ## Occupancy (logit-scale):## Estimate SD 2.5% 97.5% n_eff Rhat## -0.483 0.234 -0.882 0.00981 236 1## ## Detection (logit-scale):## Estimate SD 2.5% 97.5% n_eff Rhat## -0.642 0.226 -1.11 -0.229 285 1## ## LOOIC: 511.954## Runtime: 24.968 secThe structure of the output fromunmarked andubms is intentionally similar. Estimates of the occupancyand detection parameters are also similar, but not identical. For a moredirect comparison, call thecoef function on both modelobjects:
## unmarked stan## psi(Int) -0.5461203 -0.4834253## p(Int) -0.5939612 -0.6419488Let’s look at the output from ourfit_stan modelagain:
## ## Call:## stan_occu(formula = ~1 ~ 1, data = umf, chains = 3, iter = 500, ## refresh = 0, seed = 123)## ## Occupancy (logit-scale):## Estimate SD 2.5% 97.5% n_eff Rhat## -0.483 0.234 -0.882 0.00981 236 1## ## Detection (logit-scale):## Estimate SD 2.5% 97.5% n_eff Rhat## -0.642 0.226 -1.11 -0.229 285 1## ## LOOIC: 511.954## Runtime: 24.968 secThe first part (underCall:) is the command we used toget this model output. Underneath are two tables, one per submodel,corresponding to the occupancy and detection parts of the model. Withineach table there is one row per parameter in the submodel. Sincefit_stan had no covariates, there is only an intercept termfor each submodel. Model parameters in this summary table are alwaysshown on the appropriate transformed scale, in this case logit. To getthe corresponding probabilities, you can use thepredictfunction, which we will demonstrate later.
For each parameter, the mean and standard deviation of the posteriordistribution are given. Unlikeunmarked output, there is no\(Z\) or\(p\)-value. Instead, there is a 95%uncertainty interval provided.
The final two columns in each summary tablen_eff andRhat are MCMC diagnostics. We will discuss their meaninglater.
To extract summary values into an R table for easy manipulation, usethesummary method. Note that you have to specify whichsubmodel you want ("state" for occupancy or"det" for detection).
## [1] -0.4834253To extract the entire posterior for a parameter, use theextract method. To avoid name collisions you need to usethe full name of the parameter (which contains both the submodel and theparameter name) when extracting. To see a list of the available fullparameter names, use thenames method.
## [1] "beta_state[(Intercept)]" "beta_det[(Intercept)]"occ_intercept<-extract(fit_stan,"beta_state[(Intercept)]")[[1]]hist(occ_intercept,freq=FALSE)lines(density(occ_intercept),col='red',lwd=2)Now we’ll fit two candidate models to thecrossbill datainubms and compare them.
Along with our previous null model, we’ll fit a “global” model withboth site and observation covariates. This is just an example; perhapsother models should also be considered if we were preparing thisanalysis for publication. In our model formulas, we have normalized allcovariates withscale so they have a mean of 0 and astandard deviation of 1. This can help improve model convergence and isgenerally a good idea.
fit_null<- fit_stanfit_global<-stan_occu(~scale(date)~scale(forest)+scale(ele),data=umf,chains=3,iter=500,seed=123)## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.## Running the chains for more iterations may help. See## https://mc-stan.org/misc/warnings.html#bulk-ess## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.## Running the chains for more iterations may help. See## https://mc-stan.org/misc/warnings.html#tail-essThefit_global model gave us some warnings about theeffective sample size (n_eff) along with a suggestedsolution. We will ignore this warning for now but normally it is a goodidea to pay close attention to these warnings.
First we combine the models into afitList:
Then we generate a model selection table:
## elpd nparam elpd_diff se_diff## fit_global -236.776 5.984 0.000 0.000## fit_null -255.977 2.524 -19.201 6.618Instead of AIC, models are compared using leave-one-outcross-validation (LOO)(Vehtari, Gelman, and Gabry 2017) viatheloo package. Based on this cross-validation, theexpected predictive accuracy (elpd) for each model iscalculated. The model with the largestelpd(fit_global) performed best. Theelpd_diffcolumn shows the difference inelpd between a model and thetop model; if this difference is several times larger than the standarderror of the difference (se_diff), we are confident themodel with the largerelpd performed better. We can seethat thefit_global model is clearly the best performingmodel.
You can obtain LOO information for a single model using theloo method:
## ## Computed from 750 by 245 log-likelihood matrix.## ## Estimate SE## elpd_loo -236.8 19.2## p_loo 6.0 0.6## looic 473.6 38.5## ------## MCSE of elpd_loo is 0.1.## MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.5]).## ## All Pareto k estimates are good (k < 0.65).## See help('pareto-k-diagnostic') for details.Thelooic value is analogous to AIC.
You can also obtain the WAIC (Widely Applicable InformationCriterion) if you prefer(Vehtari, Gelman, and Gabry2017):
## ## Computed from 750 by 245 log-likelihood matrix.## ## Estimate SE## elpd_waic -236.7 19.2## p_waic 5.9 0.6## waic 473.5 38.5We’ll define the global model as our top model:
## ## Call:## stan_occu(formula = ~scale(date) ~ scale(forest) + scale(ele), ## data = umf, chains = 3, iter = 500, refresh = 0, seed = 123)## ## Occupancy (logit-scale):## Estimate SD 2.5% 97.5% n_eff Rhat## (Intercept) -0.603 0.359 -1.1896 0.293 208 1.01## scale(forest) 1.077 0.325 0.5595 1.925 211 1.01## scale(ele) 0.566 0.245 0.0924 1.054 533 1.00## ## Detection (logit-scale):## Estimate SD 2.5% 97.5% n_eff Rhat## (Intercept) -0.754 0.253 -1.290 -0.316 300 1.01## scale(date) 0.558 0.170 0.233 0.883 1030 1.00## ## LOOIC: 473.552## Runtime: 37.205 secAgain looking at the summary offit_top, we concludeMCMC chains have converged if all\(\hat{R}< 1.05\). To visualize convergence, look at thetraceplots:
The traceplots look a little messy. We may also get a warning thatn_eff is lacking for some parameters. The rule of thumb isto haven_eff > 100 * number of chains (300). The easysolution to both problems would be to re-run this model with moreiterations.
Calculating residuals is tricky for occupancy models. There isn’t onewidely accepted way of doing it.ubms implements theapproach of Wright(Wright, Irvine, and Higgs 2019) inwhich residuals are calculated separately for the state and observationprocesses. To quickly plot residuals against fitted values, use theplot method:
Note that the residuals are automatically binned, which isappropriate for a binomial response(Gelman and Hill 2007). If the modelfits the data well, you would expect 95% of the binned residual pointsto fall within the shaded area. You can also plot residuals againstcovariate values using theplot_residuals function:
ubms also supports goodness-of-fit tests (posteriorpredictive checks) for some models. In the case of occupancy models,there is support for the MacKenzie-Bailey (M-B) chi-square test(MacKenzie and Bailey2004) via thegof function. For each posteriordraw, the M-B statistic is calculated for the actual data and for asimulated dataset. The proportion of draws for which the simulatedstatistic is larger than the actual statistic should be near 0.5 if themodel fits well.
## MacKenzie-Bailey Chi-square ## Point estimate = 28.241## Posterior predictive p = 0.04Our model does not appear to fit well based on this posteriorpredictive check. The first step to addressing this would be to run themodel for more iterations to make sure that isn’t the reason.
You can use theposterior_predict function to simulatenew datasets, which you can use to calculate your own fit statistic. Thefollowing command generates 100 simulated datasets.
## [1] 100 801The output is a matrix with dimensions draws x observations (insite-major order). As an example, we can calculate the proportion ofzeros in each simulated dataset
and compare that to the proportion of zeros in the actualdataset.
actual_prop0<-mean(getY(fit_top)==0,na.rm=TRUE)#Comparehist(prop0,col='gray')abline(v=actual_prop0,col='red',lwd=2)Based on the 95% uncertainty intervals, both forest and elevationhave a positive effect on occupancy probability (both intervals do notcontain 0). Similarly, Julian date has a positive impact on detectionprobability. We can quickly visualize these marginal covariate effectswith theplot_effects function:
As withunmarked, we can get the predicted\(psi\) or\(p\) for each site or observation using thepredict function. For example, to get occupancy:
## Predicted SD 2.5% 97.5%## 1 0.08475341 0.04074847 0.02603972 0.1870306## 2 0.15341676 0.06590045 0.05564030 0.3312959## 3 0.30784613 0.07968393 0.18695527 0.5197789## 4 0.14593709 0.04629648 0.07153534 0.2537933## 5 0.35156491 0.08389258 0.22562494 0.5726406## 6 0.08783817 0.04008446 0.02912174 0.1868335You can also supply newdata as adata.frame.
## Predicted SD 2.5% 97.5%## 1 0.1401127 0.07725292 0.03806715 0.3547347One of the advantages of BUGS/JAGS is that you can directly modellatent parameters, such as the true unobserved occupancy state of a site\(z\). Using theposterior_predict function inubms, you cangenerate an equivalent posterior distribution of\(z\).
## [1] 100 267The output has one row per posterior draw, and one column per site.The posterior of\(z\) can be usefulfor post-hoc analyses. For example, suppose you wanted to test for adifference in mean occupancy probability between sites 1-50 and sites51-100:
group1<-rowMeans(zpost[,1:50],na.rm=TRUE)group2<-rowMeans(zpost[,51:100],na.rm=TRUE)plot_dat<-rbind(data.frame(group="group1",occ=mean(group1),lower=quantile(group1,0.025),upper=quantile(group1,0.975)),data.frame(group="group2",occ=mean(group2),lower=quantile(group2,0.025),upper=quantile(group2,0.975)))Now plot the posterior distributions of the two means:
library(ggplot2)ggplot(plot_dat,aes(x=group,y=occ))+geom_errorbar(aes(ymin=lower,ymax=upper),width=0.2)+geom_point(size=3)+ylim(0,1)+labs(x="Group",y="Occupancy + 95% UI")+theme_bw()+theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank(),axis.text=element_text(size=12),axis.title=element_text(size=14))