SSP is an R package designed to estimate samplingeffort in studies of ecological communities based on the definition ofpseudo-multivariate standard error (MultSE) (Anderson& Santana-Garcon 2015) and simulation of data. This guide willprovide you with a brief overview on how to useSSP.The theoretical background is described in a submitted paper byGuerra-Castro et al. (2020).
The protocol inSSP consists of simulating severalextensive data matrices that mimic some of the relevant ecologicalfeatures of the community of interest using a pilot data set. For eachsimulated data, several sampling efforts are repeatedly executed andMultSE is calculated to each one. The mean value, 0.025 and0.975 quantiles ofMultSE for each sampling effort across allsimulated data are then estimated and plotted. The mean values arestandardized in relation to the lowest sampling effort (consequently,the worst precision), and an optimal sampling effort can be identifiedas that in which the increase in sample size do not improve theprecision beyond a threshold value (e.g. 2.5 %).
SSP includes seven functions:assemparfor extrapolation of assemblage parameters using pilot data;simdata for simulation of several data sets based onextrapolated parameters;datquality for evaluation ofplausibility of simulated data;sampsd for repeatedestimations ofMultSE for different sampling designs insimulated data sets;summary_sd for summarizing thebehavior ofMultSE for each sampling design across allsimulated data sets,ioptimum for identification of theoptimal sampling effort, andplot_ssp to plot samplingeffort vsMultSE.
The first function to use isassempar. The arguments ofthis functions are:
| Argument | Description |
|---|---|
data | data.frame with species names (columns) and samples (rows)information. The first column should indicate the site to which thesample belongs, regardless of whether a single site has been sampled |
type | Nature of the data to be processed. It may be presence / absence(“P/A”), counts of individuals (“counts”), or coverage (“cover”) |
Sest.method | Method for estimating species richness. The functionspecpool is used for this. Available methods are theincidence-based Chao “chao”, first order jackknife “jack1”, second orderjackknife “jack2” and Bootstrap “boot”. By default, the “average” of thefour estimates is used |
assempar extracts the main parameters of the pilot datausing basic R functions combined withvegan functions likespecpool anddispweight. The expected numberof species in the assemblage is estimated using non-parametric methods(Gotelli et al. 2011). Due to the variability of the estimates of eachapproximation (Reese et al. 2014), we recommend using an average ofthese. The probability of occurrence of each species is estimatedbetween and within sites. The former is calculated as the frequency ofoccurrences of each of the species against the number of sites sampled,the second as the weighted average frequencies in sites where thespecies were present. Also, the degree of spatial aggregation of species(only for real counts of individuals), is identified with the index ofdispersionD (Clarke et al. 2006). The corresponding propertiesof unseen species are approximated using information of observedspecies. Specifically, the probabilities of occurrences are assumed tobe equal to the rarest species of pilot data. The mean (and variance) ofthe abundances are defined using random Poisson values with lambda asthe overall mean of species abundances.assempar returns anobject of classlist, to be used bysimdata.
The second function to use issimdata, with thefollowing arguments:
| Argument | Description |
|---|---|
Par | list of parameters estimated byassempar |
cases | Number of data sets to be simulated |
N | Total number of samples to be simulated in each site |
sites | Total number of sites to be simulated in each data set |
jitter.base | Optional numeric value controlling the magnitude of stochasticperturbations applied to site-level parameters. This jitter introducesvariability among sites while preserving the overall probabilisticstructure of each species. Default isNULL (noperturbation). |
The presence/absence of each species at each site is simulated usingBernoulli trials, with probability of success equal to the empiricalfrequency of occurrence among sites in the pilot data. For sites wherethe species is present, Bernoulli trials are used again with probabilityof success equal to the empirical frequency estimated within sites. Tobetter reflect realistic variability among nested sampling units (e.g.,sites within regions), the function can apply controlled perturbationsto the base parameters via the argumentjitter.base. Thisstochastic adjustment introduces site-level heterogeneity in occurrenceprobabilities, producing simulated communities with multivariatedispersion closer to empirical expectations.
If required, the presence/absence matrices are converted intoabundance matrices by replacing presences with random values drawn fromappropriate statistical distributions, using parameters estimated fromthe pilot data. Counts of individuals are generated using Poisson ornegative binomial distributions, depending on the degree of aggregationof each species (McArdle & Anderson 2004; Anderson & Walsh2013). Continuous abundance measures (e.g., coverage, biomass) aregenerated using the lognormal distribution.
The simulation procedure is repeatedcases times,producing a list of simulated data sets that reproduce the statisticalproperties of the original assemblage. The resulting object of classlist is used bysampsd anddatquality.
The third function issampsd. If several virtual siteshave been simulated, subsets of sites of size 2 tom aresampled, followed by the selection of sampling units (from 2 ton) using inclusion probabilities and self-weighted two-stagesampling (Tille, 2006). Each combination of sampling effort (number ofsample units and sites), are repeated several times (e.g. k =100) for all simulated matrixes. If simulated data correspond to asingle site, sampling without replacement is performed several times(e.g. k = 100) for each sample size (from 2 ton)within each simulated matrix. This approach is computationallyintensive, especially whenk is high. Keep this in mind becauseit will affect the time to get results. For each sample, suitablepre-treatments are applied and distance/similarity matrixes estimatedusing the appropriate coefficient. When simulations were done for asingle site, theMultSE is estimated as , beingV thepseudo-variance measured at each sample of sizen(Anderson & Santana-Garcon, 2015). When several sites weregenerated,MultSE are estimated using the residual mean squaresand the sites mean squares from a distance-based multivariate analysisof variance (Anderson & Santana-Garcon, 2015). The arguments of thisfunction are:
| Argument | Description |
|---|---|
dat.sim | list of data sets generated bysimdata |
Par | list of parameters estimated byassempar |
transformation | Mathematical function to reduce the weight of very dominant species:‘square root’, ‘fourth root’, ‘Log (X+1)’, ‘P/A’, ‘none’ |
method | The appropiate distance/dissimilarity metric. The functionvegdist is called for that purpose |
n | Maximum number of samples to take at each site. Can be equal or lessthanN |
m | Maximum number of sites to sample at each data set. Can be equal or lessthansites |
k | Number of repetitions of each sampling effort (samples and sites) foreach data set |
The fourth function issummary_ssp. This function isrequired to estimate an average of allMultSE obtained with thek repetitions for each sampling effort within each case, andthen an overall mean as well as the lower and upper intervals of meansfor each sampling effort among all cases. In order to have a general andcomparable criteria to evaluate the rate of change of the averagedMultSE with respect to the sampling effort, a relativization tothe maximumMultSE value (obtained with the lower samplingeffort) is calculated; then, a standard forward finite derivation iscomputed. All these results are presented in a data frame that will beused to plotMultSE with respect to the sampling effort, usingssp_plot. The arguments of this function are:
| Argument | Description |
|---|---|
results | matrix generated bysampsd |
multi.site | Logical argument indicating whether several sites were simulated |
The fifth function,ioptimum estimates what we considerthe optimal improvement. The arguments are:
| Argument | Description |
|---|---|
xx | data frame generated bysummary_ssp |
multi.site | Logical argument indicating whether several sites were simulated |
This function identifies three cut-off points based on the finitederivative between the standardizedMultSE and the samplingeffort (as the percentage of improvement in precision per sample unit),allowing us to identify: (1) the minimum improvement required, (2)sub-optimal improvement, (3) optimal improvement and (4) unnecessaryimprovement. It is possible that the cut-off points defined by omissionwill not be achieved, which might occurs if the argumentsn orm ofsampsd were set small. In situations likethis, a warning message will specifically indicate which cut-off pointwas not achieved, and will return the maximum effort currently executed.To achieve 5%, 3%, and 1% of optimization,sampsd andsummary_ssp must be run again with higher values ofn orm. Alternatively, the cut-off points can be madeflexible, say, 5%, 4%, 3%, respectively, or higher.
This function allows to visualize the behavior of theMultSEas sampling effort increases. When the simulation involves two samplingscales, a graph for samples and one for sites will be generated. AbovetheMultSE~Sampling effort projection, two shadedareas are drawn, highlighting: sub-optimal improvement (light gray), andoptimal improvement (dark gray). Both reflect the sampling effort thatimproves the precision at acceptable (light gray) or desirable levels(dark gray), but beyond the later, any gain could be consideredunnecessary. In addition, for each sampling effort, the relativizedimprovement (in relation to theMultSE estimated with the lowersampling effort) is presented cumulatively (as percentages). This isvery useful because it indicates exactly how much the precision isimproved for each sampling effort. The graphic is generated withggplot2, so the resulting object can be modified usingfunctions of that package. The arguments of this function are:
| Argument | Description |
|---|---|
xx | data frame generated bysummary_ssp |
opt | vector ormatrix generated byioptimum |
multi.site | Logical argument indicating whether several sites were simulated |
It could be desirable to measure the plausibility of simulated data.This can be done by measuring the similarity between the simulated andthe pilot data. This can be assessed withdatquality, afunction that estimate: (i) the average number of species per samplingunit, (ii) the average species diversity (Simpson diversity index) persampling unit, and (iii) the multivariate dispersion (MVD),measured as the average dissimilarity from all sampling units to themain centroid in the space of the dissimilarity measure used (Anderson2006). The arguments of this functions are:
| Argument | Description |
|---|---|
data | adata.frame with species names (columns) and samples(rows) information. The first column should indicate the site to whichthe sample belongs, regardless of whether a single site has been sampledor not |
dat.sim | alist of data sets generated bysimdata |
Par | alist of parameters estimated byassempar |
transformation | Mathematical function to reduce the weight of very dominant species:‘square root’, ‘fourth root’, ‘Log (X+1)’, ‘P/A’, ‘none’ |
method | The appropriate distance/dissimilarity metric. The functionvegdist is called for that purpose |
Micromollusks of marine shallow sandy bottoms:Presence/absence of 68 species were registered in six cores of 10 cmdiameter and 10 cm deep taken in sandy bottoms around Cayo Nuevo, Gulfof Mexico, Mexico (a small reef cay located 240 km off the North-Westerncoast of Yucatan). Data correspond to a study on the biodiversity ofmarine benthic reef habitats off the Yucatan shelf (Ortigosa etal. 2018). The main objective was to estimate an adequate samplingeffort for further quantitative studies to characterize the variabilityin species composition. To do this, the pilot data was characterizedwithassempar and several matrices of P/A data weresimulated withsimdata. To speed up the process, only 20data sets (cases) were simulated, each data matrix consisted ofN = 100 potential sampling replicates in onesite.Several sample size’s subsets (n = 2 to 50) were repeatedlysampled (k = 10) withsampsd. The Jaccard indexwas used as the similarity measure between sample units. Keep in mindthat you can simulate many more matrices (cases), number ofsamples (N andn), and repetitions (k), aslong as you have the time and patience to wait for the results!
library(SSP)data(micromollusk)#Estimation of parameterspar.mic<-assempar(data = micromollusk,type ="P/A")#Simulation of datasim.mic<-simdata(Par = par.mic,cases =20,N =100,site =1)# Quality of simulated dataqua.mic<-datquality(data = micromollusk,dat.sim = sim.mic,Par = par.mic,transformation ="none",method ="jaccard")#Sampling and estimation of MultSEsamp.mic<-sampsd(sim.mic, par.mic,transformation ="P/A",method ="jaccard",n =50,m =1,k =10)#Summarizing resultssum.mic<-summary_ssp(results = samp.mic,multi.site =FALSE)#Identification of optimal effortopt.mic<-ioptimum(xx = sum.mic,multi.site =FALSE)#plotfig.1<-plot_ssp(xx = sum.mic,opt = opt.mic,multi.site =FALSE)fig.1The behavior ofMultSE for each sampling effort on simulateddata sets can be plotted usingplot_ssp (Fig.1). The shadedareas indicate the range of samples in which each increase in samplingeffort provides suboptimal improvement (light gray), and optimalimprovement (dark gray). The suboptimal region highlights improvementbetween 37% and 47%, while the optimal indicates that up to 55% could beimproved with 10 samples, with remarkable precision gained with eachadditional sample. After 11 samples, the improvement obtained with theincrement of the sampling effort is small enough to consider the extraeffort unnecessary and redundant.
Fig. 1. MultSE and sampling effort relationship using micromollusksimulated data
Coral reef sponges: Structure and composition ofsponge assemblages associated with reefs from Alacranes Reef NationalPark (ARNP), Gulf of Mexico, Mexico, was estimated in 36 transects of 20m × 1 m across 6 sites (≈ 4 to 5 transect per site). In each transect,the colonies of 41 species of sponges were counted. This datacorresponds to a pilot study on sponge biodiversity in reef habitats ofthe Yucatán shelf (Ugalde et al. 2015). The main objective was toestimate an adequate sampling effort at two spatial scales(i.e. transect and sites) for further quantitative studies. The studiedarea represented the leeward area of the reef, with very similar reefgeomorphology. Therefore, it cannot be argued a priori that spatialdifferences in sponge diversity (if any) were due to some environmentalmodel. In this sense, we considered valid to simulate data for theentire leeward area, using the information of the six sites. Severalmatrices were simulated, each consisting in 20 potentialsiteswith 20 potentialN transects per site. As for the firstexample, the pilot data was characterized withassempar andseveral matrices with counts of individual sponges were simulated withsimdata. Again, to speed up the process, only 10 data sets(cases) were simulated. Each combination ofn (from 2to 20) andm (from 2 to 20) were repeatedly sampled (k= 10) withsampsd. The Bray-Curtis index was used as thesimilarity measure between sample units after square root transformationof simulated abundances. Similarly, the optimal sampling effort wasrepresented usingplot_ssp.
data(sponges)#Estimation of parameterspar.spo<-assempar(data = sponges,type ="counts")#Simulation of datasim.spo<-simdata(Par = par.spo,cases =10,N =20,sites =20,jitter.base =0.5)# Quality of simulated dataqua.spo<-datquality(data = sponges,dat.sim = sim.spo,Par = par.spo,transformation ="square root",method ="bray")#Sampling and estimation of MultSEsamp.spo<-sampsd(sim.spo, par.spo,transformation ="square root",method ="bray",n =20,m =20,k =10)#Summarizing resultssum.spo<-summary_ssp(results = samp.spo,multi.site =TRUE)#Identification of optimal effortopt.spo<-ioptimum(xx = sum.spo,multi.site =TRUE)#plotfig.2<-plot_ssp(xx = sum.spo,opt = opt.spo,multi.site =TRUE)fig.2Fig. 2. MultSE and sampling effort relationship using sponge simulateddata
At the scale of sites, there was a noticeable decrease of theMultSE between 5 and 10 sites (Fig. 2). This range of samplingis represented by two shaded areas: subotimal improvement (light gray)and optimal improvement (dark gray), respectively. The first representimprovements of 45% with 7 sites, the second 54% with 10 sites. Sampling9 sites will improve the worst precision in approximately 52%. At thescale of transect, the suboptimal improvement is accomplished with 5 or6 replicates, whereas the optimal is beyond 7 replicates, but below 10.It can also be noted that each additional sample improves the highestMultSE by 2-3 %, achieving a 55% improvement with 10 transects.It should be pointed out the marked differences inMultSEobtained for the two sources of variation. The magnitude of thedifference in variation fits that obtained in thepseudo-components of variation estimated for sites andresiduals in a PERMANOVA analysis of the pilot data σsites =26.4, σtransects = 34.7):
dat<-sponges[,2:length(sponges)]#Square root transformation of abundancesdat.t<-sqrt(dat)#Bray-Curtyslibrary(vegan)bc<-vegdist(dat.t,method ="bray")#function to estimate components of variation in PERMANOVAcv.permanova<-function(D, y) { D=as.matrix(D) N=dim(D)[1] g=length(levels(y[,1])) X=model.matrix(~y[,1])#model matrix H= X%*%solve(t(X)%*% X)%*%t(X)#Hat matrix I=diag(N)#Identity matrix A=-0.5* D^2 G= A-apply(A,1, mean)%o%rep(1, N)-rep(1, N)%o%apply(A,2, mean)+mean(A) MS1=sum(G*t(H))/(g-1)#Mean square of sites MS2=sum(G*t(I- H))/(N- g)#Mean square of residuals CV1= (MS1- MS2)/(N/g)# Components of variation of sites CV2= MS2# Components of variation of samples CV=c(CV1, CV2) sqrtCV=sqrt(CV)return(sqrtCV)#square root of components of variation}cv<-cv.permanova(D = bc,y = sponges)cv#> [1] 0.2644432 0.3467063Based on these results, and considering the costs (time and/or money)of visiting a site and doing each transect, as well as the relativeimportance of each spatial scale, it will be convenient to keep thenumber of sites within suboptimal improvement (i.e. 5-7), and set thenumber of transects to 8.
These examples demonstrate thatSSP has keyadvantages in estimating sample effort to study communities whose datawill be analyzed with statistical techniques based on similarityindices, such as PERMANOVA (Anderson, 2017). However, it is important tomention thatSSP is not free from assumptions. First,simulations insimdata do not consider any environmentalconstraint, nor the co-occurrence of species neither their jointdistribution functions. In addition, theSSP protocolassumes that potential differences in species composition among sites isdue to spatial aggregation of species as estimated from the pilot data.Thus, any spatial structure of species that was not captured by thepilot data will not be reflected in the simulation. Although theprotocol performs well with small pilot data, it is recommendable thatthe pilot sampling captures the greatest possible variability in thesystem under study. Despite these limitations, if the properties of thesimulated data resemble the community of interest and show ecologicalplausibility, the extrapolations derived from the procedure presentedhere will hold valid to define an optimal sampling effort for any studybased on dissimilarity-based multivariate analysis.
-Anderson, M. J. (2017). Permutational Multivariate Analysis ofVariance (PERMANOVA). Wiley StatsRef: Statistics Reference Online. JohnWiley & Sons, Ltd.
-Anderson, M. J. (2006). Distance-based tests for homogeneity ofmultivariate dispersions. Biometrics 62:245-253.
-Anderson, M. J., & J. Santana-Garcon. (2015). Measures ofprecision for dissimilarity-based multivariate analysis of ecologicalcommunities. Ecology Letters 18:66-73.
-Anderson, M. J., & D. C. I. Walsh. (2013). PERMANOVA, ANOSIM,and the Mantel test in the face of heterogeneous dispersions: What nullhypothesis are you testing? Ecological Monographs 83:557-574.
-Clarke, K. R., Chapman, M. G., Somerfield, P. J., & Needham, H.R. (2006). Dispersion-based weighting of species counts in assemblageanalyses. Journal of Experimental Marine Biology and Ecology, 320,11-27.
-Gotelli, N. J., & Colwell, R. K. (2011). Estimating speciesrichness. Pages 39– 54 in A Magurranand B McGill editors. Biologicaldiversity: frontiers in measurement and assessment. Oxford UniversityPress, Oxford, UK.
-Guerra-Castro, E.J., Cajas, J.C., Simões, N., Cruz-Motta, J.J.,& Mascaró, M. (2021). SSP: an R package to estimate sampling effortin studies of ecological communities. Ecography 44(4), 561-573. doi:
-McArdle, B. H., & M. J. Anderson. (2004). Varianceheterogeneity, transformations, and models of species abundance: acautionary tale. Canadian Journal of Fisheries and Aquatic Sciences61:1294-1302.
-Oksanen, J., F. G. Blanchet, R. Kindt, P. Legendre, P. R. Minchin,R. B. O’Hara, G. L. Simpson, P. Solymos, M. H. H. Stevens, & H.Wagner. 2019. Vegan: Community Ecology Package. R package version2.5-6.
-Ortigosa, D., N. Y. Suárez-Mozo, N. C. Barrera, & N. Simões.(2018). First survey of Interstitial molluscs from Cayo Nuevo, CampecheBank, Gulf of Mexico. Zookeys 779.
-Reese, G. C., Wilson, K. R., & Flather, C. H. (2014).Performance of species richness estimators across assemblage types andsurvey parameters. Global Ecology and Biogeography, 23(5), 585-594.
-Tillé, Y. (2006). Sampling algorithms. Springer, New York, NY.
-Ugalde, D., P. Gómez, & N. Simoes. (2015). Marine sponges(Porifera: Demospongiae) from the Gulf of México, new records andredescription of Erylus trisphaerus (de Laubenfels, 1953). Zootaxa3911:151-183.