Movatterモバイル変換


[0]ホーム

URL:


spsurvey5.6.0

An Overview of Statistical Surveys forEcological Applications

MichaelDumelle, Amanda M. Nahlik, and Sarah Lehmann

Source:vignettes/articles/overview.Rmd
overview.Rmd

Introduction

Statistical surveys provide a framework to help scientists answerimportant questions like “What proportion of lakes in a region ofinterest (e..g, a district, state, country, etc.) support healthypopulations of fish?” When we want to answer questions about a set ofunits (e.g., all lakes in the region of interest), it is not usuallyfeasible to collect data at every unit because of financial or timeconstraints. Statistical surveys select a random sample1 of units and use theresulting data to infer broad characteristics about all units. The“random” part of the random sampling is particularly important, ensuringthe sample selection process remains unbiased and that results are ableto be generalized to the broader population of interest (Paulsen et al.,1998). Random sampling also ensures our analyses accurately reflect thecharacteristics we want to measure with a known margin of error andappropriate confidence intervals. Below, details of each aspect of astatistical survey are presented, from defining a target population tointerpreting the statistical estimates and everything in between. Wefocus on applications to aquatic resources (e.g., lakes, streams,estuaries), but this discussion also applies to myriad other ecologicalcontexts (e.g., forestry, biology, wildlife, etc.)

Statistical Survey Components

Target Population (i.e., Population of Interest)

The target population (i.e., population of interest) is the set ofunits we want to describe. This definition should be clear, concise, andeasy to confirm during data collection (e.g., by field crews). Forexample, a target population definition may be “All natural andconstructed freshwater lakes in a region of interest not used foraquaculture or disposal (e.g., mine tailings, sewage treatment), greaterthan 1 hectare with more than 1,000 square meters of open water, andgreater than 1 meter in depth.” In this example, the field crews couldeasily determine on-site whether all five criteria in this definitionare met. It is crucial that the target population definition aligns withthe statistical survey’s goals.

Sampling Frame

When developing a statistical survey, statisticians refer to theunits available for sampling as “sample units” (i.e., individual lakes).For use in the statistical survey process, these units are contained ina sampling frame.

Sampling frames for natural resources are often represented using aGeographic Information System (GIS) with spatial information likelatitude and longitude coordinates. Ideally, the sampling frame andtarget population coincide (i.e., the GIS coverage contains all lakes inthe region of interest that meet the aforementioned target populationdefinition and omits all lakes that do not). In practice, however, it iscommon for a portion of sample units identified in the sampling frame tonot meet the target population definition. For example, the targetpopulation definition excludes stormwater ponds, but these ponds mayincorrectly appear as lakes in a GIS coverage. This is called“overcoverage.” On the other hand, a portion of sample units that existon the landscape and meet the target population definition may bemissing from the GIS coverage because coverage is imperfect. This iscalled “undercoverage.” Handling overcoverage and undercoverage requiressubject matter expertise and additional analysis and interpretationtechniques, though overcoverage is generally easier to accommodate in anecological survey design.

Selecting a Random Sample

To ensure a sample is unbiased and representative of the targetpopulation, it should be randomly selected. One approach to randomsampling is to select a simple random sample, whereby each sample isequally likely. However, simple random samples ignore the spatialdistribution of units on a landscape, often yielding samples that havenoticeable spatial gaps or an excess of units in certain areas (i.e.,clumping). One approach to alleviating this drawback is the GeneralizedRandom Tessellation Stratified (GRTS) algorithm (Stevens and Olsen,2004), which generates random samples that are well-spread in space, orspatially balanced. ThespsurveyR packagemakes GRTS sampling available via thegrts() function.

Example: Part 1

Suppose a region of interest has 100 lakes and the goal is to use arandom selection of 10 lakes to draw inferences about the targetpopulation. The random sample may or may not be spatially balanced.Figure 1 compares simple random sampling and spatially balanced randomsampling.

Figure 1: Comparison of two random sampling methods.  A simple random sampling method to select 10 lakes from a sampling frame of 100 produces samples while ignoring their spatial distribution. This can result in some samples clumping while others do not. In comparison, a spatially balanced random sampling method accounts for the sample distribution and produces a sample well-spread in space.

Figure 1: Comparison of two random sampling methods. A simple randomsampling method to select 10 lakes from a sampling frame of 100 producessamples while ignoring their spatial distribution. This can result insome samples clumping while others do not. In comparison, a spatiallybalanced random sampling method accounts for the sample distribution andproduces a sample well-spread in space.

Design Weights

A design weight is a continuous, numeric value assigned to each unitin the sample that represents similarity to other units not selected inthe sample. For example, if a unit sampled has a design weight of 10, itrepresents ten similar units from the sampling frame. When each sampleunit has an equal chance of being selected, all weights are equal.

Target Population Parameter Estimates

Design weights are combined with observed data for each sampled unit(e.g., fish species richness) to estimate target population parameterslike means, proportions, and totals that inform the original researchquestion(s). For example, based on the sample and observed data, aresearcher may estimate that 34.6% of lakes in the region of interesthave fish populations in good condition, with a 95% confidence intervalof 26.4% to 42.8%.

Example: Part 2

Three separate spatially balanced random samples of 10 lakes areselected from the sampling frame of 100 lakes (Figure 2). Each sampledlake is assigned a weight of 10, meaning that each site represents andequal proportion of 10 similar lakes in the sampling frame. The weightsare combined with data at each site to estimate the proportion of thetarget population in either Good, Fair, or Poor condition.

Figure 2: An overview of the survey design process.  Three different spatially balanced random samples produces three different sets of population estimates.

Figure 2: An overview of the survey design process. Three differentspatially balanced random samples produces three different sets ofpopulation estimates.

Stratification

Sometimes stratification is used as part of the design process toensure random samples are evenly distributed among different strata(i.e., groups). For example, small lakes are more prevalent than largelakes on the landscape, so a random sample ignoring lake size wouldyield mostly small lakes. As a result, the final dataset would not havevery much information on large lakes. However, sample sizes can be setseparately for small and large lakes by stratifying, allowing theselection of similarly sized spatially balanced samples for each group.By stratifying the sample selection into such subgroups (sometimescalled subpopulations), we more effectively ensure the sample accuratelyreflects the characteristics of each subgroup within the overallpopulation. Additionally, stratifying ensures that enough sites can beselected in each group to allow for comparisons between them (i.e.,comparing small lakes with large lakes in the analyses).

Stratifying changes the likelihood of some units being selected;therefore, the design weights must be adjusted to account for thoseunequal selections.

Example: Part 3

Suppose of the 100 lakes in the sampling frame, 90 are small lakesand 10 are large lakes. Selecting 10 sites randomly does not guarantee aspecific distribution of small and large lakes. For example, one randomsample could select 10 small lakes and 0 large lakes, another couldselect 9 small lakes and 1 large lake, and yet another could select 3small lakes and 7 large lakes. In an unstratified random sample (as inExample: Part 2), each of these lakes, whether small or large, would beassigned a weight of 10. While the sample would represent all lakes,reporting on small lakes and large lakes separately may be difficult, oreven impossible, due to sample size constraints. For example, if 10small lakes and 0 large lakes are selected, reporting on large lakes asa subgroup is impossible.

A stratified random sample is used to select a prespecified number ofsites from each group (or “stratum”) – in Figure 3, for example, 5 smalllakes and 5 large lakes. Because the sites in these strata now representdifferent portions of the sampling frame, the weights will be differentbetween the two groups. Each of the 5 small lakes will represent anequal proportion of the 90 small lakes in the sampling frame and receivea weight of 18, while each of the 5 large lakes will represent an equalproportion of 10 lakes in the sampling frame and receive a receive aweight of 2. Stratification ensures that there are enough small lakesand large lakes sampled to draw inferences for both groupsseparately.

Another twist: Sometimes lakes cannot be sampled (e.g., lack ofaccess, lake is dry, etc.). When data was not collected from a sampleunit, statisticians call this “nonresponse”. The reason for nonresponseinforms approaches that statisticians use to adjust the original weightsto account for this missing data, ensuring the sample remainsrepresentative of the target population.

Figure 3: Applying stratification to a spatially balanced random sample.  Stratifying by small lakes and large lakes selects an equal sample size of both groups, allowing for all lakes to be reported in addition to both small lakes and large lakes separately.

Figure 3: Applying stratification to a spatially balanced random sample.Stratifying by small lakes and large lakes selects an equal sample sizeof both groups, allowing for all lakes to be reported in addition toboth small lakes and large lakes separately.

Stratification Tradeoffs

A benefit of stratification is that units in subgroups are guaranteedto be sampled the desired number of times, and thus, reporting for thatsubgroup is possible. A drawback of stratification, however, is thatincreasing samples in small subgroups requires decreasing samples inlarge subgroups, which can increase uncertainty associated withlarge-scale (e.g., national) population estimates. These benefits anddrawbacks should be evaluated and prioritized when designing astatistical survey.

Point, Linear, and Areal Resources

Apoint resource is represented by a POINT geometryin a GIS and encompasses a finite collection of units in the samplingframe. Lakes, treated as a whole with location represented as a singlecentroid, are an example of a point resource; there are a finite,defined number of individual lakes to sample. In contrast, alinear resource is represented by a LINESTRING geometryin a GIS and encompasses an infinite number of units in the samplingframe. An example of a linear resource is a stream network; there are aninfinite number of possible locations to sample along the length of anygiven stream. Finally, anareal resource is representedby a POLYGON geometry in a GIS and, like a linear resource, encompassesan infinite number of units in the sampling frame. Examples of arealresources include an estuary; there are an infinite number of possiblelocations to sample within the boundaries of any given estuary.

Survey design principles are consistent regardless of resource type.However, the interpretation of the weights and resulting populationestimates do change depending on the resource type (Table 1).

Table 1: Differences across resource types
Resource TypesUnitsExample Interpretation
PointIndividual UnitsEach weight represents a number of individual lakes similar to theone being sampled; population estimates are for all lakes in the targetpopulation
LinearLengthsEach weight represents a number of stream miles similar to the reachbeing sampled; population estimates are for all stream miles in thetarget population
ArealAreasEach weight represents the area similar to the estuary site beingsampled; population estimates are for all square miles of estuary in thetarget population

Conclusion

Ecological and environmental monitoring efforts can significantlybenefit from incorporating statistical surveys. Statistical surveysleverage a target population definition, a sampling frame, a randomsampling mechanism that may be spatially balanced, design weights, andtarget population parameter estimates to answer scientific questions ofinterest.spsurvey provides tools for users to selectspatially balanced GRTS samples of point, linear, and areal resources,construct design weights, and estimate target population parameters(Dumelle et al., 2023). Please see the other vignettes on our website(at this link in the“Articles” tab) to learn more.

References

Dumelle, Michael and Kincaid, Tom and Olsen, Anthony R., and Weber,Marc. (2023). spsurvey: Spatial Sampling Design and Analysis in R.Journal of Statistical Software, 105(3), 1-29.

Paulsen, Steven and Hughes, Robert M. and Larsen, David P. (1998).Critical Elements in Describing and Understanding our Nation’s AquaticResources.Journal of the American Water Resources Association,34(5), 995-1005.

Stevens Jr, D. L. and Olsen, A. R. (2004). Spatially balancedsampling of natural resources.Journal of the American StatisticalAssociation, 99(465):262-278.


  1. For ecological monitoring projects, it is important tonote the difference between the statistical use of “sample” or“sampling” and the ecological use of the word. In ecological monitoring,sampling means collection of chemical, physical, or biological data. Toreduce confusion, we will refer to “ecological sampling” as monitoringor data collection, and “sampling” and “sample units” as part of thestatistical design process.↩︎


[8]ページ先頭

©2009-2025 Movatter.jp