Movatterモバイル変換


[0]ホーム

URL:


Skip to contents

Methodology

Source:vignettes/methodology.Rmd
methodology.Rmd

In the following we present the methodology insurveysdby applying the workflow described invignette("surveysd")to multiple consecutive years of EU-SILC data for one country. Themethodology contains the following steps, in this order

  • DrawBBbootstrap replicates from EU-SILC data for each yearyty_t,t=1,,nyt=1,\ldots,n_yseparately. Since EU-SILC has a rotating panel design the bootstrapreplicate of a household is carried forward through the years. That is,the bootstrap replicate of a household in the follow-up years is setequal to the bootstrap replicate of the same household when it firstenters EU-SILC.
  • Multiply each set of bootstrap replicates by the sampling weights toobtain uncalibrated bootstrap weights and calibrate each of theuncalibrated bootstrap weights using iterative proportionalfitting.
  • Estimate the point estimate of interestθ\theta,for each year and each calibrated bootstrap weight to obtainθ̃(i,yt)\tilde{\theta}^{(i,y_t)},t=1,,nyt=1,\ldots,n_y,i=1,,Bi=1,\ldots,B.For fixedyty_tapply a filter with equal weights for eachiionθ̃(i,y*)\tilde{\theta}^{(i,y^*)},y*{yt1,yt,yt+1}y^*\in \{y_{t-1},y_{t},y_{t+1}\}, to obtainθ̃(i,yt)\tilde{\theta}^{(i,y_t)}.Estimate the variance ofθ\thetausing the distribution ofθ̃(i,yt)\tilde{\theta}^{(i,y_t)}.

Bootstrapping

Bootstrapping has long been around and used widely to estimateconfidence intervals and standard errors of point estimates.[Efron (1979)} Given a random sample(X1,,Xn)(X_1,\ldots,X_n)drawn from an unknown distributionFFthe distribution of a point estimateθ(X1,,Xn;F)\theta(X_1,\ldots,X_n;F)can in many cases not be determined analytically. However when usingbootstrapping one can simulate the distribution ofθ\theta.

Lets(.)s_{(.)}be a bootstrap sample, e.g. drawingnnobservations with replacement from the sample(X1,,Xn)(X_1,\ldots,X_n),then one can estimate the standard deviation ofθ\thetausingBBbootstrap samples throughsd(θ)=1B1i=1B(θ(si)θ¯)2,sd(\theta) = \sqrt{\frac{1}{B-1}\sum\limits_{i=1}^B (\theta(s_i)-\overline{\theta})^2} \quad,

withθ¯:=1Bi=1Bθ(si)\overline{\theta}:=\frac{1}{B}\sum\limits_{i=1}^B\theta(s_i)as the sample mean over all bootstrap samples.

In context of sample surveys with sampling weights one can usebootstrapping to calculate so called bootstrap weights. These arecomputed via the bootstrap samplessis_{i},i=1,,Bi=1,\ldots,B,where for eachsis_{i}every unit of the original sample can appear00-tomm-times.Withfjif_j^{i}as the frequency of occurrence of observationjjin bootstrap samplesis_ithe uncalibrated bootstrap weightsb̃ji\tilde{b}_{j}^{i}are defined as:

b̃ji=fjiwj, \tilde{b}_{j}^{i} = f_j^{i} w_j \quad,

withwjw_jas the calibrated sampling weight of the original sample. Usingiterative proportional fitting procedures one can recalibrate thebootstrap weightsb̃j.\tilde{b}_{j}^{.},j=1,,Bj=1,\ldots,Bto get the adapted or calibrated bootstrap weightsbjib_j^i,j=1,,Bj=1,\ldots,B.

Rescaled Bootstrap

Since EU-SILC is a stratified sample without replacement drawn from afinite population the naive bootstrap procedure, as described above,does not take into account the heterogeneous inclusion probabilities ofeach sample unit. Thus it will not yield satisfactory results. Thereforewe will use the so called rescaled bootstrap procedure introduced andinvestigated by(Rao and Wu 1988). Thebootstrap samples are selected without replacement and do incorporatethe stratification as well as clustering on multiple stages (see(Chipperfield and Preston 2007),(Preston 2009)).

For simplistic reasons we will only describe the rescaled bootstrapprocedure for a two stage stratified sampling design. For more detailson a general formulation please see(Preston2009).

Sampling design

Consider the finite populationUUwhich is divided intoHHnon-overlapping stratah=1,,HUh=U\bigcup\limits_{h=1,\ldots,H} U_h = U,of which each stratahhcontains ofNhN_hclusters. For each stratahh,ChcC_{hc},c=1,,nhc=1,\ldots,n_hclusters are drawn, containingNhcN_{hc}households. Furthermore in each clusterChcC_{hc}of each stratahhsimple random sampling is performed to select a set of householdsYhcjY_{hcj},j=1,,nhcj=1,\ldots,n_{hc}.

Bootstrap procedure

In contrast to the naive bootstrap procedure where for a stage,containingnnsampling units, the bootstrap replicate is obtained by drawingnnsampling units with replacement, for the rescaled bootstrap proceduren*=n2n^*=\left\lfloor\frac{n}{2}\right\rfloorsampling units are drawn without replacement. Given a valuexx,x\lfloor x\rfloordenotes the largest integer smaller thanxx,whereasx\lceil x\rceildenotes the smallest integer lager thenxx.(Chipperfield and Preston 2007) have shownthat the choice of eithern2\left\lfloor\frac{n}{2}\right\rfloororn2\left\lceil\frac{n}{2}\right\rceilis optimal for bootstrap samples without replacement, althoughn2\left\lfloor\frac{n}{2}\right\rfloorhas the desirable property that the resulting uncalibrated bootstrapweights will never be negative.

At the first stage theii-thbootstrap replicate,fhci,1f^{i,1}_{hc},for each clusterChcC_{hc},c=1,,nhc=1,\ldots,n_h,belonging to stratahh,is defined by

fhci,1=1λh+λhnhnh*δhcc{1,,nh} f^{i,1}_{hc} = 1-\lambda_h+\lambda_h\frac{n_h}{n_h^*}\delta_{hc} \quad\quad \forall c \in \{1,\ldots,n_h\} withnh*=nh2 n_h^* = \left\lfloor\frac{n_h}{2}\right\rfloorλh=nh*(1nhNh)nhnh*, \lambda_h = \sqrt{\frac{n_h^*(1-\frac{n_h}{N_h})}{n_h-n_h^*}} \quad ,

whereδhc=1\delta_{hc}=1if clusterccis selected in the sub-sample of sizenh*n_h^*and 0 otherwise.

Theii-thbootstrap replicate at the second stage,fhcji,2f^{i,2}_{hcj},for each householdYhcjY_{hcj},j=1,,nhcj=1,\ldots,n_{hc},belonging to clusterccin stratahhis defined by

fhcji,2=fhci,1λhcnhnh*δhc[nhcnhc*δhcj1]c{1,,nh} f^{i,2}_{hcj} = f^{i,1}_{hc} - \lambda_{hc}\sqrt{\frac{n_h}{n_h^*}}\delta_{hc}\left[\frac{n_{hc}}{n_{hc}^*}\delta_{hcj}-1\right] \quad\quad \forall c \in \{1,\ldots,n_h\} withnhc*=nhc2 n_{hc}^* = \left\lfloor\frac{n_{hc}}{2}\right\rfloorλhc=nhc*Nh(1nhcNhc)nhcnhc*, \lambda_{hc} = \sqrt{\frac{n_{hc}^*N_h(1-\frac{n_{hc}}{N_{hc}})}{n_{hc}-n_{hc}^*}} \quad ,

whereδhcj=1\delta_{hcj}=1if householdjjis selected in the sub sample of sizenhc*n_{hc}^*and 0 otherwise.

Single PSUs

When dealing with multistage sampling designs the issue of singlePSUs, e.g. a single response unit is present at a stage or in a strata,can occur. When applying bootstrapping procedures these single PSUs canlead to a variety of issues. For the methodology proposed in this workwe combined single PSUs at each stage with the next smallest strata orcluster, before applying the bootstrap procedure.

Taking bootstrap replicates forward

The bootstrap procedure above is applied on the EU-SILC data for eachyearyty_t,t=1,,nyt=1,\ldots,n_yseparately. Since EU-SILC is a yearly survey with rotating penal designtheii-thbootstrap replicate at the second stage,fhcji,2f^{i,2}_{hcj},for a householdYhcjY_{hcj}is taken forward until the householdYhcjY_{hcj}drops out of the sample. That is, for the householdYhcjY_{hcj},which enters EU-SILC at yeary1y_1and drops out at yearyt̃y_{\tilde{t}},the bootstrap replicates for the yearsy2,,yt̃y_2,\ldots,y_{\tilde{t}}are set to the bootstrap replicate of the yeary1y_1.

Split households

Due to the rotating penal design so called split households canoccur. For a household participating in the EU-SILC survey it ispossible that one or more residents move to a new so called splithousehold, which is followed up on in the next wave. To take thisdynamic into account we extended the procedure of taking forward thebootstrap replicate of a household for consecutive waves of EU-SILC bytaking forward the bootstrap replicate to the split household. Thatmeans, that also any new individuals in the split household will inheritthis bootstrap replicate.

Taking bootstrap replicates forward as well as considering splithouseholds ensures that bootstrap replicates are more comparable instructure with the actual design of EU-SILC.

Uncalibrated bootstrap weights

Using theii-thbootstrap replicates at the second stage one can calculate theii-thuncalibrated bootstrap weightsbhcjib_{hcj}^{i}for each householdYhcjY_{hcj}in clustercccontained in stratahhby

b̃hcji=fhcji,2whcj, \tilde{b}_{hcj}^{i} = f^{i,2}_{hcj} w_{hcj} \quad, wherewhcjw_{hcj}corresponds to the original household weight contained in thesample.

For ease of readability we will drop the subindices regarding stratahhand clusterccfor the following sections, meaning that thejj-thhousehold in clustercccontained in stratahh,YhcjY_{hcj},will now be denoted as thejj-thhousehold,YjY_{j},wherejjis the position of the household in the data. In accordance to this theii-thuncalibrated bootstrap replicates for householdjjare thus denoted asb̃ji\tilde{b}_j^{i}and the original household weight aswjw_j.

Iterative proportional fitting (IPF)

The uncalibrated bootstrap weightsb̃ji\tilde{b}_j^{i}computed through the rescaled bootstrap procedure yields populationstatistics that differ from the known population margins of specifiedsociodemographic variables for which the base weightswjw_jhave been calibrated. To adjust for this the bootstrap weightsb̃ji\tilde{b}_{j}^{i}can be recalibrated using iterative proportional fitting as described in(Meraner, Gumprecht, and Kowarik2016).

Let the original weightwjw_{j}be calibrated forn=nP+nHn=n_P+n_Hsociodemographic variables which are divided into the sets𝒫:={pc,c=1,nP}\mathcal{P}:=\{p_{c}, c=1 \ldots,n_P\}and:={hc,c=1,nH}\mathcal{H}:=\{h_{c}, c=1 \ldots,n_H\}.𝒫\mathcal{P}and\mathcal{H}correspond to personal, for example gender or age, or householdvariables, like region or households size, respectively. Each variablein either𝒫\mathcal{P}or\mathcal{H}can take onPcP_{c}orHcH_{c}values with andNvpcN^{p_c}_v,v=1,,Pcv=1,\ldots,P_c,orNvhcN^{h_c}_v,v=1,,Hcv=1,\ldots,H_c,as the corresponding population margins. Starting withk=0k=0the iterative proportional fitting procedure is applied on eachb̃ji\tilde{b}_j^{i},i=1,,Bi=1,\ldots, Bseparately. The weights are first updated for personal and afterwardsupdated for household variables. If constraints regarding thepopulations margins are not metkkis raised by 1 and the procedure starts from the beginning. For thefollowing denote as starting weightb̃j[0]:=b̃ji\tilde{b}_j^{[0]}:=\tilde{b}_j^{i}for fixedii.

Adjustment and trimming for𝒫\mathcal{P}

The uncalibrated bootstrap weightb̃j[(n+1)k+c1]\tilde{b}_j^{[(n+1)k+c-1]}for thejj-thobservation is iteratively multiplied by a factor so that the projecteddistribution of the population matches the respective calibrationspecificationNpcN_{p_c},c=1,,nPc=1, \ldots,n_P.For eachc{1,,nP}c \in \left\{1, \ldots,n_P\right\}the calibrated weights againstNvpcN^{p_c}_vare computed asb̃j[(n+1)k+c]=b̃j[(n+1)k+c1]Nvpclb̃l[(n+1)k+c1], \tilde{b}_j^{[(n+1)k+c]} = {\tilde{b}_j}^{[(n+1)k+c-1]}\frac{N^{p_c}_v}{{\sum\limits_l} {\tilde{b}}_l^{[(n+1)k+c-1]}}, where the summation in the denominatorexpands over all observations which have the same value as observationjjfor the sociodemographic variablepcp_c.If any weightsb̃j[nk+c]\tilde{b}_j^{[nk+c]}fall outside the range[wj4;4wj]\left[\frac{w_j}{4};4w_j\right]they will be recoded to the nearest of the two boundaries. The choice ofthe boundaries results from expert-based opinions and restricts thevariance of which has a positive effect on the sampling error. Thisprocedure represents a common form of weight trimming where very largeor small weights are trimmed in order to reduce variance in exchange fora possible increase in bias ((Potter1990),(Potter 1993)).

Averaging weights within households

Since the sociodemographic variablesp1,,pncp_1,\ldots,p_{n_c}include person-specific variables, the weightsb̃j[nk+np]\tilde{b}_j^{[nk+n_p]}resulting from the iterative multiplication can be unequal for membersof the same household. This can lead to inconsistencies between resultsprojected with household and person weights. To avoid suchinconsistencies each household member is assigned the mean of thehousehold weights. That is for each personjjin householdaawithhah_ahousehold members, the weights are defined byb̃j[(n+1)k+np+1]=lab̃l[(n+1)k+np]ha \tilde{b}_j^{[(n+1)k+n_p+1]} = \frac{{\sum\limits_{l\in a}} {\tilde{b}_l^{[(n+1)k+n_p]}}}{h_a} This can result in losing thepopulation structure performed in the previous subsection.

Adjustment and trimming for\mathcal{H}

After adjustment for individual variables the weightsbj[nk+np+1]b_j^{[nk+n_p+1]}are updated for the set of household variables\mathcal{H}according to a household convergence constraint parameterϵh\epsilon_h.The parametersϵh\epsilon_hrepresent the allowed deviation from the population margins using theweightsbj[nk+np+1]b_j^{[nk+n_p+1]}compared toNvhcN^{h_c}_v,c=1,,nHc=1,\ldots,n_H,v=1,,Hcv=1,\ldots,H_c.The updated weights are computed asbj[(n+1)k+np+c+1]={bj[(n+1)k+np+1]Nvhclbl[(n+1)k+np+1]iflbj[(n+1)k+np+1]((10.9ϵh)Nvhc,(1+0.9ϵh)Nvhc)bj[(n+1)k+np+1]otherwise b_j^{[(n+1)k+n_p+c+1]} = \begin{cases} b_j^{[(n+1)k+n_p+1]}\frac{N^{h_c}_v}{\sum\limits_{l} b_l^{[(n+1)k+n_p+1]}} \quad \text{if } \sum\limits_{l} b_j^{[(n+1)k+n_p+1]} \notin ((1-0.9\epsilon_h)N^{h_c}_v,(1+0.9\epsilon_h)N^{h_c}_v) \\ b_j^{[(n+1)k+n_p+1]} \quad \text{otherwise} \end{cases} with the summation in the denominatorranging over all householdsllwhich take on the same values forhch_cas observationjj.As described in the previous subsection the new weight are recoded ifthey exceed the interval[wj4;4wj][\frac{w_j}{4};4w_j]and set to the upper or lower bound, depending ofbj[(n+1)k+np+c+1]b_j^{[(n+1)k+n_p+c+1]}falls below or above the interval respectively.

Convergence

For each adjustment and trimming step the factorNv(.)lbl[(n+1)k+j]\frac{N^{(.)}_v}{\sum\limits_{l} b_l^{[(n+1)k+j]}},j{1,,n+1}{np+1}j\in \{1,\ldots,n+1\}\backslash \{n_p+1\},is checked against convergence constraints for households,ϵh\epsilon_h,or personal variablesϵp\epsilon_p,where(.)(.)corresponds to either a household or personal variable. To be moreprecise for variables in𝒫\mathcal{P}the constraints

Nvpclb̃l[(n+1)k+j]((1ϵp)Nvpc,(1+ϵp)Nvpc)\frac{N^{p_c}_v}{{\sum\limits_l} {\tilde{b}}_l^{[(n+1)k+j]}} \in ((1-\epsilon_p)N^{p_c}_v,(1+\epsilon_p)N^{p_c}_v) and for variables in\mathcal{H}the constraints

Nvhclb̃l[(n+1)k+j]((1ϵh)Nvhc,(1+ϵh)Nvhc)\frac{N^{h_c}_v}{{\sum\limits_l} {\tilde{b}}_l^{[(n+1)k+j]}} \in ((1-\epsilon_h)N^{h_c}_v,(1+\epsilon_h)N^{h_c}_v) are verified, where the sum in thedenominator expands over all observations which have the same value forvariableshch_corpcp_c.If these constraints hold true the algorithm reaches convergence,otherwisekkis raised by 1 and the procedure repeats itself.

The above described calibration procedure is applied on each yearyty_tof EU-SILC separately,t=1,nyt=1,\ldots n_y,thus resulting in so called calibrated bootstrap sample weightsbj(i,yt)b_{j}^{(i,{y_t})},i=1,,Bi=1,\ldots,Bfor each yearyyand each householdjj.

Variance estimation

Applying the previously described algorithms to EU-SILC data formultiple consecutive yearsyty_t,t=1,nyt=1,\ldots n_y,yields calibrated bootstrap sample weightsbj(i,yt)b_{j}^{(i,{y_t})}for each yearyty_t.Using the calibrated bootstrap sample weights it is straight forward tocompute the standard error of a point estimateθ(𝐗yt,𝐰yt)\theta(\textbf{X}^{y_t},\textbf{w}^{y_t})for yearyty_twith𝐗yt=(X1yt,,Xnyt)\textbf{X}^{y_t}=(X_1^{y_t},\ldots,X_n^{y_t})as the vector of observations for the variable of interest in the surveyand𝐰yt=(w1yt,,wnyt\textbf{w}^{y_t}=(w_1^{y_t},\ldots,w_n^{y_t}as the corresponding weight vector, with

sd(θ)=1B1i=1B(θ(i,yt)θ(.,yt)¯)2 sd(\theta) = \sqrt{\frac{1}{B-1}\sum\limits_{i=1}^B (\theta^{(i,y_t)}-\overline{\theta^{(.,y_t)}})^2} withθ(.,yt)¯=1Bi=1Bθ(i,yt), \overline{\theta^{(.,y_t)}} = \frac{1}{B}\sum\limits_{i=1}^B\theta^{(i,y_t)} \quad, whereθ(i,yt):=θ(𝐗yt,𝐛(i,yt))\theta^{(i,y_t)}:=\theta(\textbf{X}^{y_t},\textbf{b}^{(i,{y_t})})is the estimate ofθ\thetain the yearyty_tusing theii-thvector of calibrated bootstrap weights.

As already mentioned the standard error estimation for indicators inEU-SILC yields high quality results for NUTS1 or country level. Whenestimation indicators on regional or other sub-aggregate levels one isconfronted with point estimates yielding high variance.

To overcome this issue we propose to estimateθ\thetafor 3, consecutive years using the calibrated bootstrap weights, thuscalculating{θ(i,yt1),θ(i,yt),θ(i,yt+1)}\{\theta^{(i,y_{t-1})},\theta^{(i,y_t)},\theta^{(i,y_{t+1})}\},i=1,,Bi=1,\ldots,B.For fixediione can apply a filter with equal filter weights on the time series{θ(i,yt1),θ(i,yt),θ(i,yt+1)}\{\theta^{(i,y_{t-1})},\theta^{(i,y_t)},\theta^{(i,y_{t+1})}\}to createθ̃(i,yt)\tilde{\theta}^{(i,y_t)}

θ̃(i,yt)=13[θ(i,yt1)+θ(i,yt)+θ(i,yt+1)]. \tilde{\theta}^{(i,y_t)} = \frac{1}{3}\left[\theta^{(i,y_{t-1})}+\theta^{(i,y_t)}+\theta^{(i,y_{t+1})}\right] \quad .

Doing this for allii,i=1,,Bi=1,\ldots,B,yieldsθ̃(i,yt)\tilde{\theta}^{(i,y_t)},i=1,,Bi=1,\ldots,B.The standard error ofθ\thetacan then be estimated with

sd(θ)=1B1i=1B(θ̃(i,yt)θ̃(.,yt)¯)2 sd(\theta) = \sqrt{\frac{1}{B-1}\sum\limits_{i=1}^B (\tilde{\theta}^{(i,y_t)}-\overline{\tilde{\theta}^{(.,y_t)}})^2} withθ̃(.,yt)¯=1Bi=1Bθ̃(i,yt). \overline{\tilde{\theta}^{(.,y_t)}}=\frac{1}{B}\sum\limits_{i=1}^B\tilde{\theta}^{(i,y_t)} \quad.

Applying the filter over the time series of estimatedθ(i,yt)\theta^{(i,y_t)}leads to a reduction of variance forθ\thetasince the filter reduces the noise in{θ(i,yt1),θ(i,yt),θ(i,yt+1)}\{\theta^{(i,y_{t-1})},\theta^{(i,y_t)},\theta^{(i,y_{t+1})}\}and thus leading to a more narrow distribution forθ̃(i,yt)\tilde{\theta}^{(i,y_t)}.

It should also be noted that estimating indicators from a survey withrotating panel design is in general not straight forward because of thehigh correlation between consecutive years. However with our approach touse bootstrap weights, which are independent from each other, we canbypass the cumbersome calculation of various correlations, and applythem directly to estimate the standard error.(Bauer et al. 2013) showed that using theproposed method on EU-SILC data for Austria the reduction in resultingstandard errors corresponds in a theoretical increase in sample size byabout25%\%.Furthermore this study compared this method to the use of small areaestimation techniques and on average the use of bootstrap sample weightsyielded more stable results.

References

Bauer, Martin, Matthias Till, Richard Heuberger, Marcel Bilgili, ThomasGlaser, Elisabeth Kafka, Johannes Klotz, et al. 2013.“Studie ZuArmut Und Sozialer Eingliederung in DenBundesl"andern.” Statistik Austria [in German].
Chipperfield, James, and John Preston. 2007.“Efficient Bootstrapfor Business Surveys.”Survey Methodology 33 (December):167–72.https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X200700210494.
Efron, B. 1979.“Bootstrap Methods: Another Look at theJackknife.”Ann. Statist. 7 (1): 1–26.https://doi.org/10.1214/aos/1176344552.
Meraner, Angelika, Daniela Gumprecht, and Alexander Kowarik. 2016.“Weighting Procedure of the Austrian Microcensus UsingAdministrative Data.”Austrian Journal of Statistics 45(June): 3.https://doi.org/10.17713/ajs.v45i3.120.
Potter, Frank J. 1990.“A Study of Procedures to Identify and TrimExtreme Sampling Weights.”Proceedings of the AmericanStatistical Association, Section on Survey Research Methods,225–30.http://www.asasrms.org/Proceedings/papers/1990_034.pdf.
———. 1993.“The Effect of Weight Trimming on Nonlinear SurveyEstimates.”Proceedings of the American StatisticalAssociation, Section on Survey Research Methods 2: 758–63.http://www.asasrms.org/Proceedings/papers/1993_127.pdf.
Preston, J. 2009.“Rescaled Bootstrap for Stratified MultistageSampling.”Survey Methodology 35 (December): 227–34.https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X200900211044.
Rao, J. N. K., and C. F. J. Wu. 1988.“Resampling Inference withComplex Survey Data.”Journal of the American StatisticalAssociation 83 (401): 231–41.

[8]ページ先頭

©2009-2025 Movatter.jp