Movatterモバイル変換

Type:

Package

Title:

Statistical Disclosure Control Methods for Anonymization of Dataand Risk Estimation

Version:

5.7.9

Date:

2025-08-01

Description:

Data from statistical agencies and other institutions are mostly confidential. This package, introduced in Templ, Kowarik and Meindl (2017) <doi:10.18637/jss.v067.i04>, can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files. The theoretical basis for the methods implemented can be found in Templ (2017) <doi:10.1007/978-3-319-50272-4>. Various risk estimation and anonymization methods are included. Note that the package includes a graphical user interface published in Meindl and Templ (2019) <doi:10.3390/a12090191> that allows to use various methods of this package.

LazyData:

TRUE

ByteCompile:

TRUE

LinkingTo:

Rcpp

Depends:

R (≥ 2.10)

Suggests:

laeken, parallel, testthat

Imports:

utils, stats, graphics, car, carData, rmarkdown, knitr,data.table, xtable, robustbase, cluster, MASS, e1071, tools,Rcpp, methods, ggplot2, shiny (≥ 1.4.0), haven, rhandsontable,DT, shinyBS, prettydoc, VIM(≥ 4.7.0)

License:

GPL-2

URL:

https://github.com/sdcTools/sdcMicro

Collate:

'0classes.r' 'addGhostVars.R' 'addNoise.r' 'aux_functions.r''createDat.R' 'createNewID.R' 'dataGen.r' 'dataSets.R''dRisk.R' 'dRiskRMD.R' 'dUtility.R' 'freqCalc.r''globalRecode.R' 'groupAndRename.R' 'GUIfunctions.R''indivRisk.R' 'infoLoss.R' 'LocalRecProg.R' 'localSupp.R''localSuppression.R' 'mdav.R' 'measure_risk.R' 'methods.r''microaggregation.R' 'modRisk.R''muargus_compatibility_functions.R' 'mvTopCoding.R''plotFunctions.R' 'plotMicro.R' 'pram.R' 'rankSwap.R''RcppExports.R' 'recordSwap.R' 'report.R' 'riskyCells.R''sdcMicro-package.R' 'shuffle.R' 'suda2.R' 'timeEstimation.R''topBotCoding.R' 'valTable.R' 'zzz.R' 'printFunctions.R''mafast.R' 'maG.R' 'sdcApp.R' 'show_sdcMicroObj.R'

RoxygenNote:

7.3.2

VignetteBuilder:

knitr

Encoding:

UTF-8

NeedsCompilation:

yes

Packaged:

2025-08-06 11:24:10 UTC; matthias

Author:

Matthias Templ

[aut, cre], Bernhard Meindl [aut], Alexander Kowarik

[aut], Johannes Gussenbauer [aut], Organisation For Economic Co-Operation And Development [cph] (Initial published c(++) code (under LGPL) code for rank swapping, mdav-microaggregation, suda2 and other (hierarchical) risk measures), Statistics Netherlands [cph] (microAggregation cpp code (under EUPL v1.1)), Pascal Heus [cph] (original measure threshold cpp code (under LGPL))

Maintainer:

Matthias Templ <matthias.templ@gmail.com>

Repository:

CRAN

Date/Publication:

2025-08-22 14:40:02 UTC

sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

Description

Data from statistical agencies and other institutions are mostly confidential. This package, introduced in Templ, Kowarik and Meindl (2017)doi:10.18637/jss.v067.i04, can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files. The theoretical basis for the methods implemented can be found in Templ (2017)doi:10.1007/978-3-319-50272-4. Various risk estimation and anonymization methods are included. Note that the package includes a graphical user interface published in Meindl and Templ (2019)doi:10.3390/a12090191 that allows to use various methods of this package.

This package includes all methods of the popular software mu-Argus plusseveral new methods. In comparison with mu-Argus the advantages of thispackage are that the results are fully reproducible even with the includedGUI, that the package can be used in batch-mode from other software, thatthe functions can be used in a very flexible way, that everybody could lookat the source code and that there are no time-consuming meta-data managementis necessary. However, the user should have a detailed knowledge about SDCwhen applying the methods on data.

Details

The package is programmed using S4-classes and it comes with a well-definedclass structure.

The implemented graphical user interface (GUI) for microdata protectionserves as an easy-to-handle tool for users who want to use the sdcMicropackage for statistical disclosure control but are not used to the native Rcommand line interface. In addition to that, interactions between objectswhich results from the anonymization process are provided within the GUI.This allows an automated recalculation and displaying information of thefrequency counts, individual risk, information loss and data utility aftereach anonymization step. In addition to that, the code for everyanonymization step carried out within the GUI is saved in a script which canthen be easily modified and reloaded.

Package:	sdcMicro
Type:	Package
Version:	2.5.9
Date:	2009-07-22
License:	GPL 2.0

Author(s)

Maintainer: Matthias Templmatthias.templ@gmail.com (ORCID)

Authors:

Bernhard MeindlBernhard.Meindl@statistik.gv.at
Alexander KowarikAlexander.Kowarik@statistik.gv.at (ORCID)
Johannes Gussenbauerjohannes.gussenbauer@statistik.gv.at

Other contributors:

Organisation For Economic Co-Operation And Development (Initial published c(++) code (under LGPL) code for rank swapping, mdav-microaggregation, suda2 and other (hierarchical) risk measures) [copyright holder]
Statistics Netherlands (microAggregation cpp code (under EUPL v1.1)) [copyright holder]
Pascal Heus (original measure threshold cpp code (under LGPL)) [copyright holder]

Matthias Templ, Alexander Kowarik, Bernhard Meindl

Maintainer: Matthias Templ <templ@statistik.tuwien.ac.at>

References

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4

Templ, M. and Kowarik, A. and Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Templ, M. and Meindl, B.Practical Applications inStatistical Disclosure Control Using R, Privacy and Anonymity inInformation Management Systems, Bookchapter, Springer London, pp. 31-62,2010.doi:10.1007/978-1-84996-238-4_3

Kowarik, A. and Templ, M. and Meindl, B. and Fonteneau, F. and Prantner, B.:Testing of IHSN Cpp Code and Inclusion of New Methods into sdcMicro,in: Lecture Notes in Computer Science, J. Domingo-Ferrer, I. Tinnirello(editors.); Springer, Berlin, 2012, ISBN: 978-3-642-33626-3, pp. 63-77.doi:10.1007/978-3-642-33627-0_6

Templ, M.Statistical Disclosure Control for Microdata Using theR-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp.67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php

Templ, M.New Developments in Statistical Disclosure Control andImputation: Robust Statistics Applied to Official Statistics,Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,264 pages.

Examples

## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2, 4:6), w = 8)ff$fkf$Fk## dealing with missing values:x <- francdatx[3,5] <- NAx[4,2] <- x[4,4] <- NAx[5,6]  <- NAx[6,2]  <- NAf2 <- freqCalc(x, keyVars = c(2, 4:6), w = 8)f2$fkf2$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk## Local SuppressionlocalS <- localSupp(f, keyVar = 2, threshold = 0.25)f2 <- freqCalc(localS$freqCalc, keyVars=c(2, 4:6), w = 8)indivf2 <- indivRisk(f2)indivf2$rk## select another keyVar and run localSupp() once again,## if you think the table is not fully protecteddata(free1)free1 <- as.data.frame(free1)f <- freqCalc(x = free1, keyVars = 1:3, w = 30)ind <- indivRisk(f)## and now you can use the interactive plot for individual risk objects:## plot(ind)## example from Capobianchi, Polettini and Lucarelli:data(francdat)l1 <- localSuppression(  obj = francdat,  keyVars=c(2, 4:6),  importance = c(1, 3, 2, 4))l1l1$xl2 <- localSuppression(obj = francdat, keyVars=c(2, 4:6), k = 2)l3 <- localSuppression(obj = francdat, keyVars=c(2, 4:6), k = 4)## Global recoding:data(free1)free1 <- as.data.frame(free1)free1[, "AGE"] <- globalRecode(  obj = free1[, "AGE"],  breaks = c(1,9,19,29,39,49,59,69,100),  labels = 1:8)## Top coding:topBotCoding(  obj = free1[, "DEBTS"],  value = 9000,  replacement = 9100,  kind = "top")## Numerical Rank Swapping:data(Tarragona)Tarragona1 <- rankSwap(Tarragona, P = 10, K0 = NULL, R0 = NULL)## Microaggregation:m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)m2 <- microaggregation(Tarragona, method = "pca", aggr = 3)## using a subset because of computation timevalTable(Tarragona[1:50, ], method = c("simple", "onedims", "pca"))data(microData)microData <- as.data.frame(microData)m_micro <- microaggregation(microData, method = "mdav")summary(m_micro)plotMicro(m_micro, 1, which.plot = 1)  # not enough observations...data(free1)free1 <- as.data.frame(free1)plotMicro(  x = microaggregation(free1[,31:34], method = "onedims"),  p = 1,  which.plot = 1)## disclosure risk (interval) and data utility:m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)dRisk(obj = Tarragona, xm = m1$mx)dRisk(obj = Tarragona, xm = m2$mx)dUtility(obj = Tarragona, xm = m1$mx)dUtility(obj = Tarragona, xm = m2$mx)## Fast generation of synthetic data with approximately## the same covariance matrix as the original one.data(mtcars)cov(mtcars[, 4:6])df_gen <- dataGen(obj = mtcars[, 4:6], n = 200)cov(df_gen)pairs(mtcars[, 4:6])pairs(df_gen)## Post-Randomization (PRAM)x <- factor(sample(1:4, 250, replace = TRUE))pr1 <- pram(x)length(which(pr1$x_pram == x))summary(pr1)x2 <- factor(sample(1:4, 250, replace=TRUE))length(which(pram(x2)$x_pram == x2))data(free1)marstat <- as.factor(free1[,"MARSTAT"])marstatPramed <- pram(marstat)summary(marstatPramed)## The same functionality can be also applied to `sdcMicroObj`-objectsdata(testdata)## undo-functionality is by default restricted to data sets## with <= `1e5` rows; to modify, env-var `sdcMicro_maxsize_undo`## can to be changed before creating a problem instanceSys.setenv("sdcMicro_maxsize_undo" = 1e6)## create an objecttestdata$water <- factor(testdata$water)sdc <- createSdcObj(  dat = testdata,  keyVars = c("urbrur", "roof", "walls", "electcon", "water", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")head(sdc@manipNumVars)## Display risk-measuressdc@risk$globalsdc <- dRisk(sdc)sdc@risk$numeric## Generation of synthetic datasynthdat <- dataGen(sdc)## use addNoise with default parameters (not suggested)sdc <- addNoise(sdc, variables = c("expend", "income"))head(sdc@manipNumVars)sdc@risk$numeric## undolast step (remove adding noise)sdc <- undolast(sdc)head(sdc@manipNumVars)sdc@risk$numeric## apply addNoise() with custom parameterssdc <- addNoise(sdc, noise = 0.2)head(sdc@manipNumVars)sdc@risk$numeric## LocalSuppressionsdc <- undolast(sdc)head(sdc@risk$individual)sdc@risk$globalsdc <- localSuppression(sdc)head(sdc@risk$individual)sdc@risk$global## microaggregationsdc <- undolast(sdc)head(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc <- microaggregation(sdc)head(get.sdcMicroObj(sdc, type = "manipNumVars"))## Post-Randomizationsdc <- undolast(sdc)head(sdc@risk$individual)sdc@risk$globalsdc <- pram(sdc, variables = "water")head(sdc@risk$individual)sdc@risk$global## rankSwapsdc <- undolast(sdc)head(sdc@risk$individual)sdc@risk$globalhead(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc <- rankSwap(sdc)head(get.sdcMicroObj(sdc, type = "manipNumVars"))head(sdc@risk$individual)sdc@risk$global## topBotCodinghead(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc@risk$numericsdc <- topBotCoding(  obj = sdc,  value = 60000000,  replacement = 62000000,  column = "income")head(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc@risk$numeric## LocalRecProgdata(testdata2)keyVars <- c("urbrur", "roof", "walls", "water", "sex")w <- "sampling_weight"sdc <- createSdcObj(testdata2,  keyVars = keyVars,  weightVar = w)sdc@risk$globalsdc <- LocalRecProg(sdc)sdc@risk$global## Model-based risks using a formulaform <- as.formula(paste("~", paste(keyVars, collapse = "+")))sdc <- modRisk(sdc, method = "default", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "CE", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "PML", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "weightedLLM", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "IPF", formulaM = form)get.sdcMicroObj(sdc, "risk")$model

Census data set

Description

This test data set was obtained on July 27, 2000 using the public use DataExtraction System of the U.S. Bureau of the Census.

Format

A data frame sampled from year 1995 with 1080 observations on thefollowing 13 variables.

AFNLWGT: Final weight (2 implied decimal places)
AGI: Adjusted gross income
EMCONTRB: Employer contribution for hlth insurance
FEDTAX: Federal income tax liability
PTOTVAL: Total person income
STATETAX: State income tax liability
TAXINC: Taxable income amount
POTHVAL: Total other persons income
INTVAL: Amt of interest income
PEARNVAL: Total person earnings
FICA: Soc. sec. retirement payroll deduction
WSALVAL: Amount: Total Wage and salary
ERNVAL: Business or Farm net earnings

Source

Public use file from the CASC project. More information on thistest data can be found in the paper listed below.

References

Brand, R. and Domingo-Ferrer, J. and Mateo-Sanz, J.M., Referencedata sets to test and compare SDC methods for protection of numericalmicrodata. Unpublished.https://research.cbs.nl/casc/CASCrefmicrodata.pdf

Examples

data(CASCrefmicrodata)str(CASCrefmicrodata)

EIA data set

Description

Data set obtained from the U.S. Energy Information Authority.

Format

A data frame with 4092 observations on the following 15 variables.

UTILITYID: UNIQUE UTILITY IDENTIFICATION NUMBER
UTILNAME: UTILITY NAME. A factor with levels4-CountyElectric Power AssnAlabama Power CoAlaska ElectricAppalachian Electric CoopAppalachian Power CoArizonaPublic Service CoArkansas Power & Light CoArkansas ValleyElec Coop CorpAtlantic City Electric CompanyBaker ElectricCoop IncBaltimore Gas & Electric CoBangor Hydro-Electric CoBerkeley Electric Coop IncBlack Hills CorpBlackstoneValley Electric CoBonneville Power AdminBoston Edison CoBountiful City Light & PowerBristol City ofBrookingsCity ofBrunswick Electric Member CorpBurlington City ofCarolina Power & Light CoCarroll Electric Coop CorpCass County Electric Coop IncCentral Illinois Light CompanyCentral Illinois Pub Serv CoCentral Louisiana Elec Co IncCentral Maine Power CoCentral Power & Light CoCentralVermont Pub Serv CorpChattanooga City ofCheyenne Light Fuel& Power CoChugach Electric Assn IncCincinnati Gas & ElectricCoCitizens Utilities CompanyCity of Boulder CityCityof ClintonCity of DoverCity of EugeneCity ofGilletteCity of Groton Dept of UtilsCity of Idaho FallsCity of IndependenceCity of NewarkCity of ReadingCity of Tupelo Water & Light DClarksville City ofCleveland City ofCleveland Electric Illum CoCoastElectric Power AssnCobb Electric Membership CorpColoradoRiver CommissionColorado Springs City ofColumbus SouthernPower CoCommonwealth Edison CoCommonwealth Electric CoConnecticut Light & Power CoConsolidated Edison Co-NY IncConsumers Power CoCornhusker Public Power DistCuivreRiver Electric Coop IncCumberland Elec Member CorpDakotaElectric AssnDawson County Public Pwr DistDayton Power &Light CompanyDecatur City ofDelaware Electric Coop IncDelmarva Power & Light CoDetroit Edison CoDuck RiverElec Member CorpDuke Power CoDuquesne Light CompanyEast Central Electric AssnEastern Maine Electric CoopEl Paso Electric CoElectric Energy IncEmpire DistrictElectric CoExeter & Hampton Electric CoFairbanks City ofFayetteville Public Works CommFirst Electric Coop CorpFlorence City ofFlorida Power & Light CoFlorida PowerCorpFort Collins Lgt & Pwr UtilityFremont City ofGeorgia Power CoGibson County Elec Member CorpGoldenValley Elec Assn IncGrand Island City ofGranite StateElectric CoGreen Mountain Power CorpGreen River ElectricCorpGreeneville City ofGulf Power CompanyGulf StatesUtilities CoHasting UtilitiesHawaii Electric Light Co IncHawaiian Electric Co IncHenderson-Union Rural E C CHomer Electric Assn IncHot Springs Rural El Assn IncHouston Lighting & Power CoHuntsville City ofIdahoPower CoIES Utilities IncIllinois Power CoIndianaMichigan Power CoIndianapolis Power & Light CoIntermountainRural Elec AssnInterstate Power CoJackson Electric MemberCorpJersey Central Power&Light CoJoe Wheeler Elec MemberCorpJohnson City City ofJones-Onslow Elec Member CorpKansas City City ofKansas City Power & Light CoKentucky Power CoKentucky Utilities CoKetchikan PublicUtilitiesKingsport Power CoKnoxville City ofKodiakElectric Assn IncKootenai Electric Coop, IncLansing Board ofWater & LightLenoir City City ofLincoln City ofLongIsland Lighting CoLos Angeles City ofLouisiana Power & LightCoLouisville Gas & Electric CoLoup River Public Power DistLower Valley Power & Light IncMaine Public Service CompanyMassachusetts Electric CoMatanuska Electric Assn IncMaui Electric Co LtdMcKenzie Electric Coop IncMemphisCity ofMidAmerican Energy CompanyMiddle Tennessee E M CMidwest Energy, IncMinnesota Power & Light CoMississippi Power & Light CoMississippi Power CoMonongahela Power CoMontana-Dakota Utilities CoMontanaPower CoMoon Lake Electric Assn IncNarragansett Electric CoNashville City ofNebraska Public Power DistrictNevadaPower CoNew Hampshire Elec Coop, IncNew Orleans PublicService IncNew York State Gas & ElectricNewport ElectricCorpNiagara Mohawk Power CorpNodak Rural Electric Coop IncNorris Public Power DistrictNortheast Oklahoma Electric CoNorthern Indiana Pub Serv CoNorthern States Power CoNorthwestern Public Service CoOhio Edison CoOhio PowerCoOhio Valley Electric CorpOklahoma Electric Coop, IncOklahoma Gas & Electric CoOliver-Mercer Elec Coop, IncOmaha Public Power DistrictOtter Tail Power CoPacificGas & Electric CoPacificorp dba Pacific Pwr & LPalmettoElectric Coop, IncPennsylvania Power & Light CoPennyrileRural Electric CoopPhiladelphia Electric CoPierre MunicipalElectricPortland General Electric CoPotomac Edison CoPotomac Electric Power CoPoudre Valley R E A, IncPowerAuthority of State of NYProvo City CorporationPublic ServiceCo of ColoradoPublic Service Co of IN IncPublic Service Coof NHPublic Service Co of NMPublic Service Co of OklahomaPublic Service Electric&Gas CoPUD No 1 of Clark CountyPUD No 1 of Snohomish CountyPuget Sound Power & Light CoRappahannock Electric CoopRochester Public UtilitiesRockland Electric CompanyRosebud Electric Coop IncRutherford Elec Member CorpSacramento Municipal Util DistSalmon River Electric Coop IncSalt River Proj Ag I & P DistSan Antonio City ofSavannah Electric & Power CoSeattleCity ofSierra Pacific Power CoSinging River Elec Power AssnSioux Valley Empire E A IncSouth Carolina Electric&Gas CoSouth Carolina Pub Serv AuthSouth Kentucky Rural E C CSouthern California Edison CoSouthern Nebraska Rural P P DSouthern Pine Elec Power AssnSouthwest Tennessee E M CSouthwestern Electric Power CoSouthwestern Public Service CoSpringfield City ofSt Joseph Light & Power CoStateLevel AdjustmentTacoma City ofTampa Electric CoTexas-New Mexico Power CoTexas Utilities Electric CoTri-County Electric Assn IncTucson Electric Power CoTurner-Hutchinsin El Coop, IncTVAU S Bureau of IndianAffairsUnion Electric CoUnion Light Heat & Power CoUnited Illuminating CoUpper Cumberland E M CUtiliCorpUnited IncVerdigris Valley Electric CoopVerendrye ElectricCoop IncVirginia Electric & Power CoVolunteer Electric CoopWallingford Town ofWarren Rural Elec Coop CorpWashington Water Power CoWatertown Municipal Utils DeptWells Rural Electric CoWest Penn Power CoWest PlainsElectric Coop IncWest River Electric Assn, IncWesternMassachusetts Elec CoWestern Resources IncWheeling PowerCompanyWisconsin Electric Power CoWisconsin Power & LightCoWisconsin Public Service CorpWright-Hennepin Coop ElecAssnYellowstone Vlly Elec Coop Inc
STATE: STATE FOR WHICH THE UTILITY IS REPORTING. A factor with levelsAKALARAZCACOCTDCDEFLGAHIIAIDILINKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOHOKORPARISCSDTNTXUTVAVTWAWIWVWY
YEAR: REPORTING YEAR FOR THE DATA
MONTH: REPORTING MONTH FOR THE DATA
RESREVENUE: REVENUE FROM SALES TO RESIDENTIAL CONSUMERS
RESSALES: SALES TO RESIDENTIAL CONSUMERS
COMREVENUE: REVENUE FROM SALES TO COMMERCIAL CONSUMERS
COMSALES: SALES TO COMMERCIAL CONSUMERS
INDREVENUE: REVENUE FROM SALES TO INDUSTRIAL CONSUMERS
INDSALES: SALES TO INDUSTRIAL CONSUMERS
OTHREVENUE: REVENUE FROM SALES TO OTHER CONSUMERS
OTHRSALES: SALES TO OTHER CONSUMERS
TOTREVENUE: REVENUE FROM SALES TO ALL CONSUMERS
TOTSALES: SALES TO ALL CONSUMERS

Source

Public use file from the CASC project.

References

Examples

data(EIA)head(EIA)

Additional Information-Loss measures

Description

MeasuresIL_correl() andIL_variables() were proposed by Andrzej Mlodak and are (theoretically) bounded between0 and1.

Usage

IL_correl(x, xm)## S3 method for class 'il_correl'print(x, digits = 3, ...)IL_variables(x, xm)## S3 method for class 'il_variables'print(x, digits = 3, ...)

Arguments

x

an object coercible to adata.frame representing the original dataset

xm

an object coercible to adata.frame representing the perturbed, modified dataset

digits

number digits used for rounding when displaying results

...

additional parameter for print-methods; currently ignored

Details

IL_correl(): is a information-loss measure that can be applied to common numerically scaled variables inx andxm. It is basedon diagonal entries of inverse correlation matrices in the original and perturbed data.
IL_variables(): for common-variables inx andxm the individual distance-functions depend on the class of the variable;specifically these functions are different for numeric variables, ordered-factors and character/factor variables. The individual distancesare summed up and scaled byn * m withn being the number of records andm being the number of (common) variables.

Details can be found in the references below

The implementation ofIL_correl() differs slightly with the original proposition from Mlodak, A. (2020) asthe constant multiplier was changed to1 / sqrt(2) instead of1/2 for better efficiency and interpretabilityof the measure.

Value

the corresponding information-loss measure

Author(s)

Bernhard Meindlbernhard.meindl@statistik.gv.at

References

Mlodak, A. (2020). Information loss resulting from statistical disclosure control of output data,Wiadomosci Statystyczne. The Polish Statistician, 2020, 65(9), 7-27, DOI: 10.5604/01.3001.0014.4121

Mlodak, A. (2019). Using the Complex Measure in an Assessment of the Information Loss Due to the Microdata Disclosure Control,Przegląd Statystyczny, 2019, 66(1), 7-26,DOI: 10.5604/01.3001.0013.8285

Examples

data("Tarragona", package = "sdcMicro")res1 <- addNoise(obj = Tarragona, variables = colnames(Tarragona), noise = 100)IL_correl(x = as.data.frame(res1$x), xm = as.data.frame(res1$xm))res2 <- addNoise(obj = Tarragona, variables = colnames(Tarragona), noise = 25)IL_correl(x = as.data.frame(res2$x), xm = as.data.frame(res2$xm))# creating test-inputsn <- 150x <- xm <- data.frame(  v1 = factor(sample(letters[1:5], n, replace = TRUE), levels = letters[1:5]),  v2 = rnorm(n),  v3 = runif(3),  v4 = ordered(sample(LETTERS[1:3], n, replace = TRUE), levels = c("A", "B", "C")))xm$v1[1:5] <- "a"xm$v2 <- rnorm(n, mean = 5)xm$v4[1:5] <- "A"IL_variables(x, xm)

Local recoding via Edmond's maximum weighted matching algorithm

Description

To be used on both categorical and numeric input variables, although usageon categorical variables is the focus of the development of this software.

Usage

LocalRecProg(  obj,  ancestors = NULL,  ancestor_setting = NULL,  k_level = 2,  FindLowestK = TRUE,  weight = NULL,  lowMemory = FALSE,  missingValue = NA,  ...)

Arguments

obj

adata.frame or asdcMicroObj-class-object

ancestors

Names of ancestors of the cateorical variables

ancestor_setting

For each ancestor the corresponding categorical variable

k_level

Level for k-anonymity

FindLowestK

requests the program to look for the smallest k thatresults in complete matches of the data.

weight

A weight for each variable (Default=1)

lowMemory

Slower algorithm with less memory consumption

missingValue

The output value for a suppressed value.

...

see arguments below

categorical: Names of categorical variables
numerical: Names of numerical variables

Details

Each record in the data represents a category of the original data, andhence all records in the input data should be unique by the N InputVariables. To achieve bigger category sizes (k-anoymity), one can form newcategories based on the recoding result and repeatedly apply this algorithm.

Value

dataframe with original variables and the supressed variables(suffix _lr). / the modifiedsdcMicroObj-class

Methods

list("signature(obj=\"sdcMicroObj\")")

Author(s)

Alexander Kowarik, Bernd Prantner, IHSN C++ source, Akimichi Takemura

References

Examples

data(testdata2)cat_vars <- c("urbrur", "roof", "walls", "water", "sex", "relat")anc_var <- c("water2", "water3", "relat2")anc_setting <- c("water","water","relat")r1 <- LocalRecProg(  obj = testdata2,  categorical = cat_vars,  missingValue = -99)r2 <- LocalRecProg(  obj = testdata2,  categorical = cat_vars,  ancestor = anc_var,  ancestor_setting = anc_setting,  missingValue = -99)r3 <- LocalRecProg(  obj = testdata2,  categorical = cat_vars,  ancestor = anc_var,  ancestor_setting = anc_setting,  missingValue = -99,  FindLowestK = FALSE)# for objects of class sdcMicro:sdc <- createSdcObj(  dat = testdata2,  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")sdc <- LocalRecProg(sdc)

Tarragona data set

Description

A real data set comprising figures of 834 companies in the Tarragona area.Data correspond to year 1995.

Format

A data frame with 834 observations on the following 13 variables.

FIXED.ASSETS: a numeric vector
CURRENT.ASSETS: a numeric vector
TREASURY: a numeric vector
UNCOMMITTED.FUNDS: a numeric vector
PAID.UP.CAPITAL: a numeric vector
SHORT.TERM.DEBT: a numeric vector
SALES: a numeric vector
LABOR.COSTS: a numeric vector
DEPRECIATION: a numeric vector
OPERATING.PROFIT: a numeric vector
FINANCIAL.OUTCOME: a numeric vector
GROSS.PROFIT: a numeric vector
NET.PROFIT: a numeric vector

Source

Public use data from the CASC project.

References

Examples

data(Tarragona)head(Tarragona)dim(Tarragona)

addGhostVars

Description

specify variables that arelinked to a key variable. This results in allsuppressions of the key-variable being also applied on the corresponding 'ghost'-variables.

Usage

addGhostVars(obj, keyVar, ghostVars)

Arguments

obj

an object of classsdcMicroObj-class

keyVar

character-vector of length 1 refering to a categorical key variable withinobj.

ghostVars

a character vector specifying variables that are linked tokeyVar. Variables listed here must not be be listed in either slots@keyVars,@numVars,@pramVars,@weightVar,@hhId or@strataVar inobj.

Value

a modifiedsdcMicroObj-class object.

Author(s)

Bernhard Meindl

References

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4 doi:10.1007/978-3-319-50272-4

Examples

data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')## we want to link the anonymization status of key variabe 'urbrur' to 'hhcivil'sdc <- addGhostVars(sdc, keyVar="urbrur", ghostVars=c("hhcivil"))## we want to link the anonymization status of key variabe 'roof' to 'represent'sdc <- addGhostVars(sdc, keyVar="roof", ghostVars=c("represent"))

Adding noise to perturb data

Description

Various methods for adding noise to perturb continuous scaled variables.

Usage

addNoise(obj, variables = NULL, noise = 150, method = "additive", ...)

Arguments

obj

either adata.frame or asdcMicroObj-class that should be perturbed

variables

vector with names of variables that should be perturbed

noise

amount of noise (in percentages)

method

choose between ‘additive’, ‘correlated’,‘correlated2’, ‘restr’, ‘ROMM’, ‘outdect’

...

see possible arguments below

Details

If ‘obj’ is of classsdcMicroObj-class, all continuous keyvariables are selected per default. If ‘obj’ is of class“data.frame” or “matrix”, the continuous variables have to bespecified.

Method ‘additive’ adds noise completely at random to each variabledepending on its size and standard deviation. ‘correlated’ andmethod ‘correlated2’ adds noise and preserves the covariances asdescribed in R. Brand (2001) or in the reference given below. Method‘restr’ takes the sample size into account when adding noise. Method‘ROMM’ is an implementation of the algorithm ROMM (RandomOrthogonalized Matrix Masking) (Fienberg, 2004). Method ‘outdect’adds noise only to outliers. The outliers are identified with univariateand robust multivariate procedures based on a robust mahalanobis distancescalculated by the MCD estimator.

Value

If ‘obj’ was of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility.

If ‘obj’ was of class “data.frame” or “matrix” anobject of class “micro” with following entities is returned:

x

the original data

xm

the modified (perturbed) data

method

method used for perturbation

noise

amount of noise

Author(s)

Matthias Templ and Bernhard Meindl

References

Domingo-Ferrer, J. and Sebe, F. and Castella, J., “On thesecurity of noise addition for privacy in statistical databases”, LectureNotes in Computer Science, vol. 3050, pp. 149-161, 2004. ISSN 0302-9743.Vol. Privacy in Statistical Databases, eds. J. Domingo-Ferrer and V. Torra,Berlin: Springer-Verlag.

Ting, D. Fienberg, S.E. and Trottini, M. “ROMM Methodology forMicrodata Release” Joint UNECE/Eurostat work session on statistical dataconfidentiality, Geneva, Switzerland, 2005,https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2005/wp.11.e.pdf

Ting, D., Fienberg, S.E., Trottini, M. “Random orthogonal matrixmasking methodology for microdata release”, International Journal ofInformation and Computer Security, vol. 2, pp. 86-105, 2008.

Templ, M. and Meindl, B.,Robustification of Microdata Masking Methodsand the Comparison with Existing Methods, Lecture Notes in ComputerScience, Privacy in Statistical Databases, vol. 5262, pp. 177-189, 2008.

Templ, M. and Meindl, B. and Kowarik, A.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro, Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4

Examples

data(Tarragona)a1 <- addNoise(Tarragona)a1data(testdata)# donttest because Examples with CPU time > 2.5 times elapsed timetestdata[, c('expend','income','savings')] <-addNoise(testdata[,c('expend','income','savings')])$xm## for objects of class sdcMicroObj:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- addNoise(sdc)

argus_microaggregation

Description

calls microaggregation code from mu-argus. In case only one variable should bemicroaggregated anduseOptimal isTRUE, Hansen-Mukherjee polynomial exact methodis applied. In any other case, the Mateo-Domingo method is used.

Usage

argus_microaggregation(df, k, useOptimal = FALSE)

Arguments

df

adata.frame with only numerical columns

k

required group size

useOptimal

(logical) should optimal microaggregation be applied (ony possible inin case of one variable)

Value

alist with two elements

original: the originally provided input data
microaggregated: the microaggregated data.frame

Examples

mat <- matrix(sample(1:100, 50, replace=TRUE), nrow=10, ncol=5)df <- as.data.frame(mat)res <- argus_microaggregation(df, k=5, useOptimal=FALSE)

argus_rankswap

Description

argus_rankswap

Usage

argus_rankswap(df, perc)

Arguments

df

adata.frame with only numerical columns

perc

a number defining the swapping percantage

Value

alist with two elements

original: the originally provided input data
swapped: thedata.frame containing the swapped values

Examples

mat <- matrix(sample(1:100, 50, replace=TRUE), nrow=10, ncol=5)df <- as.data.frame(mat)res <- argus_rankswap(df, perc=10)

Recompute Risk and Frequencies for a sdcMicroObj

Description

Recomputation of Risk should be done after manual changing the content of anobject of classsdcMicroObj

Usage

calcRisks(obj, ...)

Arguments

obj

asdcMicroObj object

...

no arguments at the moment

Details

By applying this function, the dislosure risk is re-estimated and thecorresponding slots of an object of classsdcMicroObj are updated.This function mostly used internally to automatically update the risk afteran sdc method is applied.

Value

asdcMicroObj object with updated risk values

Examples

data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- calcRisks(sdc)

Small Artificial Data set

Description

Small Toy Example Data set which was used by Sanz-Mateo et.al.

Format

The format is: int [1:13, 1:7] 10 12 17 21 9 12 12 14 13 15 ... -attr(*, "dimnames")=List of 2 ..$ : chr [1:13] "1" "2" "3" "4" ... ..$ :chr [1:7] "1" "2" "3" "4" ...

Examples

data(casc1)casc1

Dummy Dataset for Record Swapping

Description

createDat() returns dummy data to illustratetargeted record swapping. The generated data containhousehold ids ('hid'), geographic variables('nuts1', 'nuts2', 'nuts3', 'lau2') as well as someother household or personal variables.

Usage

createDat(N = 10000)

Arguments

N

integer, number of household to generate

Value

'data.table' containing dummy data

Creates new randomized IDs

Description

This is useful if the record IDs consist, for example, of a geo identifier and the household line number.This method can be used to create new, random IDs that cannot be reconstructed.

Usage

createNewID(obj, newID, withinVar)

Arguments

obj

ansdcMicroObj-class-object

newID

a character specifiying the desired variable name of the new ID

withinVar

if notNULL a character vector specifying a variable (e.g an existing household ID) whichwill be used when calculating the new IDs. If specified, the same IDs will be assigned to the same values of the given variable.

Value

ansdcMicroObj-class-object with updated slotorigData

overal disclosure risk

Description

Distance-based disclosure risk estimation via standard deviation-basedintervals around observations.

Usage

dRisk(obj, ...)

Arguments

obj

adata.frame or object of classsdcMicroObj-class

...

possible arguments are:

xm:: perturbed data
k:: percentage of the standard deviation

Details

An interval (based on the standard deviation) is built around each value ofthe perturbed value. Then we look if the original values lay in theseintervals or not. With parameter k one can enlarge or down scale theinterval.

Value

The disclosure risk or/and the modifiedsdcMicroObj-class

Author(s)

Matthias Templ

References

see method SDID in Mateo-Sanz, Sebe, Domingo-Ferrer. Outlier Protection in Continuous Microdata Masking.International Workshop on Privacy in Statistical Databases.PSD 2004: Privacy in Statistical Databases pp 201-215.

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4

Examples

data(free1)free1 <- as.data.frame(free1)m1 <- microaggregation(free1[, 31:34], method="onedims", aggr=3)m2 <- microaggregation(free1[, 31:34], method="pca", aggr=3)dRisk(obj=free1[, 31:34], xm=m1$mx)dRisk(obj=free1[, 31:34], xm=m2$mx)dUtility(obj=free1[, 31:34], xm=m1$mx)dUtility(obj=free1[, 31:34], xm=m2$mx)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')## this is already made internally: sdc <- dRisk(sdc)## and already stored in sdc

RMD based disclosure risk

Description

Distance-based disclosure risk estimation via robust Mahalanobis Distances.

Usage

dRiskRMD(obj, ...)

Arguments

obj

ansdcMicroObj-class-object or adata.frame

...

see possible arguments below

xm: masked data
k: weight for adjusting the influence of the robust Mahalanobisdistances, i.e. to increase or decrease each of the disclosure risk intervals.
k2: parameter for method RMDID2 to choose a small interval around eachmasked observation.

Details

This method is an extension of method SDID because it accounts for the“outlyingness” of each observations. This is a quite natural approachsince outliers do have a higher risk of re-identification and thereforethese outliers should have larger disclosure risk intervals as observationsin the center of the data cloud.

The algorithm works as follows:

1. Robust Mahalanobis distances are estimated in order to get a robustmultivariate distance for each observation.

2. Intervals are estimated for each observation around every data point ofthe original data points where the length of the interval isdefined/weighted by the squared robust Mahalanobis distance and theparameter $k$. The higher the RMD of an observation the larger theinterval.

3. Check if the corresponding masked values fall into the intervals aroundthe original values or not. If the value of the corresponding observationis within such an interval the whole observation is considered unsafe. So,we get a whole vector indicating which observation is save or not, and weare finished already when using method RMDID1).

4. For method RMDID1w: we return the weighted (via RMD) vector of disclosurerisk.

5. For method RMDID2: whenever an observation is considered unsafe it ischecked if $m$ other observations from the masked data are very close(defined by a parameter $k2$ for the length of the intervals as for SDID orRSDID) to such an unsafe observation from the masked data, using Euclideandistances. If more than $m$ points are in such a small interval, weconclude that this observation is “save”.

Value

The disclosure risk or the modifiedsdcMicroObj-class

risk1

percentage of sensitive observations according to method RMDID1.

risk2

standardized version of risk1

wrisk1

amount of sensitive observations according to RMDID1 weightedby their corresponding robust Mahalanobis distances.

wrisk2

RMDID2 measure

indexRisk1

index of observations with high risk according to risk1 measure

indexRisk2

index of observations with high risk according to wrisk2 measure

Author(s)

Matthias Templ

References

Templ, M. and Meindl, B.,Robust Statistics Meets SDC: NewDisclosure Risk Measures for Continuous Microdata Masking, Lecture Notes inComputer Science, Privacy in Statistical Databases, vol. 5262, pp. 113-126,2008.

Examples

data(Tarragona)x <- Tarragona[, 5:7]y <- addNoise(x)$xmdRiskRMD(x, xm=y)dRisk(x, xm=y)data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- dRiskRMD(sdc)

Data-Utility measures

Description

dUtility() allows to compute different measures of data-utility basedon various distances using original and perturbed variables.

Usage

dUtility(obj, ...)

Arguments

obj

original data or object of classsdcMicroObj

...

see arguments below

xm: perturbed data
method: method IL1, IL1s or eigen. More methods are implemented insummary.micro()

Details

The standardised distances of the perturbed data values to the original onesare measured. The following measures are available:

⁠"IL1⁠: sum of absolute distances between original and perturbed variablesscaled by absolute values of the original variables
⁠"IL1s⁠: measures the absolute distances between originaland perturbed ones, scaled by the standard deviation of original variables timesthe square root of2.
⁠"eigen⁠; compares the eigenvalues of original and perturbed data
⁠"robeigen⁠; compares robust eigenvalues of original and perturbed data

Value

data utility or modified entry for data utility thesdcMicroObj.

Author(s)

Matthias Templ

References

for IL1 and IL1s: see Mateo-Sanz, Sebe, Domingo-Ferrer.Outlier Protection in Continuous Microdata Masking.International Workshop on Privacy in Statistical Databases.PSD 2004: Privacy in Statistical Databases pp 201-215.

Templ, M. and Meindl, B.,⁠Robust Statistics Meets SDC: New Disclosure Risk Measures for Continuous Microdata Masking⁠, Lecture Notes in ComputerScience, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008.

Examples

data(free1)free1 <- as.data.frame(free1)m1 <- microaggregation(free1[, 31:34], method="onedims", aggr=3)m2 <- microaggregation(free1[, 31:34], method="pca", aggr=3)dRisk(obj=free1[, 31:34], xm=m1$mx)dRisk(obj=free1[, 31:34], xm=m2$mx)dUtility(obj=free1[, 31:34], xm=m1$mx)dUtility(obj=free1[, 31:34], xm=m2$mx)data(Tarragona)x <- Tarragona[, 5:7]y <- addNoise(x)$xmdRiskRMD(x, xm=y)dRisk(x, xm=y)dUtility(x, xm = y, method = "IL1")dUtility(x, xm = y, method = "IL1s")dUtility(x, xm = y, method = "eigen")dUtility(x, xm = y, method = "robeigen")## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')## this is already made internally, so you don't need to run this:sdc <- dUtility(sdc)

Fast generation of synthetic data

Description

Fast generation of (primitive) synthetic multivariate normal data.

Usage

dataGen(obj, ...)

Arguments

obj

ansdcMicroObj-class-object or adata.frame

...

see possible arguments below

n:: amount of observations for the generated data, defaults to 200
use:: howto compute covariances in case of missing values, see also argumentuse incov.The default choice is 'everything', other possible choices are 'all.obs', 'complete.obs', 'na.or.complete' or 'pairwise.complete.obs'.

Details

Uses the cholesky decomposition to generate synthetic data with approx. thesame means and covariances. For details see at the reference.

Value

the generated synthetic data.

Note

With this method only multivariate normal distributed data withapproxiomately the same covariance as the original data can be generatedwithout reflecting the distribution of real complex data, which are, ingeneral, not follows a multivariate normal distribution.

Author(s)

Matthias Templ

References

Mateo-Sanz, Martinez-Balleste, Domingo-Ferrer. Fast Generation of Accurate Synthetic Microdata. International Workshop on Privacy in Statistical Databases PSD 2004: Privacy in Statistical Databases, pp 298-306.

Examples

data(mtcars)cov(mtcars[,4:6])cov(dataGen(mtcars[,4:6]))pairs(mtcars[,4:6])pairs(dataGen(mtcars[,4:6]))## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- dataGen(sdc)

Distribute number of swaps

Description

Distribute number of swaps across lowest hierarchy level according to a predefinedswaprate. The swaprate is applied such that a single swap counts as swapping 2 households.Number of swaps are randomly rounded up or down, if needed, such that the total number of swaps is in coherence with the swaprate.
NOTE: This is an internal function used for testing the C++-functiondistributeDraws which is used inside the C++-functionrecordSwap().

Usage

distributeDraws_cpp(data, hierarchy, hid, swaprate, seed = 123456L)

Arguments

data

micro data containing the hierarchy levels and household ID

hierarchy

column indices of variables indata which refers to the geographic hierarchy in the micro data set. For instance county > municipality > district.

hid

column index indata which refers to the household identifier.

swaprate

double between 0 and 1 defining the proportion of households which should be swapped, see details for more explanations

seed

integer setting the sampling seed

Distribute

Description

Distribute 'totalDraws' using ratio/probability vector 'inputRatio' and randomly round each entry up or down such that the distribution results in an integer vector.Returns an integer vector containing the number of units in 'totalDraws' distributetd according to proportions in 'inputRatio'.
NOTE: This is an internal function used for testing the C++-functiondistributeRandom which is used inside the C++-functionrecordSwap().

Usage

distributeRandom_cpp(inputRatio, totalDraws, seed)

Arguments

inputRatio

vector containing ratios which are used to distribute number units in 'totalDraws'.

totalDraws

number of units to distribute

seed

integer setting the sampling seed

Remove certain variables from the data set inside a sdc object.

Description

Extract the manipulated data from an object of classsdcMicroObj-class

Usage

extractManipData(  obj,  ignoreKeyVars = FALSE,  ignorePramVars = FALSE,  ignoreNumVars = FALSE,  ignoreGhostVars = FALSE,  ignoreStrataVar = FALSE,  randomizeRecords = "no")

Arguments

obj

object of classsdcMicroObj-class

ignoreKeyVars

If manipulated KeyVariables should be returned or theunchanged original variables

ignorePramVars

if manipulated PramVariables should be returned or theunchanged original variables

ignoreNumVars

if manipulated NumericVariables should be returned orthe unchanged original variables

ignoreGhostVars

if manipulated Ghost (linked) Variables should be returned orthe unchanged original variables

ignoreStrataVar

if manipulated StrataVariables should be returned orthe unchanged original variables

randomizeRecords

(logical) specifies, if the output records should be randomized. The followingoptions are possible:

'no': default, no randomization takes place
'simple': records are just randomly swapped.
'byHH': if slot 'hhId' is notNULL, the clusters defined by this variable are randomized across the dataset. Ifslot 'hhId' isNULL, the records or the dataset are randomly changed.
'withinHH': if slot 'hhId' is notNULL, the clusters defined by this variable are randomized across the dataset andadditionally, the order of records within the clusters are also randomly changed. If slot 'hhId' isNULL, the records or the dataset arerandomly changed.

Value

adata.frame containing the anonymized data set

Author(s)

Alexander Kowarik, Bernhard Meindl

Examples

## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata,  keyVars=c('urbrur','roof'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- removeDirectID(sdc, var="age")dataM <- extractManipData(sdc)

data from the casc project

Description

Small synthetic data from Capobianchi, Polettini, Lucarelli

Format

A data frame with 8 observations on the following 8 variables.

Num1: a numeric vector
Key1: Key variable 1. A numeric vector
Num2: a numeric vector
Key2: Key variable 2. A numeric vector
Key3: Key variable 3. A numeric vector
Key4: Key variable 4. A numeric vector
Num3: a numeric vector
w: The weight vector. A numeric vector

Details

This data set is very similar to that one which are used by the authors ofthe paper given below. We need this data set only for demonstration effect,i.e. that the package provides the same results as their software.

Source

https://research.cbs.nl/casc/deliv/12d1.pdf

Examples

data(francdat)francdat

Demo data set from mu-Argus

Description

The public use toy demo data set from the mu-Argus software for SDC.

Format

The format is: num [1:4000, 1:34] 36 36 36 36 36 36 36 36 36 36 ...- attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:34] "REGION" "SEX""AGE" "MARSTAT" ...

Details

Please, see at the link given below. Please note, that the correlationstructure of the data is not very realistic, especially concerning thecontinuous scaled variables which drawn independently from are amultivariate uniform distribution.

Source

Public use file from the CASC project.

Examples

data(free1)head(free1)

Freq

Description

Extract sample frequency counts (fk) or estimated population frequency counts (Fk)

Usage

freq(obj, type = "fk")

Arguments

obj

ansdcMicroObj-class-object

type

either'fk' or'FK'

Value

a vector containing sample frequencies or weighted frequencies

Author(s)

Bernhard Meindl

Examples

data(testdata)sdc <- createSdcObj(testdata,  keyVars=c('urbrur','roof','walls','relat','sex'),  pramVars=c('water','electcon'),  numVars=c('expend','income','savings'), w='sampling_weight')head(freq(sdc, type="fk"))head(freq(sdc, type="Fk"))

Frequencies calculation for risk estimation

Description

Computation and estimation of the sample and population frequency counts.

Usage

freqCalc(x, keyVars, w = NULL, alpha = 1)

Arguments

x

data frame or matrix

keyVars

key variables

w

column index of the weight variable. Should be set to NULL if onedeal with a population.

alpha

numeric value between 0 and 1 specifying how much keys thatcontain missing values (NAs) should contribute to the calculationoffk andFk. For the default value of1, nothing changes withrespect to the implementation in prior versions. Eachwildcard-match wouldbe counted while foralpha=0 keys with missing values would be basically ignored.

Details

The function considers the case of missing values in the data. A missingvalue stands for any of the possible categories of the variable considered.It is possible to apply this function to large data sets with many(catergorical) key variables, since the computation is done in C.

freqCalc() does not support sdcMicro S4 class objects.

Value

Object from class freqCalc.

freqCalc

data set

keyVars

variables used for frequency calculation

w

index of weight vector. NULL if you do not have a sample.

alpha

value of parameteralpha

fk

the frequency of equal observations inthe key variables subset sample given for each observation.

Fk

estimated frequency in the population

n1

number of observations with fk=1

n2

number of observations with fk=2

Author(s)

Bernhard Meindl

References

look e.g. inhttps://research.cbs.nl/casc/deliv/12d1.pdfTempl, M.Statistical Disclosure Control for Microdata Using theR-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp.67-85, 2008.https://www.tdp.cat/issues/abs.a004a08.php

Templ, M. and Meindl, B.:Practical Applications in StatisticalDisclosure Control Using R, Privacy and Anonymity in Information ManagementSystems New Techniques for New Practical Problems, Springer, 31-62, 2010,ISBN: 978-1-84996-237-7.

Examples

data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)ff$freqCalcf$fkf$Fk## with missings:x <- francdatx[3,5] <- NAx[4,2] <- x[4,4] <- NAx[5,6]  <- NAx[6,2]  <- NAf2 <- freqCalc(x, keyVars=c(2,4,5,6),w=8)cbind(f2$fk, f2$Fk)## test parameter 'alpha'f3a <- freqCalc(x, keyVars=c(2,4,5,6), w=8, alpha=1)f3b <- freqCalc(x, keyVars=c(2,4,5,6), w=8, alpha=0.5)f3c <- freqCalc(x, keyVars=c(2,4,5,6), w=8, alpha=0.1)data.frame(fka=f3a$fk, fkb=f3b$fk, fkc=f3c$fk)data.frame(Fka=f3a$Fk, Fkb=f3b$Fk, Fkc=f3c$Fk)

Generate one strata variable from multiple factors

Description

For strata defined by multiple variables (e.g. sex,age,country) one combinedvariable is generated.

Usage

generateStrata(df, stratavars, name)

Arguments

df

a data.frame

stratavars

character vector with variable name

name

name of the newly generated variable

Value

The original data set with one new column.

Author(s)

Alexander Kowarik

Examples

x <- testdatax <- generateStrata(x,c("sex","urbrur"),"strataIDvar")head(x)

get.sdcMicroObj

Description

extract information fromsdcMicroObj-class-objects depending on argumenttype

Usage

get.sdcMicroObj(object, type)

Arguments

object

asdcMicroObj-class-object

type

a character vector of length 1 defining what to calculate|return|modify. Allowed types are areall slotNames ofobj.

Value

a slot of asdcMicroObj-class-object depending on argumenttype

Examples

sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sl <- slotNames(sdc)res <- sapply(sl, function(x) get.sdcMicroObj(sdc, type=x))str(res)

Global Recoding

Description

Global recoding of variables

Usage

globalRecode(obj, ...)

Arguments

obj

a numeric vector, adata.frame or an object of classsdcMicroObj-class

...

see possible arguments below

column:

which keyVar should be changed. Character vector of length 1 specifying the variable name thatshould be recoded (required ifobj is adata.frame oran object of classsdcMicroObj-class.

breaks:

either a numeric vector of cut points or number giving thenumber of intervals which x is to be cut into.

labels:

labels for the levels of the resulting category. By default,labels are constructed using "(a,b]" interval notation. If labels = FALSE,simple integer codes are returned instead of a factor.

method:

The following arguments are supported:

“equidistant:” for equal sized intervalls
“logEqui:” for equal sized intervalls for log-transformed data
“equalAmount:” for intervalls with approxiomately the same amountof observations

Details

If a labels parameter is specified, its values are used to name the factorlevels. If none is specified, the factor level labels are constructed.

Value

the modifiedsdcMicroObj-class or a factor, unless labels = FALSEwhich results in the mere integer level codes.

Note

globalRecode can not be applied to vectors stored as factors from sdcMicro >= 4.7.0!

Author(s)

Matthias Templ and Bernhard Meindl

References

Templ, M. and Kowarik, A. and Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Examples

data(free1)free1 <- as.data.frame(free1)## application to a vectorhead(globalRecode(free1$AGE, breaks=c(1,9,19,29,39,49,59,69,100), labels=1:8))table(globalRecode(free1$AGE, breaks=c(1,9,19,29,39,49,59,69,100), labels=1:8))## application to a data.frame# automatic labelstable(globalRecode(free1, column="AGE", breaks=c(1,9,19,29,39,49,59,69,100))$AGE)## calculation of brea-points using different algorithmstable(globalRecode(free1$AGE, breaks=6))table(globalRecode(free1$AGE, breaks=6, method="logEqui"))table(globalRecode(free1$AGE, breaks=6, method="equalAmount"))## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- globalRecode(sdc, column="water", breaks=3)table(get.sdcMicroObj(sdc, type="manipKeyVars")$water)

Join levels of a variables in an object of class`sdcMicroObj-class` or`factor` or`data.frame`

Description

If the input is an object of classsdcMicroObj-class, thespecified factor-variable is recoded into a factor with less levels andrisk-measures are automatically recomputed.

Usage

groupAndRename(obj, var, before, after, addNA = FALSE)

Arguments

obj

object of classsdcMicroObj-class

var

name of the keyVariable to change

before

vector of levels before recoding

after

name of new level after recoding

addNA

logical, if TRUE missing values in the input variables are added to the level specified in argumentafter.

Details

If the input is of classdata.frame, the result is adata.frame witha modified column specified byvar.

If the input is of classfactor, the result is afactor with differentlevels.

Value

the modifiedsdcMicroObj-class

Author(s)

Bernhard Meindl

References

Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Examples

## for objects of class sdcMicro:data(testdata2)testdata2$urbrur <- as.factor(testdata2$urbrur)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- groupAndRename(sdc, var="urbrur", before=c("1","2"), after=c("1"))

importProblem

Description

reads an sdcProblem with code that has been exported withinsdcApp.

Usage

importProblem(path)

Arguments

path

a file path

Value

an object of classsdcMicro_GUI_export or an object of class 'simple.error'

Author(s)

Bernhard Meindl

Individual Risk computation

Description

Estimation of the risk for each observation. After the risk is computed onecan use e.g. the function localSuppr() for the protection of values of highrisk. Further details can be found at the link given below.

Usage

indivRisk(x, method = "approx", qual = 1, survey = TRUE)

Arguments

x

object from class freqCalc

method

approx (default) or exact

qual

final correction factor

survey

TRUE, if we have survey data and FALSE if we deal with a population.

Details

S4 class sdcMicro objects are only supported by functionmeasure_riskthat also estimates the individual risk with the same method.

Value

rk:: base individual risk
method:: method
qual:: final correction factor
fk:: frequency count
knames:: colnames of the key variables

Note

The base individual risk method was developed by Benedetti,Capobianchi and Franconi

Author(s)

Matthias Templ. Bug in method “exact” fixed since version2.6.5. by Youri Baeyens.

References

Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Franconi, L. and Polettini, S. (2004)Individual riskestimation in mu-Argus: a review. Privacy in Statistical Databases, LectureNotes in Computer Science, 262–272. Springer

Machanavajjhala, A. and Kifer, D. and Gehrke, J. and Venkitasubramaniam, M.(2007)l-Diversity: Privacy Beyond k-Anonymity. ACM Trans. Knowl.Discov. Data, 1(1)

additionally, have a look at the vignettes of sdcMicro for further reading.

Examples

## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)ff$fkf$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk

Calculate information loss after targeted record swapping

Description

Calculate information loss after targeted record swapping using both the original and the swapped micro data.Information loss will be calculated on table counts defined by parameter 'table_vars' using either implemented information loss measures like absolute deviaton, relative absolute deviation and absolute deviation of square roots or custom metric, See details below.

Usage

infoLoss(  data,  data_swapped,  table_vars,  metric = c("absD", "relabsD", "abssqrtD"),  custom_metric = NULL,  hid = NULL,  probs = sort(c(seq(0, 1, by = 0.1), 0.95, 0.99)),  quantvals = c(0, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, Inf),  apply_quantvals = c("relabsD", "abssqrtD"),  exclude_zeros = FALSE,  only_inner_cells = FALSE)

Arguments

data

original micro data set, must be either a 'data.table' or 'data.frame'.

data_swapped

micro data set after targeted record swapping was applied. Must be either a 'data.table' or 'data.frame'.

table_vars

column names in both 'data' and 'data_swapped'. Defines the variables over which a (multidimensional) frequency table is constructed.Information loss is then calculated by applying the metric in 'metric' and 'custom_merics' over the cell-counts and margin counts of the table from 'data' and 'data_swapped'.

metric

character vector containing one or more of the already implemented metrices: "absD","relabsD" and/or "abssqrtD".

custom_metric

function or (named) list of functions. Functions defined here must be of the form 'fun(x,y,...)'where 'x' and 'y' expect numeric values of the same length. The output of these functions must be a numeric vector of the same length as 'x' and 'y'.

hid

'NULL' or character containing household id in 'data' and 'data_swapped'. If not 'NULL' frequencies will reflect number of households, otherwise frequencies will reflect number of persons.

probs

numeric vector containing values in the inervall [0,1].

quantvals

optional numeric vector which defines the groups used for the cumulative outputs. Is applied on the results 'm' from each information loss metric as 'cut(m,breaks=quantvals,include.lowest=TRUE)', see also return values.

apply_quantvals

character vector defining for the output of which metrices 'quantvals' should be applied to.

exclude_zeros

'TRUE' or 'FALSE', if 'TRUE' 0 cells in the frequency table using 'data_swapped' will be ignored.

only_inner_cells

'TRUE' or 'FALSE', if 'TRUE' only inner cells of the frequency table defined by 'table_vars' will be compared. Otherwise also all tables margins will bei calculated.

Details

First frequency tables are build from both 'data' and 'data_swapped' using the variables defined in 'table_vars'. By default also all table margins will be calculated, see parameter 'only_inner_cells = FALSE'.After that the information loss metrices defined in either 'metric' or 'custom_metric' are applied on each of the table cells from both frequency tables.This is done in the sense of 'metric(x,y)' where 'metric' is the information loss, 'x' a cell from the table created from 'data' and 'y' the same cell from the table created from 'data_swapped'. One or more custom metrices can be applied using the parameter 'custom_metric', see also examples.

Value

Returns a list containing:

* 'cellvalues': 'data.table' showing in a long format for each table cell the frequency counts for 'data' ~ 'count_o' and 'data_swapped' ~ 'count_s'. * 'overview': 'data.table' containing the disribution of the 'noise' in number of cells and percentage. The 'noise' ist calculated as the difference between the cell values of the frequency table generated from the original and swapped data* 'measures': 'data.table' containing the quantiles and mean (column 'waht') of the distribution of the information loss metrices applied on each table cell. The quantiles are defined by parameter 'probs'.* 'cumdistr\*': 'data.table' containing the cumulative distribution of the information loss metrices. Distribution is shown in number of cells ('cnt') and percentage ('pct'). Column 'cat' shows all unique values of the information loss metric or the grouping defined by 'quantvals'. * 'false_zero': number of table cells which are non-zero when using 'data' and zero when using 'data_swapped'.* 'false_nonzero': number of table cells which are zero when using 'data' and non-zero when using 'data_swapped'.* 'exclude_zeros': value passed to 'exclude_zero' when calling the function.

Examples

# generate dummy data seed <- 2021set.seed(seed)nhid <- 10000dat <- createDat( nhid )# define paramters for swappingk_anonymity <- 1swaprate <- .05similar <- list(c("hsize"))hier <- c("nuts1","nuts2")carry_along <- c("nuts3","lau2")risk_variables <- c("ageGroup","national")hid <- "hid"# # apply record swapping# dat_s <- recordSwap(data = dat, hid = hid, hierarchy = hier,#                     similar = similar, swaprate = swaprate,#                     k_anonymity = k_anonymity,#                     risk_variables = risk_variables,#                     carry_along = carry_along,#                     return_swapped_id = TRUE,#                     seed=seed)# # # # calculate informationn loss# # for the table nuts2 x national# iloss <- infoLoss(data=dat, data_swapped = dat_s,#                   table_vars = c("nuts2","national"))# iloss$measures # distribution of information loss measures# iloss$false_zero # no false zeros# iloss$false_nonzero # no false non-zeros# # # frequency tables of households accross# # nuts2 x hincome# # iloss <- infoLoss(data=dat, data_swapped = dat_s, #                  table_vars = c("nuts2","hincome"),#                   hid = "hid")# iloss$measures  # # # define custom metric# squareD <- function(x,y){#   (x-y)^2# }# # iloss <- infoLoss(data=dat, data_swapped = dat_s,#                  table_vars = c("nuts2","national"),#                  custom_metric = list(squareD=squareD))# iloss$measures # includes custom loss as well#

`kAnon_violations`

Description

returns the number of observations violating k-anonymity.

Usage

kAnon_violations(object, weighted, k)## S4 method for signature 'sdcMicroObj,logical,numeric'kAnon_violations(object, weighted, k)

Arguments

object

asdcMicroObj-class object

weighted

TRUE orFALSE defining if sampling weights should be taken into account

k

a positive number defining parameter k

Value

the number of records that are violating k-anonymity based on unweighted sample data only (in case parameterweighted isFALSE) or computingthe number of observations that are estimated to violate k-anonymity in the population in case parameterweighted equalsTRUE.

Local Suppression

Description

A simple method to perfom local suppression.

Usage

localSupp(obj, threshold = 0.15, keyVar)

Arguments

obj

object of classfreqCalc orsdcMicroObj-class.

threshold

threshold for individual risk

keyVar

Variable on which some values might be suppressed

Details

Values of high risk (above the threshold) of a certain variable (parameterkeyVar) are suppressed.

Value

an updated object of classfreqCalc or thesdcMicroObj-classobject with manipulated data.

Author(s)

Matthias Templ and Bernhard Meindl

References

Templ, M.Statistical Disclosure Control for MicrodataUsing the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number2, pp. 67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php

Examples

data(francdat)keyVars <- paste0("Key",1:4)f <- freqCalc(francdat, keyVars = keyVars, w = 8)ff$fkf$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk## Local SuppressionlocalS <- localSupp(f, keyVar = "Key4", threshold = 0.15)f2 <- freqCalc(localS$freqCalc, keyVars = keyVars, w = 8)indivf2 <- indivRisk(f2)indivf2$rkidentical(indivf$rk, indivf2$rk)## select another keyVar and run localSupp once again,# if you think the table is not fully protected## for objects of class sdcMicro:data(testdata)sdc <- createSdcObj(  dat = testdata,  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"),  w = "sampling_weight")sdc <- localSupp(sdc, keyVar = "urbrur", threshold = 0.045)print(sdc, type = "ls")

Local Suppression to obtain k-anonymity

Description

Algorithm to achievek-anonymity by performing local suppression.

Usage

localSuppression(obj, k = 2, importance = NULL, combs = NULL, ...)kAnon(obj, k = 2, importance = NULL, combs = NULL, ...)

Arguments

obj

asdcMicroObj-class object or adata.frame

k

Threshold fork-anonymity

importance

Numeric vector of values between 1 andn (n = length(keyVars)).This vector defines the "importance" of variables for local suppression.Variables withimportance = 1 will, if possible, not be suppressed;variables withimportance = n will be prioritized for suppression.

combs

Numeric vector. If specified, the algorithm providesk-anonymityfor each combination ofn key variables (withn being the value of the ithelement of this parameter). For example,combs = c(4,3) means thatk-anonymitywill be provided for all combinations of 4 and then 3 key variables.It is possible to assign differentk values for each combination by supplyingk as a vector.Ifk has only one value, it will be used for all subsets.

...

see additional arguments below:

keyVars: Names or indices of categorical key variables (for data.frame method)
strataVars: Name or index of the variable used for stratification.k-anonymity is ensured within each category of this variable.
alpha: Numeric value between 0 and 1 specifying how much keys with missingvalues (NAs) contribute to the calculation offk andFk.Default is1. Used only in thedata.frame method.
nc: Maximum number of cores used for stratified computations.Default is1. Parallelization is ignored on Windows.

Details

The algorithm provides ak-anonymized data set by suppressing values in keyvariables. The algorithm tries to find an optimal solution to suppress asfew values as possible and considers the specified importance vector. If notspecified, the importance vector is constructed in a way such that keyvariables with a high number of characteristics are considered lessimportant than key variables with a low number of characteristics.

The implementation providesk-anonymity per strata, if slotstrataVar hasbeen set insdcMicroObj-class or if parameterstrataVar isused when applying thedata.frame method. For details, see the examples provided.

For the parameteralpha:

alpha = 1 counts allwildcard matches (i.e.NAs match everything).
alpha = 0 assumes missing values form their own categories.

These are two extremes. Withalpha = 0, frequencies are likely underestimated whenNAs are present. Ifcombs is used withalpha = 0, the heuristic nature ofkAnon()may lead to technically correct, but not always intuitively understandable frequency evaluations.

Value

A modified dataset with suppressions that meetsk-anonymity based onthe specified key variables, or the modifiedsdcMicroObj-class object.

Note

Deprecated methodslocalSupp2 andlocalSupp2Wrapper are no longer availableinsdcMicro versions > 4.5.0.kAnon() is a more intuitive term for local suppression, since the goal is to achievek-anonymity.

Author(s)

Bernhard Meindl, Matthias Templ

References

Templ, M.Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN: 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4

Templ, M., Kowarik, A., Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67(4), 1–36, 2015.doi:10.18637/jss.v067.i04

Examples

data(francdat)## Local SuppressionlocalS <- localSuppression(francdat, keyVar = c(4, 5, 6))localSplot(localS)## for objects of class sdcMicro, no stratificationdata(testdata2)kv <- c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex")sdc <- createSdcObj(testdata2, keyVars = kv, w = "sampling_weight")sdc <- localSuppression(sdc)## for objects of class sdcMicro, with stratificationtestdata2$ageG <- cut(testdata2$age, 5, labels = paste0("AG", 1:5))sdc <- createSdcObj(  dat = testdata2,  keyVars = kv,  w = "sampling_weight",  strataVar = "ageG")sdc <- localSuppression(sdc, nc = 1)## it is also possible to provide k-anonymity for subsets of key-variables## with different parameter k!## in this case we want to provide 10-anonymity for all combinations## of 5 key variables, 20-anonymity for all combinations with 4 key variables## and 30-anonymity for all combinations of 3 key variables.sdc <- createSdcObj(testdata2, keyVars = kv, w = "sampling_weight")combs <- 5:3k <- c(10, 20, 30)sdc <- localSuppression(sdc, k = k, combs = combs)## data.frame method (no stratification)inp <- testdata2[, c(kv, "ageG")]ls <- localSuppression(inp, keyVars = 1:7)print(ls)plot(ls)## data.frame method (with stratification)ls <- kAnon(inp, keyVars = 1:7, strataVars = 8)print(ls)plot(ls)

Fast and Simple Microaggregation

Description

Function to perform a fast and simple (primitive) method ofmicroaggregation. (for large datasets)

Usage

mafast(obj, variables = NULL, by = NULL, aggr = 3, measure = mean)

Arguments

obj

either asdcMicroObj-class-object or adata.frame

variables

variables to microaggregate. If obj is of class sdcMicroObjthe numerical key variables are chosen per default.

by

grouping variable for microaggregation. If obj is of classsdcMicroObj the strata variables are chosen per default.

aggr

aggregation level (default=3)

measure

aggregation statistic, mean, median, trim, onestep (default =mean)

Value

If ‘obj’ was of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility. If ‘obj’ wasof class “data.frame” or “matrix” an object of the same classis returned.

Author(s)

Alexander Kowarik

Examples

data(Tarragona)m1 <- mafast(Tarragona, variables=c("GROSS.PROFIT","OPERATING.PROFIT","SALES"),aggr=3)data(testdata)m2 <- mafast(testdata,variables=c("expend","income","savings"),aggr=50,by="sex")summary(m2)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- dRisk(sdc)sdc@risk$numericsdc1 <- mafast(sdc,aggr=4)sdc1@risk$numericsdc2 <- mafast(sdc,aggr=10)sdc2@risk$numeric### Performance testsx <- testdatafor(i in 1:20){  x <- rbind(x,testdata)}system.time({  xx <- mafast(    obj = x,    variables = c("expend", "income", "savings"),    aggr = 50,    by = "sex"  )})

Disclosure Risk for Categorical Variables

Description

The function measures the disclosure risk for weighted or unweighted data.It computes the individual risk (and household risk if reasonable) and theglobal risk. It also computes a risk threshold based on a global risk value.

Prints a 'measure_risk'-object

Prints a 'ldiversity'-object

Usage

measure_risk(obj, ...)ldiversity(obj, ldiv_index = NULL, l_recurs_c = 2, missing = -999, ...)## S3 method for class 'measure_risk'print(x, ...)## S3 method for class 'ldiversity'print(x, ...)

Arguments

obj

Object of classsdcMicroObj-class

...

see arguments below

data:: Input data, a data.frame.
keyVars:: names (or indices) of categorical key variables (for data-frame method)
w:: name of variable containing sample weights
hid:: name of the clustering variable, e.g. the household ID
max_global_risk:: Maximal global risk for threshold computation
fast_hier:: If TRUE a fast approximation is computed if household data are provided.

ldiv_index

indices (or names) of the variables used for l-diversity

l_recurs_c

l-Diversity Constant

missing

a integer value to be used as missing value in the C++ routine

x

Output of measure_risk() or ldiversity()

Details

To be used when risk of disclosure for individuals within a family isconsidered to be statistical independent.

Internally, functionfreqCalc() andindivRisk are used forestimation.

Measuring individual risk: The individual risk approach based on so-calledsuper-population models. In such models population frequency counts aremodeled given a certain distribution. The estimation procedure of samplefrequency counts given the population frequency counts is modeled byassuming a negative binomial distribution. This is used for the estimationof the individual risk. The extensive theory can be found in Skinner (1998),the approximation formulas for the individual risk used is described inFranconi and Polettini (2004).

Measuring hierarchical risk: If “hid” - the index of variable holdinginformation on the hierarchical cluster structures (e.g., individuals thatare clustered in households) - is provided, the hierarchical risk isadditional estimated. Note that the risk of re-identifying an individualwithin a household may also affect the probability of disclosure of othermembers in the same household. Thus, the household or cluster-structure ofthe data must be taken into account when estimating disclosure risks. It iscommonly assumed that the risk of re-identification of a household is therisk that at least one member of the household can be disclosed. Thus thisprobability can be simply estimated from individual risks as 1 minus theprobability that no member of the household can be identified.

Global risk: The sum of the individual risks in the dataset gives theexpected number of re-identifications that serves as measure of the globalrisk.

l-Diversity: If “ldiv_index” is unequal to NULL, i.e. if the indicesof sensible variables are specified, various measures for l-diversity arecalculated. l-diverstiy is an extension of the well-known k-anonymityapproach where also the uniqueness in sensible variables for each patternspanned by the key variables are evaluated.

Value

A modifiedsdcMicroObj-class object or a list with the following elements:

global_risk_ER:: expected number of re-identification.
global_risk:: global risk (sum of indivdual risks).
global_risk_pct:: global risk in percent.
Res:: matrix with the risk, frequency in the sample and grossed-up frequency in the population (and the hierachical risk) for each observation.
global_threshold:: for a given max_global_risk the threshold for the risk of observations.
max_global_risk:: the input max_global_risk of the function.
hier_risk_ER:: expected number of re-identification with household structure.
hier_risk:: global risk with household structure (sum of indivdual risks).
hier_risk_pct:: global risk with household structure in percent.
ldiverstiy:: Matrix with Distinct_Ldiversity,Entropy_Ldiversity and Recursive_Ldiversity for each sensitivity variable.

Prints risk-information into the console

Information on L-Diversity Measures in the console

Author(s)

Alexander Kowarik, Bernhard Meindl, Matthias Templ, Bernd Prantner, minor parts of IHSN C++ source

References

Franconi, L. and Polettini, S. (2004)Individual riskestimation in mu-Argus: a review. Privacy in Statistical Databases, LectureNotes in Computer Science, 262–272. Springer

Machanavajjhala, A. and Kifer, D. and Gehrke, J. and Venkitasubramaniam, M.(2007)l-Diversity: Privacy Beyond k-Anonymity. ACM Trans. Knowl.Discov. Data, 1(1)

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4.

#' Templ, M. and Kowarik, A. and Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Examples

## measure_risk with sdcMicro objects:data(testdata)sdc <- createSdcObj(testdata,  keyVars=c('urbrur','roof','walls','water','electcon'),numVars=c('expend','income','savings'), w='sampling_weight')## risk is already estimated and available in...names(sdc@risk)## measure risk on data frames or matrices:res <- measure_risk(testdata,  keyVars=c("urbrur","roof","walls","water","sex"))print(res)head(res$Res)resw <- measure_risk(testdata,  keyVars=c("urbrur","roof","walls","water","sex"),w="sampling_weight")print(resw)head(resw$Res)res1 <- ldiversity(testdata,  keyVars=c("urbrur","roof","walls","water","sex"),ldiv_index="electcon")print(res1)head(res1)res2 <- ldiversity(testdata,  keyVars=c("urbrur","roof","walls","water","sex"),ldiv_index=c("electcon","relat"))print(res2)head(res2)# measure risk with household riskresh <- measure_risk(testdata,  keyVars=c("urbrur","roof","walls","water","sex"),w="sampling_weight",hid="ori_hid")print(resh)# change max_global_riskrest <- measure_risk(testdata,  keyVars=c("urbrur","roof","walls","water","sex"),  w="sampling_weight",max_global_risk=0.0001)print(rest)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')## -> when using `createSdcObj()`, the risks are already internally computed## and it is not required to explicitely run `sdc <- measure_risk(sdc)`

Replaces the raw household-level data with the anonymized household-level data in the full datasetfor anonymization of data with a household structure (or other hierarchical structure).Requires a matching household ID in both files.

Description

Replaces the raw household-level data with the anonymized household-level data in the full datasetfor anonymization of data with a household structure (or other hierarchical structure).Requires a matching household ID in both files.

Usage

mergeHouseholdData(dat, hhId, dathh)

Arguments

dat

a data.frame with the full dataset

hhId

name of the household (cluster) ID (identical in both datasets)

dathh

a dataframe with the treated household level data (generated for example withselectHouseholdData)

Value

a data.frame with the treated household level variables and the raw individual level variables

Author(s)

Thijs Benschop and Bernhard Meindl

Examples

## Load datax <- testdata## donttest is necessary because of ## Examples with CPU time > 2.5 times elapsed time## caused by using C++ code and/or data.table## Create household level datasetx_hh <- selectHouseholdData(dat=x, hhId="ori_hid",  hhVars=c("urbrur", "roof",  "walls", "water", "electcon", "household_weights"))## Anonymize household level dataset and extract datasdc_hh <- createSdcObj(x_hh, keyVars=c('urbrur','roof'), w='household_weights')sdc_hh <- kAnon(sdc_hh, k = 3)x_hh_anon <- extractManipData(sdc_hh) ## Merge anonymized household level data back into the full datasetx_anonhh <- mergeHouseholdData(x, "ori_hid", x_hh_anon) ## Anonymize full dataset and extract datasdc_full <- createSdcObj(x_anonhh, keyVars=c('sex', 'age', 'urbrur', 'roof'), w='sampling_weight')sdc_full <- kAnon(sdc_full, k = 3)x_full_anon <- extractManipData(sdc_full)

microData

Description

Small aritificial toy data set.

Format

The format is: num [1:13, 1:5] 5 7 2 1 7 8 12 3 15 4 ... - attr(*,"dimnames")=List of 2 ..$ : chr [1:13] "10000" "11000" "12000" "12100" .....$ : chr [1:5] "one" "two" "three" "four" ...

Examples

data(microData)microData <- as.data.frame(microData)m1 <- microaggregation(microData, method="mdav")summary(m1)

Microaggregation for numerical and categorical key variables based on adistance similar to the Gower Distance

Description

The microaggregation is based on the distances computed similar to the Gowerdistance. The distance function makes distinction between the variable typesfactor,ordered,numerical and mixed (semi-continuous variables with a fixedprobability mass at a constant value e.g. 0)

Usage

microaggrGower(  obj,  variables = NULL,  aggr = 3,  dist_var = NULL,  by = NULL,  mixed = NULL,  mixed.constant = NULL,  trace = FALSE,  weights = NULL,  numFun = mean,  catFun = VIM::sampleCat,  addRandom = FALSE)

Arguments

obj

sdcMicroObj-class-object or adata.frame

variables

character vector with names of variables to be aggregated(Default for sdcMicroObj is all keyVariables and all numeric key variables)

aggr

aggregation level (default=3)

dist_var

character vector with variable names for distancecomputation

by

character vector with variable names to split the dataset beforeperforming microaggregation (Default for sdcMicroObj is strataVar)

mixed

character vector with names of mixed variables

mixed.constant

numeric vector with length equal to mixed, where themixed variables have the probability mass

trace

TRUE/FALSE for some console output

weights

numerical vector with length equal the number of variablesfor distance computation

numFun

function: to be used to aggregated numerical variables

catFun

function: to be used to aggregated categorical variables

addRandom

TRUE/FALSE if a random value should be added for thedistance computation.

Details

The function sampleCat samples with probabilities corresponding to theoccurrence of the level in the NNs. The function maxCat chooses the levelwith the most occurrences and random if the maximum is not unique.

Value

The function returns the updated sdcMicroObj or simply an altereddata frame.

Note

In each by group all distance are computed, therefore introducing moreby-groups significantly decreases the computation time and memoryconsumption.

Author(s)

Alexander Kowarik

Examples

data(testdata,package="sdcMicro")testdata <- testdata[1:200,]for(i in c(1:7,9)) testdata[,i] <- as.factor(testdata[,i])test <- microaggrGower(testdata,variables=c("relat","age","expend"),  dist_var=c("age","sex","income","savings"),by=c("urbrur","roof"))sdc <- createSdcObj(testdata,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- microaggrGower(sdc)

Microaggregation

Description

Function to perform various methods of microaggregation.

Usage

microaggregation(  obj,  variables = NULL,  aggr = 3,  strata_variables = NULL,  method = "mdav",  weights = NULL,  nc = 8,  clustermethod = "clara",  measure = "mean",  trim = 0,  varsort = 1,  transf = "log")

Arguments

obj

either an object of classsdcMicroObj-class or adata.frame

variables

variables to microaggregate. ForNULL: If obj is of classsdcMicroObj, all numerical key variables are chosen per default. Fordata.frames, all columns are chosen per default.

aggr

aggregation level (default=3)

strata_variables

fordata.frames, by-variables for applying microaggregation onlywithin strata defined by the variables. ForsdcMicroObj-class-objects, thestratification-variable defined in slot@strataVar is used. This slot can be changed anytime usingstrataVar<-.

method

pca, rmd, onedims, single, simple, clustpca, pppca,clustpppca, mdav, clustmcdpca, influence, mcdpca

weights

sampling weights. If obj is of class sdcMicroObj the vectorof sampling weights is chosen automatically. If determined, a weightedversion of the aggregation measure is chosen automatically, e.g. weightedmedian or weighted mean.

nc

number of cluster, if the chosen method performs cluster analysis

clustermethod

clustermethod, if necessary

measure

aggregation statistic, mean, median, trim, onestep (default=mean)

trim

trimming percentage, if measure=trim

varsort

variable for sorting, if method=single

transf

transformation for data x

Details

Onhttps://research.cbs.nl/casc/glossary.htm one can found the“official” definition of microaggregation:

Records are grouped based on a proximity measure of variables of interest,and the same small groups of records are used in calculating aggregates forthose variables. The aggregates are released instead of the individualrecord values.

The recommended method is “rmd” which forms the proximity usingmultivariate distances based on robust methods. It is an extension of thewell-known method “mdav”. However, when computational speed isimportant, method “mdav” is the preferable choice.

While for the proximity measure very different concepts can be used, theaggregation itself is naturally done with the arithmetic mean.Nevertheless, other measures of location can be used for aggregation,especially when the group size for aggregation has been taken higher than 3.Since the median seems to be unsuitable for microaggregation because ofbeing highly robust, other mesures which are included can be chosen. If acomplex sample survey is microaggregated, the corresponding sampling weightsshould be determined to either aggregate the values by the weightedarithmetic mean or the weighted median.

This function contains also a method with which the data can be clusteredwith a variety of different clustering algorithms. Clustering observationsbefore applying microaggregation might be useful. Note, that the data areautomatically standardised before clustering.

The usage of clustering method ‘Mclust’ requires package mclust02,which must be loaded first. The package is not loaded automatically, sincethe package is not under GPL but comes with a different licence.

The are also some projection methods for microaggregation included. Therobust version ‘pppca’ or ‘clustpppca’ (clustering at first)are fast implementations and provide almost everytime the best results.

Univariate statistics are preserved best with the individual ranking method(we called them ‘onedims’, however, often this method is named‘individual ranking’), but multivariate statistics are strongaffected.

With method ‘simple’ one can apply microaggregation directly on the(unsorted) data. It is useful for the comparison with other methods as abenchmark, i.e. replies the question how much better is a sorting of thedata before aggregation.

Value

If ‘obj’ was of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility. If ‘obj’ wasof class “data.frame”, an object of class “micro” with following entities is returned:

x:: original data
mx:: the microaggregated dataset
method:: method
aggr:: aggregation level
measure:: proximity measure for aggregation

Note

if only one variable is specified,mafast is applied and argumentmethod is ignored.Parametersmeasure are ignored for methodsmdav andrmd.

Author(s)

Matthias Templ, Bernhard Meindl

For method “mdav”: This work is being supported by the InternationalHousehold Survey Network (IHSN) and funded by a DGF Grant provided by theWorld Bank to the PARIS21 Secretariat at the Organisation for EconomicCo-operation and Development (OECD). This work builds on previous workwhich is elsewhere acknowledged.

Author for the integration of the code for mdav in R: Alexander Kowarik.

References

Templ, M. and Meindl, B.,Robust Statistics Meets SDC: New DisclosureRisk Measures for Continuous Microdata Masking, Lecture Notes in ComputerScience, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008.

Templ, M.Statistical Disclosure Control for Microdata Using theR-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp.67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php

Templ, M. and Meindl, B. and Kowarik, A.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro, Journal of Statistical Software,67 (4), 1–36, 2015.

Examples

data(testdata)# donttest since Examples with CPU time larger 2.5 times elapsed time, because# of using data.table and multicore computation.m <- microaggregation(  obj = testdata[1:100, c("expend", "income", "savings")],  method = "mdav",  aggr = 4)summary(m)## for objects of class sdcMicro:## no stratification because `@strataVar` is `NULL`data(testdata2)sdc <- createSdcObj(  dat = testdata2,  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")sdc <- microaggregation(  obj = sdc,  variables = c("expend", "income"))## with stratification using variable `"relat"`strataVar(sdc) <- "relat"sdc <- microaggregation(  obj = sdc,  variables = "savings")

Global risk using log-linear models.

Description

The sample frequencies are assumed to be independent and following a Poissondistribution. The parameters of the corresponding parameters are estimatedby a log-linear model including the main effects and possible interactions.

Usage

modRisk(obj, method = "default", weights, formulaM, bound = Inf, ...)

Arguments

obj

AnsdcMicroObj-class-object or a numeric matrixor data.frame containing all variables required in the specified model.

method

chose method for model-based risk-estimation. Currently, thefollowing methods can be selected:

"default": the standard log-linear model.
"CE": the Clogg Eliason method, additionally, considers survey weights by using an offset term.
"PML": the pseudo maximum likelihood method.
"weightedLLM": the weighted maximum likelihood method, considers survey weights by including them as one of the predictors.
"IPF": iterative proportional fitting as used in deprecated method 'LLmodGlobalRisk'.

weights

a variable name specifying sampling weights

formulaM

A formula specifying the model.

bound

a number specifying a threshold for 'risky' observations in the sample.

...

additional parameters passed through, currently ignored.

Details

This measure aims to (1) calculate the number of sample uniques that arepopulation uniques with a probabilistic Poisson model and (2) to estimatethe expected number of correct matches for sample uniques.

ad 1) this risk measure is defined over all sample uniques as

\tau_1= \sum\limits_{j:f_j=1} P(F_j=1 | f_j=1) \quad ,

i.e. the expected numberof sample uniques that are population uniques.

ad 2) this risk measure is defined over all sample uniques as

\tau_2= \sum\limits_{j:f_j=1} P(1 / F_j | f_j=1) \quad .

Since population frequenciesF_k are unknown, they need to beestimated.

The iterative proportional fitting method is used to fit the parameters ofthe Poisson distributed frequency counts related to the model specified tofit the frequency counts. The obtained parameters are used to estimate aglobal risk, defined in Skinner and Holmes (1998).

Value

Two global risk measures and some model output given the specified model. If this methodis applied to ansdcMicroObj-class-object, the slot 'risk' in the object ist updatedwith the result of the model-based risk-calculation.

Author(s)

Matthias Templ, Marius Totter, Bernhard Meindl

References

Skinner, C.J. and Holmes, D.J. (1998)Estimating there-identification risk per record in microdata. Journal of OfficialStatistics, 14:361-372, 1998.

Rinott, Y. and Shlomo, N. (1998).A Generalized Negative BinomialSmoothing Model for Sample Disclosure Risk Estimation. Privacy inStatistical Databases. Lecture Notes in Computer Science. Springer-Verlag,82–93.

Clogg, C.C. and Eliasson, S.R. (1987).Some Common Problems in Log-Linear Analysis. Sociological Methods and Research, 8-44.

Examples

## data.frame methoddata(testdata2)form <- ~ sex + water + roofw <- "sampling_weight"(modRisk(testdata2, method = "default", formulaM = form, weights = w))(modRisk(testdata2, method = "CE", formulaM = form, weights = w))(modRisk(testdata2, method = "PML", formulaM = form, weights = w))(modRisk(testdata2, method = "weightedLLM", formulaM = form, weights = w))(modRisk(testdata2, method = "IPF", formulaM = form, weights = w))## application to a sdcMicroObjdata(testdata2)sdc <- createSdcObj(testdata2,  keyVars = c("urbrur", "roof", "walls", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")sdc <- modRisk(sdc, form = ~ sex + water + roof)slot(sdc, "risk")$model# an example using data from the laeken-pkglibrary(laeken)data(eusilc)f <- as.formula(paste(" ~ ", "db040 + hsize + rb090 +             age + pb220a + age:rb090 + age:hsize +             hsize:rb090"))w <- "rb050"(modRisk(eusilc, method = "default", weights = w, formulaM = f, bound = 5))(modRisk(eusilc, method = "CE", weights = w, formulaM = f, bound = 5))(modRisk(eusilc, method = "PML", weights = w, formulaM = f, bound = 5))(modRisk(eusilc, method = "weightedLLM", weights = w, formulaM = f, bound = 5))

Detection and winsorization of multivariate outliers

Description

Imputation and detection of outliers

Usage

mvTopCoding(x, maha = NULL, center = NULL, cov = NULL, alpha = 0.025)

Arguments

x

an object coercible to adata.table containing numeric entries

maha

squared mahalanobis distance of each observation

center

center of data, needed for calculation of mahalanobisdistance (if not provided)

cov

covariance matrix of data, needed for calcualtion of mahalanobisdistance (if not provided)

alpha

significance level, determining the ellipsoide to whichoutliers should be placed upon

Details

Winsorizes the potential outliers on the ellipsoid defined by(robust) Mahalanobis distances in direction to the center of the data

Value

the imputed winsorized data

Author(s)

Johannes Gussenbauer, Matthias Templ

Examples

set.seed(123)x <- MASS::mvrnorm(20, mu = c(5,5), Sigma = matrix(c(1,0.9,0.9,1), ncol = 2))x[1, 1] <- 3x[1, 2] <- 6plot(x)ximp <- mvTopCoding(x)points(ximp, col = "blue", pch = 4)# more dimensionsSigma <- diag(5)Sigma[upper.tri(Sigma)] <- 0.9Sigma[lower.tri(Sigma)] <- 0.9x <- MASS::mvrnorm(20, mu = rep(5,5), Sigma = Sigma)x[1, 1] <- 3x[1, 2] <- 6pairs(x)ximp <- mvTopCoding(x)xnew <- data.frame(rbind(x, ximp))xnew$beforeafter <- rep(c(0,1), each = nrow(x))pairs(xnew, col = xnew$beforeafter, pch = 4)# by hand (non-robust)x[2,2] <- NAm <- colMeans(x, na.rm = TRUE)s <- cov(x, use = "complete.obs")md <- stats::mahalanobis(x, m, s)ximp <- mvTopCoding(x, center = m, cov = s, maha = md)plot(x)points(ximp, col = "blue", pch = 4)

nextSdcObj

Description

internal function used to provide the undo-functionality.

Usage

nextSdcObj(obj)

Arguments

obj

asdcMicroObj-class object

Value

a modifiedsdcMicroObj-class object

Reorder data

Description

Reorders the data according to a column in the data set.
NOTE: This is an internal function used for testing the C++-functionorderData which is used inside the C++-functionrecordSwap() to speed up performance.

Usage

orderData_cpp(data, orderIndex)

Arguments

data

micro data set containing only numeric values.

orderIndex

column index indata refering to the column by which data should be ordered.

Value

ordered data set.

Plots for localSuppression objects

Description

This function creates barplots to display the number of suppressed valuesin categorical key variables to achievek-anonymity.

Usage

## S3 method for class 'localSuppression'plot(x, ...)

Arguments

x

object of derived fromlocalSuppression()

...

Additional arguments, currently available are:

"showDetails": logical, if set, a plot of suppressions bystrata is shown (if possible)

Value

aggplot plot object

Author(s)

Bernhard Meindl, Matthias Templ

Examples

data(francdat)

Plotfunctions for objects of classsdcMicroObj

Description

Descriptive plot function forsdcMicroObj-objects. Currentlyonly visualization of local supression is implemented.

Usage

## S3 method for class 'sdcMicroObj'plot(x, type = "ls", ...)

Arguments

x

An object of classsdcMicroObj

type

specified what kind of plot will be generated

"ls": plot of local suppressions in key variables

...

currently ignored

Value

aggplot plot object or (invisible)NULL if local suppressionusingkAnon() has not been applied

Author(s)

Bernhard Meindl

Examples

data(testdata)sdc <- createSdcObj(testdata,  keyVars = c("urbrur", "roof", "walls", "relat", "sex"),  w = "sampling_weight")sdc <- kAnon(sdc, k = 3)plot(sdc, type = "ls")

Comparison plots

Description

Plots for the comparison of the original data and perturbed data.

Usage

plotMicro(x, p, which.plot = 1:3)

Arguments

x

an output object ofmicroaggregation()

p

necessary parameter for the box cox transformation (lambda)

which.plot

which plot should be created?

1: density traces
2: parallel boxplots
3: differences in totals

Details

Univariate and multivariate comparison plots are implemented to detectdifferences between the perturbed and the original data, but also to compareperturbed data which are produced by different methods.

Value

returnsNULL; the selected plot is displayed

Author(s)

Matthias Templ

References

Templ, M. and Meindl, B.,Software Development for SDC inR, Lecture Notes in Computer Science, Privacy in Statistical Databases,vol. 4302, pp. 347-359, 2006.

Examples

data(free1)df <- as.data.frame(free1)[, 31:34]m1 <- microaggregation(df, method = "onedims", aggr = 3)plotMicro(m1, p = 1, which.plot = 1)plotMicro(m1, p = 1, which.plot = 2)plotMicro(m1, p = 1, which.plot = 3)

Post Randomization

Description

To be used on categorical data stored as factors. The algorithm randomlychanges the values of variables in selected records (usually the risky ones)according to an invariant probability transition matrix or a custom-definedtransition matrix.

Usage

pram(obj, variables = NULL, strata_variables = NULL, pd = 0.8, alpha = 0.5)

Arguments

obj

Input data. Allowed input data are objects of classdata.frame,factor orsdcMicroObj.

variables

Names of variables inobj on which post-randomizationshould be applied. Ifobj is a factor, this argument is ignored. Please note thatpram can only be applied to factor-variables.

strata_variables

names of variables for stratification (will be setautomatically for an object of classsdcMicroObj. One can also specifyan integer vector or factor that specifies that desired groups. This vector must match the dimensionof the input data set, however. For a possible use case, have a look at the examples.

pd

minimum diagonal entries for the generated transition matrix P.Either a vector of length 1 (which is recycled) or a vector of the same length asthe number of variables that should be postrandomized. It is also possible to setpdto a numeric matrix. This matrix will be used directly as the transition matrix. The matrix mustbe constructed as follows:

the matrix must be a square matrix
the rownames and colnames of the matrix must match the levels (in the same order) of the factor-variable that should bepostrandomized.
the rowSums of the matrix need to equal 1

It is also possible to combine the different ways. For details have a look at the examples.

alpha

amount of perturbation for the invariant Pram method. This is a numeric vectorof length 1 (that will be recycled if necessary) or a vector of the same length as the numberof variables. If one specified as transition matrix directly,alpha is ignored.

Value

a modifiedsdcMicroObj object or a new object containingoriginal and post-randomized variables (with suffix "_pram").

Note

Deprecated method 'pram_strata' is no longer availablein sdcMicro > 4.5.0

Author(s)

Alexander Kowarik, Matthias Templ, Bernhard Meindl

References

https://www.gnu.org/software/glpk/

Templ, M. and Kowarik, A. and Meindl, B.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro. in: Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Templ, M.:Statistical Disclosure Control for Microdata: Methods and Applications in R.in: Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4

Examples

data(testdata)## donttest is necessary because of## Examples with CPU time > 2.5 times elapsed time## caused by using C++ code and/or data.table## using a factor variable as inputres <- pram(as.factor(testdata$roof))print(res)summary(res)## using a data.frame as input## pram can only be applied to factors## -- > we have to recode to factors beforehandtestdata$roof <- factor(testdata$roof)testdata$walls <- factor(testdata$walls)testdata$water <- factor(testdata$water)## pram() is applied within subgroups defined by## variables "urbrur" and "sex"res <- pram(  obj = testdata,  variables = "roof", strata_variables = c("urbrur", "sex"))print(res)summary(res)## default parameters (pd = 0.8 and alpha = 0.5) for the generation## of the invariant transition matrix will be used for all variablesres1 <- pram(  obj = testdata,  variables = c("roof", "walls", "water"))print(res1)## specific parameter settings for each variableres2 <- pram(   obj = testdata,   variables = c("roof", "walls", "water"),   pd = c(0.95, 0.8, 0.9),   alpha = 0.5)print(res2)## detailed information on pram-parameters (such as the transition matrix 'Rs')## is stored in the output, eg. for variable 'roof'#attr(res2, "pram_params")$roof## we can also specify a custom transition-matrix directlymat <- diag(length(levels(testdata$roof)))rownames(mat) <- colnames(mat) <- levels(testdata$roof)res3 <- pram(   obj = testdata,   variables = "roof",  pd = mat)print(res3) # of course, nothing has changed!## it is possible use a transition matrix for a variable and use the 'traditional' way## of specifying a number for the minimal diagonal entries of the transision matrix## for other variables. In this case we must supply `pd` as list.res4 <- pram(  obj = testdata,  variables = c("roof", "walls"),  pd = list(mat, 0.5),  alpha = c(NA, 0.5))print(res4)summary(res4)attr(res4, "pram_params")## application to objects of class sdcMicro with default parametersdata(testdata2)testdata2$urbrur <- factor(testdata2$urbrur)sdc <- createSdcObj(  dat = testdata2,  keyVars = c("roof", "walls", "water", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")sdc <- pram(  obj = sdc,  variables = "urbrur")print(sdc, type = "pram")## this is equal to the previous application. If argument 'variables' is NULL,## all variables from slot 'pramVars' will be used if possible.sdc <- createSdcObj(   dat = testdata2,  keyVars = c("roof", "walls", "water", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight",  pramVars = "urbrur")sdc <- pram(sdc)print(sdc, type="pram")## we can specify transition matrices for sdcMicroObj-objects tootestdata2$roof <- factor(testdata2$roof)sdc <- createSdcObj(  dat = testdata2,  keyVars = c("roof", "walls", "water", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")mat <- diag(length(levels(testdata2$roof)))rownames(mat) <- colnames(mat) <- levels(testdata2$roof)mat[1,] <- c(0.9, 0, 0, 0.05, 0.05)sdc <- pram(   obj = sdc,   variables = "roof",   pd = mat)print(sdc, type = "pram")## we can also have a look at the transitionsget.sdcMicroObj(sdc, "pram")$transitions

Print method for objects from class freqCalc.

Description

Print method for objects from class freqCalc.

Usage

## S3 method for class 'freqCalc'print(x, ...)

Arguments

x

object from classfreqCalc

...

Additional arguments passed through.

Value

information about the frequency counts for key variables for objectof classfreqCalc.

Author(s)

Matthias Templ

Examples

## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)f

Print method for objects from class indivRisk

Description

Print method for objects from class indivRisk

Usage

## S3 method for class 'indivRisk'print(x, ...)

Arguments

x

object from class indivRisk

...

Additional arguments passed through.

Value

few information about the method and the final correction factor forobjects of class ‘indivRisk’.

Author(s)

Matthias Templ

Examples

## example from Capobianchi, Polettini and Lucarelli:data(francdat)f1 <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)data.frame(fk=f1$fk, Fk=f1$Fk)## individual risk calculation:indivRisk(f1)

Print method for objects from class localSuppression

Description

Print method for objects from class localSuppression

Usage

## S3 method for class 'localSuppression'print(x, ...)

Arguments

x

object from class localSuppression

...

Additional arguments passed through.

Value

Information about the frequency counts for key variables for objectof class ‘localSuppression’.

Author(s)

Matthias Templ

Examples

## example from Capobianchi, Polettini and Lucarelli:data(francdat)l1 <- localSuppression(francdat, keyVars = c(2, 4, 5, 6))l1

Print method for objects from class micro

Description

printing an object of classmicro

Usage

## S3 method for class 'micro'print(x, ...)

Arguments

x

object from class micro

...

Additional arguments passed through.

Value

information about method and aggregation level from objects of classmicro.

Author(s)

Matthias Templ

Examples

data(free1)free1 <- as.data.frame(free1)m1 <- microaggregation(free1[, 31:34], method='onedims', aggr=3)m1

Print method for objects from class modrisk

Description

Print method for objects from class modrisk

Usage

## S3 method for class 'modrisk'print(x, ...)

Arguments

x

an object of classmodrisk

...

Additional arguments passed through.

Value

Output of model-based risk estimation

Author(s)

Bernhard Meindl

Print method for objects from class pram

Description

Print method for objects from class pram

Usage

## S3 method for class 'pram'print(x, ...)

Arguments

x

an object of classpram

...

Additional arguments passed through.

Value

absolute and relative frequencies of changed observations in each modified variable

Author(s)

Bernhard Meindl, Matthias Templ

Matthias Templ and Bernhard Meindl

Print and Extractor Functions for objects of class`sdcMicroObj-class`

Description

Descriptive print function for Frequencies, local Supression, Recoding,categorical risk and numerical risk.

Usage

## S4 method for signature 'sdcMicroObj'print(x, type = "kAnon", docat = TRUE, ...)

Arguments

x

An object of classsdcMicroObj-class

type

Selection of the content to be returned or printed

docat

logical, if TRUE (default) the results will be actually printed

...

the type argument for the print method, currently supported are:

general: basic information on the input obj such as the number of observationsand variables.
kAnon: displays information about 2- and 3-anonymity
ls: displays various information if local suppression has been applied.
pram: displays various information if post-randomization has been applied.
recode: shows information about categorical key variables before and after recoding
risk: displays information on re-identification risks
numrisk: displays risk- and utility measures for numerical key variables

Details

Possible values for the type argument of the print function are: "freq": forFrequencies, "ls": for Local Supression output, "pram": for results ofpost-randomization "recode":for Recodes, "risk": forCategorical risk and"numrisk": for Numerical risk.

Possible values for the type argument of the freq function are: "fk": Samplefrequencies and "Fk": weighted frequencies.

Author(s)

Alexander Kowarik, Matthias Templ, Bernhard Meindl

Examples

data(testdata)sdc <- createSdcObj(testdata,  keyVars=c('urbrur','roof','walls','relat','sex'),  pramVars=c('water','electcon'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- microaggregation(sdc, method="mdav", aggr=3)print(sdc)print(sdc, type="general")print(sdc, type="ls")print(sdc, type="recode")print(sdc, type="risk")print(sdc, type="numrisk")print(sdc, type="pram")print(sdc, type="kAnon")print(sdc, type="comp_numvars")

Print method for objects from class suda2

Description

Print method for objects from class suda2.

Usage

## S3 method for class 'suda2'print(x, ...)

Arguments

x

an object of class suda2

...

additional arguments passed through.

Value

Table of dis suda scores.

Author(s)

Matthias Templ

Random Sampling

Description

Randomly select records given a probability weight vectorprob.
NOTE: This is an internal function used for testing the C++-functionrandSample which is used inside the C++-functionrecordSwap().

Usage

randSample_cpp(ID, N, prob, IDused, seed)

Arguments

ID

vector containing record IDs from which to sample

N

integer defining the number of records to be sampled

prob

a vector of probability weights for obtaining the elements of the vector being sampled.

IDused

vector containing IDs which must not be sampled

seed

integer setting the sampling seed

Rank Swapping

Description

Swapping values within a range so that, first, the correlation structure oforiginal variables are preserved, and second, the values in each record aredisturbed. To be used on numeric or ordinal variables where the rank can bedetermined and the correlation coefficient makes sense.

Usage

rankSwap(  obj,  variables = NULL,  TopPercent = 5,  BottomPercent = 5,  K0 = NULL,  R0 = NULL,  P = NULL,  missing = NA,  seed = NULL)

Arguments

obj

asdcMicroObj-class-object or adata.frame

variables

names or index of variables for that rank swapping isapplied. For an object of classsdcMicroObj-class, all numeric key variables areselected if variables=NULL.

TopPercent

Percentage of largest values that are grouped togetherbefore rank swapping is applied.

BottomPercent

Percentage of lowest values that are grouped togetherbefore rank swapping is applied.

K0

Subset-mean preservation factor. Preserves the means before andafter rank swapping within a range based on K0. K0 is the subset-meanpreservation factor such that| X_1 -X_2 | \leq \frac{2 K_0X_1}{\sqrt(N_S)}, whereX_1andX_2 are the subset means of the field before and afterswapping, andN_S is the sample size of the subset.

R0

Multivariate preservation factor. Preserves the correlationbetween variables within a certain range based on the given constant R0. Wecan specify the preservation factor asR_0=\frac{R_1}{R_2} whereR_1 is the correlation coefficient of the twofields after swapping, andR_2 is the correlation coefficient ofthe two fields before swapping.

P

Rank range as percentage of total sample size. We can specify therank range itself directly, noted asP, which is the percentage ofthe records. So two records are eligible for swapping if their ranks,i andj respectively, satisfy| i-j | \le \frac{PN}{100}, whereN is the total sample size.

missing

missing - the value to be used as missing valuein the C++ routine instead of NA. If NA, a suitable value is calculated internally.Note that in the returned dataset, all NA-values (if any) will be replaced withthis value.

seed

Seed.

Details

Rank swapping sorts the values of one numeric variable by their numericalvalues (ranking). The restricted range is determined by the rank of twoswapped values, which cannot differ, by definition, by more than Ppercent of the total number of observations. Only positive P, R0 and K0 areused and only one of it must be supplied. If none is supplied, sdcMicro setsparameter r0 to 0.95 internally.

Value

The rank-swapped data set or a modifiedsdcMicroObj-class object.

Author(s)

Alexander Kowarik for the interface, Bernhard Meindl for improvements.

For the underlying C++ code: This work is being supported by theInternational Household Survey Network (IHSN) and funded by a DGF Grantprovided by the World Bank to the PARIS21 Secretariat at the Organisationfor Economic Co-operation and Development (OECD). This work builds onprevious work which is elsewhere acknowledged.

References

Moore, Jr.R. (1996) Controlled data-swapping techniques formasking public use microdata, U.S. Bureau of the CensusStatisticalResearch Division Report Series, RR 96-04.

Examples

data(testdata2)data_swap <- rankSwap(  obj = testdata2,  variables = c("age", "income", "expend", "savings"))## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(  dat = testdata2,  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")sdc <- rankSwap(sdc)

readMicrodata

Description

reads data from various formats into R. Used insdcApp.

Usage

readMicrodata(  path,  type,  convertCharToFac = TRUE,  drop_all_missings = TRUE,  ...)

Arguments

path

a file path

type

which format does the file have. currently allowed values are

sas
spss
stata
R
rdf
csv

convertCharToFac

(logical) if TRUE, all character vectors are automaticallyconverted to factors

drop_all_missings

(logical) if TRUE, all variables that contain NA-values onlywill be dropped

...

additional parameters. Currently used only iftype='csv' to passarguments toread.table().

Value

a data.frame or an object of class 'simple.error'. If a stata file was read in, the resultingdata.framehas an additional attributelab in which variable and value labels are stored.

Note

iftype is either'sas','spss' or'stata', values read in asNaNwill be converted toNA.

Author(s)

Bernhard Meindl

Targeted Record Swapping

Description

Applies targeted record swapping on micro data considering the identificationrisk of each record as well the geographic topology.

Usage

recordSwap(data, ...)## S3 method for class 'sdcMicroObj'recordSwap(data, ...)## Default S3 method:recordSwap(  data,  hid,  hierarchy,  similar,  swaprate = 0.05,  risk = NULL,  risk_threshold = 0,  k_anonymity = 3,  risk_variables = NULL,  carry_along = NULL,  return_swapped_id = FALSE,  log_file_name = "TRS_logfile.txt",  seed = NULL,  ...)

Arguments

data

must be either a micro data set in the form of a'data.table' or 'data.frame', or an 'sdcObject', seecreateSdcObj.

...

parameters passed to 'recordSwap.default()'

hid

column index or column name in 'data' which refersto the household identifier.

hierarchy

column indices or column names of variables in'data' which refer to the geographic hierarchy in the micro dataset. For instance county > municipality > district.

similar

vector or list of integer vectors or column namescontaining similarity profiles, see details for more explanations.

swaprate

double between 0 and 1 defining the proportion ofhouseholds which should be swapped, see details for more explanations

risk

either column indices or column names in 'data' or'data.table', 'data.frame' or 'matrix' indicating risk of each recordat each hierarchy level. If 'risk'-matrix is supplied to swapping procedurewill not use the k-anonymity rule but the values found in this matrixfor swapping.When using the risk parameter is expected to have assigned a maximum value in a household for each member of the household. If this condition is not satisfied, the risk parameter is automatically adjusted to comply with this condition.If risk parameter is provided then k-anonymity rule is suppressed.

risk_threshold

single numeric value indicating when a household isconsidered "high risk", e.g. when this household must be swapped. Is onlyused when 'risk' is not 'NULL'.Risk threshold indicates households that have to be swapped, but be aware that households with risk lower than threshold, but with still high enough risk may be swapped as well. Only households with risk set to 0 are not swapped. Risk and risk threshold must be equal or bigger then 0.

k_anonymity

integer defining the threshold of high risk households(counts<k) for using k-anonymity rule

risk_variables

column indices or column names of variables in 'data'which will be considered for estimating the risk. Only used when k-anonymityrule is applied.

carry_along

integer vector indicating additional variables to swapbesides to hierarchy variables. These variables do not interfere with theprocedure of finding a record to swap with or calculating risk. Thisparameter is only used at the end of the procedure when swapping thehierarchies. We note that the variables to be used as 'carry_along' shouldbe at household level. In case it is detected that they are at individuallevel (different values within 'hid'), a warning is given.

return_swapped_id

boolean if 'TRUE' the output includes anadditional column showing the 'hid' with which a record was swapped with.The new column will have the name 'paste0(hid,"_swapped")'.

log_file_name

character, path for writing a log file. The logfile contains a list of household IDs ('hid') which could not have beenswapped and is only created if any such households exist.

seed

integer defining the seed for the random number generator, forreproducibility. if 'NULL' a random seed will be set using 'sample(1e5,1)'.

Details

The procedure accepts a 'data.frame' or 'data.table'containing all necessary information for the record swapping, e.gparameter 'hid', 'similar', 'hierarchy', etc ...First, the micro data in 'data' is ordered by 'hid' and the identificationrisk is calculated for each record in each hierarchy level. As of rightnow only counts is used as identification risk and the inverse of countsis used as sampling probability.NOTE: It will be possible to supply an identification risk for each recordand hierarchy level which will be passed down to the C++-function. Thisis however not fully implemented.

With the parameter 'k_anonymity' a k-anonymity rule is applied to definerisky households in each hierarchy level. A household is set to riskyif counts < k_anonymity in any hierarchy level and the household needsto be swapped across this hierarchy level.For instance, having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 thecounts are calculated for each geographic variable and defined'risk_variables'. If the counts for a record falls below 'k_anonymity'for hierarchy county (NUTS1, NUTS2, ...) then this record needs to be swapped across counties.Setting 'k_anonymity = 0' disables this feature and no risky householdsare defined.

After that the targeted record swapping is applied, starting from the highestto the lowest hierarchy level and cycling through all possible geographicareas at each hierarchy level, e.g every county, every municipality inevery county, etc, ...

At each geographic area, a set of values is created for records to beswapped. In all but the lowest hierarchy level, this is ONLY made outof all records which do not fulfil the k-anonymity and have not alreadybeen swapped. Those records are swapped with records not belonging tothe same geographic area, which have not already been swapped beforehand.Swapping refers to the interchange of geographic variables defined in'hierarchy'. When a record is swapped all other records containing thesame 'hid' are swapped as well.

At the lowest hierarchy level in every geographic area, the set of records tobe swapped is made up of all records which do not fulfil the k-anonymityas well as the remaining number of records such that the proportion ofswapped records of the geographic area is in coherence with the 'swaprate'.If due to the k-anonymity condition, more records have already been swappedin this geographic area then only the records which do not fulfil thek-anonymity are swapped.

Using the parameter 'similar' one can define similarity profiles.'similar' needs to be a list of vectors with each list entry containingcolumn indices of 'data'. These entries are used when searching for donorhouseholds, meaning that for a specific record the set of all donorrecords is made out of records which have the same values in'similar[[1]]'. It is however important to note, that these variablescan only be variables related to households (not persons!). If no suitabledonor can be found the next similarity profile is used, 'similar[[2]]' andthe set of all donors is then made up out of all records which have thesame values in the column indices in 'similar[[2]]'. This procedurecontinues until a donor record was found or all the similarity profileshave been used.

'swaprate' sets the swaprate of households to be swapped, where a singleswap counts for swapping 2 households, the sampled household and thecorresponding donor. Prior to the procedure, the swaprate is applied onthe lowest hierarchy level, to determine the target number of swappedhouseholds in each of the lowest hierarchies. If the target numbers of adecimal point they will randomly be rounded up or down such that thenumber of households swapped in total is in coherence to the swaprate.

Value

'data.table' with swapped records.

Author(s)

Johannes Gussenbauer

Examples

# generate 10000 dummy householdslibrary(data.table)seed <- 2021set.seed(seed)nhid <- 10000dat <- sdcMicro::createDat(nhid)# define paramters for swappingk_anonymity <- 1swaprate <- .05 # 5%similar <- list(c("hsize"))hier <- c("nuts1", "nuts2")risk_variables <- c("ageGroup", "national")hid <- "hid"## apply record swapping#dat_s <- recordSwap(#  data = dat,#  hid = hid,#  hierarchy = hier,#  similar = similar,#  swaprate = swaprate,#  k_anonymity = k_anonymity,#  risk_variables = risk_variables,#  carry_along = NULL,#  return_swapped_id = TRUE,#  seed = seed#)### number of swapped households#dat_s[hid != hid_swapped, uniqueN(hid)]### hierarchies are not consistently swapped#dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]### use parameter carry_along#dat_s <- recordSwap(#   data = dat,#   hid = hid,#  hierarchy = hier,#  similar = similar,#  swaprate = swaprate,#  k_anonymity = k_anonymity,#  risk_variables = risk_variables,#  carry_along = c("nuts3", "lau2"),#  return_swapped_id = TRUE,#  seed = seed)##dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]

Targeted Record Swapping

Description

Applies targeted record swapping on micro data set, see?recordSwap for details.
NOTE: This is an internal function called by the R-functionrecordSwap(). It's only purpose is to include the C++-function recordSwap() using Rcpp.

Usage

recordSwap_cpp(  data,  hid,  hierarchy,  similar_cpp,  swaprate,  risk,  risk_threshold,  k_anonymity,  risk_variables,  carry_along,  log_file_name,  seed = 123456L)

Arguments

data

micro data set containing only integer values. A data.frame or data.table from R needs to be transposed beforehand so that data.size() ~ number of records - data.[0].size ~ number of varaibles per record.NOTE:data has to be ordered by hid beforehand.

hid

column index indata which refers to the household identifier.

hierarchy

column indices of variables indata which refers to the geographic hierarchy in the micro data set. For instance county > municipality > district.

similar_cpp

List where each entry corresponds to column indices of variables indata which should be considered when swapping households.

swaprate

double between 0 and 1 defining the proportion of households which should be swapped, see details for more explanations

risk

vector of vectors containing risks of each individual in each hierarchy level.

risk_threshold

double indicating risk threshold above every household needs to be swapped.

k_anonymity

integer defining the threshold of high risk households (k-anonymity). This is used as k_anonymity <= counts.

risk_variables

column indices of variables indata which will be considered for estimating the risk.

carry_along

log_file_name

character, path for writing a log file. The log file contains a list of household IDs ('hid') which could not have been swapped and is only created if any such households exist.

seed

integer defining the seed for the random number generator, for reproducibility.

Value

Returns data set with swapped records.

Remove certain variables from the data set inside a sdc object.

Description

Delete variables without changing anything else in the sdcObject (writingNAs).

Usage

removeDirectID(obj, var)

Arguments

obj

object of classsdcMicroObj-class

var

name of the variable(s) to be remove

Value

the modifiedsdcMicroObj-class

Author(s)

Alexander Kowarik

Examples

## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- removeDirectID(sdc, var="age")

Generate an Html-report from an sdcMicroObj

Description

Summary statistics of the original and the perturbed data set

Usage

report(  obj,  outdir = tempdir(),  filename = "SDC-Report",  title = "SDC-Report",  internal = FALSE,  verbose = FALSE)

Arguments

obj

an object of classsdcMicroObj-class orreportObj

outdir

output folder

filename

output filename

title

Title for the report

internal

TRUE/FALSE, if TRUE a detailed internal report is produced,else a non-disclosive overview

verbose

TRUE/FALSE, if TRUE, some additional information is printed.

Details

The application of this function provides you with a html-report for yoursdcMicro object that contains useful summaries about the anonymization process.

Author(s)

Matthias Templ, Bernhard Meindl

Examples

data(testdata2)sdc <- createSdcObj(  dat = testdata2,  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"),  numVars = c("expend", "income", "savings"),  w = "sampling_weight")report(sdc)

riskyCells

Description

Allows to compute risky (unweighted) combinations of key variables eitherup to a specified dimension or using identification level. This mimics theapproach taken in mu-argus.

Usage

riskyCells(obj, useIdentificationLevel = FALSE, threshold, ...)

Arguments

obj

adata.frame,data.table or ansdcMicroObj object

useIdentificationLevel

(logical) specifies if tabulation should bedone up to a specific dimension (useIdentificationLevel = FALSE usingargumentmaxDim) or taking identification levels(useIdentificationLevel = FALSE using argumentlevel) into account.

threshold

a numeric vector specifiying the thresholds at which cellsare considered to be unsafe. In case a tabulation is done up to a specificlevel (useIdentificationLevel = FALSE), the thresholds may be specifieddifferently for each dimension. In the other case, the same threshold isused for all tables.

...

see possible arguments below

keyVars: index or variable-names withinobj that should be used fortabulation. In caseobj is asdcMicroObj object, this argument isnot used and the pre-defined key-variables are used.
level: in caseuseIdentificationLevel = TRUE, this numeric vectorspecifies the importance of the key variables. The construction of outputtables follows the implementation in mu-argus, see e.gmu-argus.The length of this numeric vector must match the number of key variables.
maxDim: in caseuseIdentificationLevel = FALSE, this number specifiesmaximal number of variables to tablulate.

Value

adata.table showing the number of unsafe cells, thresholds forany combination of the key variables. If the input was asdcMicroObjobject and some modifications have been already applied to the categoricalkey variables, the resulting output contains the number of unsafe cellsboth for the original and the modified data.

Author(s)

Bernhard Meindl

Examples

## data.frame method / all combinations up to maxDim# riskyCells(#  obj = testdata2,#  keyVars = 1:5,#  threshold = c(50, 25, 10, 5),#  useIdentificationLevel = FALSE,#  maxDim = 4# )#riskyCells(#  obj  = testdata2,#  keyVars = 1:5,#  threshold = 10,#  useIdentificationLevel = FALSE,#  maxDim = 3#)#### data.frame method / using identification levels#riskyCells(#  obj = testdata2,#  keyVars = 1:6,#  threshold = 20,#  useIdentificationLevel = TRUE,#  level = c(1, 1, 2, 3, 3, 5)#)#riskyCells(#  obj = testdata2,#  keyVars = c(1, 3, 4, 6),#  threshold = 10,#  useIdentificationLevel = TRUE,#  level = c(1, 2, 2, 4)#)#### sdcMicroObj-method / all combinations up to maxDim#testdata2[1:6] <- lapply(1:6, function(x) {#  testdata2[[x]] <- as.factor(testdata2[[x]])#})##sdc <- createSdcObj(#  dat = testdata2,#  keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"),#  numVars = c("expend", "income", "savings"),#  w = "sampling_weight")##r0 <- riskyCells(#  obj = sdc,#  useIdentificationLevel=FALSE,# threshold = c(20, 10, 5),# maxDim = 3#)#### in case key-variables have been modified, we get counts for### original and modified data#sdc <- groupAndRename(#  obj = sdc,#  var = "roof",#  before = c("5", "6", "9"),#  after = "5+"#)#r1 <- riskyCells(#  obj = sdc,#  useIdentificationLevel = FALSE,#  threshold = c(10, 5, 3),#  maxDim = 3#)#### sdcMicroObj-method / using identification levels#riskyCells(#  obj = sdc,#  useIdentificationLevel = TRUE,#  threshold = 10,#  level = c(1, 1, 3, 4, 5, 5, 5)#)

Random sample for donor records

Description

Randomly select donor records given a probability weight vector. This sampling procedure is implemented differently thanrandSample_cpp to speed up performance of C++-functionrecordSwap().
NOTE: This is an internal function used for testing the C++-functionsampleDonor which is used inside the C++-functionrecordSwap().

Usage

sampleDonor_cpp(  data,  similar_cpp,  hid,  IDswap,  IDswap_pool_vec,  prob,  seed = 123456L)

Arguments

data

micro data containing the hierarchy levels and household ID

similar_cpp

List where each entry corresponds to column indices of variables indata which should be considered when swapping households.

hid

column index indata which refers to the household identifier.

IDswap

vector containing records for which a donor needs to be sampled

IDswap_pool_vec

set from which 'IDswap' was drawn

prob

a vector of probability weights for obtaining the elements of the vector being sampled.

seed

integer setting the sampling seed

sdcApp

Description

starts the graphical user interface developed withshiny.

Usage

sdcApp(  maxRequestSize = 50,  debug = FALSE,  theme = "IHSN",  ...,  shiny.server = FALSE)

Arguments

maxRequestSize

(numeric) number defining the maximum allowed filesize (in megabytes)for uploaded files, defaults to 50MB

debug

logical ifTRUE, set shiny-debugging options

theme

select stylesheet for the interface. Supported choices are

'yeti'
'flatly'
'journal'
'IHSN'

...

arguments (e.ghost) that are passed throughrunApp whenstarting the shiny application

shiny.server

Setting this parameter toTRUE will return the app in the form of anobject rather than invoking it. This is useful for deployingsdcApp viashiny-server.

Value

starts the interactive graphical user interface which may be used to perform theanonymization process.

Examples

if(interactive()) {  sdcApp(theme = "flatly")}

Class`"sdcMicroObj"`

Description

Class to save all information about the SDC process

Usage

createSdcObj(  dat,  keyVars,  numVars = NULL,  pramVars = NULL,  ghostVars = NULL,  weightVar = NULL,  hhId = NULL,  strataVar = NULL,  sensibleVar = NULL,  excludeVars = NULL,  options = NULL,  seed = NULL,  randomizeRecords = FALSE,  alpha = 1)undolast(object)strataVar(object) <- value## S4 replacement method for signature 'sdcMicroObj,characterOrNULL'strataVar(object) <- value

Arguments

dat

The microdata set. A numeric matrix or data frame containing the data.

keyVars

Indices or names of categorical key variables. They must, ofcourse, match with the columns of ‘dat’.

numVars

Index or names of continuous key variables.

pramVars

Indices or names of categorical variables considered to be pramed.

ghostVars

if specified a list which each element being a list of exactly two elements.The first element must be a character vector specifying exactly one variable name that wasalso specified as a categorical key variable (keyVars), while the second element isa character vector of valid variable names (that must not be listed askeyVars).IflocalSuppression orkAnon was applied, the resultingsuppression pattern for each key-variable is transferred to the depending variables.

weightVar

Indices or name determining the vector of sampling weights.

hhId

Index or name of the cluster ID (if available).

strataVar

Indices or names of stratification variables.

sensibleVar

Indices or names of sensible variables (for l-diversity)

excludeVars

which variables ofdat should not be included inresult-object? Users may specify a vector of variable-names available indatthat were not specified in eitherkeyVars,numVars,pramVars,ghostVars,hhId,strataVar orsensibleVar.

options

additional options (if specified, a list must be used as input)

seed

(numeric) number specifiying the seed which will be set to allow forreproducablity. The number will be rounded and saved as elementseed in slotoptions.

randomizeRecords

(logical) ifTRUE, the order of observations in the input microdata setwill be randomized.

alpha

numeric between 0 and 1 specifying the fraction on how much keys containingNAs shouldcontribute to the frequency calculation which is also crucial for risk-estimation.

object

asdcMicroObj-class object

value

NULL or a character vector of length 1 specifying a valid variable name

Value

asdcMicroObj-class object

an object of classsdcMicroObj with modified slot@strataVar

Objects from the Class

Objects can be created by calls of the formnew("sdcMicroObj", ...).

Author(s)

Bernhard Meindl, Alexander Kowarik, Matthias Templ, Elias Rut

References

Templ, M. and Meindl, B. and Kowarik, A.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro, Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Examples

## we can also specify ghost (linked) variables## these variables are linked to some categorical key variables## and have the sampe suppression pattern as the variable that they## are linked to after \code{\link{localSuppression}} has been applieddata(testdata)testdata$electcon2 <- testdata$electcontestdata$electcon3 <- testdata$electcontestdata$water2 <- testdata$waterkeyVars <- c("urbrur","roof","walls","water","electcon","relat","sex")numVars <- c("expend","income","savings")w <- "sampling_weight"## we want to make sure that some variables not used as key-variables## have the same suppression pattern as variables that have been## selected as key variables. Thus, we are using 'ghost'-variables.ghostVars <- list()## we want variables 'electcon2' and 'electcon3' to be linked## to key-variable 'electcon'ghostVars[[1]] <- list()ghostVars[[1]][[1]] <- "electcon"ghostVars[[1]][[2]] <- c("electcon2","electcon3")## donttest because Examples with CPU time > 2.5 times elapsed time## we want variable 'water2' to be linked to key-variable 'water'ghostVars[[2]] <- list()ghostVars[[2]][[1]] <- "water"ghostVars[[2]][[2]] <- "water2"## create the sdcMicroObjobj <- createSdcObj(testdata, keyVars=keyVars,  numVars=numVars, w=w, ghostVars=ghostVars)## apply 3-anonymity to selected key variablesobj <- kAnon(obj, k=3); obj## check, if the suppression patterns are identicalmanipGhostVars <- get.sdcMicroObj(obj, "manipGhostVars")manipKeyVars <- get.sdcMicroObj(obj, "manipKeyVars")all(is.na(manipKeyVars$electcon) == is.na(manipGhostVars$electcon2))all(is.na(manipKeyVars$electcon) == is.na(manipGhostVars$electcon3))all(is.na(manipKeyVars$water) == is.na(manipGhostVars$water2))## exclude some variablesobj <- createSdcObj(testdata, keyVars=c("urbrur","roof","walls"), numVars="savings",   weightVar=w, excludeVars=c("relat","electcon","hhcivil","ori_hid","expend"))colnames(get.sdcMicroObj(obj, "origData"))

Creates a household level file from a dataset with a household structure.

Description

It removes individual level variables and selects one record per household based on a household ID. The function can also be used for other hierachical structures.

Usage

selectHouseholdData(dat, hhId, hhVars)

Arguments

dat

a data.frame with the full dataset

hhId

name of the variable with the household (cluster) ID

hhVars

character vector with names of all household level variables

Value

a data.frame with only household level variables and one record per household

Note

It is of great importance that users select a variable with containing information on household-ids and weights inhhVars.

Author(s)

Thijs Benschop and Bernhard Meindl

Examples

## ori-hid: household-ids; household_weights: sampling weights for householdsx_hh <- selectHouseholdData(dat=testdata, hhId="ori_hid",  hhVars=c("urbrur", "roof",  "walls", "water", "electcon", "household_weights"))

set.sdcMicroObj

Description

modifysdcMicroObj-class-objects depending on argumenttype

Usage

set.sdcMicroObj(object, type, input)

Arguments

object

asdcMicroObj-class-object

type

a character vector of length 1 defining what to calculate|return|modify. Allowed types are listed belowand the slot with the corresponding name will be replaced by the content ofinput.

origData:
keyVars:
pramVars:
numVars:
weightVar:
hhId:
strataVar:
sensibleVar:
manipPramVars:
manipNumVars:
manipGhostVars:
manipStrataVar:
risk:
utility:
pram:
localSuppression:
options:
prev:
set:
additionalResults:
deletedVars:

input

a list depending on argumenttype. The content of the list mustmatch the allowed data-type of the slot in thesdcMicroObj-class-objectthat should be replaced.

Value

asdcMicroObj-class-object

Examples

sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')ind_pram <- match(c("sex"), colnames(testdata2))get.sdcMicroObj(sdc, type="pramVars")sdc <- set.sdcMicroObj(sdc, type="pramVars", input=list(ind_pram))get.sdcMicroObj(sdc, type="pramVars")

Define Swap-Levels

Description

Define hierarchy levels over which record needs to be swapped according to risk variables.
NOTE: This is an internal function used for testing the C++-functionsetLevels() which is applied insiderecordSwap().

Usage

setLevels_cpp(risk, risk_threshold)

Arguments

risk

vector of vectors containing risks of each individual in each hierarchy level.risk[0] returns the vector of risks for the first unit over all hierarchy levels.risk[1] the vector if risks for all hierarchy level of unit 2, and so on.

risk_threshold

double defining the risk threshold beyond which a record/household needs to be swapped. This is understood as risk>=risk_threshhold.

Value

Integer vector with hierarchy level over which record needs to be swapped with.

Calculate Risk

Description

Calculate risk for records to be swapped and donor records. Risks are defined by 1/counts, where counts is the number of records with the same values for specifiedrisk_variables in the each geographic hierarchy.This risk will be used as sampling probability for both sampling set and donor set.
NOTE: This is an internal function used for testing the C++-functionsetRisk which is used inside the C++-functionrecordSwap().

Usage

setRisk_cpp(data, hierarchy, risk_variables, hid)

Arguments

data

micro data set containing only numeric values.

hierarchy

column indices of variables indata which refere to the geographic hierarchy in the micro data set. For instance county > municipality > district.

risk_variables

column indices of variables indata which will be considered for estimating the risk.

hid

column index indata which refers to the household identifier.

Show

Description

show a sdcMicro object

Usage

## S4 method for signature 'sdcMicroObj'show(object)

Arguments

object

an sdcmicro obj

Value

a sdcMicro object

Author(s)

Bernhard Meindl

Shuffling and EGADP

Description

Data shuffling and General Additive Data Perturbation.

Usage

shuffle(  obj,  form,  method = "ds",  weights = NULL,  covmethod = "spearman",  regmethod = "lm",  gadp = TRUE)

Arguments

obj

An object of class sdcMicroObj or a data.frame including thedata.

form

An object of class “formula” (or one that can be coercedto that class): a symbolic description of the model to be fitted. Theresponses have to consists of at least two variables of any class and theresponse variables have to be of class numeric. The response variablesbelongs to numeric key variables (quasi-identifiers of numeric scale). Thepredictors are can be distributed in any way (numeric, factor, orderedfactor).

method

currently either the original form of data shuffling(“ds” - default), “mvn” or “mlm”, see the detailssection. The last method is in experimental mode and almost untested.

weights

Survey sampling weights. Automatically chosen when obj is ofclasssdcMicroObj-class.

covmethod

Method for covariance estimation. “spearman”,“pearson” and \ dQuotemcd are possible. For the latter one, theimplementation in package robustbase is used.

regmethod

Method for multivariate regression. “lm” and“MM” are possible. For method “MM”, the function “rlm”from package MASS is applied.

gadp

TRUE, if the egadp results from a fit on the original data isreturned.

Details

Perturbed values for the sensitive variables are generated. The sensitivevariables have to be stored as responses in the argument ‘form’,which is the usual formula interface for regression models in R.

For method “ds” the EGADP method is applied on the norm inversepercentiles. Shuffling then ranks the original values according to the GADPoutput. For further details, please see the references.

Method “mvn” uses a simplification and draws from the normal Copulasdirectly before these draws are shuffled.

Method “mlm” is also a simplification. A linear model is applied, theexpected values are used as perturbed values before shuffling isapplied.

Value

If ‘obj’ is of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility. If ‘obj’ isof class “data.frame” an object of class “micro” withfollowing entities is returned:

shConf

the shuffled numeric keyvariables

egadp

the perturbed (using gadp method) numeric keyvariables

Note

In this version, the covariance method chosen is used for anycovariance and correlation estimations in the whole gadp and shufflingfunction.

Author(s)

Matthias Templ, Alexander Kowarik, Bernhard Meindl

References

K. Muralidhar, R. Parsa, R. Saranthy (1999). A general additivedata perturbation method for database security.Management Science,45, 1399-1415.

K. Muralidhar, R. Sarathy (2006). Data shuffling - a new masking approachfor numerical data.Management Science, 52(5), 658-670, 2006.

M. Templ, B. Meindl. (2008). Robustification of Microdata Masking Methodsand the Comparison with Existing Methods, in:Lecture Notes onComputer Science, J. Domingo-Ferrer, Y. Saygin (editors.); Springer,Berlin/Heidelberg, 2008, ISBN: 978-3-540-87470-6, pp. 14-25.

Examples

data(Prestige,package="carData")form <- formula(income + education ~ women + prestige + type, data=Prestige)sh <- shuffle(obj=Prestige,form)plot(Prestige[,c("income", "education")])plot(sh$sh)colMeans(Prestige[,c("income", "education")])colMeans(sh$sh)cor(Prestige[,c("income", "education")], method="spearman")cor(sh$sh, method="spearman")## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- shuffle(sdc, method=c('ds'),regmethod= c('lm'), covmethod=c('spearman'),form=savings+expend ~ urbrur+walls)

subsetMicrodata

Description

allows to restrict original data to only a subset. This may be useful to test some anonymizationmethods. This function will only be used in the graphical user interfacesdcApp.

Usage

subsetMicrodata(obj, type, n)

Arguments

obj

an object of classdata.frame containing micro data

type

algorithm used to sample from original microdata. Currently supported choices are

n_perc: the restricted microdata will be an-percent sample of the original microdata.
first_n: only the firstn observations will be used.
every_n: the restricted microdata set consists of everyn-th record.
size_n: a total ofn observations will be randomly drawn.

n

numeric vector of length 1 specifying the specific parameter with respect to argumenttype.

Value

an object of classsdcMicroObj-class with modified slot@origData.

Author(s)

Bernhard Meindl

Suda2: Detecting Special Uniques

Description

SUDA risk measure for data from (stratified) simple random sampling.

Usage

suda2(obj, ...)

Arguments

obj

adata.frame or asdcMicroObj-object

...

see arguments below

variables Categorical (key) variables. Either the column names or andindex of the variables to be used for risk measurement.
missing: Missing value coding in the given data set.
DisFraction: It is the sampling fraction for the simple randomsampling, and the common sampling fraction for stratified sampling. Bydefault, it's set to 0.01.
original_scores: if this argument isTRUE (the default), thesuda-scores are computed as described in paper "SUDA: A Program for Detecting SpecialUniques" by Elliot et al., ifFALSE, the computation of the scoresis slightly different as it was done in the original implementationof the algorithm by the IHSN.

Details

Suda 2 is a recursive algorithm for finding Minimal Sample Uniques. Thealgorithm generates all possible variable subsets of defined categorical keyvariables and scans them for unique patterns in the subsets of variables.The lower the amount of variables needed to receive uniqueness, the higherthe risk of the corresponding observation.

Value

A modifiedsdcMicroObj object or the following list

ContributionPercent: The contribution of each key variable to the SUDAscore, calculated for each row.
score: The suda score'disscore: The dis suda score
⁠attribute_contributions:⁠ adata.frame showing how much of the totalrisk is contributed by each variable. This information is stored in thefollowing two variables:
- variable: containing the name of the variable
- contribution: contains how much risk a variable contributes to the total risk.
attribute_level_contributions: returns risks of each attribute-level as adata.frame with the following three columns:
- variable: the variable name
- attribute: holding relevant level-codes
- contribution: contains the risk of this level within the variable.

Note

Since version >5.0.2, the computation of suda-scores has changed and is now by default as described inthe original paper by Elliot et al.

Author(s)

Alexander Kowarik and Bernhard Meindl (based on the C++ code from the Organisation ForEconomic Co-Operation And Development.

For the C++ code: This work is being supported by the InternationalHousehold Survey Network and funded by a DGF Grant provided by the WorldBank to the PARIS21 Secretariat at the Organisation for EconomicCo-operation and Development (OECD). This work builds on previous work whichis elsewhere acknowledged.

References

C. J. Skinner; M. J. Elliot (20xx) A Measure of Disclosure Riskfor Microdata.Journal of the Royal Statistical Society: Series B(Statistical Methodology), Vol. 64 (4), pp 855–867.

M. J. Elliot, A. Manning, K. Mayes, J. Gurd and M. Bane (20xx) SUDA: AProgram for Detecting Special Uniques, Using DIS to Modify theClassification of Special Uniques

Anna M. Manning, David J. Haglin, John A. Keane (2008) A recursive searchalgorithm for statistical disclosure assessment.Data Min Knowl Disc16:165 – 196

Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4

Summary method for objects from class freqCalc

Description

Summary method for objects of class ‘freqCalc’ to provide informationabout local suppressions.

Usage

## S3 method for class 'freqCalc'summary(object, ...)

Arguments

object

object from class freqCalc

...

Additional arguments passed through.

Details

Shows the amount of local suppressions on each variable in which localsuppression was applied.

Value

Information about local suppression in each variable (only if alocal suppression is already done).

Author(s)

Matthias Templ

Examples

## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)ff$fkf$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk## Local SuppressionlocalS <- localSupp(f, keyVar=2, threshold=0.25)f2 <- freqCalc(localS$freqCalc, keyVars=c(4,5,6), w=8)summary(f2)

Summary method for objects from class micro

Description

Summary method for objects from class ‘micro’.

Usage

## S3 method for class 'micro'summary(object, ...)

Arguments

object

objects from class micro

...

Additional arguments passed through.

Details

This function computes several measures of information loss, such as

Value

meanx

A conventional summary of the original data

meanxm

A conventional summary of the microaggregated data

amean

average relative absolute deviation of means

amedian

average relative absolute deviation of medians

aonestep

average relative absolute deviation of onestep from median

devvar

average relative absolute deviation of variances

amad

average relative absolute deviation of the mad

acov

average relative absolute deviation of covariances

arcov

average relative absolute deviation of robust (with mcd) covariances

acor

average relative absolute deviation of correlations

arcor

average relative absolute deviation of robust (with mcd) correlations

acors

average relative absolute deviation of rank-correlations

adlm

average absolute deviation of lm regression coefficients (without intercept)

adlts

average absolute deviation of lts regression coefficients (without intercept)

apcaload

average absolute deviation of pca loadings

apppacaload

average absolute deviation of robust (with projection pursuit approach) pca loadings

atotals

average relative absolute deviation of totals

pmtotals

average relative deviation of totals

Author(s)

Matthias Templ

References

Templ, M.Statistical Disclosure Control for MicrodataUsing the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number2, pp. 67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php

Examples

data(Tarragona)m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)summary(m1)

Summary method for objects from class pram

Description

Summary method for objects from class ‘pram’ to provide informationabout transitions.

Usage

## S3 method for class 'pram'summary(object, ...)

Arguments

object

object from class ‘pram’

...

Additional arguments passed through.

Details

Shows various information about the transitions.

Value

The summary of object from class ‘pram’.

Author(s)

Matthias Templ and Bernhard Meindl

References

Templ, M.Statistical Disclosure Control for MicrodataUsing the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number2, pp. 67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php

Examples

data(free1)x <- as.factor(free1[,"MARSTAT"])x2 <- pram(x)x2summary(x2)

A real-world data set on household income and expenditures

Description

A concise (1-5 lines) description of the dataset.

Format

testdata: a data frame with 4580 observations on the following 15 variables.

urbrur: a numeric vector
roof: a numeric vector
walls: a numeric vector
water: a numeric vector
electcon: a numeric vector
relat: a numeric vector
sex: a numeric vector
age: a numeric vector
hhcivil: a numeric vector
expend: a numeric vector
income: a numeric vector
savings: a numeric vector
ori_hid: a numeric vector
sampling_weight: a numeric vector
household_weights: a numeric vector

testdata2: A data frame with 93 observations on the following 19 variables.

urbrur: a numeric vector
roof: a numeric vector
walls: a numeric vector
water: a numeric vector
electcon: a numeric vector
relat: a numeric vector
sex: a numeric vector
age: a numeric vector
hhcivil: a numeric vector
expend: a numeric vector
income: a numeric vector
savings: a numeric vector
ori_hid: a numeric vector
sampling_weight: a numeric vector
represent: a numeric vector
category_count: a numeric vector
relat2: a numeric vector
water2: a numeric vector
water3: a numeric vector

References

The International Household Survey Network, www.ihsn.org

Examples

head(testdata)head(testdata2)

Top and Bottom Coding

Description

Function for Top and Bottom Coding.

Usage

topBotCoding(obj, value, replacement, kind = "top", column = NULL)

Arguments

obj

a numeric vector, adata.frame or asdcMicroObj-class-object

value

limit, from where it should be top- or bottom-coded

replacement

replacement value.

kind

top or bottom

column

variable name in case the input is adata.frame or an object of classsdcMicroObj-class.

Details

Extreme values larger or lower thanvalue are replaced by a different value (replacement in order to reduce the disclosure risk.

Value

Top or bottom coded data or modifiedsdcMicroObj-class.

Note

top-/bottom coding of factors is no longer possible as of sdcMicro >=4.7.0

Author(s)

Matthias Templ and Bernhard Meindl

References

Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04

Examples

data(free1)res <- topBotCoding(free1[,"DEBTS"], value=9000, replacement=9100, kind="top")max(res)data(testdata)range(testdata$age)testdata <- topBotCoding(testdata, value=80, replacement=81, kind="top", column="age")range(testdata$age)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),           numVars=c('expend','income','savings'), w='sampling_weight')sdc <- topBotCoding(sdc, value=500000, replacement=1000, column="income")testdataout <- extractManipData(sdc)

Comparison of different microaggregation methods

Description

A Function for the comparison of different perturbation methods.

Usage

valTable(  x,  method = c("simple", "onedims", "clustpppca", "addNoise: additive", "swappNum"),  measure = "mean",  clustermethod = "clara",  aggr = 3,  nc = 8,  transf = "log",  p = 15,  noise = 15,  w = 1:dim(x)[2],  delta = 0.1)

Arguments

x

adata.frame or amatrix

method

character vector defining names of microaggregation-, adding-noiseor rank swapping methods.

measure

FUN for aggregation. Possible values are mean (default), median, trim, onestep.

clustermethod

clustermethod, if a method will need a clustering procedure

aggr

aggregation level (default=3)

nc

number of clusters. Necessary, if a method will need a clustering procedure

transf

Transformation of variables before clustering.

p

Swapping range, if method swappNum has been chosen

noise

noise addition, if an addNoise method has been chosen

w

variables for swapping, if method swappNum has been chosen

delta

parameter for adding noise method"correlated2"

Details

Tabularize the output fromsummary.micro(). Will be enhanced to allperturbation methods in future versions.

Methods for adding noise should be named viaaddNoise:{method}, e.g.addNoise:correlated, where{method} specifies the desired method asdescribed inaddNoise().

Value

Measures of information loss splitted for the comparison of different methods.

Author(s)

Matthias Templ

References

Templ, M. and Meindl, B.,⁠Software Development for SDC in R⁠, Lecture Notes in Computer Science, Privacy in Statistical Databases,vol. 4302, pp. 347-359, 2006.

Examples

data(Tarragona)valTable(  x = Tarragona[100:200, ],  method=c("simple", "onedims", "pca"))

Change the a keyVariable of an object of class`sdcMicroObj-class` from Numeric toFactor or from Factor to Numeric

Description

Change the scale of a variable

Usage

varToFactor(obj, var)varToNumeric(obj, var)

Arguments

obj

object of classsdcMicroObj-class

var

name of the keyVariable to change

Value

the modifiedsdcMicroObj-class

Examples

## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2,  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),  numVars=c('expend','income','savings'), w='sampling_weight')sdc <- varToFactor(sdc, var="urbrur")

writeSafeFile

Description

writes an anonymized dataset to a file. This function should be used in thegraphical user interfacesdcApp() only.

Usage

writeSafeFile(obj, format, randomizeRecords, fileOut, ...)

Arguments

obj

adata.frame containing micro data

format

(character) specifies the output file format. Acceptedvalues are:

"rdata": output will be saved in the R binary file-format
"sav": output will be saved as SPSS-file
"dta": ouput will be saved as STATA-file
"csv": output will be saved as comma seperated (text)-file
"sas": output will be saved as SAS-file (sas7bdat)

randomizeRecords

(logical) specifies, if the output records shouldbe randomized. The following options are possible:

"no": default, no randomization takes place
"simple": records are randomly swapped
"byHH": if slot"hhId" is notNULL, the clusters defined by thisvariable are randomized across the dataset. If slot"hhId" isNULL, therecords or the dataset are randomly changed.
"withinHH": if slot"hhId" is notNULL, the clusters defined bythis variable are randomized across the dataset and additionally, the orderof records within the clusters are also randomly changed. If slot"hhId"isNULL, the records or the dataset are randomly changed.

fileOut

(character) file to which output should be written

...

optional arguments used forutils::write.table() ifargument"format" equals"csv"

Value

invisibleNULL if the file was successfully written

Author(s)

Bernhard Meindl

Movatterモバイル変換

sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation

Description

Details

Author(s)

References

See Also

Examples

Census data set

Description

Format

Source

References

Examples

EIA data set

Description

Format

Source

References

Examples

Additional Information-Loss measures

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Local recoding via Edmond's maximum weighted matching algorithm

Description

Usage

Arguments

Details

Value

Methods

Author(s)

References

Examples

Tarragona data set

Description

Format

Source

References

Examples

addGhostVars

Description

Usage

Arguments

Value

Author(s)

References

Examples

Adding noise to perturb data

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

argus_microaggregation

Description

Usage

Arguments

Value

See Also

Examples

argus_rankswap

Description

Usage

Arguments

Value

See Also

Examples

Recompute Risk and Frequencies for a sdcMicroObj

Description

Usage