| Type: | Package |
| Title: | Statistical Disclosure Control Methods for Anonymization of Dataand Risk Estimation |
| Version: | 5.7.9 |
| Date: | 2025-08-01 |
| Description: | Data from statistical agencies and other institutions are mostly confidential. This package, introduced in Templ, Kowarik and Meindl (2017) <doi:10.18637/jss.v067.i04>, can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files. The theoretical basis for the methods implemented can be found in Templ (2017) <doi:10.1007/978-3-319-50272-4>. Various risk estimation and anonymization methods are included. Note that the package includes a graphical user interface published in Meindl and Templ (2019) <doi:10.3390/a12090191> that allows to use various methods of this package. |
| LazyData: | TRUE |
| ByteCompile: | TRUE |
| LinkingTo: | Rcpp |
| Depends: | R (≥ 2.10) |
| Suggests: | laeken, parallel, testthat |
| Imports: | utils, stats, graphics, car, carData, rmarkdown, knitr,data.table, xtable, robustbase, cluster, MASS, e1071, tools,Rcpp, methods, ggplot2, shiny (≥ 1.4.0), haven, rhandsontable,DT, shinyBS, prettydoc, VIM(≥ 4.7.0) |
| License: | GPL-2 |
| URL: | https://github.com/sdcTools/sdcMicro |
| Collate: | '0classes.r' 'addGhostVars.R' 'addNoise.r' 'aux_functions.r''createDat.R' 'createNewID.R' 'dataGen.r' 'dataSets.R''dRisk.R' 'dRiskRMD.R' 'dUtility.R' 'freqCalc.r''globalRecode.R' 'groupAndRename.R' 'GUIfunctions.R''indivRisk.R' 'infoLoss.R' 'LocalRecProg.R' 'localSupp.R''localSuppression.R' 'mdav.R' 'measure_risk.R' 'methods.r''microaggregation.R' 'modRisk.R''muargus_compatibility_functions.R' 'mvTopCoding.R''plotFunctions.R' 'plotMicro.R' 'pram.R' 'rankSwap.R''RcppExports.R' 'recordSwap.R' 'report.R' 'riskyCells.R''sdcMicro-package.R' 'shuffle.R' 'suda2.R' 'timeEstimation.R''topBotCoding.R' 'valTable.R' 'zzz.R' 'printFunctions.R''mafast.R' 'maG.R' 'sdcApp.R' 'show_sdcMicroObj.R' |
| RoxygenNote: | 7.3.2 |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| NeedsCompilation: | yes |
| Packaged: | 2025-08-06 11:24:10 UTC; matthias |
| Author: | Matthias Templ |
| Maintainer: | Matthias Templ <matthias.templ@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-08-22 14:40:02 UTC |
sdcMicro: Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation
Description
Data from statistical agencies and other institutions are mostly confidential. This package, introduced in Templ, Kowarik and Meindl (2017)doi:10.18637/jss.v067.i04, can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files. The theoretical basis for the methods implemented can be found in Templ (2017)doi:10.1007/978-3-319-50272-4. Various risk estimation and anonymization methods are included. Note that the package includes a graphical user interface published in Meindl and Templ (2019)doi:10.3390/a12090191 that allows to use various methods of this package.
This package includes all methods of the popular software mu-Argus plusseveral new methods. In comparison with mu-Argus the advantages of thispackage are that the results are fully reproducible even with the includedGUI, that the package can be used in batch-mode from other software, thatthe functions can be used in a very flexible way, that everybody could lookat the source code and that there are no time-consuming meta-data managementis necessary. However, the user should have a detailed knowledge about SDCwhen applying the methods on data.
Details
The package is programmed using S4-classes and it comes with a well-definedclass structure.
The implemented graphical user interface (GUI) for microdata protectionserves as an easy-to-handle tool for users who want to use the sdcMicropackage for statistical disclosure control but are not used to the native Rcommand line interface. In addition to that, interactions between objectswhich results from the anonymization process are provided within the GUI.This allows an automated recalculation and displaying information of thefrequency counts, individual risk, information loss and data utility aftereach anonymization step. In addition to that, the code for everyanonymization step carried out within the GUI is saved in a script which canthen be easily modified and reloaded.
| Package: | sdcMicro |
| Type: | Package |
| Version: | 2.5.9 |
| Date: | 2009-07-22 |
| License: | GPL 2.0 |
Author(s)
Maintainer: Matthias Templmatthias.templ@gmail.com (ORCID)
Authors:
Bernhard MeindlBernhard.Meindl@statistik.gv.at
Alexander KowarikAlexander.Kowarik@statistik.gv.at (ORCID)
Johannes Gussenbauerjohannes.gussenbauer@statistik.gv.at
Other contributors:
Organisation For Economic Co-Operation And Development (Initial published c(++) code (under LGPL) code for rank swapping, mdav-microaggregation, suda2 and other (hierarchical) risk measures) [copyright holder]
Statistics Netherlands (microAggregation cpp code (under EUPL v1.1)) [copyright holder]
Pascal Heus (original measure threshold cpp code (under LGPL)) [copyright holder]
Matthias Templ, Alexander Kowarik, Bernhard Meindl
Maintainer: Matthias Templ <templ@statistik.tuwien.ac.at>
References
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4
Templ, M. and Kowarik, A. and Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Templ, M. and Meindl, B.Practical Applications inStatistical Disclosure Control Using R, Privacy and Anonymity inInformation Management Systems, Bookchapter, Springer London, pp. 31-62,2010.doi:10.1007/978-1-84996-238-4_3
Kowarik, A. and Templ, M. and Meindl, B. and Fonteneau, F. and Prantner, B.:Testing of IHSN Cpp Code and Inclusion of New Methods into sdcMicro,in: Lecture Notes in Computer Science, J. Domingo-Ferrer, I. Tinnirello(editors.); Springer, Berlin, 2012, ISBN: 978-3-642-33626-3, pp. 63-77.doi:10.1007/978-3-642-33627-0_6
Templ, M.Statistical Disclosure Control for Microdata Using theR-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp.67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php
Templ, M.New Developments in Statistical Disclosure Control andImputation: Robust Statistics Applied to Official Statistics,Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,264 pages.
See Also
Useful links:
Examples
## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2, 4:6), w = 8)ff$fkf$Fk## dealing with missing values:x <- francdatx[3,5] <- NAx[4,2] <- x[4,4] <- NAx[5,6] <- NAx[6,2] <- NAf2 <- freqCalc(x, keyVars = c(2, 4:6), w = 8)f2$fkf2$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk## Local SuppressionlocalS <- localSupp(f, keyVar = 2, threshold = 0.25)f2 <- freqCalc(localS$freqCalc, keyVars=c(2, 4:6), w = 8)indivf2 <- indivRisk(f2)indivf2$rk## select another keyVar and run localSupp() once again,## if you think the table is not fully protecteddata(free1)free1 <- as.data.frame(free1)f <- freqCalc(x = free1, keyVars = 1:3, w = 30)ind <- indivRisk(f)## and now you can use the interactive plot for individual risk objects:## plot(ind)## example from Capobianchi, Polettini and Lucarelli:data(francdat)l1 <- localSuppression( obj = francdat, keyVars=c(2, 4:6), importance = c(1, 3, 2, 4))l1l1$xl2 <- localSuppression(obj = francdat, keyVars=c(2, 4:6), k = 2)l3 <- localSuppression(obj = francdat, keyVars=c(2, 4:6), k = 4)## Global recoding:data(free1)free1 <- as.data.frame(free1)free1[, "AGE"] <- globalRecode( obj = free1[, "AGE"], breaks = c(1,9,19,29,39,49,59,69,100), labels = 1:8)## Top coding:topBotCoding( obj = free1[, "DEBTS"], value = 9000, replacement = 9100, kind = "top")## Numerical Rank Swapping:data(Tarragona)Tarragona1 <- rankSwap(Tarragona, P = 10, K0 = NULL, R0 = NULL)## Microaggregation:m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)m2 <- microaggregation(Tarragona, method = "pca", aggr = 3)## using a subset because of computation timevalTable(Tarragona[1:50, ], method = c("simple", "onedims", "pca"))data(microData)microData <- as.data.frame(microData)m_micro <- microaggregation(microData, method = "mdav")summary(m_micro)plotMicro(m_micro, 1, which.plot = 1) # not enough observations...data(free1)free1 <- as.data.frame(free1)plotMicro( x = microaggregation(free1[,31:34], method = "onedims"), p = 1, which.plot = 1)## disclosure risk (interval) and data utility:m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)dRisk(obj = Tarragona, xm = m1$mx)dRisk(obj = Tarragona, xm = m2$mx)dUtility(obj = Tarragona, xm = m1$mx)dUtility(obj = Tarragona, xm = m2$mx)## Fast generation of synthetic data with approximately## the same covariance matrix as the original one.data(mtcars)cov(mtcars[, 4:6])df_gen <- dataGen(obj = mtcars[, 4:6], n = 200)cov(df_gen)pairs(mtcars[, 4:6])pairs(df_gen)## Post-Randomization (PRAM)x <- factor(sample(1:4, 250, replace = TRUE))pr1 <- pram(x)length(which(pr1$x_pram == x))summary(pr1)x2 <- factor(sample(1:4, 250, replace=TRUE))length(which(pram(x2)$x_pram == x2))data(free1)marstat <- as.factor(free1[,"MARSTAT"])marstatPramed <- pram(marstat)summary(marstatPramed)## The same functionality can be also applied to `sdcMicroObj`-objectsdata(testdata)## undo-functionality is by default restricted to data sets## with <= `1e5` rows; to modify, env-var `sdcMicro_maxsize_undo`## can to be changed before creating a problem instanceSys.setenv("sdcMicro_maxsize_undo" = 1e6)## create an objecttestdata$water <- factor(testdata$water)sdc <- createSdcObj( dat = testdata, keyVars = c("urbrur", "roof", "walls", "electcon", "water", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")head(sdc@manipNumVars)## Display risk-measuressdc@risk$globalsdc <- dRisk(sdc)sdc@risk$numeric## Generation of synthetic datasynthdat <- dataGen(sdc)## use addNoise with default parameters (not suggested)sdc <- addNoise(sdc, variables = c("expend", "income"))head(sdc@manipNumVars)sdc@risk$numeric## undolast step (remove adding noise)sdc <- undolast(sdc)head(sdc@manipNumVars)sdc@risk$numeric## apply addNoise() with custom parameterssdc <- addNoise(sdc, noise = 0.2)head(sdc@manipNumVars)sdc@risk$numeric## LocalSuppressionsdc <- undolast(sdc)head(sdc@risk$individual)sdc@risk$globalsdc <- localSuppression(sdc)head(sdc@risk$individual)sdc@risk$global## microaggregationsdc <- undolast(sdc)head(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc <- microaggregation(sdc)head(get.sdcMicroObj(sdc, type = "manipNumVars"))## Post-Randomizationsdc <- undolast(sdc)head(sdc@risk$individual)sdc@risk$globalsdc <- pram(sdc, variables = "water")head(sdc@risk$individual)sdc@risk$global## rankSwapsdc <- undolast(sdc)head(sdc@risk$individual)sdc@risk$globalhead(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc <- rankSwap(sdc)head(get.sdcMicroObj(sdc, type = "manipNumVars"))head(sdc@risk$individual)sdc@risk$global## topBotCodinghead(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc@risk$numericsdc <- topBotCoding( obj = sdc, value = 60000000, replacement = 62000000, column = "income")head(get.sdcMicroObj(sdc, type = "manipNumVars"))sdc@risk$numeric## LocalRecProgdata(testdata2)keyVars <- c("urbrur", "roof", "walls", "water", "sex")w <- "sampling_weight"sdc <- createSdcObj(testdata2, keyVars = keyVars, weightVar = w)sdc@risk$globalsdc <- LocalRecProg(sdc)sdc@risk$global## Model-based risks using a formulaform <- as.formula(paste("~", paste(keyVars, collapse = "+")))sdc <- modRisk(sdc, method = "default", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "CE", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "PML", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "weightedLLM", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelsdc <- modRisk(sdc, method = "IPF", formulaM = form)get.sdcMicroObj(sdc, "risk")$modelCensus data set
Description
This test data set was obtained on July 27, 2000 using the public use DataExtraction System of the U.S. Bureau of the Census.
Format
A data frame sampled from year 1995 with 1080 observations on thefollowing 13 variables.
- AFNLWGT
Final weight (2 implied decimal places)
- AGI
Adjusted gross income
- EMCONTRB
Employer contribution for hlth insurance
- FEDTAX
Federal income tax liability
- PTOTVAL
Total person income
- STATETAX
State income tax liability
- TAXINC
Taxable income amount
- POTHVAL
Total other persons income
- INTVAL
Amt of interest income
- PEARNVAL
Total person earnings
- FICA
Soc. sec. retirement payroll deduction
- WSALVAL
Amount: Total Wage and salary
- ERNVAL
Business or Farm net earnings
Source
Public use file from the CASC project. More information on thistest data can be found in the paper listed below.
References
Brand, R. and Domingo-Ferrer, J. and Mateo-Sanz, J.M., Referencedata sets to test and compare SDC methods for protection of numericalmicrodata. Unpublished.https://research.cbs.nl/casc/CASCrefmicrodata.pdf
Examples
data(CASCrefmicrodata)str(CASCrefmicrodata)EIA data set
Description
Data set obtained from the U.S. Energy Information Authority.
Format
A data frame with 4092 observations on the following 15 variables.
- UTILITYID
UNIQUE UTILITY IDENTIFICATION NUMBER
- UTILNAME
UTILITY NAME. A factor with levels
4-CountyElectric Power AssnAlabama Power CoAlaska ElectricAppalachian Electric CoopAppalachian Power CoArizonaPublic Service CoArkansas Power & Light CoArkansas ValleyElec Coop CorpAtlantic City Electric CompanyBaker ElectricCoop IncBaltimore Gas & Electric CoBangor Hydro-Electric CoBerkeley Electric Coop IncBlack Hills CorpBlackstoneValley Electric CoBonneville Power AdminBoston Edison CoBountiful City Light & PowerBristol City ofBrookingsCity ofBrunswick Electric Member CorpBurlington City ofCarolina Power & Light CoCarroll Electric Coop CorpCass County Electric Coop IncCentral Illinois Light CompanyCentral Illinois Pub Serv CoCentral Louisiana Elec Co IncCentral Maine Power CoCentral Power & Light CoCentralVermont Pub Serv CorpChattanooga City ofCheyenne Light Fuel& Power CoChugach Electric Assn IncCincinnati Gas & ElectricCoCitizens Utilities CompanyCity of Boulder CityCityof ClintonCity of DoverCity of EugeneCity ofGilletteCity of Groton Dept of UtilsCity of Idaho FallsCity of IndependenceCity of NewarkCity of ReadingCity of Tupelo Water & Light DClarksville City ofCleveland City ofCleveland Electric Illum CoCoastElectric Power AssnCobb Electric Membership CorpColoradoRiver CommissionColorado Springs City ofColumbus SouthernPower CoCommonwealth Edison CoCommonwealth Electric CoConnecticut Light & Power CoConsolidated Edison Co-NY IncConsumers Power CoCornhusker Public Power DistCuivreRiver Electric Coop IncCumberland Elec Member CorpDakotaElectric AssnDawson County Public Pwr DistDayton Power &Light CompanyDecatur City ofDelaware Electric Coop IncDelmarva Power & Light CoDetroit Edison CoDuck RiverElec Member CorpDuke Power CoDuquesne Light CompanyEast Central Electric AssnEastern Maine Electric CoopEl Paso Electric CoElectric Energy IncEmpire DistrictElectric CoExeter & Hampton Electric CoFairbanks City ofFayetteville Public Works CommFirst Electric Coop CorpFlorence City ofFlorida Power & Light CoFlorida PowerCorpFort Collins Lgt & Pwr UtilityFremont City ofGeorgia Power CoGibson County Elec Member CorpGoldenValley Elec Assn IncGrand Island City ofGranite StateElectric CoGreen Mountain Power CorpGreen River ElectricCorpGreeneville City ofGulf Power CompanyGulf StatesUtilities CoHasting UtilitiesHawaii Electric Light Co IncHawaiian Electric Co IncHenderson-Union Rural E C CHomer Electric Assn IncHot Springs Rural El Assn IncHouston Lighting & Power CoHuntsville City ofIdahoPower CoIES Utilities IncIllinois Power CoIndianaMichigan Power CoIndianapolis Power & Light CoIntermountainRural Elec AssnInterstate Power CoJackson Electric MemberCorpJersey Central Power&Light CoJoe Wheeler Elec MemberCorpJohnson City City ofJones-Onslow Elec Member CorpKansas City City ofKansas City Power & Light CoKentucky Power CoKentucky Utilities CoKetchikan PublicUtilitiesKingsport Power CoKnoxville City ofKodiakElectric Assn IncKootenai Electric Coop, IncLansing Board ofWater & LightLenoir City City ofLincoln City ofLongIsland Lighting CoLos Angeles City ofLouisiana Power & LightCoLouisville Gas & Electric CoLoup River Public Power DistLower Valley Power & Light IncMaine Public Service CompanyMassachusetts Electric CoMatanuska Electric Assn IncMaui Electric Co LtdMcKenzie Electric Coop IncMemphisCity ofMidAmerican Energy CompanyMiddle Tennessee E M CMidwest Energy, IncMinnesota Power & Light CoMississippi Power & Light CoMississippi Power CoMonongahela Power CoMontana-Dakota Utilities CoMontanaPower CoMoon Lake Electric Assn IncNarragansett Electric CoNashville City ofNebraska Public Power DistrictNevadaPower CoNew Hampshire Elec Coop, IncNew Orleans PublicService IncNew York State Gas & ElectricNewport ElectricCorpNiagara Mohawk Power CorpNodak Rural Electric Coop IncNorris Public Power DistrictNortheast Oklahoma Electric CoNorthern Indiana Pub Serv CoNorthern States Power CoNorthwestern Public Service CoOhio Edison CoOhio PowerCoOhio Valley Electric CorpOklahoma Electric Coop, IncOklahoma Gas & Electric CoOliver-Mercer Elec Coop, IncOmaha Public Power DistrictOtter Tail Power CoPacificGas & Electric CoPacificorp dba Pacific Pwr & LPalmettoElectric Coop, IncPennsylvania Power & Light CoPennyrileRural Electric CoopPhiladelphia Electric CoPierre MunicipalElectricPortland General Electric CoPotomac Edison CoPotomac Electric Power CoPoudre Valley R E A, IncPowerAuthority of State of NYProvo City CorporationPublic ServiceCo of ColoradoPublic Service Co of IN IncPublic Service Coof NHPublic Service Co of NMPublic Service Co of OklahomaPublic Service Electric&Gas CoPUD No 1 of Clark CountyPUD No 1 of Snohomish CountyPuget Sound Power & Light CoRappahannock Electric CoopRochester Public UtilitiesRockland Electric CompanyRosebud Electric Coop IncRutherford Elec Member CorpSacramento Municipal Util DistSalmon River Electric Coop IncSalt River Proj Ag I & P DistSan Antonio City ofSavannah Electric & Power CoSeattleCity ofSierra Pacific Power CoSinging River Elec Power AssnSioux Valley Empire E A IncSouth Carolina Electric&Gas CoSouth Carolina Pub Serv AuthSouth Kentucky Rural E C CSouthern California Edison CoSouthern Nebraska Rural P P DSouthern Pine Elec Power AssnSouthwest Tennessee E M CSouthwestern Electric Power CoSouthwestern Public Service CoSpringfield City ofSt Joseph Light & Power CoStateLevel AdjustmentTacoma City ofTampa Electric CoTexas-New Mexico Power CoTexas Utilities Electric CoTri-County Electric Assn IncTucson Electric Power CoTurner-Hutchinsin El Coop, IncTVAU S Bureau of IndianAffairsUnion Electric CoUnion Light Heat & Power CoUnited Illuminating CoUpper Cumberland E M CUtiliCorpUnited IncVerdigris Valley Electric CoopVerendrye ElectricCoop IncVirginia Electric & Power CoVolunteer Electric CoopWallingford Town ofWarren Rural Elec Coop CorpWashington Water Power CoWatertown Municipal Utils DeptWells Rural Electric CoWest Penn Power CoWest PlainsElectric Coop IncWest River Electric Assn, IncWesternMassachusetts Elec CoWestern Resources IncWheeling PowerCompanyWisconsin Electric Power CoWisconsin Power & LightCoWisconsin Public Service CorpWright-Hennepin Coop ElecAssnYellowstone Vlly Elec Coop Inc- STATE
STATE FOR WHICH THE UTILITY IS REPORTING. A factor with levels
AKALARAZCACOCTDCDEFLGAHIIAIDILINKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOHOKORPARISCSDTNTXUTVAVTWAWIWVWY- YEAR
REPORTING YEAR FOR THE DATA
- MONTH
REPORTING MONTH FOR THE DATA
- RESREVENUE
REVENUE FROM SALES TO RESIDENTIAL CONSUMERS
- RESSALES
SALES TO RESIDENTIAL CONSUMERS
- COMREVENUE
REVENUE FROM SALES TO COMMERCIAL CONSUMERS
- COMSALES
SALES TO COMMERCIAL CONSUMERS
- INDREVENUE
REVENUE FROM SALES TO INDUSTRIAL CONSUMERS
- INDSALES
SALES TO INDUSTRIAL CONSUMERS
- OTHREVENUE
REVENUE FROM SALES TO OTHER CONSUMERS
- OTHRSALES
SALES TO OTHER CONSUMERS
- TOTREVENUE
REVENUE FROM SALES TO ALL CONSUMERS
- TOTSALES
SALES TO ALL CONSUMERS
Source
Public use file from the CASC project.
References
Brand, R. and Domingo-Ferrer, J. and Mateo-Sanz, J.M., Referencedata sets to test and compare SDC methods for protection of numericalmicrodata. Unpublished.https://research.cbs.nl/casc/CASCrefmicrodata.pdf
Examples
data(EIA)head(EIA)Additional Information-Loss measures
Description
MeasuresIL_correl() andIL_variables() were proposed by Andrzej Mlodak and are (theoretically) bounded between0 and1.
Usage
IL_correl(x, xm)## S3 method for class 'il_correl'print(x, digits = 3, ...)IL_variables(x, xm)## S3 method for class 'il_variables'print(x, digits = 3, ...)Arguments
x | an object coercible to a |
xm | an object coercible to a |
digits | number digits used for rounding when displaying results |
... | additional parameter for print-methods; currently ignored |
Details
IL_correl(): is a information-loss measure that can be applied to common numerically scaled variables inxandxm. It is basedon diagonal entries of inverse correlation matrices in the original and perturbed data.IL_variables(): for common-variables inxandxmthe individual distance-functions depend on the class of the variable;specifically these functions are different for numeric variables, ordered-factors and character/factor variables. The individual distancesare summed up and scaled byn * mwithnbeing the number of records andmbeing the number of (common) variables.
Details can be found in the references below
The implementation ofIL_correl() differs slightly with the original proposition from Mlodak, A. (2020) asthe constant multiplier was changed to1 / sqrt(2) instead of1/2 for better efficiency and interpretabilityof the measure.
Value
the corresponding information-loss measure
Author(s)
Bernhard Meindlbernhard.meindl@statistik.gv.at
References
Mlodak, A. (2020). Information loss resulting from statistical disclosure control of output data,Wiadomosci Statystyczne. The Polish Statistician, 2020, 65(9), 7-27, DOI: 10.5604/01.3001.0014.4121
Mlodak, A. (2019). Using the Complex Measure in an Assessment of the Information Loss Due to the Microdata Disclosure Control,Przegląd Statystyczny, 2019, 66(1), 7-26,DOI: 10.5604/01.3001.0013.8285
Examples
data("Tarragona", package = "sdcMicro")res1 <- addNoise(obj = Tarragona, variables = colnames(Tarragona), noise = 100)IL_correl(x = as.data.frame(res1$x), xm = as.data.frame(res1$xm))res2 <- addNoise(obj = Tarragona, variables = colnames(Tarragona), noise = 25)IL_correl(x = as.data.frame(res2$x), xm = as.data.frame(res2$xm))# creating test-inputsn <- 150x <- xm <- data.frame( v1 = factor(sample(letters[1:5], n, replace = TRUE), levels = letters[1:5]), v2 = rnorm(n), v3 = runif(3), v4 = ordered(sample(LETTERS[1:3], n, replace = TRUE), levels = c("A", "B", "C")))xm$v1[1:5] <- "a"xm$v2 <- rnorm(n, mean = 5)xm$v4[1:5] <- "A"IL_variables(x, xm)Local recoding via Edmond's maximum weighted matching algorithm
Description
To be used on both categorical and numeric input variables, although usageon categorical variables is the focus of the development of this software.
Usage
LocalRecProg( obj, ancestors = NULL, ancestor_setting = NULL, k_level = 2, FindLowestK = TRUE, weight = NULL, lowMemory = FALSE, missingValue = NA, ...)Arguments
obj | a |
ancestors | Names of ancestors of the cateorical variables |
ancestor_setting | For each ancestor the corresponding categorical variable |
k_level | Level for k-anonymity |
FindLowestK | requests the program to look for the smallest k thatresults in complete matches of the data. |
weight | A weight for each variable (Default=1) |
lowMemory | Slower algorithm with less memory consumption |
missingValue | The output value for a suppressed value. |
... | see arguments below
|
Details
Each record in the data represents a category of the original data, andhence all records in the input data should be unique by the N InputVariables. To achieve bigger category sizes (k-anoymity), one can form newcategories based on the recoding result and repeatedly apply this algorithm.
Value
dataframe with original variables and the supressed variables(suffix _lr). / the modifiedsdcMicroObj-class
Methods
- list("signature(obj=\"sdcMicroObj\")")
Author(s)
Alexander Kowarik, Bernd Prantner, IHSN C++ source, Akimichi Takemura
References
Kowarik, A. and Templ, M. and Meindl, B. and Fonteneau, F. and Prantner, B.:Testing of IHSN Cpp Code and Inclusion of New Methods into sdcMicro,in: Lecture Notes in Computer Science, J. Domingo-Ferrer, I. Tinnirello(editors.); Springer, Berlin, 2012, ISBN: 978-3-642-33626-3, pp. 63-77.doi:10.1007/978-3-642-33627-0_6
Examples
data(testdata2)cat_vars <- c("urbrur", "roof", "walls", "water", "sex", "relat")anc_var <- c("water2", "water3", "relat2")anc_setting <- c("water","water","relat")r1 <- LocalRecProg( obj = testdata2, categorical = cat_vars, missingValue = -99)r2 <- LocalRecProg( obj = testdata2, categorical = cat_vars, ancestor = anc_var, ancestor_setting = anc_setting, missingValue = -99)r3 <- LocalRecProg( obj = testdata2, categorical = cat_vars, ancestor = anc_var, ancestor_setting = anc_setting, missingValue = -99, FindLowestK = FALSE)# for objects of class sdcMicro:sdc <- createSdcObj( dat = testdata2, keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")sdc <- LocalRecProg(sdc)Tarragona data set
Description
A real data set comprising figures of 834 companies in the Tarragona area.Data correspond to year 1995.
Format
A data frame with 834 observations on the following 13 variables.
- FIXED.ASSETS
a numeric vector
- CURRENT.ASSETS
a numeric vector
- TREASURY
a numeric vector
- UNCOMMITTED.FUNDS
a numeric vector
- PAID.UP.CAPITAL
a numeric vector
- SHORT.TERM.DEBT
a numeric vector
- SALES
a numeric vector
- LABOR.COSTS
a numeric vector
- DEPRECIATION
a numeric vector
- OPERATING.PROFIT
a numeric vector
- FINANCIAL.OUTCOME
a numeric vector
- GROSS.PROFIT
a numeric vector
- NET.PROFIT
a numeric vector
Source
Public use data from the CASC project.
References
Brand, R. and Domingo-Ferrer, J. and Mateo-Sanz, J.M., Referencedata sets to test and compare SDC methods for protection of numericalmicrodata. Unpublished.https://research.cbs.nl/casc/CASCrefmicrodata.pdf
Examples
data(Tarragona)head(Tarragona)dim(Tarragona)addGhostVars
Description
specify variables that arelinked to a key variable. This results in allsuppressions of the key-variable being also applied on the corresponding 'ghost'-variables.
Usage
addGhostVars(obj, keyVar, ghostVars)Arguments
obj | an object of class |
keyVar | character-vector of length 1 refering to a categorical key variable within |
ghostVars | a character vector specifying variables that are linked to |
Value
a modifiedsdcMicroObj-class object.
Author(s)
Bernhard Meindl
References
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4doi:10.1007/978-3-319-50272-4
Examples
data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')## we want to link the anonymization status of key variabe 'urbrur' to 'hhcivil'sdc <- addGhostVars(sdc, keyVar="urbrur", ghostVars=c("hhcivil"))## we want to link the anonymization status of key variabe 'roof' to 'represent'sdc <- addGhostVars(sdc, keyVar="roof", ghostVars=c("represent"))Adding noise to perturb data
Description
Various methods for adding noise to perturb continuous scaled variables.
Usage
addNoise(obj, variables = NULL, noise = 150, method = "additive", ...)Arguments
obj | either a |
variables | vector with names of variables that should be perturbed |
noise | amount of noise (in percentages) |
method | choose between ‘additive’, ‘correlated’,‘correlated2’, ‘restr’, ‘ROMM’, ‘outdect’ |
... | see possible arguments below |
Details
If ‘obj’ is of classsdcMicroObj-class, all continuous keyvariables are selected per default. If ‘obj’ is of class“data.frame” or “matrix”, the continuous variables have to bespecified.
Method ‘additive’ adds noise completely at random to each variabledepending on its size and standard deviation. ‘correlated’ andmethod ‘correlated2’ adds noise and preserves the covariances asdescribed in R. Brand (2001) or in the reference given below. Method‘restr’ takes the sample size into account when adding noise. Method‘ROMM’ is an implementation of the algorithm ROMM (RandomOrthogonalized Matrix Masking) (Fienberg, 2004). Method ‘outdect’adds noise only to outliers. The outliers are identified with univariateand robust multivariate procedures based on a robust mahalanobis distancescalculated by the MCD estimator.
Value
If ‘obj’ was of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility.
If ‘obj’ was of class “data.frame” or “matrix” anobject of class “micro” with following entities is returned:
x | the original data |
xm | the modified (perturbed) data |
method | method used for perturbation |
noise | amount of noise |
Author(s)
Matthias Templ and Bernhard Meindl
References
Domingo-Ferrer, J. and Sebe, F. and Castella, J., “On thesecurity of noise addition for privacy in statistical databases”, LectureNotes in Computer Science, vol. 3050, pp. 149-161, 2004. ISSN 0302-9743.Vol. Privacy in Statistical Databases, eds. J. Domingo-Ferrer and V. Torra,Berlin: Springer-Verlag.
Ting, D. Fienberg, S.E. and Trottini, M. “ROMM Methodology forMicrodata Release” Joint UNECE/Eurostat work session on statistical dataconfidentiality, Geneva, Switzerland, 2005,https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2005/wp.11.e.pdf
Ting, D., Fienberg, S.E., Trottini, M. “Random orthogonal matrixmasking methodology for microdata release”, International Journal ofInformation and Computer Security, vol. 2, pp. 86-105, 2008.
Templ, M. and Meindl, B.,Robustification of Microdata Masking Methodsand the Comparison with Existing Methods, Lecture Notes in ComputerScience, Privacy in Statistical Databases, vol. 5262, pp. 177-189, 2008.
Templ, M.New Developments in Statistical Disclosure Control andImputation: Robust Statistics Applied to Official Statistics,Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,264 pages.
Templ, M. and Meindl, B. and Kowarik, A.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro, Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4
See Also
sdcMicroObj-class,summary.micro
Examples
data(Tarragona)a1 <- addNoise(Tarragona)a1data(testdata)# donttest because Examples with CPU time > 2.5 times elapsed timetestdata[, c('expend','income','savings')] <-addNoise(testdata[,c('expend','income','savings')])$xm## for objects of class sdcMicroObj:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- addNoise(sdc)argus_microaggregation
Description
calls microaggregation code from mu-argus. In case only one variable should bemicroaggregated anduseOptimal isTRUE, Hansen-Mukherjee polynomial exact methodis applied. In any other case, the Mateo-Domingo method is used.
Usage
argus_microaggregation(df, k, useOptimal = FALSE)Arguments
df | a |
k | required group size |
useOptimal | (logical) should optimal microaggregation be applied (ony possible inin case of one variable) |
Value
alist with two elements
original: the originally provided input data
microaggregated: the microaggregated data.frame
See Also
mu-Argus manual athttps://github.com/sdcTools/manuals/raw/master/mu-argus/MUmanual5.1.pdf
Examples
mat <- matrix(sample(1:100, 50, replace=TRUE), nrow=10, ncol=5)df <- as.data.frame(mat)res <- argus_microaggregation(df, k=5, useOptimal=FALSE)argus_rankswap
Description
argus_rankswap
Usage
argus_rankswap(df, perc)Arguments
df | a |
perc | a number defining the swapping percantage |
Value
alist with two elements
original: the originally provided input data
swapped: the
data.framecontaining the swapped values
See Also
mu-Argus manual athttps://github.com/sdcTools/manuals/raw/master/mu-argus/MUmanual5.1.pdf
Examples
mat <- matrix(sample(1:100, 50, replace=TRUE), nrow=10, ncol=5)df <- as.data.frame(mat)res <- argus_rankswap(df, perc=10)Recompute Risk and Frequencies for a sdcMicroObj
Description
Recomputation of Risk should be done after manual changing the content of anobject of classsdcMicroObj
Usage
calcRisks(obj, ...)Arguments
obj | asdcMicroObj object |
... | no arguments at the moment |
Details
By applying this function, the dislosure risk is re-estimated and thecorresponding slots of an object of classsdcMicroObj are updated.This function mostly used internally to automatically update the risk afteran sdc method is applied.
Value
asdcMicroObj object with updated risk values
See Also
Examples
data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- calcRisks(sdc)Small Artificial Data set
Description
Small Toy Example Data set which was used by Sanz-Mateo et.al.
Format
The format is: int [1:13, 1:7] 10 12 17 21 9 12 12 14 13 15 ... -attr(*, "dimnames")=List of 2 ..$ : chr [1:13] "1" "2" "3" "4" ... ..$ :chr [1:7] "1" "2" "3" "4" ...
Examples
data(casc1)casc1Dummy Dataset for Record Swapping
Description
createDat() returns dummy data to illustratetargeted record swapping. The generated data containhousehold ids ('hid'), geographic variables('nuts1', 'nuts2', 'nuts3', 'lau2') as well as someother household or personal variables.
Usage
createDat(N = 10000)Arguments
N | integer, number of household to generate |
Value
'data.table' containing dummy data
See Also
recordSwap
Creates new randomized IDs
Description
This is useful if the record IDs consist, for example, of a geo identifier and the household line number.This method can be used to create new, random IDs that cannot be reconstructed.
Usage
createNewID(obj, newID, withinVar)Arguments
obj | an |
newID | a character specifiying the desired variable name of the new ID |
withinVar | if not |
Value
ansdcMicroObj-class-object with updated slotorigData
overal disclosure risk
Description
Distance-based disclosure risk estimation via standard deviation-basedintervals around observations.
Usage
dRisk(obj, ...)Arguments
obj | a |
... | possible arguments are:
|
Details
An interval (based on the standard deviation) is built around each value ofthe perturbed value. Then we look if the original values lay in theseintervals or not. With parameter k one can enlarge or down scale theinterval.
Value
The disclosure risk or/and the modifiedsdcMicroObj-class
Author(s)
Matthias Templ
References
see method SDID in Mateo-Sanz, Sebe, Domingo-Ferrer. Outlier Protection in Continuous Microdata Masking.International Workshop on Privacy in Statistical Databases.PSD 2004: Privacy in Statistical Databases pp 201-215.
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4
See Also
Examples
data(free1)free1 <- as.data.frame(free1)m1 <- microaggregation(free1[, 31:34], method="onedims", aggr=3)m2 <- microaggregation(free1[, 31:34], method="pca", aggr=3)dRisk(obj=free1[, 31:34], xm=m1$mx)dRisk(obj=free1[, 31:34], xm=m2$mx)dUtility(obj=free1[, 31:34], xm=m1$mx)dUtility(obj=free1[, 31:34], xm=m2$mx)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')## this is already made internally: sdc <- dRisk(sdc)## and already stored in sdcRMD based disclosure risk
Description
Distance-based disclosure risk estimation via robust Mahalanobis Distances.
Usage
dRiskRMD(obj, ...)Arguments
obj | an |
... | see possible arguments below
|
Details
This method is an extension of method SDID because it accounts for the“outlyingness” of each observations. This is a quite natural approachsince outliers do have a higher risk of re-identification and thereforethese outliers should have larger disclosure risk intervals as observationsin the center of the data cloud.
The algorithm works as follows:
1. Robust Mahalanobis distances are estimated in order to get a robustmultivariate distance for each observation.
2. Intervals are estimated for each observation around every data point ofthe original data points where the length of the interval isdefined/weighted by the squared robust Mahalanobis distance and theparameter $k$. The higher the RMD of an observation the larger theinterval.
3. Check if the corresponding masked values fall into the intervals aroundthe original values or not. If the value of the corresponding observationis within such an interval the whole observation is considered unsafe. So,we get a whole vector indicating which observation is save or not, and weare finished already when using method RMDID1).
4. For method RMDID1w: we return the weighted (via RMD) vector of disclosurerisk.
5. For method RMDID2: whenever an observation is considered unsafe it ischecked if $m$ other observations from the masked data are very close(defined by a parameter $k2$ for the length of the intervals as for SDID orRSDID) to such an unsafe observation from the masked data, using Euclideandistances. If more than $m$ points are in such a small interval, weconclude that this observation is “save”.
Value
The disclosure risk or the modifiedsdcMicroObj-class
risk1 | percentage of sensitive observations according to method RMDID1. |
risk2 | standardized version of risk1 |
wrisk1 | amount of sensitive observations according to RMDID1 weightedby their corresponding robust Mahalanobis distances. |
wrisk2 | RMDID2 measure |
indexRisk1 | index of observations with high risk according to risk1 measure |
indexRisk2 | index of observations with high risk according to wrisk2 measure |
Author(s)
Matthias Templ
References
Templ, M. and Meindl, B.,Robust Statistics Meets SDC: NewDisclosure Risk Measures for Continuous Microdata Masking, Lecture Notes inComputer Science, Privacy in Statistical Databases, vol. 5262, pp. 113-126,2008.
Templ, M.New Developments in Statistical Disclosure Control andImputation: Robust Statistics Applied to Official Statistics,Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,264 pages.
See Also
Examples
data(Tarragona)x <- Tarragona[, 5:7]y <- addNoise(x)$xmdRiskRMD(x, xm=y)dRisk(x, xm=y)data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- dRiskRMD(sdc)Data-Utility measures
Description
dUtility() allows to compute different measures of data-utility basedon various distances using original and perturbed variables.
Usage
dUtility(obj, ...)Arguments
obj | original data or object of classsdcMicroObj |
... | see arguments below
|
Details
The standardised distances of the perturbed data values to the original onesare measured. The following measures are available:
"IL1: sum of absolute distances between original and perturbed variablesscaled by absolute values of the original variables"IL1s: measures the absolute distances between originaland perturbed ones, scaled by the standard deviation of original variables timesthe square root of2."eigen; compares the eigenvalues of original and perturbed data"robeigen; compares robust eigenvalues of original and perturbed data
Value
data utility or modified entry for data utility thesdcMicroObj.
Author(s)
Matthias Templ
References
for IL1 and IL1s: see Mateo-Sanz, Sebe, Domingo-Ferrer.Outlier Protection in Continuous Microdata Masking.International Workshop on Privacy in Statistical Databases.PSD 2004: Privacy in Statistical Databases pp 201-215.
Templ, M. and Meindl, B.,Robust Statistics Meets SDC: New Disclosure Risk Measures for Continuous Microdata Masking, Lecture Notes in ComputerScience, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008.
See Also
Examples
data(free1)free1 <- as.data.frame(free1)m1 <- microaggregation(free1[, 31:34], method="onedims", aggr=3)m2 <- microaggregation(free1[, 31:34], method="pca", aggr=3)dRisk(obj=free1[, 31:34], xm=m1$mx)dRisk(obj=free1[, 31:34], xm=m2$mx)dUtility(obj=free1[, 31:34], xm=m1$mx)dUtility(obj=free1[, 31:34], xm=m2$mx)data(Tarragona)x <- Tarragona[, 5:7]y <- addNoise(x)$xmdRiskRMD(x, xm=y)dRisk(x, xm=y)dUtility(x, xm = y, method = "IL1")dUtility(x, xm = y, method = "IL1s")dUtility(x, xm = y, method = "eigen")dUtility(x, xm = y, method = "robeigen")## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')## this is already made internally, so you don't need to run this:sdc <- dUtility(sdc)Fast generation of synthetic data
Description
Fast generation of (primitive) synthetic multivariate normal data.
Usage
dataGen(obj, ...)Arguments
obj | an |
... | see possible arguments below
|
Details
Uses the cholesky decomposition to generate synthetic data with approx. thesame means and covariances. For details see at the reference.
Value
the generated synthetic data.
Note
With this method only multivariate normal distributed data withapproxiomately the same covariance as the original data can be generatedwithout reflecting the distribution of real complex data, which are, ingeneral, not follows a multivariate normal distribution.
Author(s)
Matthias Templ
References
Mateo-Sanz, Martinez-Balleste, Domingo-Ferrer. Fast Generation of Accurate Synthetic Microdata. International Workshop on Privacy in Statistical Databases PSD 2004: Privacy in Statistical Databases, pp 298-306.
See Also
Examples
data(mtcars)cov(mtcars[,4:6])cov(dataGen(mtcars[,4:6]))pairs(mtcars[,4:6])pairs(dataGen(mtcars[,4:6]))## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- dataGen(sdc)Distribute number of swaps
Description
Distribute number of swaps across lowest hierarchy level according to a predefinedswaprate. The swaprate is applied such that a single swap counts as swapping 2 households.Number of swaps are randomly rounded up or down, if needed, such that the total number of swaps is in coherence with the swaprate.
NOTE: This is an internal function used for testing the C++-functiondistributeDraws which is used inside the C++-functionrecordSwap().
Usage
distributeDraws_cpp(data, hierarchy, hid, swaprate, seed = 123456L)Arguments
data | micro data containing the hierarchy levels and household ID |
hierarchy | column indices of variables in |
hid | column index in |
swaprate | double between 0 and 1 defining the proportion of households which should be swapped, see details for more explanations |
seed | integer setting the sampling seed |
Distribute
Description
Distribute 'totalDraws' using ratio/probability vector 'inputRatio' and randomly round each entry up or down such that the distribution results in an integer vector.Returns an integer vector containing the number of units in 'totalDraws' distributetd according to proportions in 'inputRatio'.
NOTE: This is an internal function used for testing the C++-functiondistributeRandom which is used inside the C++-functionrecordSwap().
Usage
distributeRandom_cpp(inputRatio, totalDraws, seed)Arguments
inputRatio | vector containing ratios which are used to distribute number units in 'totalDraws'. |
totalDraws | number of units to distribute |
seed | integer setting the sampling seed |
Remove certain variables from the data set inside a sdc object.
Description
Extract the manipulated data from an object of classsdcMicroObj-class
Usage
extractManipData( obj, ignoreKeyVars = FALSE, ignorePramVars = FALSE, ignoreNumVars = FALSE, ignoreGhostVars = FALSE, ignoreStrataVar = FALSE, randomizeRecords = "no")Arguments
obj | object of class |
ignoreKeyVars | If manipulated KeyVariables should be returned or theunchanged original variables |
ignorePramVars | if manipulated PramVariables should be returned or theunchanged original variables |
ignoreNumVars | if manipulated NumericVariables should be returned orthe unchanged original variables |
ignoreGhostVars | if manipulated Ghost (linked) Variables should be returned orthe unchanged original variables |
ignoreStrataVar | if manipulated StrataVariables should be returned orthe unchanged original variables |
randomizeRecords | (logical) specifies, if the output records should be randomized. The followingoptions are possible:
|
Value
adata.frame containing the anonymized data set
Author(s)
Alexander Kowarik, Bernhard Meindl
Examples
## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- removeDirectID(sdc, var="age")dataM <- extractManipData(sdc)data from the casc project
Description
Small synthetic data from Capobianchi, Polettini, Lucarelli
Format
A data frame with 8 observations on the following 8 variables.
- Num1
a numeric vector
- Key1
Key variable 1. A numeric vector
- Num2
a numeric vector
- Key2
Key variable 2. A numeric vector
- Key3
Key variable 3. A numeric vector
- Key4
Key variable 4. A numeric vector
- Num3
a numeric vector
- w
The weight vector. A numeric vector
Details
This data set is very similar to that one which are used by the authors ofthe paper given below. We need this data set only for demonstration effect,i.e. that the package provides the same results as their software.
Source
https://research.cbs.nl/casc/deliv/12d1.pdf
Examples
data(francdat)francdatDemo data set from mu-Argus
Description
The public use toy demo data set from the mu-Argus software for SDC.
Format
The format is: num [1:4000, 1:34] 36 36 36 36 36 36 36 36 36 36 ...- attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:34] "REGION" "SEX""AGE" "MARSTAT" ...
Details
Please, see at the link given below. Please note, that the correlationstructure of the data is not very realistic, especially concerning thecontinuous scaled variables which drawn independently from are amultivariate uniform distribution.
Source
Public use file from the CASC project.
Examples
data(free1)head(free1)Freq
Description
Extract sample frequency counts (fk) or estimated population frequency counts (Fk)
Usage
freq(obj, type = "fk")Arguments
obj | an |
type | either |
Value
a vector containing sample frequencies or weighted frequencies
Author(s)
Bernhard Meindl
Examples
data(testdata)sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof','walls','relat','sex'), pramVars=c('water','electcon'), numVars=c('expend','income','savings'), w='sampling_weight')head(freq(sdc, type="fk"))head(freq(sdc, type="Fk"))Frequencies calculation for risk estimation
Description
Computation and estimation of the sample and population frequency counts.
Usage
freqCalc(x, keyVars, w = NULL, alpha = 1)Arguments
x | data frame or matrix |
keyVars | key variables |
w | column index of the weight variable. Should be set to NULL if onedeal with a population. |
alpha | numeric value between 0 and 1 specifying how much keys thatcontain missing values ( |
Details
The function considers the case of missing values in the data. A missingvalue stands for any of the possible categories of the variable considered.It is possible to apply this function to large data sets with many(catergorical) key variables, since the computation is done in C.
freqCalc() does not support sdcMicro S4 class objects.
Value
Object from class freqCalc.
freqCalc | data set |
keyVars | variables used for frequency calculation |
w | index of weight vector. NULL if you do not have a sample. |
alpha | value of parameter |
fk | the frequency of equal observations inthe key variables subset sample given for each observation. |
Fk | estimated frequency in the population |
n1 | number of observations with fk=1 |
n2 | number of observations with fk=2 |
Author(s)
Bernhard Meindl
References
look e.g. inhttps://research.cbs.nl/casc/deliv/12d1.pdfTempl, M.Statistical Disclosure Control for Microdata Using theR-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp.67-85, 2008.https://www.tdp.cat/issues/abs.a004a08.php
Templ, M.New Developments in Statistical Disclosure Control andImputation: Robust Statistics Applied to Official Statistics,Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,264 pages.
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4doi:10.1007/978-3-319-50272-4
Templ, M. and Meindl, B.:Practical Applications in StatisticalDisclosure Control Using R, Privacy and Anonymity in Information ManagementSystems New Techniques for New Practical Problems, Springer, 31-62, 2010,ISBN: 978-1-84996-237-7.
See Also
Examples
data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)ff$freqCalcf$fkf$Fk## with missings:x <- francdatx[3,5] <- NAx[4,2] <- x[4,4] <- NAx[5,6] <- NAx[6,2] <- NAf2 <- freqCalc(x, keyVars=c(2,4,5,6),w=8)cbind(f2$fk, f2$Fk)## test parameter 'alpha'f3a <- freqCalc(x, keyVars=c(2,4,5,6), w=8, alpha=1)f3b <- freqCalc(x, keyVars=c(2,4,5,6), w=8, alpha=0.5)f3c <- freqCalc(x, keyVars=c(2,4,5,6), w=8, alpha=0.1)data.frame(fka=f3a$fk, fkb=f3b$fk, fkc=f3c$fk)data.frame(Fka=f3a$Fk, Fkb=f3b$Fk, Fkc=f3c$Fk)Generate one strata variable from multiple factors
Description
For strata defined by multiple variables (e.g. sex,age,country) one combinedvariable is generated.
Usage
generateStrata(df, stratavars, name)Arguments
df | a data.frame |
stratavars | character vector with variable name |
name | name of the newly generated variable |
Value
The original data set with one new column.
Author(s)
Alexander Kowarik
Examples
x <- testdatax <- generateStrata(x,c("sex","urbrur"),"strataIDvar")head(x)get.sdcMicroObj
Description
extract information fromsdcMicroObj-class-objects depending on argumenttype
Usage
get.sdcMicroObj(object, type)Arguments
object | a |
type | a character vector of length 1 defining what to calculate|return|modify. Allowed types are areall slotNames of |
Value
a slot of asdcMicroObj-class-object depending on argumenttype
Examples
sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sl <- slotNames(sdc)res <- sapply(sl, function(x) get.sdcMicroObj(sdc, type=x))str(res)Global Recoding
Description
Global recoding of variables
Usage
globalRecode(obj, ...)Arguments
obj | a numeric vector, a |
... | see possible arguments below
|
Details
If a labels parameter is specified, its values are used to name the factorlevels. If none is specified, the factor level labels are constructed.
Value
the modifiedsdcMicroObj-class or a factor, unless labels = FALSEwhich results in the mere integer level codes.
Note
globalRecode can not be applied to vectors stored as factors from sdcMicro >= 4.7.0!
Author(s)
Matthias Templ and Bernhard Meindl
References
Templ, M. and Kowarik, A. and Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4doi:10.1007/978-3-319-50272-4
See Also
Examples
data(free1)free1 <- as.data.frame(free1)## application to a vectorhead(globalRecode(free1$AGE, breaks=c(1,9,19,29,39,49,59,69,100), labels=1:8))table(globalRecode(free1$AGE, breaks=c(1,9,19,29,39,49,59,69,100), labels=1:8))## application to a data.frame# automatic labelstable(globalRecode(free1, column="AGE", breaks=c(1,9,19,29,39,49,59,69,100))$AGE)## calculation of brea-points using different algorithmstable(globalRecode(free1$AGE, breaks=6))table(globalRecode(free1$AGE, breaks=6, method="logEqui"))table(globalRecode(free1$AGE, breaks=6, method="equalAmount"))## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- globalRecode(sdc, column="water", breaks=3)table(get.sdcMicroObj(sdc, type="manipKeyVars")$water)Join levels of a variables in an object of classsdcMicroObj-class orfactor ordata.frame
Description
If the input is an object of classsdcMicroObj-class, thespecified factor-variable is recoded into a factor with less levels andrisk-measures are automatically recomputed.
Usage
groupAndRename(obj, var, before, after, addNA = FALSE)Arguments
obj | object of class |
var | name of the keyVariable to change |
before | vector of levels before recoding |
after | name of new level after recoding |
addNA | logical, if TRUE missing values in the input variables are added to the level specified in argument |
Details
If the input is of classdata.frame, the result is adata.frame witha modified column specified byvar.
If the input is of classfactor, the result is afactor with differentlevels.
Value
the modifiedsdcMicroObj-class
Author(s)
Bernhard Meindl
References
Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4doi:10.1007/978-3-319-50272-4
Examples
## for objects of class sdcMicro:data(testdata2)testdata2$urbrur <- as.factor(testdata2$urbrur)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- groupAndRename(sdc, var="urbrur", before=c("1","2"), after=c("1"))importProblem
Description
reads an sdcProblem with code that has been exported withinsdcApp.
Usage
importProblem(path)Arguments
path | a file path |
Value
an object of classsdcMicro_GUI_export or an object of class 'simple.error'
Author(s)
Bernhard Meindl
Individual Risk computation
Description
Estimation of the risk for each observation. After the risk is computed onecan use e.g. the function localSuppr() for the protection of values of highrisk. Further details can be found at the link given below.
Usage
indivRisk(x, method = "approx", qual = 1, survey = TRUE)Arguments
x | object from class freqCalc |
method | approx (default) or exact |
qual | final correction factor |
survey | TRUE, if we have survey data and FALSE if we deal with a population. |
Details
S4 class sdcMicro objects are only supported by functionmeasure_riskthat also estimates the individual risk with the same method.
Value
- rk:
base individual risk
- method:
method
- qual:
final correction factor
- fk:
frequency count
- knames:
colnames of the key variables
Note
The base individual risk method was developed by Benedetti,Capobianchi and Franconi
Author(s)
Matthias Templ. Bug in method “exact” fixed since version2.6.5. by Youri Baeyens.
References
Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Franconi, L. and Polettini, S. (2004)Individual riskestimation in mu-Argus: a review. Privacy in Statistical Databases, LectureNotes in Computer Science, 262–272. Springer
Machanavajjhala, A. and Kifer, D. and Gehrke, J. and Venkitasubramaniam, M.(2007)l-Diversity: Privacy Beyond k-Anonymity. ACM Trans. Knowl.Discov. Data, 1(1)
additionally, have a look at the vignettes of sdcMicro for further reading.
See Also
Examples
## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)ff$fkf$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rkCalculate information loss after targeted record swapping
Description
Calculate information loss after targeted record swapping using both the original and the swapped micro data.Information loss will be calculated on table counts defined by parameter 'table_vars' using either implemented information loss measures like absolute deviaton, relative absolute deviation and absolute deviation of square roots or custom metric, See details below.
Usage
infoLoss( data, data_swapped, table_vars, metric = c("absD", "relabsD", "abssqrtD"), custom_metric = NULL, hid = NULL, probs = sort(c(seq(0, 1, by = 0.1), 0.95, 0.99)), quantvals = c(0, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, Inf), apply_quantvals = c("relabsD", "abssqrtD"), exclude_zeros = FALSE, only_inner_cells = FALSE)Arguments
data | original micro data set, must be either a 'data.table' or 'data.frame'. |
data_swapped | micro data set after targeted record swapping was applied. Must be either a 'data.table' or 'data.frame'. |
table_vars | column names in both 'data' and 'data_swapped'. Defines the variables over which a (multidimensional) frequency table is constructed.Information loss is then calculated by applying the metric in 'metric' and 'custom_merics' over the cell-counts and margin counts of the table from 'data' and 'data_swapped'. |
metric | character vector containing one or more of the already implemented metrices: "absD","relabsD" and/or "abssqrtD". |
custom_metric | function or (named) list of functions. Functions defined here must be of the form 'fun(x,y,...)'where 'x' and 'y' expect numeric values of the same length. The output of these functions must be a numeric vector of the same length as 'x' and 'y'. |
hid | 'NULL' or character containing household id in 'data' and 'data_swapped'. If not 'NULL' frequencies will reflect number of households, otherwise frequencies will reflect number of persons. |
probs | numeric vector containing values in the inervall [0,1]. |
quantvals | optional numeric vector which defines the groups used for the cumulative outputs. Is applied on the results 'm' from each information loss metric as 'cut(m,breaks=quantvals,include.lowest=TRUE)', see also return values. |
apply_quantvals | character vector defining for the output of which metrices 'quantvals' should be applied to. |
exclude_zeros | 'TRUE' or 'FALSE', if 'TRUE' 0 cells in the frequency table using 'data_swapped' will be ignored. |
only_inner_cells | 'TRUE' or 'FALSE', if 'TRUE' only inner cells of the frequency table defined by 'table_vars' will be compared. Otherwise also all tables margins will bei calculated. |
Details
First frequency tables are build from both 'data' and 'data_swapped' using the variables defined in 'table_vars'. By default also all table margins will be calculated, see parameter 'only_inner_cells = FALSE'.After that the information loss metrices defined in either 'metric' or 'custom_metric' are applied on each of the table cells from both frequency tables.This is done in the sense of 'metric(x,y)' where 'metric' is the information loss, 'x' a cell from the table created from 'data' and 'y' the same cell from the table created from 'data_swapped'. One or more custom metrices can be applied using the parameter 'custom_metric', see also examples.
Value
Returns a list containing:
* 'cellvalues': 'data.table' showing in a long format for each table cell the frequency counts for 'data' ~ 'count_o' and 'data_swapped' ~ 'count_s'. * 'overview': 'data.table' containing the disribution of the 'noise' in number of cells and percentage. The 'noise' ist calculated as the difference between the cell values of the frequency table generated from the original and swapped data* 'measures': 'data.table' containing the quantiles and mean (column 'waht') of the distribution of the information loss metrices applied on each table cell. The quantiles are defined by parameter 'probs'.* 'cumdistr\*': 'data.table' containing the cumulative distribution of the information loss metrices. Distribution is shown in number of cells ('cnt') and percentage ('pct'). Column 'cat' shows all unique values of the information loss metric or the grouping defined by 'quantvals'. * 'false_zero': number of table cells which are non-zero when using 'data' and zero when using 'data_swapped'.* 'false_nonzero': number of table cells which are zero when using 'data' and non-zero when using 'data_swapped'.* 'exclude_zeros': value passed to 'exclude_zero' when calling the function.
Examples
# generate dummy data seed <- 2021set.seed(seed)nhid <- 10000dat <- createDat( nhid )# define paramters for swappingk_anonymity <- 1swaprate <- .05similar <- list(c("hsize"))hier <- c("nuts1","nuts2")carry_along <- c("nuts3","lau2")risk_variables <- c("ageGroup","national")hid <- "hid"# # apply record swapping# dat_s <- recordSwap(data = dat, hid = hid, hierarchy = hier,# similar = similar, swaprate = swaprate,# k_anonymity = k_anonymity,# risk_variables = risk_variables,# carry_along = carry_along,# return_swapped_id = TRUE,# seed=seed)# # # # calculate informationn loss# # for the table nuts2 x national# iloss <- infoLoss(data=dat, data_swapped = dat_s,# table_vars = c("nuts2","national"))# iloss$measures # distribution of information loss measures# iloss$false_zero # no false zeros# iloss$false_nonzero # no false non-zeros# # # frequency tables of households accross# # nuts2 x hincome# # iloss <- infoLoss(data=dat, data_swapped = dat_s, # table_vars = c("nuts2","hincome"),# hid = "hid")# iloss$measures # # # define custom metric# squareD <- function(x,y){# (x-y)^2# }# # iloss <- infoLoss(data=dat, data_swapped = dat_s,# table_vars = c("nuts2","national"),# custom_metric = list(squareD=squareD))# iloss$measures # includes custom loss as well#kAnon_violations
Description
returns the number of observations violating k-anonymity.
Usage
kAnon_violations(object, weighted, k)## S4 method for signature 'sdcMicroObj,logical,numeric'kAnon_violations(object, weighted, k)Arguments
object | a |
weighted |
|
k | a positive number defining parameter k |
Value
the number of records that are violating k-anonymity based on unweighted sample data only (in case parameterweighted isFALSE) or computingthe number of observations that are estimated to violate k-anonymity in the population in case parameterweighted equalsTRUE.
Local Suppression
Description
A simple method to perfom local suppression.
Usage
localSupp(obj, threshold = 0.15, keyVar)Arguments
obj | object of class |
threshold | threshold for individual risk |
keyVar | Variable on which some values might be suppressed |
Details
Values of high risk (above the threshold) of a certain variable (parameterkeyVar) are suppressed.
Value
an updated object of classfreqCalc or thesdcMicroObj-classobject with manipulated data.
Author(s)
Matthias Templ and Bernhard Meindl
References
Templ, M.Statistical Disclosure Control for MicrodataUsing the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number2, pp. 67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4doi:10.1007/978-3-319-50272-4
See Also
Examples
data(francdat)keyVars <- paste0("Key",1:4)f <- freqCalc(francdat, keyVars = keyVars, w = 8)ff$fkf$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk## Local SuppressionlocalS <- localSupp(f, keyVar = "Key4", threshold = 0.15)f2 <- freqCalc(localS$freqCalc, keyVars = keyVars, w = 8)indivf2 <- indivRisk(f2)indivf2$rkidentical(indivf$rk, indivf2$rk)## select another keyVar and run localSupp once again,# if you think the table is not fully protected## for objects of class sdcMicro:data(testdata)sdc <- createSdcObj( dat = testdata, keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"), w = "sampling_weight")sdc <- localSupp(sdc, keyVar = "urbrur", threshold = 0.045)print(sdc, type = "ls")Local Suppression to obtain k-anonymity
Description
Algorithm to achievek-anonymity by performing local suppression.
Usage
localSuppression(obj, k = 2, importance = NULL, combs = NULL, ...)kAnon(obj, k = 2, importance = NULL, combs = NULL, ...)Arguments
obj | a |
k | Threshold fork-anonymity |
importance | Numeric vector of values between 1 andn ( |
combs | Numeric vector. If specified, the algorithm providesk-anonymityfor each combination ofn key variables (withn being the value of the ithelement of this parameter). For example, |
... | see additional arguments below:
|
Details
The algorithm provides ak-anonymized data set by suppressing values in keyvariables. The algorithm tries to find an optimal solution to suppress asfew values as possible and considers the specified importance vector. If notspecified, the importance vector is constructed in a way such that keyvariables with a high number of characteristics are considered lessimportant than key variables with a low number of characteristics.
The implementation providesk-anonymity per strata, if slotstrataVar hasbeen set insdcMicroObj-class or if parameterstrataVar isused when applying thedata.frame method. For details, see the examples provided.
For the parameteralpha:
alpha = 1counts allwildcard matches (i.e.NAs match everything).alpha = 0assumes missing values form their own categories.
These are two extremes. Withalpha = 0, frequencies are likely underestimated whenNAs are present. Ifcombs is used withalpha = 0, the heuristic nature ofkAnon()may lead to technically correct, but not always intuitively understandable frequency evaluations.
Value
A modified dataset with suppressions that meetsk-anonymity based onthe specified key variables, or the modifiedsdcMicroObj-class object.
Note
Deprecated methodslocalSupp2 andlocalSupp2Wrapper are no longer availableinsdcMicro versions > 4.5.0.kAnon() is a more intuitive term for local suppression, since the goal is to achievek-anonymity.
Author(s)
Bernhard Meindl, Matthias Templ
References
Templ, M.Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN: 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4
Templ, M., Kowarik, A., Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67(4), 1–36, 2015.doi:10.18637/jss.v067.i04
Examples
data(francdat)## Local SuppressionlocalS <- localSuppression(francdat, keyVar = c(4, 5, 6))localSplot(localS)## for objects of class sdcMicro, no stratificationdata(testdata2)kv <- c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex")sdc <- createSdcObj(testdata2, keyVars = kv, w = "sampling_weight")sdc <- localSuppression(sdc)## for objects of class sdcMicro, with stratificationtestdata2$ageG <- cut(testdata2$age, 5, labels = paste0("AG", 1:5))sdc <- createSdcObj( dat = testdata2, keyVars = kv, w = "sampling_weight", strataVar = "ageG")sdc <- localSuppression(sdc, nc = 1)## it is also possible to provide k-anonymity for subsets of key-variables## with different parameter k!## in this case we want to provide 10-anonymity for all combinations## of 5 key variables, 20-anonymity for all combinations with 4 key variables## and 30-anonymity for all combinations of 3 key variables.sdc <- createSdcObj(testdata2, keyVars = kv, w = "sampling_weight")combs <- 5:3k <- c(10, 20, 30)sdc <- localSuppression(sdc, k = k, combs = combs)## data.frame method (no stratification)inp <- testdata2[, c(kv, "ageG")]ls <- localSuppression(inp, keyVars = 1:7)print(ls)plot(ls)## data.frame method (with stratification)ls <- kAnon(inp, keyVars = 1:7, strataVars = 8)print(ls)plot(ls)Fast and Simple Microaggregation
Description
Function to perform a fast and simple (primitive) method ofmicroaggregation. (for large datasets)
Usage
mafast(obj, variables = NULL, by = NULL, aggr = 3, measure = mean)Arguments
obj | either a |
variables | variables to microaggregate. If obj is of class sdcMicroObjthe numerical key variables are chosen per default. |
by | grouping variable for microaggregation. If obj is of classsdcMicroObj the strata variables are chosen per default. |
aggr | aggregation level (default=3) |
measure | aggregation statistic, mean, median, trim, onestep (default =mean) |
Value
If ‘obj’ was of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility. If ‘obj’ wasof class “data.frame” or “matrix” an object of the same classis returned.
Author(s)
Alexander Kowarik
See Also
Examples
data(Tarragona)m1 <- mafast(Tarragona, variables=c("GROSS.PROFIT","OPERATING.PROFIT","SALES"),aggr=3)data(testdata)m2 <- mafast(testdata,variables=c("expend","income","savings"),aggr=50,by="sex")summary(m2)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- dRisk(sdc)sdc@risk$numericsdc1 <- mafast(sdc,aggr=4)sdc1@risk$numericsdc2 <- mafast(sdc,aggr=10)sdc2@risk$numeric### Performance testsx <- testdatafor(i in 1:20){ x <- rbind(x,testdata)}system.time({ xx <- mafast( obj = x, variables = c("expend", "income", "savings"), aggr = 50, by = "sex" )})Disclosure Risk for Categorical Variables
Description
The function measures the disclosure risk for weighted or unweighted data.It computes the individual risk (and household risk if reasonable) and theglobal risk. It also computes a risk threshold based on a global risk value.
Prints a 'measure_risk'-object
Prints a 'ldiversity'-object
Usage
measure_risk(obj, ...)ldiversity(obj, ldiv_index = NULL, l_recurs_c = 2, missing = -999, ...)## S3 method for class 'measure_risk'print(x, ...)## S3 method for class 'ldiversity'print(x, ...)Arguments
obj | Object of class |
... | see arguments below
|
ldiv_index | indices (or names) of the variables used for l-diversity |
l_recurs_c | l-Diversity Constant |
missing | a integer value to be used as missing value in the C++ routine |
x | Output of measure_risk() or ldiversity() |
Details
To be used when risk of disclosure for individuals within a family isconsidered to be statistical independent.
Internally, functionfreqCalc() andindivRisk are used forestimation.
Measuring individual risk: The individual risk approach based on so-calledsuper-population models. In such models population frequency counts aremodeled given a certain distribution. The estimation procedure of samplefrequency counts given the population frequency counts is modeled byassuming a negative binomial distribution. This is used for the estimationof the individual risk. The extensive theory can be found in Skinner (1998),the approximation formulas for the individual risk used is described inFranconi and Polettini (2004).
Measuring hierarchical risk: If “hid” - the index of variable holdinginformation on the hierarchical cluster structures (e.g., individuals thatare clustered in households) - is provided, the hierarchical risk isadditional estimated. Note that the risk of re-identifying an individualwithin a household may also affect the probability of disclosure of othermembers in the same household. Thus, the household or cluster-structure ofthe data must be taken into account when estimating disclosure risks. It iscommonly assumed that the risk of re-identification of a household is therisk that at least one member of the household can be disclosed. Thus thisprobability can be simply estimated from individual risks as 1 minus theprobability that no member of the household can be identified.
Global risk: The sum of the individual risks in the dataset gives theexpected number of re-identifications that serves as measure of the globalrisk.
l-Diversity: If “ldiv_index” is unequal to NULL, i.e. if the indicesof sensible variables are specified, various measures for l-diversity arecalculated. l-diverstiy is an extension of the well-known k-anonymityapproach where also the uniqueness in sensible variables for each patternspanned by the key variables are evaluated.
Value
A modifiedsdcMicroObj-class object or a list with the following elements:
- global_risk_ER:
expected number of re-identification.
- global_risk:
global risk (sum of indivdual risks).
- global_risk_pct:
global risk in percent.
- Res:
matrix with the risk, frequency in the sample and grossed-up frequency in the population (and the hierachical risk) for each observation.
- global_threshold:
for a given max_global_risk the threshold for the risk of observations.
- max_global_risk:
the input max_global_risk of the function.
- hier_risk_ER:
expected number of re-identification with household structure.
- hier_risk:
global risk with household structure (sum of indivdual risks).
- hier_risk_pct:
global risk with household structure in percent.
- ldiverstiy:
Matrix with Distinct_Ldiversity,Entropy_Ldiversity and Recursive_Ldiversity for each sensitivity variable.
Prints risk-information into the console
Information on L-Diversity Measures in the console
Author(s)
Alexander Kowarik, Bernhard Meindl, Matthias Templ, Bernd Prantner, minor parts of IHSN C++ source
References
Franconi, L. and Polettini, S. (2004)Individual riskestimation in mu-Argus: a review. Privacy in Statistical Databases, LectureNotes in Computer Science, 262–272. Springer
Machanavajjhala, A. and Kifer, D. and Gehrke, J. and Venkitasubramaniam, M.(2007)l-Diversity: Privacy Beyond k-Anonymity. ACM Trans. Knowl.Discov. Data, 1(1)
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4.
#' Templ, M. and Kowarik, A. and Meindl, B.Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
See Also
Examples
## measure_risk with sdcMicro objects:data(testdata)sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof','walls','water','electcon'),numVars=c('expend','income','savings'), w='sampling_weight')## risk is already estimated and available in...names(sdc@risk)## measure risk on data frames or matrices:res <- measure_risk(testdata, keyVars=c("urbrur","roof","walls","water","sex"))print(res)head(res$Res)resw <- measure_risk(testdata, keyVars=c("urbrur","roof","walls","water","sex"),w="sampling_weight")print(resw)head(resw$Res)res1 <- ldiversity(testdata, keyVars=c("urbrur","roof","walls","water","sex"),ldiv_index="electcon")print(res1)head(res1)res2 <- ldiversity(testdata, keyVars=c("urbrur","roof","walls","water","sex"),ldiv_index=c("electcon","relat"))print(res2)head(res2)# measure risk with household riskresh <- measure_risk(testdata, keyVars=c("urbrur","roof","walls","water","sex"),w="sampling_weight",hid="ori_hid")print(resh)# change max_global_riskrest <- measure_risk(testdata, keyVars=c("urbrur","roof","walls","water","sex"), w="sampling_weight",max_global_risk=0.0001)print(rest)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')## -> when using `createSdcObj()`, the risks are already internally computed## and it is not required to explicitely run `sdc <- measure_risk(sdc)`Replaces the raw household-level data with the anonymized household-level data in the full datasetfor anonymization of data with a household structure (or other hierarchical structure).Requires a matching household ID in both files.
Description
Replaces the raw household-level data with the anonymized household-level data in the full datasetfor anonymization of data with a household structure (or other hierarchical structure).Requires a matching household ID in both files.
Usage
mergeHouseholdData(dat, hhId, dathh)Arguments
dat | a data.frame with the full dataset |
hhId | name of the household (cluster) ID (identical in both datasets) |
dathh | a dataframe with the treated household level data (generated for example withselectHouseholdData) |
Value
a data.frame with the treated household level variables and the raw individual level variables
Author(s)
Thijs Benschop and Bernhard Meindl
Examples
## Load datax <- testdata## donttest is necessary because of ## Examples with CPU time > 2.5 times elapsed time## caused by using C++ code and/or data.table## Create household level datasetx_hh <- selectHouseholdData(dat=x, hhId="ori_hid", hhVars=c("urbrur", "roof", "walls", "water", "electcon", "household_weights"))## Anonymize household level dataset and extract datasdc_hh <- createSdcObj(x_hh, keyVars=c('urbrur','roof'), w='household_weights')sdc_hh <- kAnon(sdc_hh, k = 3)x_hh_anon <- extractManipData(sdc_hh) ## Merge anonymized household level data back into the full datasetx_anonhh <- mergeHouseholdData(x, "ori_hid", x_hh_anon) ## Anonymize full dataset and extract datasdc_full <- createSdcObj(x_anonhh, keyVars=c('sex', 'age', 'urbrur', 'roof'), w='sampling_weight')sdc_full <- kAnon(sdc_full, k = 3)x_full_anon <- extractManipData(sdc_full)microData
Description
Small aritificial toy data set.
Format
The format is: num [1:13, 1:5] 5 7 2 1 7 8 12 3 15 4 ... - attr(*,"dimnames")=List of 2 ..$ : chr [1:13] "10000" "11000" "12000" "12100" .....$ : chr [1:5] "one" "two" "three" "four" ...
Examples
data(microData)microData <- as.data.frame(microData)m1 <- microaggregation(microData, method="mdav")summary(m1)Microaggregation for numerical and categorical key variables based on adistance similar to the Gower Distance
Description
The microaggregation is based on the distances computed similar to the Gowerdistance. The distance function makes distinction between the variable typesfactor,ordered,numerical and mixed (semi-continuous variables with a fixedprobability mass at a constant value e.g. 0)
Usage
microaggrGower( obj, variables = NULL, aggr = 3, dist_var = NULL, by = NULL, mixed = NULL, mixed.constant = NULL, trace = FALSE, weights = NULL, numFun = mean, catFun = VIM::sampleCat, addRandom = FALSE)Arguments
obj |
|
variables | character vector with names of variables to be aggregated(Default for sdcMicroObj is all keyVariables and all numeric key variables) |
aggr | aggregation level (default=3) |
dist_var | character vector with variable names for distancecomputation |
by | character vector with variable names to split the dataset beforeperforming microaggregation (Default for sdcMicroObj is strataVar) |
mixed | character vector with names of mixed variables |
mixed.constant | numeric vector with length equal to mixed, where themixed variables have the probability mass |
trace | TRUE/FALSE for some console output |
weights | numerical vector with length equal the number of variablesfor distance computation |
numFun | function: to be used to aggregated numerical variables |
catFun | function: to be used to aggregated categorical variables |
addRandom | TRUE/FALSE if a random value should be added for thedistance computation. |
Details
The function sampleCat samples with probabilities corresponding to theoccurrence of the level in the NNs. The function maxCat chooses the levelwith the most occurrences and random if the maximum is not unique.
Value
The function returns the updated sdcMicroObj or simply an altereddata frame.
Note
In each by group all distance are computed, therefore introducing moreby-groups significantly decreases the computation time and memoryconsumption.
Author(s)
Alexander Kowarik
See Also
Examples
data(testdata,package="sdcMicro")testdata <- testdata[1:200,]for(i in c(1:7,9)) testdata[,i] <- as.factor(testdata[,i])test <- microaggrGower(testdata,variables=c("relat","age","expend"), dist_var=c("age","sex","income","savings"),by=c("urbrur","roof"))sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- microaggrGower(sdc)Microaggregation
Description
Function to perform various methods of microaggregation.
Usage
microaggregation( obj, variables = NULL, aggr = 3, strata_variables = NULL, method = "mdav", weights = NULL, nc = 8, clustermethod = "clara", measure = "mean", trim = 0, varsort = 1, transf = "log")Arguments
obj | either an object of class |
variables | variables to microaggregate. For |
aggr | aggregation level (default=3) |
strata_variables | for |
method | pca, rmd, onedims, single, simple, clustpca, pppca,clustpppca, mdav, clustmcdpca, influence, mcdpca |
weights | sampling weights. If obj is of class sdcMicroObj the vectorof sampling weights is chosen automatically. If determined, a weightedversion of the aggregation measure is chosen automatically, e.g. weightedmedian or weighted mean. |
nc | number of cluster, if the chosen method performs cluster analysis |
clustermethod | clustermethod, if necessary |
measure | aggregation statistic, mean, median, trim, onestep (default=mean) |
trim | trimming percentage, if measure=trim |
varsort | variable for sorting, if method=single |
transf | transformation for data x |
Details
Onhttps://research.cbs.nl/casc/glossary.htm one can found the“official” definition of microaggregation:
Records are grouped based on a proximity measure of variables of interest,and the same small groups of records are used in calculating aggregates forthose variables. The aggregates are released instead of the individualrecord values.
The recommended method is “rmd” which forms the proximity usingmultivariate distances based on robust methods. It is an extension of thewell-known method “mdav”. However, when computational speed isimportant, method “mdav” is the preferable choice.
While for the proximity measure very different concepts can be used, theaggregation itself is naturally done with the arithmetic mean.Nevertheless, other measures of location can be used for aggregation,especially when the group size for aggregation has been taken higher than 3.Since the median seems to be unsuitable for microaggregation because ofbeing highly robust, other mesures which are included can be chosen. If acomplex sample survey is microaggregated, the corresponding sampling weightsshould be determined to either aggregate the values by the weightedarithmetic mean or the weighted median.
This function contains also a method with which the data can be clusteredwith a variety of different clustering algorithms. Clustering observationsbefore applying microaggregation might be useful. Note, that the data areautomatically standardised before clustering.
The usage of clustering method ‘Mclust’ requires package mclust02,which must be loaded first. The package is not loaded automatically, sincethe package is not under GPL but comes with a different licence.
The are also some projection methods for microaggregation included. Therobust version ‘pppca’ or ‘clustpppca’ (clustering at first)are fast implementations and provide almost everytime the best results.
Univariate statistics are preserved best with the individual ranking method(we called them ‘onedims’, however, often this method is named‘individual ranking’), but multivariate statistics are strongaffected.
With method ‘simple’ one can apply microaggregation directly on the(unsorted) data. It is useful for the comparison with other methods as abenchmark, i.e. replies the question how much better is a sorting of thedata before aggregation.
Value
If ‘obj’ was of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility. If ‘obj’ wasof class “data.frame”, an object of class “micro” with following entities is returned:
x:original data
mx:the microaggregated dataset
method:method
aggr:aggregation level
measure:proximity measure for aggregation
Note
if only one variable is specified,mafast is applied and argumentmethod is ignored.Parametersmeasure are ignored for methodsmdav andrmd.
Author(s)
Matthias Templ, Bernhard Meindl
For method “mdav”: This work is being supported by the InternationalHousehold Survey Network (IHSN) and funded by a DGF Grant provided by theWorld Bank to the PARIS21 Secretariat at the Organisation for EconomicCo-operation and Development (OECD). This work builds on previous workwhich is elsewhere acknowledged.
Author for the integration of the code for mdav in R: Alexander Kowarik.
References
Templ, M. and Meindl, B.,Robust Statistics Meets SDC: New DisclosureRisk Measures for Continuous Microdata Masking, Lecture Notes in ComputerScience, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008.
Templ, M.Statistical Disclosure Control for Microdata Using theR-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp.67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php
Templ, M.New Developments in Statistical Disclosure Control andImputation: Robust Statistics Applied to Official Statistics,Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280,264 pages.
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4doi:10.1007/978-3-319-50272-4
Templ, M. and Meindl, B. and Kowarik, A.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro, Journal of Statistical Software,67 (4), 1–36, 2015.
See Also
summary.micro,plotMicro,valTable
Examples
data(testdata)# donttest since Examples with CPU time larger 2.5 times elapsed time, because# of using data.table and multicore computation.m <- microaggregation( obj = testdata[1:100, c("expend", "income", "savings")], method = "mdav", aggr = 4)summary(m)## for objects of class sdcMicro:## no stratification because `@strataVar` is `NULL`data(testdata2)sdc <- createSdcObj( dat = testdata2, keyVars = c("urbrur", "roof", "walls", "water", "electcon", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")sdc <- microaggregation( obj = sdc, variables = c("expend", "income"))## with stratification using variable `"relat"`strataVar(sdc) <- "relat"sdc <- microaggregation( obj = sdc, variables = "savings")Global risk using log-linear models.
Description
The sample frequencies are assumed to be independent and following a Poissondistribution. The parameters of the corresponding parameters are estimatedby a log-linear model including the main effects and possible interactions.
Usage
modRisk(obj, method = "default", weights, formulaM, bound = Inf, ...)Arguments
obj | An |
method | chose method for model-based risk-estimation. Currently, thefollowing methods can be selected:
|
weights | a variable name specifying sampling weights |
formulaM | A formula specifying the model. |
bound | a number specifying a threshold for 'risky' observations in the sample. |
... | additional parameters passed through, currently ignored. |
Details
This measure aims to (1) calculate the number of sample uniques that arepopulation uniques with a probabilistic Poisson model and (2) to estimatethe expected number of correct matches for sample uniques.
ad 1) this risk measure is defined over all sample uniques as
\tau_1= \sum\limits_{j:f_j=1} P(F_j=1 | f_j=1) \quad ,
i.e. the expected numberof sample uniques that are population uniques.
ad 2) this risk measure is defined over all sample uniques as
\tau_2= \sum\limits_{j:f_j=1} P(1 / F_j | f_j=1) \quad .
Since population frequenciesF_k are unknown, they need to beestimated.
The iterative proportional fitting method is used to fit the parameters ofthe Poisson distributed frequency counts related to the model specified tofit the frequency counts. The obtained parameters are used to estimate aglobal risk, defined in Skinner and Holmes (1998).
Value
Two global risk measures and some model output given the specified model. If this methodis applied to ansdcMicroObj-class-object, the slot 'risk' in the object ist updatedwith the result of the model-based risk-calculation.
Author(s)
Matthias Templ, Marius Totter, Bernhard Meindl
References
Skinner, C.J. and Holmes, D.J. (1998)Estimating there-identification risk per record in microdata. Journal of OfficialStatistics, 14:361-372, 1998.
Rinott, Y. and Shlomo, N. (1998).A Generalized Negative BinomialSmoothing Model for Sample Disclosure Risk Estimation. Privacy inStatistical Databases. Lecture Notes in Computer Science. Springer-Verlag,82–93.
Clogg, C.C. and Eliasson, S.R. (1987).Some Common Problems in Log-Linear Analysis. Sociological Methods and Research, 8-44.
See Also
Examples
## data.frame methoddata(testdata2)form <- ~ sex + water + roofw <- "sampling_weight"(modRisk(testdata2, method = "default", formulaM = form, weights = w))(modRisk(testdata2, method = "CE", formulaM = form, weights = w))(modRisk(testdata2, method = "PML", formulaM = form, weights = w))(modRisk(testdata2, method = "weightedLLM", formulaM = form, weights = w))(modRisk(testdata2, method = "IPF", formulaM = form, weights = w))## application to a sdcMicroObjdata(testdata2)sdc <- createSdcObj(testdata2, keyVars = c("urbrur", "roof", "walls", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")sdc <- modRisk(sdc, form = ~ sex + water + roof)slot(sdc, "risk")$model# an example using data from the laeken-pkglibrary(laeken)data(eusilc)f <- as.formula(paste(" ~ ", "db040 + hsize + rb090 + age + pb220a + age:rb090 + age:hsize + hsize:rb090"))w <- "rb050"(modRisk(eusilc, method = "default", weights = w, formulaM = f, bound = 5))(modRisk(eusilc, method = "CE", weights = w, formulaM = f, bound = 5))(modRisk(eusilc, method = "PML", weights = w, formulaM = f, bound = 5))(modRisk(eusilc, method = "weightedLLM", weights = w, formulaM = f, bound = 5))Detection and winsorization of multivariate outliers
Description
Imputation and detection of outliers
Usage
mvTopCoding(x, maha = NULL, center = NULL, cov = NULL, alpha = 0.025)Arguments
x | an object coercible to a |
maha | squared mahalanobis distance of each observation |
center | center of data, needed for calculation of mahalanobisdistance (if not provided) |
cov | covariance matrix of data, needed for calcualtion of mahalanobisdistance (if not provided) |
alpha | significance level, determining the ellipsoide to whichoutliers should be placed upon |
Details
Winsorizes the potential outliers on the ellipsoid defined by(robust) Mahalanobis distances in direction to the center of the data
Value
the imputed winsorized data
Author(s)
Johannes Gussenbauer, Matthias Templ
Examples
set.seed(123)x <- MASS::mvrnorm(20, mu = c(5,5), Sigma = matrix(c(1,0.9,0.9,1), ncol = 2))x[1, 1] <- 3x[1, 2] <- 6plot(x)ximp <- mvTopCoding(x)points(ximp, col = "blue", pch = 4)# more dimensionsSigma <- diag(5)Sigma[upper.tri(Sigma)] <- 0.9Sigma[lower.tri(Sigma)] <- 0.9x <- MASS::mvrnorm(20, mu = rep(5,5), Sigma = Sigma)x[1, 1] <- 3x[1, 2] <- 6pairs(x)ximp <- mvTopCoding(x)xnew <- data.frame(rbind(x, ximp))xnew$beforeafter <- rep(c(0,1), each = nrow(x))pairs(xnew, col = xnew$beforeafter, pch = 4)# by hand (non-robust)x[2,2] <- NAm <- colMeans(x, na.rm = TRUE)s <- cov(x, use = "complete.obs")md <- stats::mahalanobis(x, m, s)ximp <- mvTopCoding(x, center = m, cov = s, maha = md)plot(x)points(ximp, col = "blue", pch = 4)nextSdcObj
Description
internal function used to provide the undo-functionality.
Usage
nextSdcObj(obj)Arguments
obj | a |
Value
a modifiedsdcMicroObj-class object
Reorder data
Description
Reorders the data according to a column in the data set.
NOTE: This is an internal function used for testing the C++-functionorderData which is used inside the C++-functionrecordSwap() to speed up performance.
Usage
orderData_cpp(data, orderIndex)Arguments
data | micro data set containing only numeric values. |
orderIndex | column index in |
Value
ordered data set.
Plots for localSuppression objects
Description
This function creates barplots to display the number of suppressed valuesin categorical key variables to achievek-anonymity.
Usage
## S3 method for class 'localSuppression'plot(x, ...)Arguments
x | object of derived from |
... | Additional arguments, currently available are:
|
Value
aggplot plot object
Author(s)
Bernhard Meindl, Matthias Templ
See Also
Examples
data(francdat)Plotfunctions for objects of classsdcMicroObj
Description
Descriptive plot function forsdcMicroObj-objects. Currentlyonly visualization of local supression is implemented.
Usage
## S3 method for class 'sdcMicroObj'plot(x, type = "ls", ...)Arguments
x | An object of classsdcMicroObj |
type | specified what kind of plot will be generated
|
... | currently ignored |
Value
aggplot plot object or (invisible)NULL if local suppressionusingkAnon() has not been applied
Author(s)
Bernhard Meindl
Examples
data(testdata)sdc <- createSdcObj(testdata, keyVars = c("urbrur", "roof", "walls", "relat", "sex"), w = "sampling_weight")sdc <- kAnon(sdc, k = 3)plot(sdc, type = "ls")Comparison plots
Description
Plots for the comparison of the original data and perturbed data.
Usage
plotMicro(x, p, which.plot = 1:3)Arguments
x | an output object of |
p | necessary parameter for the box cox transformation ( |
which.plot | which plot should be created?
|
Details
Univariate and multivariate comparison plots are implemented to detectdifferences between the perturbed and the original data, but also to compareperturbed data which are produced by different methods.
Value
returnsNULL; the selected plot is displayed
Author(s)
Matthias Templ
References
Templ, M. and Meindl, B.,Software Development for SDC inR, Lecture Notes in Computer Science, Privacy in Statistical Databases,vol. 4302, pp. 347-359, 2006.
See Also
Examples
data(free1)df <- as.data.frame(free1)[, 31:34]m1 <- microaggregation(df, method = "onedims", aggr = 3)plotMicro(m1, p = 1, which.plot = 1)plotMicro(m1, p = 1, which.plot = 2)plotMicro(m1, p = 1, which.plot = 3)Post Randomization
Description
To be used on categorical data stored as factors. The algorithm randomlychanges the values of variables in selected records (usually the risky ones)according to an invariant probability transition matrix or a custom-definedtransition matrix.
Usage
pram(obj, variables = NULL, strata_variables = NULL, pd = 0.8, alpha = 0.5)Arguments
obj | Input data. Allowed input data are objects of class |
variables | Names of variables in |
strata_variables | names of variables for stratification (will be setautomatically for an object of classsdcMicroObj. One can also specifyan integer vector or factor that specifies that desired groups. This vector must match the dimensionof the input data set, however. For a possible use case, have a look at the examples. |
pd | minimum diagonal entries for the generated transition matrix P.Either a vector of length 1 (which is recycled) or a vector of the same length asthe number of variables that should be postrandomized. It is also possible to set
It is also possible to combine the different ways. For details have a look at the examples. |
alpha | amount of perturbation for the invariant Pram method. This is a numeric vectorof length 1 (that will be recycled if necessary) or a vector of the same length as the numberof variables. If one specified as transition matrix directly, |
Value
a modifiedsdcMicroObj object or a new object containingoriginal and post-randomized variables (with suffix "_pram").
Note
Deprecated method 'pram_strata' is no longer availablein sdcMicro > 4.5.0
Author(s)
Alexander Kowarik, Matthias Templ, Bernhard Meindl
References
https://www.gnu.org/software/glpk/
Kowarik, A. and Templ, M. and Meindl, B. and Fonteneau, F. and Prantner, B.:Testing of IHSN Cpp Code and Inclusion of New Methods into sdcMicro,in: Lecture Notes in Computer Science, J. Domingo-Ferrer, I. Tinnirello(editors.); Springer, Berlin, 2012, ISBN: 978-3-642-33626-3, pp. 63-77.doi:10.1007/978-3-642-33627-0_6
Templ, M. and Kowarik, A. and Meindl, B.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro. in: Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Templ, M.:Statistical Disclosure Control for Microdata: Methods and Applications in R.in: Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4
Examples
data(testdata)## donttest is necessary because of## Examples with CPU time > 2.5 times elapsed time## caused by using C++ code and/or data.table## using a factor variable as inputres <- pram(as.factor(testdata$roof))print(res)summary(res)## using a data.frame as input## pram can only be applied to factors## -- > we have to recode to factors beforehandtestdata$roof <- factor(testdata$roof)testdata$walls <- factor(testdata$walls)testdata$water <- factor(testdata$water)## pram() is applied within subgroups defined by## variables "urbrur" and "sex"res <- pram( obj = testdata, variables = "roof", strata_variables = c("urbrur", "sex"))print(res)summary(res)## default parameters (pd = 0.8 and alpha = 0.5) for the generation## of the invariant transition matrix will be used for all variablesres1 <- pram( obj = testdata, variables = c("roof", "walls", "water"))print(res1)## specific parameter settings for each variableres2 <- pram( obj = testdata, variables = c("roof", "walls", "water"), pd = c(0.95, 0.8, 0.9), alpha = 0.5)print(res2)## detailed information on pram-parameters (such as the transition matrix 'Rs')## is stored in the output, eg. for variable 'roof'#attr(res2, "pram_params")$roof## we can also specify a custom transition-matrix directlymat <- diag(length(levels(testdata$roof)))rownames(mat) <- colnames(mat) <- levels(testdata$roof)res3 <- pram( obj = testdata, variables = "roof", pd = mat)print(res3) # of course, nothing has changed!## it is possible use a transition matrix for a variable and use the 'traditional' way## of specifying a number for the minimal diagonal entries of the transision matrix## for other variables. In this case we must supply `pd` as list.res4 <- pram( obj = testdata, variables = c("roof", "walls"), pd = list(mat, 0.5), alpha = c(NA, 0.5))print(res4)summary(res4)attr(res4, "pram_params")## application to objects of class sdcMicro with default parametersdata(testdata2)testdata2$urbrur <- factor(testdata2$urbrur)sdc <- createSdcObj( dat = testdata2, keyVars = c("roof", "walls", "water", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")sdc <- pram( obj = sdc, variables = "urbrur")print(sdc, type = "pram")## this is equal to the previous application. If argument 'variables' is NULL,## all variables from slot 'pramVars' will be used if possible.sdc <- createSdcObj( dat = testdata2, keyVars = c("roof", "walls", "water", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight", pramVars = "urbrur")sdc <- pram(sdc)print(sdc, type="pram")## we can specify transition matrices for sdcMicroObj-objects tootestdata2$roof <- factor(testdata2$roof)sdc <- createSdcObj( dat = testdata2, keyVars = c("roof", "walls", "water", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")mat <- diag(length(levels(testdata2$roof)))rownames(mat) <- colnames(mat) <- levels(testdata2$roof)mat[1,] <- c(0.9, 0, 0, 0.05, 0.05)sdc <- pram( obj = sdc, variables = "roof", pd = mat)print(sdc, type = "pram")## we can also have a look at the transitionsget.sdcMicroObj(sdc, "pram")$transitionsPrint method for objects from class freqCalc.
Description
Print method for objects from class freqCalc.
Usage
## S3 method for class 'freqCalc'print(x, ...)Arguments
x | object from class |
... | Additional arguments passed through. |
Value
information about the frequency counts for key variables for objectof classfreqCalc.
Author(s)
Matthias Templ
See Also
Examples
## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)fPrint method for objects from class indivRisk
Description
Print method for objects from class indivRisk
Usage
## S3 method for class 'indivRisk'print(x, ...)Arguments
x | object from class indivRisk |
... | Additional arguments passed through. |
Value
few information about the method and the final correction factor forobjects of class ‘indivRisk’.
Author(s)
Matthias Templ
See Also
Examples
## example from Capobianchi, Polettini and Lucarelli:data(francdat)f1 <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)data.frame(fk=f1$fk, Fk=f1$Fk)## individual risk calculation:indivRisk(f1)Print method for objects from class localSuppression
Description
Print method for objects from class localSuppression
Usage
## S3 method for class 'localSuppression'print(x, ...)Arguments
x | object from class localSuppression |
... | Additional arguments passed through. |
Value
Information about the frequency counts for key variables for objectof class ‘localSuppression’.
Author(s)
Matthias Templ
See Also
Examples
## example from Capobianchi, Polettini and Lucarelli:data(francdat)l1 <- localSuppression(francdat, keyVars = c(2, 4, 5, 6))l1Print method for objects from class micro
Description
printing an object of classmicro
Usage
## S3 method for class 'micro'print(x, ...)Arguments
x | object from class micro |
... | Additional arguments passed through. |
Value
information about method and aggregation level from objects of classmicro.
Author(s)
Matthias Templ
See Also
Examples
data(free1)free1 <- as.data.frame(free1)m1 <- microaggregation(free1[, 31:34], method='onedims', aggr=3)m1Print method for objects from class modrisk
Description
Print method for objects from class modrisk
Usage
## S3 method for class 'modrisk'print(x, ...)Arguments
x | an object of class |
... | Additional arguments passed through. |
Value
Output of model-based risk estimation
Author(s)
Bernhard Meindl
See Also
Print method for objects from class pram
Description
Print method for objects from class pram
Usage
## S3 method for class 'pram'print(x, ...)Arguments
x | an object of class |
... | Additional arguments passed through. |
Value
absolute and relative frequencies of changed observations in each modified variable
Author(s)
Bernhard Meindl, Matthias Templ
Matthias Templ and Bernhard Meindl
See Also
Print and Extractor Functions for objects of classsdcMicroObj-class
Description
Descriptive print function for Frequencies, local Supression, Recoding,categorical risk and numerical risk.
Usage
## S4 method for signature 'sdcMicroObj'print(x, type = "kAnon", docat = TRUE, ...)Arguments
x | An object of class |
type | Selection of the content to be returned or printed |
docat | logical, if TRUE (default) the results will be actually printed |
... | the type argument for the print method, currently supported are:
|
Details
Possible values for the type argument of the print function are: "freq": forFrequencies, "ls": for Local Supression output, "pram": for results ofpost-randomization "recode":for Recodes, "risk": forCategorical risk and"numrisk": for Numerical risk.
Possible values for the type argument of the freq function are: "fk": Samplefrequencies and "Fk": weighted frequencies.
Author(s)
Alexander Kowarik, Matthias Templ, Bernhard Meindl
Examples
data(testdata)sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof','walls','relat','sex'), pramVars=c('water','electcon'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- microaggregation(sdc, method="mdav", aggr=3)print(sdc)print(sdc, type="general")print(sdc, type="ls")print(sdc, type="recode")print(sdc, type="risk")print(sdc, type="numrisk")print(sdc, type="pram")print(sdc, type="kAnon")print(sdc, type="comp_numvars")Print method for objects from class suda2
Description
Print method for objects from class suda2.
Usage
## S3 method for class 'suda2'print(x, ...)Arguments
x | an object of class suda2 |
... | additional arguments passed through. |
Value
Table of dis suda scores.
Author(s)
Matthias Templ
See Also
Random Sampling
Description
Randomly select records given a probability weight vectorprob.
NOTE: This is an internal function used for testing the C++-functionrandSample which is used inside the C++-functionrecordSwap().
Usage
randSample_cpp(ID, N, prob, IDused, seed)Arguments
ID | vector containing record IDs from which to sample |
N | integer defining the number of records to be sampled |
prob | a vector of probability weights for obtaining the elements of the vector being sampled. |
IDused | vector containing IDs which must not be sampled |
seed | integer setting the sampling seed |
Rank Swapping
Description
Swapping values within a range so that, first, the correlation structure oforiginal variables are preserved, and second, the values in each record aredisturbed. To be used on numeric or ordinal variables where the rank can bedetermined and the correlation coefficient makes sense.
Usage
rankSwap( obj, variables = NULL, TopPercent = 5, BottomPercent = 5, K0 = NULL, R0 = NULL, P = NULL, missing = NA, seed = NULL)Arguments
obj | a |
variables | names or index of variables for that rank swapping isapplied. For an object of class |
TopPercent | Percentage of largest values that are grouped togetherbefore rank swapping is applied. |
BottomPercent | Percentage of lowest values that are grouped togetherbefore rank swapping is applied. |
K0 | Subset-mean preservation factor. Preserves the means before andafter rank swapping within a range based on K0. K0 is the subset-meanpreservation factor such that |
R0 | Multivariate preservation factor. Preserves the correlationbetween variables within a certain range based on the given constant R0. Wecan specify the preservation factor as |
P | Rank range as percentage of total sample size. We can specify therank range itself directly, noted as |
missing | missing - the value to be used as missing valuein the C++ routine instead of NA. If NA, a suitable value is calculated internally.Note that in the returned dataset, all NA-values (if any) will be replaced withthis value. |
seed | Seed. |
Details
Rank swapping sorts the values of one numeric variable by their numericalvalues (ranking). The restricted range is determined by the rank of twoswapped values, which cannot differ, by definition, by more than Ppercent of the total number of observations. Only positive P, R0 and K0 areused and only one of it must be supplied. If none is supplied, sdcMicro setsparameter r0 to 0.95 internally.
Value
The rank-swapped data set or a modifiedsdcMicroObj-class object.
Author(s)
Alexander Kowarik for the interface, Bernhard Meindl for improvements.
For the underlying C++ code: This work is being supported by theInternational Household Survey Network (IHSN) and funded by a DGF Grantprovided by the World Bank to the PARIS21 Secretariat at the Organisationfor Economic Co-operation and Development (OECD). This work builds onprevious work which is elsewhere acknowledged.
References
Moore, Jr.R. (1996) Controlled data-swapping techniques formasking public use microdata, U.S. Bureau of the CensusStatisticalResearch Division Report Series, RR 96-04.
Kowarik, A. and Templ, M. and Meindl, B. and Fonteneau, F. and Prantner, B.:Testing of IHSN Cpp Code and Inclusion of New Methods into sdcMicro,in: Lecture Notes in Computer Science, J. Domingo-Ferrer, I. Tinnirello(editors.); Springer, Berlin, 2012, ISBN: 978-3-642-33626-3, pp. 63-77.doi:10.1007/978-3-642-33627-0_6
Examples
data(testdata2)data_swap <- rankSwap( obj = testdata2, variables = c("age", "income", "expend", "savings"))## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj( dat = testdata2, keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")sdc <- rankSwap(sdc)readMicrodata
Description
reads data from various formats into R. Used insdcApp.
Usage
readMicrodata( path, type, convertCharToFac = TRUE, drop_all_missings = TRUE, ...)Arguments
path | a file path |
type | which format does the file have. currently allowed values are
|
convertCharToFac | (logical) if TRUE, all character vectors are automaticallyconverted to factors |
drop_all_missings | (logical) if TRUE, all variables that contain NA-values onlywill be dropped |
... | additional parameters. Currently used only if |
Value
a data.frame or an object of class 'simple.error'. If a stata file was read in, the resultingdata.framehas an additional attributelab in which variable and value labels are stored.
Note
iftype is either'sas','spss' or'stata', values read in asNaNwill be converted toNA.
Author(s)
Bernhard Meindl
Targeted Record Swapping
Description
Applies targeted record swapping on micro data considering the identificationrisk of each record as well the geographic topology.
Usage
recordSwap(data, ...)## S3 method for class 'sdcMicroObj'recordSwap(data, ...)## Default S3 method:recordSwap( data, hid, hierarchy, similar, swaprate = 0.05, risk = NULL, risk_threshold = 0, k_anonymity = 3, risk_variables = NULL, carry_along = NULL, return_swapped_id = FALSE, log_file_name = "TRS_logfile.txt", seed = NULL, ...)Arguments
data | must be either a micro data set in the form of a'data.table' or 'data.frame', or an 'sdcObject', seecreateSdcObj. |
... | parameters passed to 'recordSwap.default()' |
hid | column index or column name in 'data' which refersto the household identifier. |
hierarchy | column indices or column names of variables in'data' which refer to the geographic hierarchy in the micro dataset. For instance county > municipality > district. |
similar | vector or list of integer vectors or column namescontaining similarity profiles, see details for more explanations. |
swaprate | double between 0 and 1 defining the proportion ofhouseholds which should be swapped, see details for more explanations |
risk | either column indices or column names in 'data' or'data.table', 'data.frame' or 'matrix' indicating risk of each recordat each hierarchy level. If 'risk'-matrix is supplied to swapping procedurewill not use the k-anonymity rule but the values found in this matrixfor swapping.When using the risk parameter is expected to have assigned a maximum value in a household for each member of the household. If this condition is not satisfied, the risk parameter is automatically adjusted to comply with this condition.If risk parameter is provided then k-anonymity rule is suppressed. |
risk_threshold | single numeric value indicating when a household isconsidered "high risk", e.g. when this household must be swapped. Is onlyused when 'risk' is not 'NULL'.Risk threshold indicates households that have to be swapped, but be aware that households with risk lower than threshold, but with still high enough risk may be swapped as well. Only households with risk set to 0 are not swapped. Risk and risk threshold must be equal or bigger then 0. |
k_anonymity | integer defining the threshold of high risk households(counts<k) for using k-anonymity rule |
risk_variables | column indices or column names of variables in 'data'which will be considered for estimating the risk. Only used when k-anonymityrule is applied. |
carry_along | integer vector indicating additional variables to swapbesides to hierarchy variables. These variables do not interfere with theprocedure of finding a record to swap with or calculating risk. Thisparameter is only used at the end of the procedure when swapping thehierarchies. We note that the variables to be used as 'carry_along' shouldbe at household level. In case it is detected that they are at individuallevel (different values within 'hid'), a warning is given. |
return_swapped_id | boolean if 'TRUE' the output includes anadditional column showing the 'hid' with which a record was swapped with.The new column will have the name 'paste0(hid,"_swapped")'. |
log_file_name | character, path for writing a log file. The logfile contains a list of household IDs ('hid') which could not have beenswapped and is only created if any such households exist. |
seed | integer defining the seed for the random number generator, forreproducibility. if 'NULL' a random seed will be set using 'sample(1e5,1)'. |
Details
The procedure accepts a 'data.frame' or 'data.table'containing all necessary information for the record swapping, e.gparameter 'hid', 'similar', 'hierarchy', etc ...First, the micro data in 'data' is ordered by 'hid' and the identificationrisk is calculated for each record in each hierarchy level. As of rightnow only counts is used as identification risk and the inverse of countsis used as sampling probability.NOTE: It will be possible to supply an identification risk for each recordand hierarchy level which will be passed down to the C++-function. Thisis however not fully implemented.
With the parameter 'k_anonymity' a k-anonymity rule is applied to definerisky households in each hierarchy level. A household is set to riskyif counts < k_anonymity in any hierarchy level and the household needsto be swapped across this hierarchy level.For instance, having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 thecounts are calculated for each geographic variable and defined'risk_variables'. If the counts for a record falls below 'k_anonymity'for hierarchy county (NUTS1, NUTS2, ...) then this record needs to be swapped across counties.Setting 'k_anonymity = 0' disables this feature and no risky householdsare defined.
After that the targeted record swapping is applied, starting from the highestto the lowest hierarchy level and cycling through all possible geographicareas at each hierarchy level, e.g every county, every municipality inevery county, etc, ...
At each geographic area, a set of values is created for records to beswapped. In all but the lowest hierarchy level, this is ONLY made outof all records which do not fulfil the k-anonymity and have not alreadybeen swapped. Those records are swapped with records not belonging tothe same geographic area, which have not already been swapped beforehand.Swapping refers to the interchange of geographic variables defined in'hierarchy'. When a record is swapped all other records containing thesame 'hid' are swapped as well.
At the lowest hierarchy level in every geographic area, the set of records tobe swapped is made up of all records which do not fulfil the k-anonymityas well as the remaining number of records such that the proportion ofswapped records of the geographic area is in coherence with the 'swaprate'.If due to the k-anonymity condition, more records have already been swappedin this geographic area then only the records which do not fulfil thek-anonymity are swapped.
Using the parameter 'similar' one can define similarity profiles.'similar' needs to be a list of vectors with each list entry containingcolumn indices of 'data'. These entries are used when searching for donorhouseholds, meaning that for a specific record the set of all donorrecords is made out of records which have the same values in'similar[[1]]'. It is however important to note, that these variablescan only be variables related to households (not persons!). If no suitabledonor can be found the next similarity profile is used, 'similar[[2]]' andthe set of all donors is then made up out of all records which have thesame values in the column indices in 'similar[[2]]'. This procedurecontinues until a donor record was found or all the similarity profileshave been used.
'swaprate' sets the swaprate of households to be swapped, where a singleswap counts for swapping 2 households, the sampled household and thecorresponding donor. Prior to the procedure, the swaprate is applied onthe lowest hierarchy level, to determine the target number of swappedhouseholds in each of the lowest hierarchies. If the target numbers of adecimal point they will randomly be rounded up or down such that thenumber of households swapped in total is in coherence to the swaprate.
Value
'data.table' with swapped records.
Author(s)
Johannes Gussenbauer
Examples
# generate 10000 dummy householdslibrary(data.table)seed <- 2021set.seed(seed)nhid <- 10000dat <- sdcMicro::createDat(nhid)# define paramters for swappingk_anonymity <- 1swaprate <- .05 # 5%similar <- list(c("hsize"))hier <- c("nuts1", "nuts2")risk_variables <- c("ageGroup", "national")hid <- "hid"## apply record swapping#dat_s <- recordSwap(# data = dat,# hid = hid,# hierarchy = hier,# similar = similar,# swaprate = swaprate,# k_anonymity = k_anonymity,# risk_variables = risk_variables,# carry_along = NULL,# return_swapped_id = TRUE,# seed = seed#)### number of swapped households#dat_s[hid != hid_swapped, uniqueN(hid)]### hierarchies are not consistently swapped#dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]### use parameter carry_along#dat_s <- recordSwap(# data = dat,# hid = hid,# hierarchy = hier,# similar = similar,# swaprate = swaprate,# k_anonymity = k_anonymity,# risk_variables = risk_variables,# carry_along = c("nuts3", "lau2"),# return_swapped_id = TRUE,# seed = seed)##dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]Targeted Record Swapping
Description
Applies targeted record swapping on micro data set, see?recordSwap for details.
NOTE: This is an internal function called by the R-functionrecordSwap(). It's only purpose is to include the C++-function recordSwap() using Rcpp.
Usage
recordSwap_cpp( data, hid, hierarchy, similar_cpp, swaprate, risk, risk_threshold, k_anonymity, risk_variables, carry_along, log_file_name, seed = 123456L)Arguments
data | micro data set containing only integer values. A data.frame or data.table from R needs to be transposed beforehand so that data.size() ~ number of records - data.[0].size ~ number of varaibles per record.NOTE:data has to be ordered by hid beforehand. |
hid | column index in |
hierarchy | column indices of variables in |
similar_cpp | List where each entry corresponds to column indices of variables in |
swaprate | double between 0 and 1 defining the proportion of households which should be swapped, see details for more explanations |
risk | vector of vectors containing risks of each individual in each hierarchy level. |
risk_threshold | double indicating risk threshold above every household needs to be swapped. |
k_anonymity | integer defining the threshold of high risk households (k-anonymity). This is used as k_anonymity <= counts. |
risk_variables | column indices of variables in |
carry_along | integer vector indicating additional variables to swap besides to hierarchy variables.These variables do not interfere with the procedure of finding a record to swap with or calculating risk.This parameter is only used at the end of the procedure when swapping the hierarchies. |
log_file_name | character, path for writing a log file. The log file contains a list of household IDs ('hid') which could not have been swapped and is only created if any such households exist. |
seed | integer defining the seed for the random number generator, for reproducibility. |
Value
Returns data set with swapped records.
Remove certain variables from the data set inside a sdc object.
Description
Delete variables without changing anything else in the sdcObject (writingNAs).
Usage
removeDirectID(obj, var)Arguments
obj | object of class |
var | name of the variable(s) to be remove |
Value
the modifiedsdcMicroObj-class
Author(s)
Alexander Kowarik
Examples
## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata, keyVars=c('urbrur','roof'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- removeDirectID(sdc, var="age")Generate an Html-report from an sdcMicroObj
Description
Summary statistics of the original and the perturbed data set
Usage
report( obj, outdir = tempdir(), filename = "SDC-Report", title = "SDC-Report", internal = FALSE, verbose = FALSE)Arguments
obj | an object of class |
outdir | output folder |
filename | output filename |
title | Title for the report |
internal | TRUE/FALSE, if TRUE a detailed internal report is produced,else a non-disclosive overview |
verbose | TRUE/FALSE, if TRUE, some additional information is printed. |
Details
The application of this function provides you with a html-report for yoursdcMicro object that contains useful summaries about the anonymization process.
Author(s)
Matthias Templ, Bernhard Meindl
Examples
data(testdata2)sdc <- createSdcObj( dat = testdata2, keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"), numVars = c("expend", "income", "savings"), w = "sampling_weight")report(sdc)riskyCells
Description
Allows to compute risky (unweighted) combinations of key variables eitherup to a specified dimension or using identification level. This mimics theapproach taken in mu-argus.
Usage
riskyCells(obj, useIdentificationLevel = FALSE, threshold, ...)Arguments
obj | a |
useIdentificationLevel | (logical) specifies if tabulation should bedone up to a specific dimension ( |
threshold | a numeric vector specifiying the thresholds at which cellsare considered to be unsafe. In case a tabulation is done up to a specificlevel ( |
... | see possible arguments below
|
Value
adata.table showing the number of unsafe cells, thresholds forany combination of the key variables. If the input was asdcMicroObjobject and some modifications have been already applied to the categoricalkey variables, the resulting output contains the number of unsafe cellsboth for the original and the modified data.
Author(s)
Bernhard Meindl
Examples
## data.frame method / all combinations up to maxDim# riskyCells(# obj = testdata2,# keyVars = 1:5,# threshold = c(50, 25, 10, 5),# useIdentificationLevel = FALSE,# maxDim = 4# )#riskyCells(# obj = testdata2,# keyVars = 1:5,# threshold = 10,# useIdentificationLevel = FALSE,# maxDim = 3#)#### data.frame method / using identification levels#riskyCells(# obj = testdata2,# keyVars = 1:6,# threshold = 20,# useIdentificationLevel = TRUE,# level = c(1, 1, 2, 3, 3, 5)#)#riskyCells(# obj = testdata2,# keyVars = c(1, 3, 4, 6),# threshold = 10,# useIdentificationLevel = TRUE,# level = c(1, 2, 2, 4)#)#### sdcMicroObj-method / all combinations up to maxDim#testdata2[1:6] <- lapply(1:6, function(x) {# testdata2[[x]] <- as.factor(testdata2[[x]])#})##sdc <- createSdcObj(# dat = testdata2,# keyVars = c("urbrur", "roof", "walls", "water", "electcon", "relat", "sex"),# numVars = c("expend", "income", "savings"),# w = "sampling_weight")##r0 <- riskyCells(# obj = sdc,# useIdentificationLevel=FALSE,# threshold = c(20, 10, 5),# maxDim = 3#)#### in case key-variables have been modified, we get counts for### original and modified data#sdc <- groupAndRename(# obj = sdc,# var = "roof",# before = c("5", "6", "9"),# after = "5+"#)#r1 <- riskyCells(# obj = sdc,# useIdentificationLevel = FALSE,# threshold = c(10, 5, 3),# maxDim = 3#)#### sdcMicroObj-method / using identification levels#riskyCells(# obj = sdc,# useIdentificationLevel = TRUE,# threshold = 10,# level = c(1, 1, 3, 4, 5, 5, 5)#)Random sample for donor records
Description
Randomly select donor records given a probability weight vector. This sampling procedure is implemented differently thanrandSample_cpp to speed up performance of C++-functionrecordSwap().
NOTE: This is an internal function used for testing the C++-functionsampleDonor which is used inside the C++-functionrecordSwap().
Usage
sampleDonor_cpp( data, similar_cpp, hid, IDswap, IDswap_pool_vec, prob, seed = 123456L)Arguments
data | micro data containing the hierarchy levels and household ID |
similar_cpp | List where each entry corresponds to column indices of variables in |
hid | column index in |
IDswap | vector containing records for which a donor needs to be sampled |
IDswap_pool_vec | set from which 'IDswap' was drawn |
prob | a vector of probability weights for obtaining the elements of the vector being sampled. |
seed | integer setting the sampling seed |
sdcApp
Description
starts the graphical user interface developed withshiny.
Usage
sdcApp( maxRequestSize = 50, debug = FALSE, theme = "IHSN", ..., shiny.server = FALSE)Arguments
maxRequestSize | (numeric) number defining the maximum allowed filesize (in megabytes)for uploaded files, defaults to 50MB |
debug | logical if |
theme | select stylesheet for the interface. Supported choices are
|
... | arguments (e.g |
shiny.server | Setting this parameter to |
Value
starts the interactive graphical user interface which may be used to perform theanonymization process.
Examples
if(interactive()) { sdcApp(theme = "flatly")}Class"sdcMicroObj"
Description
Class to save all information about the SDC process
Usage
createSdcObj( dat, keyVars, numVars = NULL, pramVars = NULL, ghostVars = NULL, weightVar = NULL, hhId = NULL, strataVar = NULL, sensibleVar = NULL, excludeVars = NULL, options = NULL, seed = NULL, randomizeRecords = FALSE, alpha = 1)undolast(object)strataVar(object) <- value## S4 replacement method for signature 'sdcMicroObj,characterOrNULL'strataVar(object) <- valueArguments
dat | The microdata set. A numeric matrix or data frame containing the data. |
keyVars | Indices or names of categorical key variables. They must, ofcourse, match with the columns of ‘dat’. |
numVars | Index or names of continuous key variables. |
pramVars | Indices or names of categorical variables considered to be pramed. |
ghostVars | if specified a list which each element being a list of exactly two elements.The first element must be a character vector specifying exactly one variable name that wasalso specified as a categorical key variable ( |
weightVar | Indices or name determining the vector of sampling weights. |
hhId | Index or name of the cluster ID (if available). |
strataVar | Indices or names of stratification variables. |
sensibleVar | Indices or names of sensible variables (for l-diversity) |
excludeVars | which variables of |
options | additional options (if specified, a list must be used as input) |
seed | (numeric) number specifiying the seed which will be set to allow forreproducablity. The number will be rounded and saved as element |
randomizeRecords | (logical) if |
alpha | numeric between 0 and 1 specifying the fraction on how much keys containing |
object | a |
value |
|
Value
asdcMicroObj-class object
an object of classsdcMicroObj with modified slot@strataVar
Objects from the Class
Objects can be created by calls of the formnew("sdcMicroObj", ...).
Author(s)
Bernhard Meindl, Alexander Kowarik, Matthias Templ, Elias Rut
References
Templ, M. and Meindl, B. and Kowarik, A.:Statistical Disclosure Control forMicro-Data Using the R Package sdcMicro, Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
Examples
## we can also specify ghost (linked) variables## these variables are linked to some categorical key variables## and have the sampe suppression pattern as the variable that they## are linked to after \code{\link{localSuppression}} has been applieddata(testdata)testdata$electcon2 <- testdata$electcontestdata$electcon3 <- testdata$electcontestdata$water2 <- testdata$waterkeyVars <- c("urbrur","roof","walls","water","electcon","relat","sex")numVars <- c("expend","income","savings")w <- "sampling_weight"## we want to make sure that some variables not used as key-variables## have the same suppression pattern as variables that have been## selected as key variables. Thus, we are using 'ghost'-variables.ghostVars <- list()## we want variables 'electcon2' and 'electcon3' to be linked## to key-variable 'electcon'ghostVars[[1]] <- list()ghostVars[[1]][[1]] <- "electcon"ghostVars[[1]][[2]] <- c("electcon2","electcon3")## donttest because Examples with CPU time > 2.5 times elapsed time## we want variable 'water2' to be linked to key-variable 'water'ghostVars[[2]] <- list()ghostVars[[2]][[1]] <- "water"ghostVars[[2]][[2]] <- "water2"## create the sdcMicroObjobj <- createSdcObj(testdata, keyVars=keyVars, numVars=numVars, w=w, ghostVars=ghostVars)## apply 3-anonymity to selected key variablesobj <- kAnon(obj, k=3); obj## check, if the suppression patterns are identicalmanipGhostVars <- get.sdcMicroObj(obj, "manipGhostVars")manipKeyVars <- get.sdcMicroObj(obj, "manipKeyVars")all(is.na(manipKeyVars$electcon) == is.na(manipGhostVars$electcon2))all(is.na(manipKeyVars$electcon) == is.na(manipGhostVars$electcon3))all(is.na(manipKeyVars$water) == is.na(manipGhostVars$water2))## exclude some variablesobj <- createSdcObj(testdata, keyVars=c("urbrur","roof","walls"), numVars="savings", weightVar=w, excludeVars=c("relat","electcon","hhcivil","ori_hid","expend"))colnames(get.sdcMicroObj(obj, "origData"))Creates a household level file from a dataset with a household structure.
Description
It removes individual level variables and selects one record per household based on a household ID. The function can also be used for other hierachical structures.
Usage
selectHouseholdData(dat, hhId, hhVars)Arguments
dat | a data.frame with the full dataset |
hhId | name of the variable with the household (cluster) ID |
hhVars | character vector with names of all household level variables |
Value
a data.frame with only household level variables and one record per household
Note
It is of great importance that users select a variable with containing information on household-ids and weights inhhVars.
Author(s)
Thijs Benschop and Bernhard Meindl
Examples
## ori-hid: household-ids; household_weights: sampling weights for householdsx_hh <- selectHouseholdData(dat=testdata, hhId="ori_hid", hhVars=c("urbrur", "roof", "walls", "water", "electcon", "household_weights"))set.sdcMicroObj
Description
modifysdcMicroObj-class-objects depending on argumenttype
Usage
set.sdcMicroObj(object, type, input)Arguments
object | a |
type | a character vector of length 1 defining what to calculate|return|modify. Allowed types are listed belowand the slot with the corresponding name will be replaced by the content of
|
input | a list depending on argument |
Value
asdcMicroObj-class-object
Examples
sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')ind_pram <- match(c("sex"), colnames(testdata2))get.sdcMicroObj(sdc, type="pramVars")sdc <- set.sdcMicroObj(sdc, type="pramVars", input=list(ind_pram))get.sdcMicroObj(sdc, type="pramVars")Define Swap-Levels
Description
Define hierarchy levels over which record needs to be swapped according to risk variables.
NOTE: This is an internal function used for testing the C++-functionsetLevels() which is applied insiderecordSwap().
Usage
setLevels_cpp(risk, risk_threshold)Arguments
risk | vector of vectors containing risks of each individual in each hierarchy level. |
risk_threshold | double defining the risk threshold beyond which a record/household needs to be swapped. This is understood as risk>=risk_threshhold. |
Value
Integer vector with hierarchy level over which record needs to be swapped with.
Calculate Risk
Description
Calculate risk for records to be swapped and donor records. Risks are defined by 1/counts, where counts is the number of records with the same values for specifiedrisk_variables in the each geographic hierarchy.This risk will be used as sampling probability for both sampling set and donor set.
NOTE: This is an internal function used for testing the C++-functionsetRisk which is used inside the C++-functionrecordSwap().
Usage
setRisk_cpp(data, hierarchy, risk_variables, hid)Arguments
data | micro data set containing only numeric values. |
hierarchy | column indices of variables in |
risk_variables | column indices of variables in |
hid | column index in |
Show
Description
show a sdcMicro object
Usage
## S4 method for signature 'sdcMicroObj'show(object)Arguments
object | an sdcmicro obj |
Value
a sdcMicro object
Author(s)
Bernhard Meindl
Shuffling and EGADP
Description
Data shuffling and General Additive Data Perturbation.
Usage
shuffle( obj, form, method = "ds", weights = NULL, covmethod = "spearman", regmethod = "lm", gadp = TRUE)Arguments
obj | An object of class sdcMicroObj or a data.frame including thedata. |
form | An object of class “formula” (or one that can be coercedto that class): a symbolic description of the model to be fitted. Theresponses have to consists of at least two variables of any class and theresponse variables have to be of class numeric. The response variablesbelongs to numeric key variables (quasi-identifiers of numeric scale). Thepredictors are can be distributed in any way (numeric, factor, orderedfactor). |
method | currently either the original form of data shuffling(“ds” - default), “mvn” or “mlm”, see the detailssection. The last method is in experimental mode and almost untested. |
weights | Survey sampling weights. Automatically chosen when obj is ofclass |
covmethod | Method for covariance estimation. “spearman”,“pearson” and \ dQuotemcd are possible. For the latter one, theimplementation in package robustbase is used. |
regmethod | Method for multivariate regression. “lm” and“MM” are possible. For method “MM”, the function “rlm”from package MASS is applied. |
gadp | TRUE, if the egadp results from a fit on the original data isreturned. |
Details
Perturbed values for the sensitive variables are generated. The sensitivevariables have to be stored as responses in the argument ‘form’,which is the usual formula interface for regression models in R.
For method “ds” the EGADP method is applied on the norm inversepercentiles. Shuffling then ranks the original values according to the GADPoutput. For further details, please see the references.
Method “mvn” uses a simplification and draws from the normal Copulasdirectly before these draws are shuffled.
Method “mlm” is also a simplification. A linear model is applied, theexpected values are used as perturbed values before shuffling isapplied.
Value
If ‘obj’ is of classsdcMicroObj-class the correspondingslots are filled, like manipNumVars, risk and utility. If ‘obj’ isof class “data.frame” an object of class “micro” withfollowing entities is returned:
shConf | the shuffled numeric keyvariables |
egadp | the perturbed (using gadp method) numeric keyvariables |
Note
In this version, the covariance method chosen is used for anycovariance and correlation estimations in the whole gadp and shufflingfunction.
Author(s)
Matthias Templ, Alexander Kowarik, Bernhard Meindl
References
K. Muralidhar, R. Parsa, R. Saranthy (1999). A general additivedata perturbation method for database security.Management Science,45, 1399-1415.
K. Muralidhar, R. Sarathy (2006). Data shuffling - a new masking approachfor numerical data.Management Science, 52(5), 658-670, 2006.
M. Templ, B. Meindl. (2008). Robustification of Microdata Masking Methodsand the Comparison with Existing Methods, in:Lecture Notes onComputer Science, J. Domingo-Ferrer, Y. Saygin (editors.); Springer,Berlin/Heidelberg, 2008, ISBN: 978-3-540-87470-6, pp. 14-25.
See Also
Examples
data(Prestige,package="carData")form <- formula(income + education ~ women + prestige + type, data=Prestige)sh <- shuffle(obj=Prestige,form)plot(Prestige[,c("income", "education")])plot(sh$sh)colMeans(Prestige[,c("income", "education")])colMeans(sh$sh)cor(Prestige[,c("income", "education")], method="spearman")cor(sh$sh, method="spearman")## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- shuffle(sdc, method=c('ds'),regmethod= c('lm'), covmethod=c('spearman'),form=savings+expend ~ urbrur+walls)subsetMicrodata
Description
allows to restrict original data to only a subset. This may be useful to test some anonymizationmethods. This function will only be used in the graphical user interfacesdcApp.
Usage
subsetMicrodata(obj, type, n)Arguments
obj | an object of class |
type | algorithm used to sample from original microdata. Currently supported choices are
|
n | numeric vector of length 1 specifying the specific parameter with respect to argument |
Value
an object of classsdcMicroObj-class with modified slot@origData.
Author(s)
Bernhard Meindl
Suda2: Detecting Special Uniques
Description
SUDA risk measure for data from (stratified) simple random sampling.
Usage
suda2(obj, ...)Arguments
obj | a |
... | see arguments below
|
Details
Suda 2 is a recursive algorithm for finding Minimal Sample Uniques. Thealgorithm generates all possible variable subsets of defined categorical keyvariables and scans them for unique patterns in the subsets of variables.The lower the amount of variables needed to receive uniqueness, the higherthe risk of the corresponding observation.
Value
A modifiedsdcMicroObj object or the following list
ContributionPercent: The contribution of each key variable to the SUDAscore, calculated for each row.score: The suda score'disscore: The dis suda scoreattribute_contributions:adata.frameshowing how much of the totalrisk is contributed by each variable. This information is stored in thefollowing two variables:variable: containing the name of the variablecontribution: contains how much risk a variable contributes to the total risk.
attribute_level_contributions: returns risks of each attribute-level as adata.framewith the following three columns:variable: the variable nameattribute: holding relevant level-codescontribution: contains the risk of this level within the variable.
Note
Since version >5.0.2, the computation of suda-scores has changed and is now by default as described inthe original paper by Elliot et al.
Author(s)
Alexander Kowarik and Bernhard Meindl (based on the C++ code from the Organisation ForEconomic Co-Operation And Development.
For the C++ code: This work is being supported by the InternationalHousehold Survey Network and funded by a DGF Grant provided by the WorldBank to the PARIS21 Secretariat at the Organisation for EconomicCo-operation and Development (OECD). This work builds on previous work whichis elsewhere acknowledged.
References
C. J. Skinner; M. J. Elliot (20xx) A Measure of Disclosure Riskfor Microdata.Journal of the Royal Statistical Society: Series B(Statistical Methodology), Vol. 64 (4), pp 855–867.
M. J. Elliot, A. Manning, K. Mayes, J. Gurd and M. Bane (20xx) SUDA: AProgram for Detecting Special Uniques, Using DIS to Modify theClassification of Special Uniques
Anna M. Manning, David J. Haglin, John A. Keane (2008) A recursive searchalgorithm for statistical disclosure assessment.Data Min Knowl Disc16:165 – 196
Templ, M. Statistical Disclosure Control for Microdata: Methods and Applications in R.Springer International Publishing, 287 pages, 2017. ISBN 978-3-319-50272-4.doi:10.1007/978-3-319-50272-4
Summary method for objects from class freqCalc
Description
Summary method for objects of class ‘freqCalc’ to provide informationabout local suppressions.
Usage
## S3 method for class 'freqCalc'summary(object, ...)Arguments
object | object from class freqCalc |
... | Additional arguments passed through. |
Details
Shows the amount of local suppressions on each variable in which localsuppression was applied.
Value
Information about local suppression in each variable (only if alocal suppression is already done).
Author(s)
Matthias Templ
See Also
Examples
## example from Capobianchi, Polettini and Lucarelli:data(francdat)f <- freqCalc(francdat, keyVars=c(2,4,5,6),w=8)ff$fkf$Fk## individual risk calculation:indivf <- indivRisk(f)indivf$rk## Local SuppressionlocalS <- localSupp(f, keyVar=2, threshold=0.25)f2 <- freqCalc(localS$freqCalc, keyVars=c(4,5,6), w=8)summary(f2)Summary method for objects from class micro
Description
Summary method for objects from class ‘micro’.
Usage
## S3 method for class 'micro'summary(object, ...)Arguments
object | objects from class micro |
... | Additional arguments passed through. |
Details
This function computes several measures of information loss, such as
Value
meanx | A conventional summary of the original data |
meanxm | A conventional summary of the microaggregated data |
amean | average relative absolute deviation of means |
amedian | average relative absolute deviation of medians |
aonestep | average relative absolute deviation of onestep from median |
devvar | average relative absolute deviation of variances |
amad | average relative absolute deviation of the mad |
acov | average relative absolute deviation of covariances |
arcov | average relative absolute deviation of robust (with mcd) covariances |
acor | average relative absolute deviation of correlations |
arcor | average relative absolute deviation of robust (with mcd) correlations |
acors | average relative absolute deviation of rank-correlations |
adlm | average absolute deviation of lm regression coefficients (without intercept) |
adlts | average absolute deviation of lts regression coefficients (without intercept) |
apcaload | average absolute deviation of pca loadings |
apppacaload | average absolute deviation of robust (with projection pursuit approach) pca loadings |
atotals | average relative absolute deviation of totals |
pmtotals | average relative deviation of totals |
Author(s)
Matthias Templ
References
Templ, M.Statistical Disclosure Control for MicrodataUsing the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number2, pp. 67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php
See Also
Examples
data(Tarragona)m1 <- microaggregation(Tarragona, method = "onedims", aggr = 3)summary(m1)Summary method for objects from class pram
Description
Summary method for objects from class ‘pram’ to provide informationabout transitions.
Usage
## S3 method for class 'pram'summary(object, ...)Arguments
object | object from class ‘pram’ |
... | Additional arguments passed through. |
Details
Shows various information about the transitions.
Value
The summary of object from class ‘pram’.
Author(s)
Matthias Templ and Bernhard Meindl
References
Templ, M.Statistical Disclosure Control for MicrodataUsing the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number2, pp. 67-85, 2008.http://www.tdp.cat/issues/abs.a004a08.php
See Also
Examples
data(free1)x <- as.factor(free1[,"MARSTAT"])x2 <- pram(x)x2summary(x2)A real-world data set on household income and expenditures
Description
A concise (1-5 lines) description of the dataset.
Format
testdata: a data frame with 4580 observations on the following 15 variables.
- urbrur
a numeric vector
- roof
a numeric vector
- walls
a numeric vector
- water
a numeric vector
- electcon
a numeric vector
- relat
a numeric vector
- sex
a numeric vector
- age
a numeric vector
- hhcivil
a numeric vector
- expend
a numeric vector
- income
a numeric vector
- savings
a numeric vector
- ori_hid
a numeric vector
- sampling_weight
a numeric vector
- household_weights
a numeric vector
testdata2: A data frame with 93 observations on the following 19 variables.
- urbrur
a numeric vector
- roof
a numeric vector
- walls
a numeric vector
- water
a numeric vector
- electcon
a numeric vector
- relat
a numeric vector
- sex
a numeric vector
- age
a numeric vector
- hhcivil
a numeric vector
- expend
a numeric vector
- income
a numeric vector
- savings
a numeric vector
- ori_hid
a numeric vector
- sampling_weight
a numeric vector
- represent
a numeric vector
- category_count
a numeric vector
- relat2
a numeric vector
- water2
a numeric vector
- water3
a numeric vector
References
The International Household Survey Network, www.ihsn.org
Examples
head(testdata)head(testdata2)Top and Bottom Coding
Description
Function for Top and Bottom Coding.
Usage
topBotCoding(obj, value, replacement, kind = "top", column = NULL)Arguments
obj | a numeric vector, a |
value | limit, from where it should be top- or bottom-coded |
replacement | replacement value. |
kind | top or bottom |
column | variable name in case the input is a |
Details
Extreme values larger or lower thanvalue are replaced by a different value (replacement in order to reduce the disclosure risk.
Value
Top or bottom coded data or modifiedsdcMicroObj-class.
Note
top-/bottom coding of factors is no longer possible as of sdcMicro >=4.7.0
Author(s)
Matthias Templ and Bernhard Meindl
References
Templ, M. and Kowarik, A. and Meindl, B. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software,67 (4), 1–36, 2015.doi:10.18637/jss.v067.i04
See Also
Examples
data(free1)res <- topBotCoding(free1[,"DEBTS"], value=9000, replacement=9100, kind="top")max(res)data(testdata)range(testdata$age)testdata <- topBotCoding(testdata, value=80, replacement=81, kind="top", column="age")range(testdata$age)## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- topBotCoding(sdc, value=500000, replacement=1000, column="income")testdataout <- extractManipData(sdc)Comparison of different microaggregation methods
Description
A Function for the comparison of different perturbation methods.
Usage
valTable( x, method = c("simple", "onedims", "clustpppca", "addNoise: additive", "swappNum"), measure = "mean", clustermethod = "clara", aggr = 3, nc = 8, transf = "log", p = 15, noise = 15, w = 1:dim(x)[2], delta = 0.1)Arguments
x | a |
method | character vector defining names of microaggregation-, adding-noiseor rank swapping methods. |
measure | FUN for aggregation. Possible values are mean (default), median, trim, onestep. |
clustermethod | clustermethod, if a method will need a clustering procedure |
aggr | aggregation level (default=3) |
nc | number of clusters. Necessary, if a method will need a clustering procedure |
transf | Transformation of variables before clustering. |
p | Swapping range, if method swappNum has been chosen |
noise | noise addition, if an addNoise method has been chosen |
w | variables for swapping, if method swappNum has been chosen |
delta | parameter for adding noise method |
Details
Tabularize the output fromsummary.micro(). Will be enhanced to allperturbation methods in future versions.
Methods for adding noise should be named viaaddNoise:{method}, e.g.addNoise:correlated, where{method} specifies the desired method asdescribed inaddNoise().
Value
Measures of information loss splitted for the comparison of different methods.
Author(s)
Matthias Templ
References
Templ, M. and Meindl, B.,Software Development for SDC in R, Lecture Notes in Computer Science, Privacy in Statistical Databases,vol. 4302, pp. 347-359, 2006.
See Also
microaggregation(),summary.micro()
Examples
data(Tarragona)valTable( x = Tarragona[100:200, ], method=c("simple", "onedims", "pca"))Change the a keyVariable of an object of classsdcMicroObj-class from Numeric toFactor or from Factor to Numeric
Description
Change the scale of a variable
Usage
varToFactor(obj, var)varToNumeric(obj, var)Arguments
obj | object of class |
var | name of the keyVariable to change |
Value
the modifiedsdcMicroObj-class
Examples
## for objects of class sdcMicro:data(testdata2)sdc <- createSdcObj(testdata2, keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), numVars=c('expend','income','savings'), w='sampling_weight')sdc <- varToFactor(sdc, var="urbrur")writeSafeFile
Description
writes an anonymized dataset to a file. This function should be used in thegraphical user interfacesdcApp() only.
Usage
writeSafeFile(obj, format, randomizeRecords, fileOut, ...)Arguments
obj | a |
format | (character) specifies the output file format. Acceptedvalues are:
|
randomizeRecords | (logical) specifies, if the output records shouldbe randomized. The following options are possible:
|
fileOut | (character) file to which output should be written |
... | optional arguments used for |
Value
invisibleNULL if the file was successfully written
Author(s)
Bernhard Meindl