The "foreign" package forR already provides facilities toimport data from other statistical software packages such as SPSS orStata, but they are limited by the way survey data are generallyrepresented inR. That is, since variables in an R data framecan only be numerical vectors or factors, any direct translation of SPSSor Stata data sets into data frames will lead to the loss of informationof information, such as variable labels, variable labels, oruser-specified missing values. (Value labels can be preserved bytranslating them into factor levels, but this means losing informationabout the original codes. It will also lead to undesired missing values,if variables in the original data sets are only partially labelled.) The"memisc" package for this reason provides functions that allow to importSPSS or Stata data sets into objects of the class"data.set" defined in it.
Importing data using the facilities provided by the "memisc" packageconsists of two steps. In the first step, a description of the data inthe file is collected in an object of class "importer". In the secondstep, data are imported into "data.set" objects with the help of these"importer" objects. These "importer" objects contain only meta-datae.g. about variable labels, value labels, and user-defined missingvalues. This allows to get an overview of the structure of the filewithout the need of loading the complete data, which is advantageousesp. if the data set is large. For example, with the help of an"importer object" it is possible to see what the labels of the variablesare so that one can select those variables from the data file that areactually needed. The data set object in R memory can then created by --ifimprtr is an importer object -- by calls likesubset(imprtr,...),imprtr[...] oras.data.set(imprtr). Some examples are given in thefollowing.
Note that these examples require data not included in the package(you need to register toGESIS todownload the data). The vignette code cannot be run without thisadditional data.
In order to import data from an SPSS "system" file, the usual binaryformat in which SPSS data now is usually saved and often distributed,one needs to first make the file that contains the data known to R, asin the following example:
:
SPSS system file 'Data/ZA5702_v2-0-0.sav' with 979 variables and 3911 observationsOnce the "system file" is declared using the functionspss.system.file(), metadata becomes available, such as thenumber of cases and variables (as just seen), the names and labels ofthe variables (as seen below):
:
study 'Studiennummer' version 'GESIS Archiv Version' year 'Erhebungsjahr' field 'Erhebungszeitraum' glescomp 'GLES-Komponente' survey 'Erhebung/Welle' lfdn 'Laufende Nummer (Kumulation)' vlfdn 'Laufende Nummer (Vorwahl)' nlfdn 'Laufende Nummer (Nachwahl)' datum 'Datum der Befragung (Monat/Tag/Jahr)'(Here only an extract of the full output was shown, since the dataset contains as many as 979 variables.)
An "importer" object, such asZA5702 in this example,would also allow to obtain a full codebook with
but we refrain from showing such a codebook for the obvious reason ofnot creating too much output. As the inspection of the data in the fileshows, most variable names have a standardised, yet non-mnemonicstructure. Variables referring to questions asked in the pre-electionwave of the GLES 2013 study have names starting with "v",those referring to questions asked in the post-election wave have namesstarting with "v", while those referring to question askedin both waves have names starting "nv". For a specificanalysis, such variable names are not very useful. For this reason wewant to rename them. We could do this after loading the data, but it ismore convenient to do the data import and the renaming in one step as inthe example below:
gles2013work<-subset(ZA5702,select=c(wave = survey,intent.turnout = v10,turnout = n10,voteint.candidate = v11aa,voteint.list = v11ba,postal.vote.candidate = v12aa,postal.vote.list = v12ba,vote.candidate = n11aa,vote.list = n11ba,bula = bl ))The variable names to the left of the equality sign are the variablenames as they will appear in the data set after import, while thevariable names to the right of the equality aign are the variable namesas they exist in the data file.
As a demonstration of what information can be extracted from the datafile, we create a codebook for one of the items in the data set:
================================================================================ gles2013work$turnout 'Wahlbeteiligung'-------------------------------------------------------------------------------- Storage mode: double Measurement: interval Missing values: -Inf - -1 Values and labels N Valid Total -99 M 'keine Angabe' 3 0.1 -97 M 'trifft nicht zu' 20 0.5 -94 M 'nicht in Auswahlgesamtheit' 2003 51.2 1 'ja, habe gewaehlt' 1596 84.7 40.8 2 'nein, habe nicht gewaehlt' 289 15.3 7.4 Min: 1.000 Max: 2.000 Mean: 1.153 Std.Dev.: 0.360Data from SPSS "portable" files are imported in essentially the sameway as data from SPSS "system" files: The first step again is to makethe data set known toR:
:
SPSS portable file 'Data/ZA3861.por' with 331 variables and 3263 observationsSince this file contains German umlauts (in contrast to the previousexample), we need to convert the character coding of the value labelsetc. from "Latin-1" (the original coding of the data) into the nativeencoding of the system (unless the computer is using natively "Latin-1"encoding and not - as must Mac and most Linux System - a variant ofUTF8).
Importer objects created from "portable" files can be examined in thesame way as importer objects created from "system" files. For example,we get a description of the variables in the data set (the variablelabels) and a codebook.
:
vvpnid 'Fallnummer' vsplitwo 'West-Ost-Kennung' vvornach 'Vor-/Nachwahl' vland 'Bundesland' v10 'Wirtschaftl. Lage allgemein' v20 'Wirtschaftl. Lage retrospektiv' v30 'Wirtschaftl. Lage prospektiv' v31 'Wichtigkeit Erst/Zweitstimme BTW (nicht 94)'v40 'Demokratiezufriedenheit' v50 'Staerke Politikinteresse'To actually import the data and make them accessible for analysis wecan (as above), useas.data.set(), orsubset()as in this example:
work2002<-subset(ZA3861,select=c(respid = VVPNID,split.wo = VSPLITWO,split.vor.nach = VVORNACH,Bundesland = VLAND,Erststimme = V69,Zweitstimme = V70,Geschlecht = VSEX,GebMonat = VMONAT,GebJahr = VJAHR,Konfession = VRELIG,Kirchgang = VKIRCHG,Erwerbst = VBERUFTG,FrErwerbst = VFRBERTG,Beruf = VBERUF,Famstand = VFAMSTDN,Partner = VPARTNER,BildungP = VPBILDGA,BerufstP = VPBERUFT,FrBerufstP = VPFBERTG,BerufP = VPBERUF,ReprGewicht = VGVWNW ) )Data from more recent study components of the American NationalElecion Study comes in fixed-width format, with some additional SPSSsyntax files that define columns, variable labels, value labels, andmissing values.memisc also provides an importer functionsuch data. Naturally this requires a little bit more information. Inaddition to the actual data file, we also need a file with SPSS syntaxspecifying the data columns. Optionally, Syntax files that definevariable labels, value lables, and missing values can also bespecified.
anes2008TS<-spss.fixed.file("Data/anes2008/anes2008TS_dat.txt",columns.file="Data/anes2008/anes2008TS_col.sps",varlab.file="Data/anes2008/anes2008TS_lab.sps",codes.file="Data/anes2008/anes2008TS_cod.sps",missval.file="Data/anes2008/anes2008TS_md.sps")anes2008TS:
SPSS fixed column file 'issues/anes2008/anes2008TS_dat.txt' with 1954 variables and 2322 observations with variable labels from file 'issues/anes2008/anes2008TS_lab.sps' with value labels from file 'issues/anes2008/anes2008TS_cod.sps' with missing value definitions from file 'issues/anes2008/anes2008TS_md.sps'Further information about the data can now be obtained from thereturned importer object in the same way as from importer objects thatdescribe SPSS "system" or SPSS "portable" files. That is, we can usenames(),description(), andcodebook(). To get the data in to the memory ofRwe can use (as above) the functionsas.data.set() andsubset().
Data from Stata files (up to Stata Version 12) can be imported in thesame way as data from SPSS files. The main difference is the functionused for it, and the fact that user-defined missing values do not existsin Stata. For this, see the following example:
:
Stata file 'Data/ZA5702_v2-0-0.dta' with 874 variables and 3911 observationsgles2013work.dta<-subset(ZA5702.dta,select=c(wave = survey,intent.turnout = v10,turnout = n10,voteint.candidate = v11aa,voteint.list = v11ba,postal.vote.candidate = v12aa,postal.vote.list = v12ba,vote.candidate = n11aa,vote.list = n11ba,bula = bl ))codebook(gles2013work.dta$turnout)================================================================================ gles2013work.dta$turnout 'Wahlbeteiligung'-------------------------------------------------------------------------------- Storage mode: integer Measurement: nominal Missing values: 100 - 127 Values and labels N Percent -99 'keine Angabe' 3 0.1 -98 'weiss nicht' 0 0.0 -97 'trifft nicht zu' 20 0.5 -96 'Split' 0 0.0 -95 'nicht teilgenommen' 0 0.0 -94 'nicht in Auswahlgesamtheit' 2003 51.2 -93 'Interview abgebrochen' 0 0.0 -92 'Fehler in Daten' 0 0.0 -86 'nicht wahlberechtigt' 0 0.0 -85 'nicht waehlen' 0 0.0 -84 'keine Erst-/Zweitstimme abgegeben' 0 0.0 -83 'ungueltig waehlen' 0 0.0 -82 'keine andere Partei waehlen' 0 0.0 -81 'noch nicht entschieden' 0 0.0 -72 'nicht einzuschaetzen' 0 0.0 -71 'nicht bekannt' 0 0.0 1 'ja, habe gewaehlt' 1596 40.8 2 'nein, habe nicht gewaehlt' 289 7.4