Movatterモバイル変換

Normally, R processes data column-by-column. The data step allows youto process data row-by-row. Row-by-row processing of data is useful whenyou have related columns, and wish to perform conditional logic on thosecolumns. Thedatastep() function allows you to realize thisstyle of data processing. It is particularly advantageous when you wishto perform deeply nested conditional logic. It is also very useful forby-group processing.

Keep, Drop, and Rename

The data step has parameters to perform basic shaping of theresulting data frame. These parameters are ‘keep’, ‘drop’, and ‘rename’.For example, the above data step could have been performed by sendingall columns into the data step, and keeping only the desired columns.Using thekeep parameter also allows you to order theresulting columns.

Example 2: Keeping Data Step Variables

library(libr)# Keep and order output columnsdf<-datastep(mtcars[1:10,],keep =c("mpg","cyl","disp","mpgcat","recdt"), {if (mpg>=20)      mpgcat<-"High"else      mpgcat<-"Low"    recdt<-as.Date("1974-06-10")if (cyl==8)      is8cyl<-TRUE  })df#                    mpg cyl  disp mpgcat      recdt# Mazda RX4         21.0   6 160.0   High 1974-06-10# Mazda RX4 Wag     21.0   6 160.0   High 1974-06-10# Datsun 710        22.8   4 108.0   High 1974-06-10# Hornet 4 Drive    21.4   6 258.0   High 1974-06-10# Hornet Sportabout 18.7   8 360.0    Low 1974-06-10# Valiant           18.1   6 225.0    Low 1974-06-10# Duster 360        14.3   8 360.0    Low 1974-06-10# Merc 240D         24.4   4 146.7   High 1974-06-10# Merc 230          22.8   4 140.8   High 1974-06-10# Merc 280          19.2   6 167.6    Low 1974-06-10

By Group Processing

Thedatastep() function also has the capabilities ofperforming by-group processing. A by-group is accomplished using theby parameter, and passing a vector of column names thatdefine the group. Once a by-group is defined, thefirst.andlast. automatic variables become active, which allowyou to identify the boundaries between groups. Note that, by default,your data must be sorted properly before sending it into the data step.To turn the sort check off, set thesort_check parameter toFALSE.

Example 4: By Groups

library(libr)# Identify start and end of by-groupsdf<-datastep(mtcars[1:10,],keep =c("mpg","cyl","gear","grp"),by =c("gear"),sort_check =FALSE, {if (first.& last.)      grp<-"Start - End"elseif (first.)      grp<-"Start"elseif (last.)      grp<-"End"else      grp<-"-"  })df#                    mpg cyl gear   grp# Mazda RX4         21.0   6    4 Start# Mazda RX4 Wag     21.0   6    4     -# Datsun 710        22.8   4    4   End# Hornet 4 Drive    21.4   6    3 Start# Hornet Sportabout 18.7   8    3     -# Valiant           18.1   6    3     -# Duster 360        14.3   8    3   End# Merc 240D         24.4   4    4 Start# Merc 230          22.8   4    4     -# Merc 280          19.2   6    4   End

By Group Processing of Multiple Variables

If desired, you can pass multiple variables on thebyparameter. When there are multiple by groups, thefirst.andlast. automatic variables described above willrepresent an “or” combination of values for all by-variables. Inaddition, automatic variables will be created for each variable in theby group, similar to SAS®. Observe:

Example 5: Multiple By Groups

library(libr)# Create sample datadf<-data.frame(HairEyeColor)[seq(2,32,2), ]# Sort by groupsdf<-sort(df,by =c("Sex","Hair"))# Identify start and end of by-groupsdf2<-datastep(df,drop =c("Eye","Freq"),by =c("Sex","Hair"), {    fSex<- first.Sex    lSex<- last.Sex    fHair<- first.Hair    lHair<- last.Hair  })df2#     Hair    Sex  fSex  lSex fHair lHair# 1  Brown   Male  TRUE FALSE  TRUE FALSE# 2  Brown   Male FALSE FALSE FALSE FALSE# 3  Brown   Male FALSE FALSE FALSE FALSE# 4  Brown   Male FALSE FALSE FALSE  TRUE# 5  Blond   Male FALSE FALSE  TRUE FALSE# 6  Blond   Male FALSE FALSE FALSE FALSE# 7  Blond   Male FALSE FALSE FALSE FALSE# 8  Blond   Male FALSE  TRUE FALSE  TRUE# 9  Brown Female  TRUE FALSE  TRUE FALSE# 10 Brown Female FALSE FALSE FALSE FALSE# 11 Brown Female FALSE FALSE FALSE FALSE# 12 Brown Female FALSE FALSE FALSE  TRUE# 13 Blond Female FALSE FALSE  TRUE FALSE# 14 Blond Female FALSE FALSE FALSE FALSE# 15 Blond Female FALSE FALSE FALSE FALSE# 16 Blond Female FALSE  TRUE FALSE  TRUE

The abovefirst.Sex,last.Sex,first.Hair, andlast.Hair variables may alsobe used in conditions, functions, or any other expression inside yourdatastep. Note that likefirst. andlast. theyare dropped automatically at the end of the datastep. If you want toretain their values, assign them to a new variable as shown above.

Using Summary Functions

There may be times when you want to combine row-by-row conditionalprocessing with column-by-column vector operations. For example, let’ssay you want to calculate a mean and then perform conditional processingon that mean. This situation can be handled using thecalculate parameter on thedatastep()function. The function will execute thecalculate blockfirst, add any assigned variables to the data frame, and then executethe data step. Below is an example of such a scenario:

Example 6: Calculate Block

library(libr)# Categorize mpg as above or below the meandf<-datastep(mtcars,keep =c("mpg","cyl","mean_mpg","mpgcat"),calculate = { mean_mpg=mean(mpg) },  {if (mpg>= mean_mpg)      mpgcat<-"High"else      mpgcat<-"Low"  })df[1:10,]#                    mpg cyl mean_mpg mpgcat# Mazda RX4         21.0   6 20.09062   High# Mazda RX4 Wag     21.0   6 20.09062   High# Datsun 710        22.8   4 20.09062   High# Hornet 4 Drive    21.4   6 20.09062   High# Hornet Sportabout 18.7   8 20.09062    Low# Valiant           18.1   6 20.09062    Low# Duster 360        14.3   8 20.09062    Low# Merc 240D         24.4   4 20.09062   High# Merc 230          22.8   4 20.09062   High# Merc 280          19.2   6 20.09062    Low

Data Steps with`dplyr`

Note that thedatastep() function is pipe-friendly, andcan be combined withdplyr functions in a datapipeline. Also note that thedatastep() function willrecognize any group attributes added by thegroup_by()function. Therefore, within adplyr pipeline, it is notnecessary to use anydatastep parameters. The followingexample recreates the above data frame from Example 5, but with adplyr pipeline.

Example 7: Data Pipeline

library(libr)library(dplyr)library(magrittr)# Add datastep to dplyr pipelinedf<- mtcars%>%select(mpg, cyl, gear)%>%mutate(mean_mpg =mean(mpg))%>%datastep({if (mpg>= mean_mpg)      mpgcat<-"High"else      mpgcat<-"Low"  })%>%filter(row_number()<=10)df#     mpg cyl gear mean_mpg mpgcat# 1  21.0   6    4 20.09062   High# 2  21.0   6    4 20.09062   High# 3  22.8   4    4 20.09062   High# 4  21.4   6    3 20.09062   High# 5  18.7   8    3 20.09062    Low# 6  18.1   6    3 20.09062    Low# 7  14.3   8    3 20.09062    Low# 8  24.4   4    4 20.09062   High# 9  22.8   4    4 20.09062   High# 10 19.2   6    4 20.09062    Low

Data Attributes

Thelibr package recognizes several useful dataattributes that are not normally recognized by other R functions. Forexample, it is very convenient to assignlabel anddescription attributes to your columns, so other people canunderstand what data the columns contain.

For this reason, thedatastep() function provides anattrib parameter that allows you to supply such attributes aspart of a data step. Attributes are assigned with a named list and thedsattr() object.

Example 8: Attributes

library(libr)# Assign label attributes to all columnsdf<-datastep(mtcars[1:10, ],keep =c("mpg","cyl","mpgcat"),calculate = { mean_mpg=mean(mpg) },attrib =list(mpg =dsattr(label ="Miles Per Gallon"),cyl =dsattr(label ="Cylinders"),mpgcat =dsattr(label ="Mileage Category")), {if (mpg>= mean_mpg)      mpgcat<-"High"else      mpgcat<-"Low"  })# View attributes in dictionarydictionary(df)# # A tibble: 3 x 10#   Name  Column Class     Label            Description Format Width Justify  Rows   NAs#   <chr> <chr>  <chr>     <chr>            <chr>       <lgl>  <int> <chr>   <int> <int># 1 df    mpg    numeric   Miles Per Gallon NA          NA        NA NA         10     0# 2 df    cyl    numeric   Cylinders        NA          NA        NA NA         10     0# 3 df    mpgcat character Mileage Category NA          NA         4 NA         10     0

Data Step Array

As mentioned previously, R typically operates in a column-wisemanner. That is, R processes data column-by-column. But what if you needto get a sum or mean across a row?

This situation is what led to the development of thedatastep array. The data step array allows you to define a list ofcolumns and
iterate over the list inside a data step. Data step arrays are definedwith thearrays parameter, which accepts a named list ofdsarray() objects.

To see the array in action, we’ll use theAirPassengerssample data. This data shows international airline passengers by monthbetween 1949 and 1960. The data looks like this:

AirPassengers#      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec# 1949 112 118 132 129 121 135 148 148 136 119 104 118# 1950 115 126 141 135 125 149 170 170 158 133 114 140# 1951 145 150 178 163 172 178 199 199 184 162 146 166# 1952 171 180 193 181 183 218 230 242 209 191 172 194# 1953 196 196 236 235 229 243 264 272 237 211 180 201# 1954 204 188 235 227 234 264 302 293 259 229 203 229# 1955 242 233 267 269 270 315 364 347 312 274 237 278# 1956 284 277 317 313 318 374 413 405 355 306 271 306# 1957 315 301 356 348 355 422 465 467 404 347 305 336# 1958 340 318 362 348 363 435 491 505 404 359 310 337# 1959 360 342 406 396 420 472 548 559 463 407 362 405# 1960 417 391 419 461 472 535 622 606 508 461 390 432

This example illustrates how to create row totals, row means, andfind the top month using a data step array. The array has an indexer toextract values. You can use the indexer to extract a single value or asubset of values. An empty indexer will return all the values in thearray.

Example 9: Using a Data Step Array

library(libr)# Create AirPassengers Data Framedf<-as.data.frame(t(matrix(AirPassengers,12,dimnames =list(month.abb,seq(1949,1960)))),stringsAsFactors =FALSE)# Use datastep array to get year tot, mean, and top monthdat<-datastep(df,arrays =list(months =dsarray(names(df))),attrib =list(Tot =0,Mean =0,Top =""),drop ="mth",                {                  Tot<-sum(months[])                  Mean<-mean(months[])for (mthin months) {if (months[mth]==max(months[])) {                      Top<- mth                    }                  }                })dat#      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec  Tot     Mean Top# 1949 112 118 132 129 121 135 148 148 136 119 104 118 1520 126.6667 Aug# 1950 115 126 141 135 125 149 170 170 158 133 114 140 1676 139.6667 Aug# 1951 145 150 178 163 172 178 199 199 184 162 146 166 2042 170.1667 Aug# 1952 171 180 193 181 183 218 230 242 209 191 172 194 2364 197.0000 Aug# 1953 196 196 236 235 229 243 264 272 237 211 180 201 2700 225.0000 Aug# 1954 204 188 235 227 234 264 302 293 259 229 203 229 2867 238.9167 Jul# 1955 242 233 267 269 270 315 364 347 312 274 237 278 3408 284.0000 Jul# 1956 284 277 317 313 318 374 413 405 355 306 271 306 3939 328.2500 Jul# 1957 315 301 356 348 355 422 465 467 404 347 305 336 4421 368.4167 Aug# 1958 340 318 362 348 363 435 491 505 404 359 310 337 4572 381.0000 Aug# 1959 360 342 406 396 420 472 548 559 463 407 362 405 5140 428.3333 Aug# 1960 417 391 419 461 472 535 622 606 508 461 390 432 5714 476.1667 Jul

In the example above, the “Tot”, “Mean”, and “Top” columns were allcalculated using the datastep array. These types of row-wise statisticsare hard to calculate otherwise.

Filtering and Duplicating Rows

The datastep provides different ways to control which rows areoutput.

First, the function has awhere parameter to pass afilter expression to the datastep. The where clause will be executed atthe end of datastep processing. Pass in the where clause using theexpression() function. Like so:

# Prepare sample datadat<-as.data.frame(HairEyeColor)# Filter for black hair and blue eyesres<-datastep(dat,where =expression(Hair=="Black"& Eye=="Blue"),                {})res#    Hair  Eye    Sex Freq# 1 Black Blue   Male   11# 2 Black Blue Female    9

The datastep also recognizes thedelete() andoutput() functions to remove or duplicate rows from insidethe datastep. These functions give you conditional control over whichrows are output.

# Delete rows with frequencies less than 25res1<-datastep(dat, {if (Freq<25)delete()       })res1#     Hair   Eye    Sex Freq# 1  Black Brown   Male   32# 2  Brown Brown   Male   53# 3  Brown  Blue   Male   50# 4  Blond  Blue   Male   30# 5  Brown Hazel   Male   25# 6  Black Brown Female   36# 7  Brown Brown Female   66# 8  Brown  Blue Female   34# 9  Blond  Blue Female   64# 10 Brown Hazel Female   29# Only output rows for brown-eyes and frequencies over 25res2<-datastep(dat, {if (Eye=="Brown") {if (Freq>=25) {output()            }          }        })res2#    Hair   Eye    Sex Freq# 1 Black Brown   Male   32# 2 Brown Brown   Male   53# 3 Black Brown Female   36# 4 Brown Brown Female   66

You can also use theoutput() function to createdatasets from scratch, just like in SAS®. To create a dataset fromscratch, simply pass in an empty data frame and output the desiredvalues.

# Create metadatares3<-datastep(data.frame(), {          name<-"mtcars"          rows<-nrow(mtcars)          cols<-ncol(mtcars)output()          name<-"iris"          rows<-nrow(iris)          cols<-ncol(iris)output()          name<-"beaver1"          rows<-nrow(beaver1)          cols<-ncol(beaver1)output()        })res3#      name rows cols# 1  mtcars   32   11# 2    iris  150    5# 3 beaver1  114    4

Set and Merge Operations

When working with data, joining datasets is an essential activity.While there are many different functions in R to perform joins, thedatastep() “set” and “merge” parameters offer unusualflexibility.

The “set” parameter stacks two or more datasets. The “merge”parameter joins two or more datasets. Together, these two parametersallow you to perform the most common types of data combinations.

To illustrate, first let’s create up some sample datasets. Thedatasets we will create include one “region” dataset, and two “stores”datasets. Note that the columns on the stores datasets are notidentical.

# Create sample dataregion<-read.table(header =TRUE,text ='  REGION   NAME  R01      East  R02      West  R03      North  R04      South',stringsAsFactors =FALSE)# First stores datasetstores1<-read.table(header =TRUE,text ='  ID  NAME             SIZE REGION FRANCHISE  A01 "Eastern Lumber"    L    R01        T  A02 "Tri-City Hardwood" M    R02        F  A05 "Reliable Hardware" S    R01        T',stringsAsFactors =FALSE)# Extra column on this onestores2<-read.table(header =TRUE,text ='  ID  NAME             SIZE REGION  A03 "AAA Mills"         S    R05  A04 "Home and Yard"     L    R03',stringsAsFactors =FALSE)

Despite not having the same columns, the two stores datasets can beset using thedatastep() function. The function will fillin the missing values automatically. Like so:

# Set operationallstores<-datastep(stores1,set = stores2, {})# Extra values filled with NAallstores#    ID              NAME SIZE REGION FRANCHISE# 1 A01    Eastern Lumber    L    R01      TRUE# 2 A02 Tri-City Hardwood    M    R02     FALSE# 3 A05 Reliable Hardware    S    R01      TRUE# 4 A03         AAA Mills    S    R05        NA# 5 A04     Home and Yard    L    R03        NA

Let’s pretend we noticed the missing data, and decide to fill it in.We can do that by merging the missing FRANCHISE values to the secondstores dataset. First let’s create the missing data:

# Create small dataset of missing FRANCHISE valuesfranchises<-data.frame(FRANCHISE =c(F, F),stringsAsFactors =FALSE)franchises#   FRANCHISE# 1     FALSE# 2     FALSE

Next we can merge in the missing data on “stores2”, and set the twostore datasets again:

# Merge in missing FRANCHISE columnstores2mod<-datastep(stores2,merge = franchises, {})stores2mod#    ID          NAME SIZE REGION FRANCHISE# 1 A03     AAA Mills    S    R05     FALSE# 2 A04 Home and Yard    L    R03     FALSE# Set againallstores<-datastep(stores1,set = stores2mod, {})# Now everything is alignedallstores#    ID              NAME SIZE REGION FRANCHISE# 1 A01    Eastern Lumber    L    R01      TRUE# 2 A02 Tri-City Hardwood    M    R02     FALSE# 3 A05 Reliable Hardware    S    R01      TRUE# 4 A03         AAA Mills    S    R05     FALSE# 5 A04     Home and Yard    L    R03     FALSE

Observe that we did not have to specify a join condition on themerge. When no “merge_by” is indicated, the datastep will simply appendthe new columns to the right - without complaining. This behavior isvery convenient.

Now let’s do another join, but this time we will specify a joincondition. We will join in the store regions by the region ID. We willalso set up merge flags so we can see which rows were in which inputdataset.

# Merge operation - Outer Joinres<-datastep(allstores,merge = region,merge_by ="REGION",merge_in =c("inA","inB"), {})# View resultsres#     ID            NAME.1 SIZE REGION FRANCHISE NAME.2 inA inB# 1  A01    Eastern Lumber    L    R01      TRUE   East   1   1# 2  A05 Reliable Hardware    S    R01      TRUE   East   1   1# 3  A02 Tri-City Hardwood    M    R02     FALSE   West   1   1# 4  A04     Home and Yard    L    R03     FALSE  North   1   1# 5  A03         AAA Mills    S    R05     FALSE   <NA>   1   0# 6 <NA>              <NA> <NA>    R04        NA  South   0   1

Notice three things:

The “Name” field appeared in both datasets, and was thereforeappended with suffixes to distinguish them.
The region value “R05” is in the stores dataset, but not in theregion dataset. It appears this value was coded incorrectly.
Region “R04” had no stores.

Let’s try one last time to fix the above issues. We can fix thecolumn names with the “rename” parameter, and exclude rows with a“where” expression. Also, we can recode “R05” to “R04” inside thedatastep. Finally, we’ll drop the merge flags to clean up thecolumns.

# Merge operation - Left join and clean upres<-datastep(allstores,merge = region,merge_by ="REGION",merge_in =c("inA","inB"),rename =c(NAME.1 ="STORE_NAME",NAME.2 ="REGION_NAME"),where =expression(inA==TRUE),drop =c("inA","inB"),                {if (REGION=="R05") {                    REGION<-"R04"                    NAME.2<-"South"                  }                })#'# View resultsres#    ID        STORE_NAME SIZE REGION FRANCHISE REGION_NAME# 1 A01    Eastern Lumber    L    R01      TRUE        East# 2 A05 Reliable Hardware    S    R01      TRUE        East# 3 A02 Tri-City Hardwood    M    R02     FALSE        West# 4 A04     Home and Yard    L    R03     FALSE       North# 5 A03         AAA Mills    S    R04     FALSE       South

Datastep Performance

One weakness of thelibrdatastep()function is performance. The function is far slower than the equivalentSAS® datastep. The performance profile may limit the number of recordsyou are able to reasonably process with thedatastep().

One thing you can do to increase performance is to reduce the numberof rows and columns on the input data. You can perform thispre-filtering with Base R orTidyverse functions. Thisstrategy is particularly recommended if you were planning to subset thedata anyway using the “where” or “keep” options.

The Base Rsubset() function is convenient to usebecause it is always available. Here is an example showing how to reducethe size of theiris sample dataframe using Base Rsubset() before sending it to a datastep.

Example 10: Increasing Performance

# Subset the input dataset first for only needed rows and columnsdat <- subset(iris, Species == 'versicolor', c('Petal.Length', 'Petal.Width')) |>        datastep({                if (Petal.Length < 3.5)            Petal.Size <- "Short"         else if (Petal.Length > 4.5)            Petal.Size <- "Long"         else            Petal.Size <- "Medium"              })# View Some Resultsdat[1:10, ]#    Petal.Length Petal.Width Petal.Size# 1           4.7         1.4       Long# 2           4.5         1.5     Medium# 3           4.9         1.5       Long# 4           4.0         1.3     Medium# 5           4.6         1.5       Long# 6           4.5         1.3     Medium# 7           4.7         1.6       Long# 8           3.3         1.0      Short# 9           4.6         1.3       Long# 10          3.9         1.4     Medium

Next:Disclaimer

Movatterモバイル変換

Data Step Operations

Example 1: Simple Data Step

Keep, Drop, and Rename

Example 2: Keeping Data Step Variables

The Retain Parameter

Example 3: Drop, Retain, and Rename Parameters

By Group Processing

Example 4: By Groups

By Group Processing of Multiple Variables

Example 5: Multiple By Groups

Using Summary Functions

Example 6: Calculate Block

Data Steps with`dplyr`

Example 7: Data Pipeline

Data Attributes

Example 8: Attributes

Data Step Array

Example 9: Using a Data Step Array

Filtering and Duplicating Rows

Set and Merge Operations

Datastep Performance

Example 10: Increasing Performance

Movatterモバイル変換

Data Step Operations

Example 1: Simple Data Step

Keep, Drop, and Rename

Example 2: Keeping Data Step Variables

The Retain Parameter

Example 3: Drop, Retain, and Rename Parameters

By Group Processing

Example 4: By Groups

By Group Processing of Multiple Variables

Example 5: Multiple By Groups

Using Summary Functions

Example 6: Calculate Block

Data Steps withdplyr

Example 7: Data Pipeline

Data Attributes

Example 8: Attributes

Data Step Array

Example 9: Using a Data Step Array

Filtering and Duplicating Rows

Set and Merge Operations

Datastep Performance

Example 10: Increasing Performance

Data Steps with`dplyr`