Movatterモバイル変換

ruler offers a set of tools for creating tidy datavalidation reports usingdplyrgrammar of data manipulation. It is structured to be flexible andextendable in terms of creating rules and using their output.

To fully use this package a solid knowledge ofdplyr isrequired. The key idea behindruler’s design is to validatedata by modifying regulardplyr code with as littleoverhead as possible.

Some functionality is powered by thekeyholder package.It is highly recommended to use its supported functions during ruleconstruction. All one- and two-tabledplyr verbs applied tolocal data frames are supported and considered the most appropriate wayto create rules.

Installation

Example

# Utilities functionsis_integerish<-function(x) {all(x==as.integer(x))}z_score<-function(x) {abs(x-mean(x))/sd(x)}# Define rule packsmy_packs<-list(data_packs(dims = .%>%summarise(nrow_low =nrow(.)>=10,nrow_high =nrow(.)<=15,ncol_low =ncol(.)>=20,ncol_high =ncol(.)<=30)  ),group_packs(vs_am_num = .%>%group_by(vs, am)%>%summarise(vs_am_low =n()>=7),.group_vars =c("vs","am")  ),col_packs(enough_col_sum = .%>%summarise_if(is_integerish,rules(is_enough =sum(.)>=14))  ),row_packs(enough_row_sum = .%>%filter(vs==1)%>%transmute(is_enough =rowSums(.)>=200)  ),cell_packs(dbl_not_outlier = .%>%transmute_if(is.numeric,rules(is_not_out =z_score(.)<1))%>%slice(-(1:5))  ))# Expose data to rulesmtcars_exposed<- mtcars%>%as_tibble()%>%expose(my_packs)# View exposuremtcars_exposed%>%get_exposure()#>   Exposure#>#> Packs info:#> # A tibble: 5 × 4#>   name            type       fun        remove_obeyers#>   <chr>           <chr>      <list>     <lgl>#> 1 dims            data_pack  <data_pck> TRUE#> 2 vs_am_num       group_pack <grop_pck> TRUE#> 3 enough_col_sum  col_pack   <col_pack> TRUE#> 4 enough_row_sum  row_pack   <row_pack> TRUE#> 5 dbl_not_outlier cell_pack  <cell_pck> TRUE#>#> Tidy data validation report:#> # A tibble: 117 × 5#>   pack            rule       var      id value#>   <chr>           <chr>      <chr> <int> <lgl>#> 1 dims            nrow_high  .all      0 FALSE#> 2 dims            ncol_low   .all      0 FALSE#> 3 vs_am_num       vs_am_low  0.1       0 FALSE#> 4 enough_col_sum  is_enough  am        0 FALSE#> 5 enough_row_sum  is_enough  .all     19 FALSE#> 6 dbl_not_outlier is_not_out mpg      15 FALSE#> # ℹ 111 more rows# Assert any breakerinvisible(mtcars_exposed%>%assert_any_breaker())#>   Breakers report#> Tidy data validation report:#> # A tibble: 117 × 5#>   pack            rule       var      id value#>   <chr>           <chr>      <chr> <int> <lgl>#> 1 dims            nrow_high  .all      0 FALSE#> 2 dims            ncol_low   .all      0 FALSE#> 3 vs_am_num       vs_am_low  0.1       0 FALSE#> 4 enough_col_sum  is_enough  am        0 FALSE#> 5 enough_row_sum  is_enough  .all     19 FALSE#> 6 dbl_not_outlier is_not_out mpg      15 FALSE#> # ℹ 111 more rows#> Error: assert_any_breaker: Some breakers found in exposure.

Overview

Rule is a function which converts data unit ofinterest (data, group, column, row, cell) to logical value indicatingwhether this object satisfies certain condition.

Rule pack is a function which combines several rulesinto one functional block. The recommended way of creating rules is bycreating packs right away with the use ofdplyr andmagrittr’s pipe operator.

Exposing data to rules means applying rules to data,collecting results in common format and attaching them to the data as anexposure attribute. In this way actual exposure can be donein multiple steps and also be a part of a general data preparationpipeline.

Exposure is a format designed to contain uniforminformation about validation of different data units. Forreproducibility it also saves information about applied packs. Basicallyexposure is a list with two elements:

There are four basic combinations ofvar andid values which define five basic data units:

With exposure attached to data one can perform different kinds ofactions: exploration, assertion, imputation and so on.

Usage

Creating packs

Data packs

# List of two rule packs for checking data propertiesmy_data_packs<-data_packs(# data_dims is a pack namedata_dims = .%>%summarise(# ncol and nrow are rule namesncol =ncol(.)==12,nrow =nrow(.)==32  ),# Data after subsetting should have number of rows in between 10 and 30# Rules are applied separatelyvs_1 = .%>%filter(vs==1)%>%summarise(nrow_low =nrow(.)>10,nrow_high =nrow(.)<30    ))

Group packs

# List of one nameless rule pack for checking group propertymy_group_packs<-group_packs(# Name will be imputed during exposure  .%>%group_by(vs, am)%>%summarise(any_cyl_6 =any(cyl==6)),# One should supply grouping variables for correct interpretation of output.group_vars =c("vs","am"))

Column packs

# rules() defines function predicators with necessary name imputations# List of two rule pack for checking certain columns' propertiesmy_col_packs<-col_packs(sum_bounds = .%>%summarise_at(# Check only columns with names starting with 'c'vars(starts_with("c")),rules(sum_low =sum(.)>300,sum_high =sum(.)<400)  ),# In the edge case of checking one column with one rule there is a need# for forcing inclusion of names in the output of summarise_at().# This is done with naming argument in vars()vs_mean = .%>%summarise_at(vars(vs = vs),rules(mean(.)>0.5)))

Row packs

z_score<-function(x) {  (x-mean(x))/sd(x)}# List of one rule pack checking certain rows' propertymy_row_packs<-row_packs(row_mean = .%>%mutate(rowMean =rowMeans(.))%>%transmute(is_common_row_mean =abs(z_score(rowMean))<1)%>%# Check only rows 10-15# Values in 'id' column of report will be based on input data (i.e. 10-15)# and not on output data (1-6)slice(10:15))

Cell packs

is_integerish<-function(x) {all(x==as.integer(x))}# List of two cell pack checking certain cells' propertymy_cell_packs<-cell_packs(my_cell_pack_1 = .%>%transmute_if(# Check only integer-like columns    is_integerish,rules(is_common =abs(z_score(.))<1)  )%>%# Check only rows 20-30slice(20:30),# The same edge case as in column rule packvs_side = .%>%transmute_at(vars(vs ="vs"),rules(.>mean(.))))

Exposing

mtcars%>%expose(my_data_packs, my_group_packs)%>%get_exposure()#>   Exposure#>#> Packs info:#> # A tibble: 3 × 4#>   name          type       fun        remove_obeyers#>   <chr>         <chr>      <list>     <lgl>#> 1 data_dims     data_pack  <data_pck> TRUE#> 2 vs_1          data_pack  <data_pck> TRUE#> 3 group_pack__1 group_pack <grop_pck> TRUE#>#> Tidy data validation report:#> # A tibble: 3 × 5#>   pack          rule      var      id value#>   <chr>         <chr>     <chr> <int> <lgl>#> 1 data_dims     ncol      .all      0 FALSE#> 2 group_pack__1 any_cyl_6 0.0       0 FALSE#> 3 group_pack__1 any_cyl_6 1.1       0 FALSE

mtcars%>%expose(my_data_packs, my_group_packs,.remove_obeyers =FALSE)%>%get_exposure()#>   Exposure#>#> Packs info:#> # A tibble: 3 × 4#>   name          type       fun        remove_obeyers#>   <chr>         <chr>      <list>     <lgl>#> 1 data_dims     data_pack  <data_pck> FALSE#> 2 vs_1          data_pack  <data_pck> FALSE#> 3 group_pack__1 group_pack <grop_pck> FALSE#>#> Tidy data validation report:#> # A tibble: 8 × 5#>   pack          rule      var      id value#>   <chr>         <chr>     <chr> <int> <lgl>#> 1 data_dims     ncol      .all      0 FALSE#> 2 data_dims     nrow      .all      0 TRUE#> 3 vs_1          nrow_low  .all      0 TRUE#> 4 vs_1          nrow_high .all      0 TRUE#> 5 group_pack__1 any_cyl_6 0.0       0 FALSE#> 6 group_pack__1 any_cyl_6 0.1       0 TRUE#> # ℹ 2 more rows

By defaultexpose() guesses the pack type if ‘not-pack’function is supplied. This behaviour has some edge cases but is usefulfor interactive use.

mtcars%>%expose(some_data_pack = .%>%summarise(nrow =nrow(.)==10),some_col_pack = .%>%summarise_at(vars(vs ="vs"),rules(is.character(.)))  )%>%get_exposure()#>   Exposure#>#> Packs info:#> # A tibble: 2 × 4#>   name           type      fun        remove_obeyers#>   <chr>          <chr>     <list>     <lgl>#> 1 some_data_pack data_pack <data_pck> TRUE#> 2 some_col_pack  col_pack  <col_pack> TRUE#>#> Tidy data validation report:#> # A tibble: 2 × 5#>   pack           rule    var      id value#>   <chr>          <chr>   <chr> <int> <lgl>#> 1 some_data_pack nrow    .all      0 FALSE#> 2 some_col_pack  rule__1 vs        0 FALSE

mtcars%>%expose(some_data_pack = .%>%summarise(nrow =nrow(.)==10),some_col_pack = .%>%summarise_at(vars(vs ="vs"),rules(is.character(.))),.guess =FALSE  )%>%get_exposure()#> Error in expose_single.default(X[[i]], ...): There is unsupported class of rule pack.

Acting after exposure

General actions are recommended to be done withact_after_exposure(). It takes two arguments:

If trigger didn’t notify then the input data is returned untouched.Otherwise the output of.actor() is returned.Note thatact_after_exposure() is oftenused for creating side effects (printing, throwing error etc.) and inthat case should invisibly return its input (to be able to use it withpipe).

trigger_one_pack<-function(.tbl) {  packs_number<- .tbl%>%get_packs_info()%>%nrow()  packs_number>1}actor_one_pack<-function(.tbl) {cat("More than one pack was applied.\n")invisible(.tbl)}mtcars%>%expose(my_col_packs, my_row_packs)%>%act_after_exposure(.trigger = trigger_one_pack,.actor = actor_one_pack  )%>%invisible()#> More than one pack was applied.

ruler has functionassert_any_breaker()which can notify about presence of any breaker in exposure.

mtcars%>%expose(my_col_packs, my_row_packs)%>%assert_any_breaker()#>   Breakers report#> Tidy data validation report:#> # A tibble: 4 × 5#>   pack       rule               var      id value#>   <chr>      <chr>              <chr> <int> <lgl>#> 1 sum_bounds sum_low            cyl       0 FALSE#> 2 sum_bounds sum_low            carb      0 FALSE#> 3 vs_mean    rule__1            vs        0 FALSE#> 4 row_mean   is_common_row_mean .all     15 FALSE#> Error: assert_any_breaker: Some breakers found in exposure.