Movatterモバイル変換


[0]ホーム

URL:


Data wrangling

library(scrutiny)

In general, scrutiny’s techniques for error detection are designedfor a focus on the essential points, cutting out time-consumingrepetition. There are some bottlenecks, however, such as enteringdecimal numbers as strings, or splitting strings that look like"7.64 (1.5)".

This vignette shows how to save your time preparing data for errordetection. It gives some general tips for these tasks, and then presentsscrutiny’s own specialized wrangling functions.

Trailing zeros

Motivation

One particular challenge when looking for numeric irregularitiesusing R is that numbers often have to be treated as strings. The reasonis that numeric values don’t preserve any trailing zeros. This is amajor problem because trailing zeros are as important to, e.g., GRIM orDEBIT as any other trailing digits would be.

The only solution I know of is to work with strings — namely, stringsthat can be converted to non-NA numeric values. I willdiscuss two ways to work with them: (1) directly entering or importingnumbers as strings, and (2) restoring trailing zeros.

Enter numbers as strings

Automated

Several R packages help to extract tables from PDF. I recommendtabulizer (not currently on CRAN; seeinstallationnotes). There are also thepdftables andpdftoolspackages.

Using tabulizer requires Java to be installed. When it works well,tabulizer is a great tool for importing tables quickly and efficiently.It automatically captures values as strings, so trailing zeros aretreated just like other digits.

However, tabulizer might sometimes struggle, especially with olderPDF files. That is most likely the fault of the PDF format itselfbecause it has no inbuilt support for tables, so any effort to extractthem faces serious ambiguities. (See below,Replace column names byrow values, for a solution to one such issue.)

If there are many tables in multiple files formatted in the same way,it can be useful to check if tabulizer reliably and accurately capturesthem. If it doesn’t, you might have to use copy and paste.

With copy and paste

Perhaps not all R users know that RStudio features an option formultiple cursors. These are especially useful in conjunction withtibble::tribble(), which is available via scrutiny. Here’show to use multiple cursors in the present context:

  1. Copy a column of numbers from PDF, pressing and holdingAlt on Windows oroption on Mac. (This worksat least in Adobe Acrobat.)
  2. Paste it into atribble() call as below.
  3. Pressing and holdingAlt/option, selectall the copied numbers.
  4. Enter quotation marks and, fortribble()’s syntax, acomma.

You should then get something like this:

flights1<- tibble::tribble(~x,"8.97","2.61","7.26","3.64","9.26","10.46","7.39",)

All that’s missing is the sample size. Add it either via anothertribble() column as above or viadplyr::mutate(), which also comes with scrutiny:

flights1<- flights1%>%  dplyr::mutate(n =28)flights1#> # A tibble: 7 × 2#>   x         n#>   <chr> <dbl>#> 1 8.97     28#> 2 2.61     28#> 3 7.26     28#> 4 3.64     28#> 5 9.26     28#> 6 10.46    28#> 7 7.39     28

Restore trailing zeros

When dealing with numbers that used to have trailing zeros but lostthem from being registered as numeric, callrestore_zeros()to format them correctly. Suppose all of the following numbersoriginally had one decimal place, but some no longer do:

vec<-c(4,6.9,5,4.2,4.8,7,4)vec%>%decimal_places()#> [1] 0 1 0 1 1 0 0

Now, get them back withrestore_zeros():

vec%>%restore_zeros()#> [1] "4.0" "6.9" "5.0" "4.2" "4.8" "7.0" "4.0"vec%>%restore_zeros()%>%decimal_places()#> [1] 1 1 1 1 1 1 1

This uses the default of going by the longest mantissa and paddingthe other strings with decimal zeros until they have that many decimalplaces. However, this is just a heuristic: The longest mantissa mightitself have lost decimal places. Specify thewidth argumentto explicitly state the desired mantissa length:

vec%>%restore_zeros(width =2)#> [1] "4.00" "6.90" "5.00" "4.20" "4.80" "7.00" "4.00"vec%>%restore_zeros(width =2)%>%decimal_places()#> [1] 2 2 2 2 2 2 2

A convenient way to restore trailing zeros in a data frame isrestore_zeros_df(). By default, it operates on all columnsthat are coercible to numeric (factors don’t count):

iris<- tibble::as_tibble(iris)iris%>%restore_zeros_df(width =3)#> # A tibble: 150 × 5#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species#>    <chr>        <chr>       <chr>        <chr>       <fct>#>  1 5.100        3.500       1.400        0.200       setosa#>  2 4.900        3.000       1.400        0.200       setosa#>  3 4.700        3.200       1.300        0.200       setosa#>  4 4.600        3.100       1.500        0.200       setosa#>  5 5.000        3.600       1.400        0.200       setosa#>  6 5.400        3.900       1.700        0.400       setosa#>  7 4.600        3.400       1.400        0.300       setosa#>  8 5.000        3.400       1.500        0.200       setosa#>  9 4.400        2.900       1.400        0.200       setosa#> 10 4.900        3.100       1.500        0.100       setosa#> # ℹ 140 more rows

Specify columns mostly like you would indplyr::select():

iris%>%restore_zeros_df(starts_with("Sepal"),width =3)#> # A tibble: 150 × 5#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species#>    <chr>        <chr>              <dbl>       <dbl> <fct>#>  1 5.100        3.500                1.4         0.2 setosa#>  2 4.900        3.000                1.4         0.2 setosa#>  3 4.700        3.200                1.3         0.2 setosa#>  4 4.600        3.100                1.5         0.2 setosa#>  5 5.000        3.600                1.4         0.2 setosa#>  6 5.400        3.900                1.7         0.4 setosa#>  7 4.600        3.400                1.4         0.3 setosa#>  8 5.000        3.400                1.5         0.2 setosa#>  9 4.400        2.900                1.4         0.2 setosa#> 10 4.900        3.100                1.5         0.1 setosa#> # ℹ 140 more rows

Split strings by parentheses

Basic usage

With summary data copied or extracted from PDF (see above), you mightencounter values presented like5.22 (0.73). Instead ofmanually teasing them apart, callsplit_by_parens():

flights2<- tibble::tribble(~drone,~selfpilot,"0.09 (0.21)","0.19 (0.13)","0.19 (0.28)","0.53 (0.10)","0.62 (0.16)","0.50 (0.11)","0.15 (0.35)","0.57 (0.16)",)flights2%>%split_by_parens()#> # A tibble: 4 × 4#>   drone_x drone_sd selfpilot_x selfpilot_sd#>   <chr>   <chr>    <chr>       <chr>#> 1 0.09    0.21     0.19        0.13#> 2 0.19    0.28     0.53        0.10#> 3 0.62    0.16     0.50        0.11#> 4 0.15    0.35     0.57        0.16

Optionally, transform these values into a more useful format:

flights2%>%split_by_parens(transform =TRUE)#> # A tibble: 8 × 3#>   .origin   x     sd#>   <chr>     <chr> <chr>#> 1 drone     0.09  0.21#> 2 drone     0.19  0.28#> 3 drone     0.62  0.16#> 4 drone     0.15  0.35#> 5 selfpilot 0.19  0.13#> 6 selfpilot 0.53  0.10#> 7 selfpilot 0.50  0.11#> 8 selfpilot 0.57  0.16

From here, you can calldebit_map() almost right away(supposing you deal with binary distributions’ means and standarddeviations):

flights2%>%split_by_parens(transform =TRUE)%>%  dplyr::mutate(n =80)%>%debit_map()#> # A tibble: 8 × 12#>   x     sd        n consistency rounding   sd_lower sd_incl_lower sd_upper#>   <chr> <chr> <int> <lgl>       <chr>         <dbl> <lgl>            <dbl>#> 1 0.09  0.21     80 FALSE       up_or_down    0.205 TRUE             0.215#> 2 0.19  0.28     80 FALSE       up_or_down    0.275 TRUE             0.285#> 3 0.62  0.16     80 FALSE       up_or_down    0.155 TRUE             0.165#> 4 0.15  0.35     80 TRUE        up_or_down    0.345 TRUE             0.355#> 5 0.19  0.13     80 FALSE       up_or_down    0.125 TRUE             0.135#> 6 0.53  0.10     80 FALSE       up_or_down    0.095 TRUE             0.105#> 7 0.50  0.11     80 FALSE       up_or_down    0.105 TRUE             0.115#> 8 0.57  0.16     80 FALSE       up_or_down    0.155 TRUE             0.165#> # ℹ 4 more variables: sd_incl_upper <lgl>, x_lower <dbl>, x_upper <dbl>,#> #   .origin <chr>

If your strings look like"2.65 [0.27]", specify thesep argument as"brackets". Likewise for"2.65 {0.27}" andsep = "braces". What aboutother separators, as in"2.65 <0.27>"? Specifysep as those two substrings, likesep = c("<", ">"). In all of these cases, the outputwill be the same as the default would be if the strings were like"2.65 (0.27)".

Column name suffixes

The defaults for column name suffixes are (1)"x" forthe part before the parentheses and (2)"sd" for the partinside of them. However, this won’t fit for all data presented like5.22 (0.73). Override the defaults by specifyingcol1 and/orcol2:

flights2%>%split_by_parens(end1 ="beta",end2 ="se")#> # A tibble: 4 × 4#>   drone_beta drone_se selfpilot_beta selfpilot_se#>   <chr>      <chr>    <chr>          <chr>#> 1 0.09       0.21     0.19           0.13#> 2 0.19       0.28     0.53           0.10#> 3 0.62       0.16     0.50           0.11#> 4 0.15       0.35     0.57           0.16

These suffixes become column names iftransform is settoTRUE:

flights2%>%split_by_parens(end1 ="beta",end2 ="se",transform =TRUE)#> # A tibble: 8 × 3#>   .origin   beta  se#>   <chr>     <chr> <chr>#> 1 drone     0.09  0.21#> 2 drone     0.19  0.28#> 3 drone     0.62  0.16#> 4 drone     0.15  0.35#> 5 selfpilot 0.19  0.13#> 6 selfpilot 0.53  0.10#> 7 selfpilot 0.50  0.11#> 8 selfpilot 0.57  0.16

Extract substrings frombefore_parens() andinside_parens()

There also are specific functions for extracting the parts of theindividual string vectors before or inside the parentheses:

flights3<- flights2%>%  dplyr::pull(selfpilot)flights3#> [1] "0.19 (0.13)" "0.53 (0.10)" "0.50 (0.11)" "0.57 (0.16)"flights3%>%before_parens()#> [1] "0.19" "0.53" "0.50" "0.57"flights3%>%inside_parens()#> [1] "0.13" "0.10" "0.11" "0.16"

Replace column names by row values

When extracting tables from PDF with tabulizer, you might get dataframes (converted from matrices) that have wrong, nondescript columnnames, while the correct column names are stored in one or more rowswithin the data frame itself.

I will first simulate the problem.x andnshould be column names, but instead they are values in the firstrow:

flights1_with_issues<- flights1%>%    dplyr::mutate(n =as.character(n))%>%    tibble::add_row(x ="x",n ="n",.before =1)colnames(flights1_with_issues)<-c("Var1","Var2")flights1_with_issues#> # A tibble: 8 × 2#>   Var1  Var2#>   <chr> <chr>#> 1 x     n#> 2 8.97  28#> 3 2.61  28#> 4 7.26  28#> 5 3.64  28#> 6 9.26  28#> 7 10.46 28#> 8 7.39  28

To remedy the issue, callrow_to_colnames() on the dataframe. It will replace the column names by the values of one or morerows. The latter are specified by their position numbers as indplyr::slice(). For these numbers, the default is1 because the column names will often be stored in thefirst row, if at all. The specified row or rows are then dropped becausethey shouldn’t have been rows in the first place.

With the above example:

flights1_with_issues%>%row_to_colnames()#> # A tibble: 7 × 2#>   x     n#>   <chr> <chr>#> 1 8.97  28#> 2 2.61  28#> 3 7.26  28#> 4 3.64  28#> 5 9.26  28#> 6 10.46 28#> 7 7.39  28

Note thatn is still a string vector, but this is truefor all columns in tables extracted with tabulizer.


[8]ページ先頭

©2009-2025 Movatter.jp