Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Assertive programming for R analysis pipelines

License

NotificationsYou must be signed in to change notification settings

tonyfischetti/assertr

Repository files navigation

assertr logo

R-CMD-checkCodecov test coverageCRAN statusCRAN RStudio mirror downloadsrOpenSci software peer-review

What is it?

The assertr package supplies a suite of functions designed to verifyassumptions about data early in an analysis pipeline so thatdata errors are spotted early and can be addressed quickly.

This package does not need to be used with the magrittr/dplyr pipingmechanism but the examples in this README use them for clarity.

Installation

You can install the latest version on CRAN like this

    install.packages("assertr")

or you can install the bleeding-edge development version like this:

    install.packages("devtools")devtools::install_github("ropensci/assertr")

What does it look like?

This package offers five assertion functions,assert,verify,insist,assert_rows, andinsist_rows, that are designed to be usedshortly after data-loading in an analysis pipeline...

Let’s say, for example, that the R’s built-in car dataset,mtcars, was notbuilt-in but rather procured from an external source that was known for makingerrors in data entry or coding. Pretend we wanted to find the averagemiles per gallon for each number of engine cylinders. We might want to first,confirm

  • that it has the columns "mpg", "vs", and "am"
  • that the dataset contains more than 10 observations
  • that the column for 'miles per gallon' (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datumthat is outside 4 standard deviations from its mean, and
  • that the am and vs columns (automatic/manual and v/straight engine,respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is uniquejointly between the "mpg", "am", and "wt" columns
  • each row's mahalanobis distance is within 10 median absolute deviations ofall the distances (for outlier detection)

This could be written (in order) usingassertr like this:

    library(dplyr)    library(assertr)mtcars %>%      verify(has_all_names("mpg","vs","am","wt")) %>%      verify(nrow(.)>10) %>%      verify(mpg>0) %>%      insist(within_n_sds(4),mpg) %>%      assert(in_set(0,1),am,vs) %>%      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%      assert_rows(col_concat,is_uniq,mpg,am,wt) %>%      insist_rows(maha_dist, within_n_mads(10), everything()) %>%      group_by(cyl) %>%      summarise(avg.mpg=mean(mpg))

If any of these assertions were violated, an error would have been raisedand the pipeline would have been terminated early.

Let's see what the error message look like when you chaina bunch of failing assertions together.

>mtcars %>%+chain_start %>%+   assert(in_set(1,2,3,4),carb) %>%+   assert_rows(rowMeans, within_bounds(0,5),gear:carb) %>%+   verify(nrow(.)==10) %>%+   verify(mpg<32) %>%+chain_endThereare7errorsacross4verbs:-verbredux_fnpredicatecolumnindexvalue1assert<NA>  in_set(1,2,3,4)carb306.02assert<NA>  in_set(1,2,3,4)carb318.03assert_rowsrowMeans within_bounds(0,5)~gear:carb305.54assert_rowsrowMeans within_bounds(0,5)~gear:carb316.55verify<NA>       nrow(.)==10<NA>1NA6verify<NA>mpg<32<NA>18NA7verify<NA>mpg<32<NA>20NAError:assertrstoppedexecution

What doesassertr give me?

  • verify - takes a data frame (its first argument is provided bythe%>% operator above), and a logical (boolean) expression. Then,verifyevaluates that expression using the scope of the provided data frame. If anyof the logical values of the expression's result areFALSE,verify willraise an error that terminates any further processing of the pipeline.

  • assert - takes a data frame, a predicate function, and an arbitrarynumber of columns to apply the predicate function to. The predicate function(a function that returns a logical/boolean value) is then applied to everyelement of the columns selected, and will raise an error if it finds anyviolations. Internally, theassert function usesdplyr'sselect function to extract the columns to test the predicate function on.

  • insist - takes a data frame, a predicate-generating function, and anarbitrary number of columns. For each column, the the predicate-generatingfunction is applied, returning a predicate. The predicate is then applied toevery element of the columns selected, and will raise an error if it finds anyviolations. The reason for using a predicate-generating function to return apredicate to use against each value in each of the selected rows is sothat, for example, bounds can be dynamically generated based on what the datalook like; this the only way to, say, create bounds that check if each datum iswithin x z-scores, since the standard deviation isn't known a priori.Internally, theinsist function usesdplyr'sselect function to extractthe columns to test the predicate function on.

  • assert_rows - takes a data frame, a row reduction function, a predicatefunction, and an arbitrary number of columns to apply the predicate functionto. The row reduction function is applied to the data frame, and returns a valuefor each row. The predicate function is then applied to every element of vectorreturned from the row reduction function, and will raise an error if it findsany violations. This functionality is useful, for example, in conjunction withthenum_row_NAs() function to ensure that there is below a certain number ofmissing values in each row. Internally, theassert_rows function usesdplyr'sselect function to extract the columns to test the predicatefunction on.

  • insist_rows - takes a data frame, a row reduction function, apredicate-generatingfunction, and an arbitrary number of columns to apply the predicate functionto. The row reduction function is applied to the data frame, and returns a valuefor each row. The predicate-generating function is then applied to the vectorreturned from the row reduction function and the resultant predicate isapplied to each element of that vector. It will raise an error if it finds anyviolations. This functionality is useful, for example, in conjunction withthemaha_dist() function to ensure that there are no flagrant outliers.Internally, theassert_rows function usesdplyr'sselect function toextract the columns to test the predicate function on.

assertr also offers four (so far) predicate functions designed to be usedwith theassert andassert_rows functions:

  • not_na - that checks if an element is not NA
  • within_bounds - that returns a predicate function that checks if a numericvalue falls within the bounds supplied, and
  • in_set - that returns a predicate function that checks if an element isa member of the set supplied. (also allows inverse for "not in set")
  • is_uniq - that checks to see if each element appears only once

and predicate generators designed to be used with theinsist andinsist_rowsfunctions:

  • within_n_sds - used to dynamically create bounds to check vector elements withbased on standard z-scores
  • within_n_mads - better method for dynamically creating bounds to check vectorelements with based on 'robust' z-scores (using median absolute deviation)

and the following row reduction functions designed to be used withassert_rowsandinsist_rows:

  • num_row_NAs - counts number of missing values in each row
  • maha_dist - computes the mahalanobis distance of each row (for outlierdetection). It will coerce categorical variables into numerics if it needs to.
  • col_concat - concatenates all rows into strings
  • duplicated_across_cols - checking if a row contains a duplicated valueacross columns

and, finally, some other utilities for use withverify

  • has_all_names - check if the data frame or list has all supplied names
  • has_only_names - check that a data frame or list haveonly the namesrequested
  • has_class - checks if passed data has a particular class

More info

For more info, check out theassertr vignette

> vignette("assertr")

Orread it here

ropensci_footer


[8]ページ先頭

©2009-2025 Movatter.jp