Thecheckthat philosophy is that you already performgood data checks and you should keep doing it. But those checks would beeven better if they lived in the code, rather than in your head.Checkthat therefore provides functions that closely resemble the checksyou already do by hand or by eye, so that it is easy for you to alsoexpress them in code as you go.
Checkthat’s main function ischeck_that(.data, ...),which takes a dataframe as its first argument, followed by any number ofassertions you want to check for that dataframe.
When all checks pass, you get a brief message confirming that’s thecase.
When at least one check fails,check_that() throws anerror, halting the potentially risky execution of subsequent code. Itthen gives you get a detailed breakdown of what the outcome was for eachtest.
mtcars|>check_that(all(cyl>2),any(mpg>35) )#>#> ── Data Checks ─────────────────────────────────────────────────────────────────#>#> ✔ all(cyl > 2) --> TRUE#> ✖ any(mpg > 35) --> FALSE#>#> ────────────────────────────────────────────────────────────────────────────────#>#> Error in `cli_throw_test_error()`:#> ! At least one data check failed.Thecheck_that() function is designed to work with bothbase R’s existing logical functions (e.g.,all(),any()), as well it’s own set of more special helperfunctions. Theses helper functions are designed to be both readable andto mirror in code what you already do manually by eye-balling adataset.
Thecheck_that() function always invisibly returns thesame.data you gave it (always unmodified). This allows youto easily integrate it directly into your data manipulationpipelines.
library(dplyr)new_mtcars<- mtcars|>select(mpg)|>mutate(km_per_litre =0.425* mpg)|>check_that(max(km_per_litre)<15)#> ✔ all data checks passinghead(new_mtcars)#> mpg km_per_litre#> Mazda RX4 21.0 8.9250#> Mazda RX4 Wag 21.0 8.9250#> Datsun 710 22.8 9.6900#> Hornet 4 Drive 21.4 9.0950#> Hornet Sportabout 18.7 7.9475#> Valiant 18.1 7.6925Because it returns the same dataframe it received,check_that() can also be used at multiple points in asingle pipeline. That way, you can check that multi-step processes areunfolding according to plan. This is be especially important for datatasks that are sensitive to the order of operations, or for checks onintermediate data that wont be available at the end.
Consider a surprisingly tricky example. Imagine we wanted to (1)create a factor variable (type) designating cars as eithersmall ("sm") or large ("lg") based on theirweight (wt). Further imagine that we then (2) planned tofilter in only the small cars and (3) calculate their meanmpg as ourdesired_mpg. This value might thenbe used to inform a personal purchase decision or perhaps to establishan industry benchmark for a manufacturer.
The resulting data pipeline should be simple, but let’s usecheck_that() at multiple points to be safe.
wt variable at the end ofthe pipeline. So, right after we usewt to computetype, we immediately check that all the weights in the"sm" group are less than those in the"lg"group, as intended.desired_mpg is within aplausible range.Here, the first check throws an error and stopS the pipeline. It alsosaves us from an inaccuratedesired_mpg that the secondcheck would not have caught.
mtcars|>mutate(type =factor(wt<3,labels =c("sm","lg"),ordered =TRUE))|>check_that(max(wt[type=="sm"])<=min(wt[type=="lg"]))|>filter(type=="sm")|>summarise(desired_mpg =mean(mpg))|>check_that(desired_mpg>15)#>#> ── Data Checks ─────────────────────────────────────────────────────────────────#>#> ✖ max(wt[type == "sm"]) <= min(wt[type == "lg"]) --> FALSE#>#> ────────────────────────────────────────────────────────────────────────────────#>#> Error in `cli_throw_test_error()`:#> ! At least one data check failed.What happened? A quick reading offactor(wt < 3, labels = c("sm", "lg"), ordered = TRUE)seems like it would correctly assign cars to the correct group. However,the labels are out of order in the function call.1 As a result, the heavycars are mistakenly labelled"sm" and vice-versa.2
Importantly, this mistake (a) would have given us an erroneously lowdesired_mpg and (b) would have gone undetected by our finalcheck_that(desired_mpg > 15). It was a call tocheck_that() earlier in the pipeline that caught the errorand prevented us from drawing an bad conclusion about our data lateron.
Checkthat’s philosophy is your existing data checks by eye areprobably already good. Their only major problem is that they live inyour head and not in your code. So, checkthat provides a range of helperfunctions to work alongside base R’s existing collection (e.g.,all(),any()). These include both some basicand more special varieties.
The most basic helpers are just syntactic sugar around R’s existingcomparison operators:=,<,<=,>,>=. Each of themtakes a logical vector as its first argument and requires you to specifya proportion (p) or count (n) of those valuesthat must be true.
The remaining helpers includesome_of(),whenever(), andfor_case() and are moreflexible than their basic counterparts. They’re optimized for the kindof semi-approximate data checking you are likely already doing byeye.
For most people, this involves a general sense of what most of thedata should look like most of the time, but not exact knowledge ofspecific proportions or counts. For example, you might have good reasonto thinksome_of() thecyl values should begreater than 4, but you don’t know exactly how many. However, you doknow it should probably beat_least 30%, butat_most 25 total cases in your dataset. Anything outsidethat range would be implausible and so you want to guard it withcheck_that().
mtcars|>check_that(some_of(cyl>4,at_least = .30,at_most =25),whenever(is_observed = wt<3,then_expect = mpg>19),for_case(2, mpg==21, hp==110) )#> ✔ all data checks passingJust like unit tests for production code, the tests created withthese special helper functions will be technically imperfect and leavesome (possibly important) scenarios addressed. After all, there’s a bigrange of possibilities betweenat_least = .30 andat_most = 25, and some of them might involve an undetecteddata problem.
However, checkthat takes the position that imperfect tests are stillvaluable informative and you should be able to take advantage of them.For example, if you have reasons to be concerned about the data in yourcolumn crossing theat_most = 25, you should be able toquickly and easily write that test with a combination ofcheck_that() andsome_of().
Moreover, a world ofno tests at all is much worse than aworld ofsome tests that fail to cover every case. With that inmind, checkthat’s special helper functions are designed to bring youfromnot writing down any tests in your code toquickly andeasily coding the tests you already do by eye.
In addition to concerns about the individual rows or columns in yourdata, you may also want to perform checks on the entire dataframe inquestion. For those cases,check_that() provides the.d pronoun, which works similarly to.x in thepurrr package.
In short,.d is a copy of the data you provided tocheck_that(), which you can use to write checks about thewhole dataset.
This is especially useful for operations that could change the shapeof your dataset (e.g., pivots, nests, joins). In the case of pivoting,you might want to check that the dataset have the correct anticipateddimensions.
library(tidyr)mtcars|>check_that(ncol(.d)==11,nrow(.d)==32)|># original dimensionspivot_longer(cols =everything(),names_to ="name",values_to ="values" )|>check_that(ncol(.d)==2,nrow(.d)==32*11)# check that cols became rows#> ✔ all data checks passing#> ✔ all data checks passingAfter a join, you may want to check that there is a new column in theexpected location, but also that there are no unanticipated newrows.
cyl_ratings_df<-data.frame(cyl =c(4,6,8),group =c("A","B","C"))mtcars|>left_join(cyl_ratings_df,by ="cyl")|>check_that(ncol(.d)==12,# check that there's one new columnnames(.d)[length(names(.d))]=="group",# check new column is "group"nrow(.d)==32# check that no new rows )#> ✔ all data checks passingThefactor() function mapswt < 3 == FALSE --> 1 andTRUE --> 2in this case becauseFALSE < TRUE. So, whenwt < 3 == TRUE,factor(..., labels = c("sm", "lg")) will assign the 2ndlabel (mistakenly, from our perspective).↩︎
Full disclosure, I made this mistake when preparing thisexample, and it was thecheck_that() function that pointedit out to me. So, I decided to include it.↩︎