14.9 A more transparent caching mechanism

Note: A new caching mechanismxfun::cache_exec() has been introduced to supersede thexfun::cache_rds() introduced in this section. You are now recommended to usexfun::cache_exec(), which is also transparent and yet still flexible.

If you feel the caching mechanism ofknitr introduced in Section11.4 is too complicated (it is!), you may consider a simpler caching mechanism based on the functionxfun::cache_rds(), e.g.,

xfun::cache_rds({# write your time-consuming code in this expression})

The tricky thing aboutknitr’s caching is how it decides when to invalidate the cache. Forxfun::cache_rds(), it is much clearer: the first time you pass an R expression to this function, it evaluates the expression and saves the result to a.rds file; the next time you runcache_rds() again, it reads the.rds file and returns the result immediately without evaluating the expression again. The most obvious way to invalidate the cache is to delete the.rds file. If you do not want to manually delete it, you may callxfun::cache_rds() with the argumentrerun = TRUE.

Whenxfun::cache_rds() is called inside a code chunk in aknitr source document, the path of the.rds file is determined by the chunk optioncache.path and the chunk label. For example, for a code chunk with the chunk labelfoo in the Rmd documentinput.Rmd:

```{r, foo}res <- xfun::cache_rds({  Sys.sleep(3)  1:10})```

The path of the.rds file will be of the forminput_cache/FORMAT/foo_HASH.rds, whereFORMAT is the Pandoc output format name (e.g.,html orlatex), andHASH is an MD5 hash that contains 32 hexadecimal digits (consisting a-z and 0-9), e.g.,input_cache/html/foo_7a3f22c4309d400eff95de0e8bddac71.rds.

As documented on the help page?xfun::cache_rds, there are two common cases in which you may want to invalidate the cache: 1) the code in the expression to be evaluated has changed; 2) the code uses an external variable, and the value of that variable has changed. Next we will explain how these two ways of cache invalidation work, as well as how to keep multiple copies of the cache corresponding to different versions of the code.

14.9.1 Invalidate the cache by changing code in the expression

When you change the code incache_rds() (e.g., fromcache_rds({x + 1}) tocache_rds({x + 2})), the cache will be automatically invalidated and the expression will be re-evaluated. However, please note that changes in white spaces or comments do not matter. Or generally speaking, as long as the change does not affect the parsed expression, the cache will not be invalidated. For example, the two expressions passed tocache_rds() below are essentially identical:

res<- xfun::cache_rds({Sys.sleep(3  );  x<-1:10;# semi-colons won't matter  x+1;})res<- xfun::cache_rds({Sys.sleep(3)  x<-1:10# a comment  x+1# feel free to make any changes in white spaces})

Hence if you have executedcache_rds() on the first expression, the second expression will be able to take advantage of the cache. This feature is helpful because it allows you make cosmetic changes in your code without invalidating the cache.

If you are not sure if two versions of code are equivalent, you may try theparse_code() below:

parse_code<-function(expr) {deparse(substitute(expr))}# white spaces and semi-colons do not matterparse_code({x+1})
## [1] "{"         "    x + 1" "}"
parse_code({ x+1; })
## [1] "{"         "    x + 1" "}"
# left arrow and right arrow are equivalentidentical(parse_code({x<-1}),parse_code({1-> x}))
## [1] TRUE

14.9.2 Invalidate the cache by changes in global variables

There are two types of variables in an expression: global variables and local variables. Global variables are those created outside the expression, and local variables are those created inside the expression. If the value of a global variable in the expression has changed, your cached result will no longer reflect the result that you would obtain by running the expression again. For example, in the expression below, ify has changed, you are most likely to want to invalidate the cache and rerun the expression, otherwise you still get the result from the old value ofy:

y<-2res<- xfun::cache_rds({  x<-1:10  x+ y})

To invalidate the cache wheny has changed, you may letcache_rds() know through thehash argument thaty needs to be considered when deciding if the cache should be invalidated:

res<- xfun::cache_rds({  x<-1:10  x+ y},hash =list(y))

When the value of thehash argument is changed, the 32-digit hash in the cache filename (as mentioned earlier) will change accordingly, therefore the cache will be invalidated. This provides a way to specify the cache’s dependency on other R objects. For example, if you want the cache to be dependent on the version of R, you may specify the dependency like this:

res<- xfun::cache_rds({  x<-1:10  x+ y},hash =list(y,getRversion()))

Or if you want the cache to depend on when a data file was last modified:

res<- xfun::cache_rds({  x<-read.csv("data.csv")  x[[1]]+ y},hash =list(y,file.mtime("data.csv")))

If you do not want to provide this list of global variables to thehash argument, you may tryhash = "auto" instead, which tellscache_rds() to try to figure out all global variables automatically and use a list of their values as the value for thehash argument, e.g.,

res<- xfun::cache_rds({  x<-1:10  x+ y+ z# y and z are global variables},hash ="auto")

This is equivalent to:

res<- xfun::cache_rds({  x<-1:10  x+ y+ z# y and z are global variables},hash =list(y = y,z = z))

The global variables are identified bycodetools::findGlobals() whenhash = "auto", which may not be completely reliable. You know your own code the best, so we recommend that you specify the list of values explicitly in thehash argument if you want to be completely sure which variables can invalidate the cache.

14.9.3 Keep multiple copies of the cache

Since the cache is typically used for time-consuming code, perhaps you should invalidate it conservatively. You might regret invalidating the cache too soon or aggressively, because if you should need an older version of the cache again, you would have to wait for a long time for the computing to be redone.

Theclean argument ofcache_rds() allows you to keep older copies of the cache if you set it toFALSE. You can also set the global R optionoptions(xfun.cache_rds.clean = FALSE) if you want this to be the default behavior throughout the entire R session. By default,clean = TRUE andcache_rds() will try to delete the older cache every time. Settingclean = FALSE can be useful if you are still experimenting with the code. For example, you can cache two versions of a linear model:

model<- xfun::cache_rds({lm(dist~ speed,data = cars)},clean =FALSE)model<- xfun::cache_rds({lm(dist~ speed+I(speed^2),data = cars)},clean =FALSE)

After you decide which model to use, you can setclean = TRUE again, or delete this argument (so the defaultTRUE is used).

14.9.4 Comparison withknitr’s caching

You may wonder when to useknitr’s caching (i.e., set the chunk optioncache = TRUE), and when to usexfun::cache_rds() in aknitr source document. The biggest disadvantage ofxfun::cache_rds() is that it does not cache side effects (but only the value of the expression), whereasknitr does. Some side effects may be useful, such as printed output or plots. For example, in the code below, the text output and the plot will be lost whencache_rds() loads the cache the next time, and only the value1:10 will be returned:

xfun::cache_rds({print("Hello world!")plot(cars)1:10})

By comparison, for a code chunk with the optioncache = TRUE, everything will be cached:

```{r, cache=TRUE}print("Hello world!")plot(cars)1:10```

The biggest disadvantage ofknitr’s caching (and also what users complain most frequently about) is that your cache might be inadvertently invalidated, because the cache is determined by too many factors. For example, any changes in chunk options can invalidate the cache,17 but some chunk options may not be relevant to the computing. In the code chunk below, changing the chunk optionfig.width = 6 tofig.width = 10 should not invalidate the cache, but it will:

```{r, cache=TRUE, fig.width=6}# there are no plots in this chunkx <- rnorm(1000)mean(x)```

Actually,knitr caching is quite powerful and flexible, and its behavior can be tweaked in many ways. As its author, I often doubt if it is worth introducing these lesser-known features, because you may end up spending much more time on learning and understanding how the cache works than the time the actual computing takes.

In case it is not clear,xfun::cache_rds() is a general way for caching the computing, and it works anywhere, whereasknitr’s caching only works inknitr documents.


  1. This is the default behavior, and you can change it. Seehttps://yihui.org/knitr/demo/cache/ for how you can make the cache more granular, so not all chunk options affect the cache.↩︎