Thewhere package has one main functionrun() that provides a clean syntax for vectorising the useof NSE (non-standard evaluation), for example inggplot2,dplyr, ordata.table. There are also two(infix) wrappers%where% and%for% thatprovide arguably cleaner syntax. A typical example might look like
subgroups<- .(all =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5)(iris%>%filter(x)%>%summarise(across(Sepal.Length:Petal.Width, mean),.by = Species))%for% subgroupsHere we have a population dataset and various subpopulations ofinterest and we want to apply the same code over all subpopulations. Ifthe subpopulations were a partition of the data (for example, a censuspopulation could be divided into 5 year age bands), then we can usegroup_by() indplyr or faceting inggplot, for example, to apply the same code over allsubpopulations. In general, however, the populations will not be so easyto apply over, for example if we have some defined by age, others bygender, and then others as a combination of the two. A variable thatallows multiple options to be selected (for example, ethnicity in theNew Zealand Census), can alone define subpopulations (in this caseethnic groups) that cannot be vectorised over with the partitioningfunctionality (like group by and faceting) in standard packages. Thewhere package makes these examples straightforward.
As a running example we will use theiris dataset andthe following (largely unnatural) sub-populations of irises:
These subgroups can be captured with the.() function tocapture the filter conditions used to define these populations:
subgroups<- .(all =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5)To utilise these subgroups directly with standard R is tricky. Forexample we could form the separate populations with repeated code.
# With base Ririsiris[iris[["Sepal.Length"]]>6, ]# or with(iris, iris[Sepal.Length > 6])iris[iris[["Petal.Length"]]>5.5, ]# or with(iris, iris[Petal.Length > 5.5])# With dplyririsfilter(iris, Sepal.Length>6)filter(iris, Petal.Length>5.5)# With data.tableirisas.data.table(iris)[Sepal.Length>6]as.data.table(iris)[Petal.Length>5.5]or this could be done by first explicitly capturing expressions (asdone above with.) and then evaluating them:
lapply(subgroups,function(group)with(iris, iris[eval(group), ]))This requires some comfort with managing expressions in R and canquickly get messy with more complex queries, particularly if we want toapply across more than one set of expressions. Therun()function hides these manipulations:
run(with(iris, iris[subgroup, ]),subgroup = subgroups)# orwith(iris, iris[x, ])%for% subgroupsA standard group by and summarise operation:
library(dplyr)subgroups= .(all =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5)functions= .(mean, sum, prod)run( iris%>%filter(subgroup)%>%summarise(across(Sepal.Length:Petal.Width, summary),.by = Species),subgroup = subgroups,summary = functions)The same usingdata.table:
library(data.table)df<-as.data.table(iris)run(df[subgroup,lapply(.SD, functions),keyby ="Species",.SDcols = Sepal.Length:Petal.Width],subgroup = subgroups,functions = functions)Producing the sameggplot over the differentpopulations:
library(ggplot2)plots<-run(ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal(),subgroup = subgroups)Map(function(plot, name) plot+ggtitle(name), plots,names(plots))Or different plots for the full population:
run(ggplot(iris,aes(Sepal.Length, Sepal.Width))+ plot+theme_minimal(),plot = .(geom_point(),geom_smooth()))A natrual extension of the previous example can fail is a non-obviousway, due to expressions being executed differently than might beintended. For example the following does not work
# Failsrun(ggplot(iris,aes(Sepal.Length, Sepal.Width))+ plot+theme_minimal(),plot = .(geom_point(),geom_smooth(),geom_quantile()+geom_rug()))since, for the third plot, it tries to evaluate
# Failsggplot(iris,aes(Sepal.Length, Sepal.Width))+ (geom_quantile()+geom_rug())+theme_minimal()andgeom_quantile() + geom_rug() throws an error. Thisparticular use case can be accomplished by putting the separategeoms in a list
run(ggplot(iris,aes(Sepal.Length, Sepal.Width))+ plot+theme_minimal(),plot = .(point =geom_point(),smooth =geom_smooth(),quantilerug =list(geom_quantile(),geom_rug())))# or by separating out the combined geoms as a function (also using a list)geom_quantilerug<-function()list(geom_quantile(),geom_rug())run(ggplot(iris,aes(Sepal.Length, Sepal.Width))+ plot+theme_minimal(),plot = .(point =geom_point(),smooth =geom_smooth(),quantilerug =geom_quantilerug()))We can callrun() from within a function to further hidedetails. For example, we could produce subpopulation summaries for thedifferent species of iris:
population_summaries<-function(df)run(with(df, df[subgroup, ]),subgroup = subgroups)as.data.table(iris)[, .(population_summaries(.SD)), keyby="Species"]As a more general example, if we are undertaking an analysis ofdifferent subpopulations, then we could fix the populations in afunction and apply code immediately over all groups.
on_subpopulations<-function(expr,populations = subgroups)eval(substitute(run(expr,subgroup = populations),list(expr =substitute(expr))))on_subpopulations(as.data.table(iris)[subgroup])on_subpopulations( iris%>%filter(subgroup)%>%summarise(across(Sepal.Length:Petal.Width, mean),.by = Species))on_subpopulations(ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal())As when following the DRY (Don’t Repeat Yourself) principle ingeneral, this isolation makes it straightforward to add a newsubpopulation, here by editing the subgroups:
subgroups= .(all =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5,veriscolor = Species=="versicolor")Taking things to the absurd, we can also isolate out the analysiscode:
analyses<- .(subset =as.data.table(iris)[subgroup],summarise = iris%>%filter(subgroup)%>%summarise(across(Sepal.Length:Petal.Width, mean),.by = Species),plot =ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal())lapply(analyses,function(expr)do.call("on_subpopulations",list(expr)))Theggplot example
on_subpopulations(ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal())does not give identical results to executing theggplotcode with the given subgroups, since the ggplot object stores theexecution environment, which will be different.
If important, this can be remedied by capturing and passing thecalling environment in theon_subpopulations()function:
on_subpopulations<-function(expr,populations = subgroups) { e<-parent.frame()eval(substitute(run(expr,subgroup = populations,e = e),list(expr =substitute(expr))))}As some syntactic sugar, there are also two infix versions ofrun:
%where% is a full infix version ofruntaking the expression as the left argument and a named list of values tobe substituted as the right argument.%for% has slightly simplified syntax but only allowsone substitution, for the symbolx.as.data.table(iris)[subgroup,lapply(.SD, summary), keyby="Species", .SDcols= Sepal.Length:Petal.Width]%where%list(subgroup = subgroups[1:3],summary = functions)# note `subgroup` replaced with 'x'as.data.table(iris)[x,lapply(.SD, mean), keyby="Species", .SDcols= Sepal.Length:Petal.Width]%for% subgroupsComplex expressions (for example, with pipes or+) needto be wrapped with “()” or “{}”. For example
(iris%>%filter(x)%>%summarise(across(Sepal.Length:Petal.Width, mean),.by = Species))%for% subgroupsAn additional%with% function provides a similar syntaxto%where% for standard evaluation:
(a+ b)%with% { a=1 b=2}