Movatterモバイル変換


[0]ホーム

URL:


Where

Thewhere package has one main functionrun() that provides a clean syntax for vectorising the useof NSE (non-standard evaluation), for example inggplot2,dplyr, ordata.table. There are also two(infix) wrappers%where% and%for% thatprovide arguably cleaner syntax. A typical example might look like

subgroups<- .(all        =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5)(iris%>%filter(x)%>%summarise(across(Sepal.Length:Petal.Width,                   mean),.by = Species))%for% subgroups

Here we have a population dataset and various subpopulations ofinterest and we want to apply the same code over all subpopulations. Ifthe subpopulations were a partition of the data (for example, a censuspopulation could be divided into 5 year age bands), then we can usegroup_by() indplyr or faceting inggplot, for example, to apply the same code over allsubpopulations. In general, however, the populations will not be so easyto apply over, for example if we have some defined by age, others bygender, and then others as a combination of the two. A variable thatallows multiple options to be selected (for example, ethnicity in theNew Zealand Census), can alone define subpopulations (in this caseethnic groups) that cannot be vectorised over with the partitioningfunctionality (like group by and faceting) in standard packages. Thewhere package makes these examples straightforward.

Simple example

As a running example we will use theiris dataset andthe following (largely unnatural) sub-populations of irises:

These subgroups can be captured with the.() function tocapture the filter conditions used to define these populations:

subgroups<- .(all        =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5)

To utilise these subgroups directly with standard R is tricky. Forexample we could form the separate populations with repeated code.

# With base Ririsiris[iris[["Sepal.Length"]]>6, ]# or with(iris, iris[Sepal.Length > 6])iris[iris[["Petal.Length"]]>5.5, ]# or with(iris, iris[Petal.Length > 5.5])# With dplyririsfilter(iris, Sepal.Length>6)filter(iris, Petal.Length>5.5)# With data.tableirisas.data.table(iris)[Sepal.Length>6]as.data.table(iris)[Petal.Length>5.5]

or this could be done by first explicitly capturing expressions (asdone above with.) and then evaluating them:

lapply(subgroups,function(group)with(iris, iris[eval(group), ]))

This requires some comfort with managing expressions in R and canquickly get messy with more complex queries, particularly if we want toapply across more than one set of expressions. Therun()function hides these manipulations:

run(with(iris, iris[subgroup, ]),subgroup = subgroups)# orwith(iris, iris[x, ])%for% subgroups

More interesting examples

A standard group by and summarise operation:

library(dplyr)subgroups= .(all        =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5)functions= .(mean, sum, prod)run(  iris%>%filter(subgroup)%>%summarise(across(Sepal.Length:Petal.Width,                     summary),.by = Species),subgroup = subgroups,summary  = functions)

The same usingdata.table:

library(data.table)df<-as.data.table(iris)run(df[subgroup,lapply(.SD, functions),keyby ="Species",.SDcols = Sepal.Length:Petal.Width],subgroup  = subgroups,functions = functions)

Producing the sameggplot over the differentpopulations:

library(ggplot2)plots<-run(ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal(),subgroup = subgroups)Map(function(plot, name) plot+ggtitle(name), plots,names(plots))

Or different plots for the full population:

run(ggplot(iris,aes(Sepal.Length, Sepal.Width))+    plot+theme_minimal(),plot = .(geom_point(),geom_smooth()))

A limitation

A natrual extension of the previous example can fail is a non-obviousway, due to expressions being executed differently than might beintended. For example the following does not work

# Failsrun(ggplot(iris,aes(Sepal.Length, Sepal.Width))+    plot+theme_minimal(),plot = .(geom_point(),geom_smooth(),geom_quantile()+geom_rug()))

since, for the third plot, it tries to evaluate

# Failsggplot(iris,aes(Sepal.Length, Sepal.Width))+    (geom_quantile()+geom_rug())+theme_minimal()

andgeom_quantile() + geom_rug() throws an error. Thisparticular use case can be accomplished by putting the separategeoms in a list

run(ggplot(iris,aes(Sepal.Length, Sepal.Width))+    plot+theme_minimal(),plot = .(point  =geom_point(),smooth =geom_smooth(),quantilerug =list(geom_quantile(),geom_rug())))# or by separating out the combined geoms as a function (also using a list)geom_quantilerug<-function()list(geom_quantile(),geom_rug())run(ggplot(iris,aes(Sepal.Length, Sepal.Width))+    plot+theme_minimal(),plot = .(point  =geom_point(),smooth =geom_smooth(),quantilerug =geom_quantilerug()))

run in a function

We can callrun() from within a function to further hidedetails. For example, we could produce subpopulation summaries for thedifferent species of iris:

population_summaries<-function(df)run(with(df, df[subgroup, ]),subgroup = subgroups)as.data.table(iris)[, .(population_summaries(.SD)), keyby="Species"]

As a more general example, if we are undertaking an analysis ofdifferent subpopulations, then we could fix the populations in afunction and apply code immediately over all groups.

on_subpopulations<-function(expr,populations = subgroups)eval(substitute(run(expr,subgroup = populations),list(expr =substitute(expr))))on_subpopulations(as.data.table(iris)[subgroup])on_subpopulations(  iris%>%filter(subgroup)%>%summarise(across(Sepal.Length:Petal.Width,                     mean),.by = Species))on_subpopulations(ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal())

As when following the DRY (Don’t Repeat Yourself) principle ingeneral, this isolation makes it straightforward to add a newsubpopulation, here by editing the subgroups:

subgroups= .(all        =TRUE,long_sepal = Sepal.Length>6,long_petal = Petal.Length>5.5,veriscolor = Species=="versicolor")

Taking things to the absurd, we can also isolate out the analysiscode:

analyses<- .(subset    =as.data.table(iris)[subgroup],summarise = iris%>%filter(subgroup)%>%summarise(across(Sepal.Length:Petal.Width,                                 mean),.by = Species),plot      =ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal())lapply(analyses,function(expr)do.call("on_subpopulations",list(expr)))

A small warning

Theggplot example

on_subpopulations(ggplot(filter(iris, subgroup),aes(Sepal.Length, Sepal.Width))+geom_point()+theme_minimal())

does not give identical results to executing theggplotcode with the given subgroups, since the ggplot object stores theexecution environment, which will be different.

If important, this can be remedied by capturing and passing thecalling environment in theon_subpopulations()function:

on_subpopulations<-function(expr,populations = subgroups) {  e<-parent.frame()eval(substitute(run(expr,subgroup = populations,e = e),list(expr =substitute(expr))))}

Infix notation

As some syntactic sugar, there are also two infix versions ofrun:

as.data.table(iris)[subgroup,lapply(.SD, summary), keyby="Species",                    .SDcols= Sepal.Length:Petal.Width]%where%list(subgroup = subgroups[1:3],summary  = functions)# note `subgroup` replaced with 'x'as.data.table(iris)[x,lapply(.SD, mean), keyby="Species",                    .SDcols= Sepal.Length:Petal.Width]%for%  subgroups

Complex expressions (for example, with pipes or+) needto be wrapped with “()” or “{}”. For example

(iris%>%filter(x)%>%summarise(across(Sepal.Length:Petal.Width,                     mean),.by = Species))%for% subgroups

An additional%with% function provides a similar syntaxto%where% for standard evaluation:

(a+ b)%with% {  a=1  b=2}

[8]ページ先頭

©2009-2025 Movatter.jp