Movatterモバイル変換

1 Introduction

In recent years, novel statistical modeling techniques have become morecomputationally intensive in an effort to accommodate the massivedatasets afforded by advances in fields such as data mining and geneticsequencing. The volume and complexity of available data continue togrow, overwhelming even the fastest modern computers (Goudey 2015). Effortsto analyze such complex data usually involve some level of datareduction (e.g., Principle Component Analysis, Partial Least Squares,Sufficient Dimension Reduction), which can yield convoluted statisticalmodels whose parameters researchers and statisticians alike struggle tointerpret. Furthermore, many of these strategies are limited to specificmodel forms and lack the flexibility to explore higher-orderinteractions. Herein we present an alternative to data reduction thataccommodates a variety of modeling strategies, including large-subsetselection and identification of higher-order interaction terms. Theresulting models, coefficient estimates, and predictions remain easilyinterpretable and flexible to traditional modes of statisticalinference.

The Feasible Solutions Algorithm (FSA) overcomes the problems describedabove by searching a reduced data space to producefeasiblesolutions (Miller 1984;Hawkins 1994). Feasible models (Hawkins 1994) are optimalunder a given criterion function in the sense that no single exchange ofan explanatory variable contained in the model for a variable outsidethe model will yield an improvement to the criterion. (Miller 1984) firstintroduced the idea of a sequential-replacement algorithm such as theFSA. According to Miller, this computationally "cheap" method boastsrapid convergence; flexible implementation; and improved resultscompared to forward or backward selection. Miller also noted that thereplacement algorithm could give too many solutions if repeated.However, (Hawkins 1994) applied a similar exchange algorithm to the problemof finding robust regression estimators and minimum-volume ellipsoidestimators in multivariate data, demonstrating that with a sufficientnumber of random starts, the algorithm could find the optimal solutionwith arbitrarily high probability. Upon testing the algorithm, Hawkinsalso exhibited its superior performance relative to exhaustive search.

Exhaustively searching for an optimal model containing interactions isparticularly demanding, computationally speaking, and not alwaysreasonable or even attainable. For this reason, many published analysesdo not even attempt to explore interactions. In other cases, interactionterms are identified by a primary investigator on the basis of his orher prior knowledge of the field, and then screened on an individuallevel by a statistician or data scientist. Such a process is tedious andtime-consuming, and usually results in interactions being ignored oroverlooked due to the sheer number of possibilities. These factors uniteto afford a widespread lack of consideration for interactions, therebyundermining the predictive power of models attempting to capture complexrelationships (Foster and Stine 2004). We address these limitations by implementing anFSA with the capacity to explore higher-order terms, combined with theaccessibility and ease of use associated with an R package.

The R implementation of our Feasible Solution Algorithm,rFSA 0.9.1,is now accessibleviaGitHubandCRAN.

1.1 Feasible Solutions Algorithm

Statisticians are often faced with the problem of identifying aninformative subset of$m$ explanatory variables from a set of$p$predictor terms, which we will denote${\bf X}^p$. Consider selecting$p^+\ge0$ explanatory variables, denoted${\bf X}^{p^+}$, to compose apreliminary model. Let$g({\bf Y},{\bf X})$ be an objective (criterion)function, generally a measure of model quality such as$R^2$, AIC, orPRESS. We wish to identify the$m$ additional variables,${\bf X}^m$,that when added to the existing model serve to optimize the objectivefunction$g({\bf Y},{\bf X}^{(p^+ + m)})$.

The Feasible Solution Algorithm implements the following strategy, wherewe initially consider only first-order predictor terms:

Randomly select$m$ variables to compose${\bf X}^m$, and computethe objective function$g({\bf Y},{\bf X}^{(p^+ + m)})$.
Under each possible exchange of one of the$m$ variables in theworking model for a variable not contained in the model, compute thenew value of the objective function.
Make the single exchange of variables from Step 2 that most improvesthe objective function.
Continue making exchanges until no single additional exchangeimproves the objective function. The variables${\bf X}^{p+},{\bf X}^{m^*}$ composing the final model constitute asingle feasible solution.
Return to (1) to search for another feasible solution.

(Miller 1984) illustrates the general procedure as follows: Suppose you havea dataset containing 26 predictor variables labeled A through Z, andimagine you wish to find the best subset of four predictor variables.Randomly select four initial predictors; suppose these are$ABCD$.Consider exchanging one of$A$,$B$,$C$, or$D$ with each of the 22remaining variables. Make the change that most improves the objectivefunction; suppose we swap$C$ for$X$, so that the working subsetbecomes$ABXD$. Next consider exchanging one of$A$,$B$,$X$, or$D$.(In point of fact, considering an exchange involving$X$ is hereredundant and unnecessary.) Repeat this process until no additionalexchange of variables yields further improvement to the objectivefunction.

If instead a user requests interaction terms of$m^{\text{th}}$ order,our FSA randomly selects$m$ effects to compose an initial interactionterm in step (1). In step (2), the algorithm considers exchanging anyone of the$m$ variables involved in the interaction term. In accordancewith the hierarchical paradigm, all associated lower-order interactionsand main effects are exchanged as well, prior to calculating theassociated objective function; no linear or lower-order terms areincluded in the model beyond those required to complete the hierarchyunless the user requests a variable be fixed in the model. Thesequential-replacement procedure otherwise operates as described above,but this extension permits optimization with respect to the$p$-value ofan interaction term as an alternative to standard model-buildingcriteria.

Each iteration of the FSA yields a single feasible solution, which isthe most-optimal model under a given criterion that can be reached fromthe (random) starting configuration by means of sequential replacement.Of course, depending on the initialized model, the algorithm may notconverge to theglobal optimum with respect to the specified objectivefunction. However, if a truly correct model exists and consists ofexplanatory variables present in the dataset, then under an appropriateobjective function and with many iterations,rFSA is increasinglylikely to discover it: A model containing correct explanatory variablesshould yield a value for the criterion superior to that of a modelcontaining explanatory variables unrelated to the response. In ourexperience withrFSA, the models best supported by the data tend tomanifest more often than inferior models, but it can be shown that withmany random starts even relatively rare (locally or globally) optimalconfigurations may be identified(Hawkins 1994). Thus, with multiple randomstarts, the FSA is likely to discover the optimal model, as well asadditional models that would be overlooked if one considered only the‘best’ model under a specified criterion, but which may containadditional information of interest from a clinical or scientificviewpoint.

1.2 Other algorithms for subset selection

Many existing methods attempt to identify the best subset of predictorsthat adequately explains a response variable. Common automaticvariable-selection techniques include forward selection, backwardelimination, and step-wise regression methods, while penalizedregression techniques, such as LASSO, are commonly used for subsetselection with many variables. Exhaustive search procedures simplyexamine all possible combinations of predictors under a viable modelstructure — but even on the most-powerful computers, this approachbecomes impracticable for datasets containing many explanatoryvariables (Goudey 2015). Algorithms implementing these procedures in thecontext of linear or generalized linear models, respectively, arecurrently available in the form of R packagesleaps (Lumley and Miller 2009),glmulti (Calcagno 2013), andglmnet (Friedman 2008).

1.3 Other algorithms for finding interactions

Computational methods for exploring potential interactions includeRandom Forest (Jiang 2009) and Boosting (Lampa 2014) methods. In the formerapproach, researchers examine branches of a regression tree to identifysplits that contribute downstream to alternative classifications of theresponse variable. Bayesian methods also exist for identifyinginteractions, although they tend to specialize in exploration of largegenetic datasets (Zhang and Liu 2007). LASSO can be used to identifyinteractions by selecting informative variables from an initial poolcontaining all possible interactions as well as first-orderterms (Bien et al. 2013), and LASSO for Hierarchical Interactions has beenimplemented in the R packagehierNet (Bien and Tibshirani 2014).Although all of these methods have been used previously for identifyinginteractions, the software for implementing them are limited withrespect to the statistical methods they utilize and criterion functionsthey oblige. The statistical community would therefore benefit from arobust algorithm whose package can accommodate multiple statisticalmethods and criterion functions.

2 rFSA

Our R packagerFSA implements the FSA described above for use insubset selection and identification of interaction terms. The primaryfunction,FSA, accepts as two of its arguments any user-specified Rfunctions for use in fitting models (e.g.,lm,glm) andcalculating the model criterion (e.g.,AIC,BIC,r.squared),with only the restrictions that the criterion function must (1) acceptas its argument the model object returned by the specified model-fittingfunction and (2) return a single numeric value. Additional arguments toFSA include the number of parameters to consider, whether toinvestigate interactions with or without quadratic terms, and whether tominimize or maximize the criterion function.

2.1 Architecture

Within the R packagerFSA, the functionFSA is used to identifyfeasible models. A full list of the parameters belonging toFSA aredescribed below, for reference in illustrating the architecture of thepackage:

Parameters:

`formula`	=2em symbolic description of the functional model form to be fitted. All subsequent analysis will model the response variable designated by this parameter.
`data`	=2em data frame containing the set of predictor variables of interest.
`fitfunc`	=2em function to fit the model. Defaults to`lm`.
`fixvar`	=2em vector of one or more variable names to be fixed in the model as main effects. Defaults to NULL.
`quad`	=2em logical: whether to consider higher-order interactions containing two instances of the same variable. Defaults to FALSE.
`m`	=2em number of predictors to compose a model. If`interactions` is TRUE then$m$ is the order of interactions to consider.
`numrs`	=2em number of random starts to perform.
`cores`	=2em number of cores to use while running. Restricted to 1 for a Windows machine.
`interactions`	=2em logical: whether to include interactions in model. Defaults to TRUE.
`criterion`	=2em vector of criterion function(s) to maximize or minimize. Defaults to`AIC`.
`minmax`	=2em vector of strings`"min"` or`"max"` specifying whether to minimize or maximize each criterion function. Defaults to`"min"`.
`checkfeas`	=2em vector of variable names composing a model to test for feasibility. If multiple random starts specified, test of feasibility will occupy the final random start. Defaults to NULL.
`var4int`	=2em variable to be fixed in the model as one component of the interaction term. Defaults to NULL.
`min.nonmissing`	=2em minimum threshold for observations required to fit a model. Defaults to 1.
`return.models`	=2em logical: whether to return fitted model objects. Defaults to FALSE.
`...`	=2em additional arguments to be passed to the model-fitting function.

Assume we wish to discover the first-order linear model containing threepredictors, chosen from a pool of eight predictors, with maximal$R^2$.To increase our probability of discovering the global optimum withrespect to our chosen criterion, we choose to execute ten random starts.To decrease the time until completion, we run the algorithm on fivecores. Letdat.all denote the data frame containing the responsevariable, denotedY, and all eight predictors. Then the code requiredto execute this search is provided below, acting on a simulated responsevariable and eight continuous predictors.

set.seed(40508)# simulate explanatory variables, response, and uninformative variablesx1<-rnorm(100)x2<-rnorm(100)x3<-rnorm(100)dat.all<-data.frame(Y =2+3*x1-4*x2+0.5*x3+rnorm(100),X1 = x1,X2 = x2,X3 = x3,X4 =rnorm(100),X5 =rnorm(100),X6 =rnorm(100),X7 =rnorm(100),X8 =rnorm(100))# load package and execute rFSAlibrary(rFSA)fsa.fit<-FSA(formula ='Y~1',data = dat.all,fitfunc = lm,quad =FALSE,m =3,numrs =10,cores =5,interactions =FALSE,criterion = r.squared,minmax ="max",return.models =FALSE)print(fsa.fit)

rFSA first generates the ten random starts, each of which consists ofthree indices: one corresponding to each predictor in the initial model.For each random start, the algorithm generates every unique exchange ofvariables that can be achieved from the initial model. Correspondingmodels are fitted by calling thefitfunc, in this caselm, and thecriterion function calculated for each of the resulting models. If theuser specifies multiple cores, then themcapply function from Rpackageparallel is employed to fit multiple models simultaneously.Criterion values, here$R^2$ values, are stored in a hash table toprevent having to fit the same model multiple times across differentrandom starts, thereby achieving an additional gain in computationalefficiency as well as reducing the memory requirement. (Ifreturn.models isTRUE, then the model objects themselves are storedin a separate list.)rFSA makes the exchange that achieves thegreatest improvement in model criterion, and then repeats the process ofgenerating and making exchanges. The algorithm ceases iterating on asingle random start when the best exchange yields no change in themodel, or when the working model converges to a feasible solutiondiscovered previously.

TheFSA function is written as anS3 object. Returned results assumeclass definitionFSA and can be used in conjunction with standardS3functionsprint,summary,predict,fitted, andplot, describedin greater detail below.

2.2 Implementation

The core of the package is the FSA optimization procedure, which isdepicted as a flowchart in Figure (1). As this figureillustrates, the procedure can be divided into several sub-procedures asfollows:

graphic without alt text — Figure 1: Overview of our implementation of the FSA optimizationalgorithm. A number of random starts are generated, followed bycandidate variable exchanges and the criterion value associated witheach new model. The best swap candidate is chosen according to thecriterion values and then compared with the starting configuration. Ifthey are equal, then they constitute a feasible solution. Otherwise, thestarting configuration is updated by assuming the value of the bestswap, and the procedure iterates until a feasible solution isfound.

The algorithm first generates a number (specified by user, heredenoted$M$) of random starts$P_1,P_2\cdots{}P_M$. Each member$P_i$ is a random combination of variables from the predictor set.
For each random start$P_i$,rFSA then generates all possiblevariable exchanges$Q_{i1},Q_{i2}\cdots{}Q_{iN}$. Similar to arandom start, each$Q_{ij}$ is also a combination of variables, butdiffers from its corresponding start position$P_i$ by precisely onevariable. As a result, we obtain$M \times N$ combinations.
Next, for each combination$Q_{ij}$,rFSA fits the model using thefunction specified by the input argumentfitfunc and calculatesthe corresponding criterion value using the function specified bycriterion. Thus we obtain a criterion value$C_{ij}$ for eachcombination$Q_{ij}$.
The criterion values$C_{i1},C_{i2}\cdots{}C_{iN}$ are used toidentify the optimal exchange among
$Q_{i1},Q_{i2},\cdots{},Q_{iN}$ from the starting configuration$P_i$. The best swap combination, denoted$P_i'$, is that with thesmallest or largest criterion depending on the user-specifiedargumentminmax.
Finally, the algorithm decides whether to continue iterating bycomparing the best swap$P_i'$ and initial combination$P_i$. Ifthey are equal, then the algorithm stops processing and stores$P_i$as a feasible solution. Otherwise, it updates$P_i$ to$P_i'$ andcontinues iterating until a feasible solution is found.

Note that the computationally most-intensive stage of the algorithmoccurs when obtaining the model fit and criterion values forcombinations$Q_{11} \cdots Q_{1N}, \cdots , Q_{M1} \cdots Q_{MN}$.Moreover, some of this work is redundant, as the same model fit may berequired as many as$M$ times. We therefore elect to store criterionvalues upon calculation, thereby reducing the computational burden andimproving execution efficiency. A detailed schematic of thisimplementation is provided in Figure (2).

TheFSA function implements the procedure depicted in Figure(2) to maintain a global table of criterion valuesfor easy reference. Prior to calculating a model criterion,rFSA firstchecks whether the model has been fitted previously and, if so, uses thestored criterion value rather than refitting the model. If the model’scriterion is not represented in the existing table, thenrFSA fits themodel, calculates the criterion value, and stores that value in thetable. Specifically, we adopt following techniques to optimize ourimplementation of the algorithm:

By maintaining a look-up table, we avoid redundant computations tofit the same model multiple times.
We implement the look-up table as a hash table, which achievesconstant-time insertion and querying.
We use the R packagehashmap (Russell 2017)to implement the criterion table.hashmap is written in C++,yielding further improvement to the execution speed ofrFSA.
We fit models in parallel on multiple computation cores using themclapply function of theparallel package. Because model fittingis independent across different sets of variables, the task isappropriate for multi-thread or multi-process parallelization.

2.3 Usage

In this section, we consider some nuances related to the arguments ofthe functionFSA, and end with details of the results returned by thesame function.

Theformula parameter allows users to specify an initial fit. Becauseinitial models are randomly generated, it is sufficient to specify anull model containing solely the desired response variable and anintercept term,e.g.,Y$\sim$1. Variables to be fixed in themodel should be provided in the form of a vector of character stringsrepresenting the variable names, provided toFSA as the argumentfixvar.

Thedata structure should contain only the target response variableand predictor variables of interest, with each variable occupying adistinct column of the data structure. Variables known to be superfluousshould be omitted from the data structure prior to executingFSA.Furthermore, categorical variables should be denoted in quotes or storedas factors to distinguish from quantitative variables, and should assumeat least two levels to avoid complications due to invariance.

Models will be fitted using the user-specifiedfitfunc, to which theformula,data, and any additional arguments will be passed. Theobjective functioncriterion may be any user-specified R function thataccepts as its argument the model object returned byfitfunc, providedthe criterion function returns a single numeric value (orNA) that canbe minimized or maximized. If users intend to write their own criterionfunctions, we recommend that they protect against errors by enshroudingthe functions intryCatch statements, available in base R. As analternative, users may specify one of the criterion functions built intorFSA, which are detailed below. In either case, the specifiedcriterion function will be maximized or minimized in accordance with thevalue ofminmax.

Themclapply function used to support parallel processing inrFSAaccommodates the use of multiple cores for Unix environments only. Forthis reason,rFSA automatically instantiates thecores parameter to1 for Windows users, regardless of the user-specified value. For Unixmachines with multiple cores, we recommend a maximum threshold forcores of one fewer than the number of cores available on the machine.Users can executeparallel::detectCores() to determine the number ofcores on their computers.

To request subset selection without interactions,interaction must beset toFALSE. (In this case, thequad parameter will be ignored.) Inthis scenario,m denotes the number of predictors to compose eachresulting subset. In the case ofinteractions = TRUE, the meaning ofm changes to represent the desired interaction order. (During modelfitting, all lower-order terms associated with the interactions are alsoincluded). In such an instance, the parametervar4int can be used tofix a variable as one member of the interaction term, therebyrestricting theFSA function to identify$m$-way interactionscontaining the specified predictor along with any other$m-1$ variables.Thequad parameter specifies whether higher-order interactions shouldbe permitted to contain the same variable twice. Regardless of theseparameters, theFSA function yields feasible solutions equal in numbertonumrs. However, these models may not be unique if a feasiblesolution is accessible from multiple random starting positions. In orderto provide cleaner results,rFSA produces a table summarizing theseresults (see Function Outputs, below).

Finally, theFSA function accommodates testing for the feasibility ofa model, where the desired predictors may be provided as the argumentcheckfeas. If the model is feasible, then the resulting feasiblesolution will contain the same variables ascheckfeas; otherwise, theFSA function returns the feasible model accessible from the initialmodel specified incheckfeas. If the user specifies multiple randomstarts, then the test of feasibility occupies the final random start.

Criterion Functions

Criterion functions built intorFSA include$R^2$ (r.squared) andadjusted$R^2$ (adj.r.squared), Allen’s PRESS statistic (apress),interaction$p$-values (int.p.val), Akaike’s Information Criterion(AIC) and the Bayesian Information Criterion (BIC), penalizedQuasi-likelihood under the Independence model Criterion (QICu.geeglm),and Bhattacharyya Distance (bdist). These functions may not beappropriate for all functions provided tofitfunc as, for example,r.squared does not accommodate generalized linear models.

Most of the above metrics are well known and thus do not requireintroduction, but we summarize briefly two of the less-commonstatistics. The Bhattacharyya Distance (Bhattacharyya 1946) is a distancemeasure that quantifies the disparity between two distributions. TheBhattacharyya Distance is particularly useful for a binary response whenthe user’s interest lies in exploring two-way interactions betweencontinuous explanatory variables (Janse 2017). This is a common problem inlarge genetic datasets, in which contextbdist is doubly usefulbecause it converges faster than the other criterion functions. TheQuasi-likelihood under the Independence model Criterion (QIC) (Pan 2001) isa goodness-of-fit statistic for generalized estimating equations,analogous to AIC. The function built intorFSA accepts an object ofclassgeeglm, fitted withgeepack (Yan 2002;Yan and Fine 2004;Højsgaard et al. 2006),and returns the penalized QIC, denoted QICu, which is preferred forsubset selection.

Function Outputs

A successful run of theFSA function returns a list with sevenelements, described below:

Returned Values:

$`originalfit`	model object representing the fit of the user-specified original model.
$`call`	list of`FSA` function parameters and their actualized values.
$`solutions`	list of initial and final (i.e., feasible) predictors contained in each model; criterion function values for feasible models; and number of exchanges made. If`return.models=TRUE` then`solutions` will contain an object of all fitted models called`checked.model`.
$`table`	data frame of unique feasible solutions; corresponding criterion values; and the number of random starts that converged to each solution.
$`efficiency`	character string contrasting the total number of fitted models and criterion values calculated usingrFSA against those required for exhaustive search on the same dataset.
$`nfits`	integer representing the total number of fitted models.

Given anFSA object as its argument, theprint command will displaya table containing the original user-specified model and all feasiblesolutions that the algorithm found overnumrs random starts. The tablealso contains criterion values for each model, and the number of randomstarts that converged to each feasible solution. Generally speaking, theoriginal fit is not a feasible solution, and thus the number of randomstarts for that model will be listed asNA. Thesummary commandacting on an object of classFSA will display a list containing as itselements the summary output of each fitted model; the format thereforedepends on the chosenfitfunc. Thepredict andfitted functionsoperate in a similar fashion tosummary, returning a list of predictedor fitted values, respectively, for each model. Finally, theplotfunction generates a Q-Q plot and a plot of residuals against fittedvalues for each model, displayed in a compact manner.

2.4 Availability

Currently,rFSA version 0.9.1 is available to download from theComprehensive R Archive Network athttps://cran.r-project.org/web/packages/rFSA. To install the newestbeta version ofrFSA, first install thedevtools package in R,and then rundevtools::install_github("joshuawlambert/rFSA").

2.5 Shiny App

An easy-to-use Shiny Application is also available to facilitate thebasic functions of the package. We believe this application will servean important role for researchers unfamiliar with R, who wouldnonetheless benefit from the ability to describe and/or make predictionsfrom large datasets. Our Shiny Application allows users to upload theirown data and specify parameter valuesvia radio buttons and drop-downboxes, as well as to visualize fundamental elements of the resultingmodels. The application currently subsists on a server hosted by theUniversity of Kentucky, accessible fromhttps://shiny.as.uky.edu/mcfsa/.Important: Usersshould notupload sensitive data to our FSA Shiny Application. To accommodateprotected data, a future version ofrFSA will permit users to runlocal instances of the Shiny Application.

3 Comparison to other packages

Several R packages exist already for finding best subsets and exploringpairwise interactions. Two of the most-popular packages in current useare theleaps (Lumley and Miller 2009) package andglmulti (Calcagno 2013). Thesepackages use criterion functions such as$R^2$, adjusted$R^2$, Mallow’s$C_p$, residual sum of squares, and AIC/BIC to identify the best subsetof predictor variables to describe a given response variable. Althoughthese packages are useful, they offer a limited selection of statisticalmodels and criterion functions. In this section, we describe these andother limitations on existing packages, and explain howrFSA seeks toovercome them.

Theleaps package leverages exhaustive search, forward or backwardselection, or a sequential-replacement algorithm to find the best subsetof predictors to compose a model. This sequential-replacement algorithmis a variation on the FSA introduced by(Miller 1984) and described above,and therefore serves as a relatively efficient option for analyzinglarge datasets. Although theleaps package is flexible and robust, atpresent it does not accommodate interaction terms, model forms otherthan first-order linear, or the use of alternative criterion functionsto the aforementioned five. Thebestglm (McLeod and Xu 2017) package seeks toextend best-subset model selection to generalized linear models but isnot natively capable of looking for higher-order interactions, usingexternal criterion functions, or accommodating other statisticalmethods. It also makes no special consideration for large problems,rendering it unsuitable for datasets with more than 100 predictors.glmulti improves on the aforementioned packages in that it is capableof incorporating pairwise interaction terms into exhaustive search or agenetic algorithm, but it does not support other statistical methods,higher-order interactions, or flexible criterion functions.

3.1 Timing comparisons

We now present the results of a simulation conducted to compare theperformance ofrFSA against that of of exhaustive search withleaps.We include neitherglmulti norbestglm in the comparison becauseneither package was able to accommodate the high-dimensional datasets ofinterest ($p>100$).

Simulations were conducted on twenty-five datasets of$N=250$observations for various numbers of predictors, again denoted by$p$. Ineach dataset, a continuous response was generated randomly from astandard normal distribution. Half of the predictors were generatedrandomly from a standard normal distribution, and the other half, from aBernoulli distribution with$P(X=1)=0.50$.

The execution speed of each package was measured individually for eachdataset. Simulations were conducted in R version 3.4.1 on a Linuxmachine with Intel(R) Xeon E5-2698 v4 @ 2.20GHz with 128.00 GB ofmemory. A single core was allocated for each ofregsubsets and theFSA function. Code follows for calling the relevant function in eachof packagesleaps (version 3.0) andrFSA (version 0.9.1) to searchfor best subsets of size three:

leaps::regsubsets(x = ...,y = ...,nbest =1,nvmax =3,really.big = T,method ="exhaustive")rFSA::FSA(X1$\sim$1,data = ...,interactions = F,m =3,numrs =1,criterion = AIC,minmax ="min")

Figure3 compares the run-time in$\log(seconds)$ for thesecommands over twenty-five simulations and$p=(150, 250, 350, 450, 650, 750, 850)$ when searching for best subsetsof size three. Exhaustive search withleaps exhibits a superiorrun-time on average for$p \leq 350$, whilerFSA is more efficientwhen$p>350$.rFSA was approximately 13 minutes faster for$p=750$.The growth in execution time forrFSA slows relative to that ofleaps as the number of variables increases: Although the overhead ishigher forFSA, the costs associated with analyzing incrementallylarger volumes of data are lower.

Figure4 depicts run-times for twenty-five simulations of$p=(50,100,150,200,250,300,350)$ when searching for best subsets of sizefive. In this scenario, exhaustive search withleaps evinces a fastertime for$p=50$, whilerFSA exhibits a superior run-time for all otherdatasets ($p>50$). The largest difference in time betweenleaps::regsubsets andrFSA::FSA for best subsets of size five and afixed$p$ was$89,815$ seconds (or$24.9$ hours) for$p=350$. Figures3 and4 both illustrate that the growth inrun-time of our implemented FSA algorithm is much slower thanleaps,which confirms the better time complexity ofrFSA. As a result, ourimplemented algorithm is able to analyze datasets with large$p$ at amuch faster rate than the conventional alternative package.

Across the 175 simulations for best subsets of size three,rFSAidentified the optimal solution in 158 of the simulations. Whensearching for best subsets of size five,rFSA discovered the optimalsolution in 144 simulations and afforded gains in run-time of up to 24hours. AlthoughrFSA does not implement an exhaustive search,execution with many random starts often requires fewer computationswhile yet producing the optimal solution with arbitrarily highprobability(Hawkins 1994). Thus for large$p$, we argue thatrFSA is apractical solution for researchers who wish to consider high-dimensionaldata, higher-order terms, generalized linear or mixed models, or othernon-traditional statistical methods and criterion functions.

4 Example

4.1 Two- and three-way interactions in Census Data

The 2014 Planning Database Block Group Data (PDB) from the Census Bureauis publicly available athttp://www.census.gov/research/data/planning_database/2014/.Descriptions of the variables can be found from PDB documentation on theaforementioned website. Herein we examine only the census blocksassociated with Kentucky, with several variables removed because theywere transformations of other variables. The final dataset contained3285 observations and 67 quantitative explanatory variables in additionto the quantitative response variable, mail response rate. The reviseddataset can be downloaded fromhttps://github.com/joshuawlambert/data/raw/master/census_data_nopct.csv.

In this example, we search for the best linear model containing (a) twomain effects and their interaction, or (b) three main effects, as wellas their pairwise and three-way interactions. To fit linear models usingtheFSA function, we setfitfunc equal tolm. For simplicity, wechoose not to fix any variables in the models, which we achieve bysettingfixvar equal toNULL. We provide a null model regressing theresponse variable$y$ (Mail Response Rate) solely on an intercept term,thereby restrictingrFSA to use$y$ as the response variablethroughout the procedure. The criterion function is taken to beint.p.val (interaction$p$-value) and minimized on each iteration. Tospecify a desire for two-way interactions, we set parametersinteractions = TRUE andm = 2; for three-way interactions,interactions = TRUE andm = 3. We request 50 random starts, whichrFSA completed in approximately one minute form = 2 or five minutesform = 3 on a Windows 7 machine with Intel(R) Core(TM) i7-4790 CPU @3.60GHz and 24.00 GB of memory.

R Code follows for reproduction of these results:

install.packages("rFSA")library(rFSA)download.file("http://raw.githubusercontent.com/joshuawlambert/data/master/               census_data_nopct.csv",destfile ="tmp.csv")census_data_nopct<-read.csv(file ="tmp.csv")# find two-way interactionsset.seed(123)fit_2_way<-FSA(formula ="y~1",data = census_data_nopct,fitfunc = lm,fixvar =NULL,quad = F,m =2,numrs =50,cores =1,interactions = T,criterion = int.p.val,minmax ="min")print(fit_2_way)# summary of solutions foundsummary(fit_2_way)# list of summaries from each lm fitplot(fit_2_way)# diagnostic plots# find three-way interactionsset.seed(1234)fit_3_way<-FSA(formula ="y~1",data = census_data_nopct,fitfunc = lm,fixvar =NULL,quad = F,m =3,numrs =50,cores =1,interactions = T,criterion = int.p.val,minmax ="min")print(fit_3_way)# summary of solutions foundsummary(fit_3_way)# list of summaries from each lm fitplot(fit_3_way)# diagnostic plots

Two-way interactions

Across 50 random starts seeking two-way interactions,rFSA discoveredthree unique feasible solutions. The solution with the smallest two-wayinteraction$p$-value occurred 24 times, ranking second in prevalence.The three interactions suggest relationships between (a) number ofhouseholds with both spouses as members, and number of persons aged 18to 24 ($p = 6.19 \times 10^{-38}$); (b) average household income andaverage aggregate house value in dollars ($p = 1.68 \times 10^{-43}$);and (c) number of mobile homes and total vacant housing units($p = 7.26 \times 10^{-19}$). All three interaction$p$-values aresignificant using a Bonferroni-adjusted cutoff of$\frac{0.05}{{\binom{67}{2}}} \approx 0.00045$.

Having identified a set of feasible solutions, an investigator generallywishes to examine a summary of each model fit. Given an object,fit,returned byFSA, the function callsummary(fit) returns a list ofsummaries for each model fit (including the original). Assessing the fitof each feasible solution can be facilitated withplot(fit) to displaydiagnostic plots for the original and feasible models. Each solutionshould be considered in a practical manner, with reasonableinterpretations of the interaction term being considered prior to itsinclusion in a final model.

The three interactions listed above are informative for understandingthe types of results returned by the algorithm. For example, one mightexpect that a larger average household income would be correlatedpositively with the average value of houses in a census block. Thus, asin any linear regression setting, multicollinearity may account forapparent relationships in the data. The second interaction, betweennumber of households with both spouses as members and the number ofpersons aged 18 to 24, may be more meaningful in the description of mailresponse rate ($y$). Because this interaction is both justifiable andstatistically significant, an investigator would have considerableevidence to warrant its inclusion in a final model.

Three-way interactions

Across 50 random starts seeking three-way interactions,rFSAidentified 14 feasible solutions. The most-frequently observed feasiblesolution occurred 13 times and boasted the smallest three-wayinteraction$p$-value. This interaction contains two variablesidentified previously in our search for two-way interactions. Thethree-way interaction among the number of households in which bothspouses are members, the number of persons aged 18 to 24, and the numberof people who indicate no Hispanic origin and their sole race as"White," affords an interaction$p$-value of$7.62 \times 10^{-24}$and remains significant under a Bonferroni correction. This interactionsuggests that the combined effect of married couples and young adults isfurther modified by the number of people who identify as "White." Thatis, in locations with a larger population of whites, the effect on mailresponse rate of having more married couples changes depending on theprevalence of young adults aged 18 and 24. The interaction is reasonablyeasy both to interpret and to justify, as well as being statisticallysignificant, and thus would warrant additional investigation by aninterested researcher.

In sum, leveragingrFSA can highlight associations and interactionsunderlying a dataset that may not be immediately apparent to theresearcher. The package is easy to manipulate (as demonstrated by thesparseness of the code provided in this example) as well as highlyefficient, and generally returns multiple subsets of variables to permitflexible exploration and validation.

5 Conclusion

In this paper, we discuss the implementation in R of a complex algorithmoriginally proposed by (Miller 1984) and (Hawkins 1994). We furtherdemonstrate its versatility and computational efficiency by comparisonwith existing subset-selection packages and by execution on a real-lifecensus dataset. Our package,rFSA, is available on CRAN and GitHub.Additionally, a light version of the algorithm is available as a ShinyApplication for the convenience of users with limited familiarity withR.

rFSA boasts several improvements over existing R packages for findingbest subsets, some of the most popular of which includeleaps,bestglm, andglmulti. Although these packages all perform well inimplementing exhaustive searches without interactions, they offer alimited choice of statistical technique, and often struggle whenconfronted with high-dimensional datasets. By contrast,rFSAaccommodates large datasets, higher-order interaction terms, and avariety of model forms and criterion functions, and can be adapted towork with less-traditional statistical methods such as survey-weighted,penalized, hierarchical, zero-inflated, survival, and many otherregression techniques. The results permit facile interpretation andremain flexible to conventional modes of statistical inference. Finally,the algorithm exhibits a considerable speedup on single-core operationsfor large subsets, and large$p$, relative to variable-selectionpackages in common use.

To improve the accessibility and convenience of the FSA for use by thegeneral research community, we plan to incorporate more features ofrFSA into the existing Shiny Application. Moreover, we intend toimprove the data-visualization features of both App and package, and tobuild an off-line version of the Shiny App for users with secure data.We will seek further improvements to execution speed and resourcemanagement to encourage analysis of yet-larger datasets, and incorporateadditional model forms and criterion functions to permit efficientanalysis by non-traditional methods.

Through its versatility and flexibility,rFSA provides an alternativealgorithm for model selection that allows users to find statisticallyoptimal subsets and interaction effects in a variety of datasets.Improved model selection may afford better predictive power, which canin turn illuminate relationships underlying large datasets in a varietyof research fields.

6 Summary

This paper outlines an R package, namedrFSA, for subset selection anddiscovery of higher-order interactions.rFSA is flexible to a varietyof statistical models and criterion functions, including thoseimplemented by the user, and boasts execution speeds superior toexisting subset-selection packages in context of large datasets. Therelease version ofrFSA is hosted on CRAN, and the development versioncan be accessed from GitHub by callingdevtools::install_github("joshuawlambert/rFSA").

7 Acknowledgements

This research and package creation were supported in part by theKentucky Biomedical Research Infrastructure Network and INBRE Grant (P20RR16481) and a National Multiple Sclerosis Society Pilot Grant(PP-1609-25975).

7.1 Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded atRJ-2018-059.zip

7.2 CRAN packages used

rFSA,leaps,glmulti,glmnet,hierNet,hashmap,geepack,devtools

7.3 CRAN Task Views implied by cited packages

ChemPhys,Econometrics,MachineLearning,MixedModels,Survival

A. Bhattacharyya. On a measure of divergence between two multinomial populations.Indian Statistical Institute, 7: 1946.

J. Bien, J. Taylor and R. Tibshirani. A lasso for hierarchical interactions.The Annals of Statistics, 41: 2013. URLhttps://doi.org/10.1214/13-aos1096.

J. Bien and R. Tibshirani.hierNet: A lasso for hierarchical interactions. 2014. URLhttps://CRAN.R-project.org/package=hierNet. R package version 1.6.

V. Calcagno.: Model selection and multimodel inference made easy. 2013. URLhttps://cran.r-project.org/web/packages/glmulti/.R package version 1.0.7.

D. Foster and R. Stine. Variable selection in data mining: Building a predictive model for bankruptcy.Journal of the American Statistical Association, 99: 303–313, 2004. URLhttps://doi.org/10.1198/016214504000000287.

J. Friedman.:lasso and elastic-net regularized generalized linear models. 2008. URLhttps://cran.r-project.org/web/packages/glmnet/.R package version 2.0-5.

B. Goudey. High performance computing enabling exhaustive analysis of higher order single nucleotide polymorphism interaction in genome wide association studies.Health Information Science and Systems, 3: 2015. URLhttps://doi.org/10.1186/2047-2501-3-s1-s3.

D. Hawkins. The feasible solution algorithm for least trimmed squares regression.Computational Statistics & Data Analysis, 17: 185–196, 1994. URLhttps://doi.org/10.1016/0167-9473(92)00070-8.

S. Højsgaard, U. Halekoh and J. Yan. The r package geepack for generalized estimating equations.Journal of Statistical Software, 15/2: 1–11, 2006. URLhttps://doi.org/10.18637/jss.v015.i02.

S. Janse. Inference using bhattacharyya distance to model interaction effects when the number of predictors far exceeds the sample size.Theses and Dissertations–University of Kentucky Statistics, 2017. URLhttps://doi.org/10.13023/etd.2017.455.

R. Jiang. A random forest approach to the detection of epistatic interactions in case-control studies.BMC Bioinformatics, 10: 2009. URLhttps://doi.org/10.1186/1471-2105-10-s1-s65.

E. Lampa. The identification of complex interactions in epidemiology and toxicology: A simulation study of boosted regression trees.Environmental Health, 13: 2014. URLhttps://doi.org/10.1186/1476-069x-13-57.

T. Lumley and A. Miller.: Regression subset selection. 2009. URLhttps://cran.r-project.org/web/packages/leaps/.R package version 2.9.

A. I. McLeod and C. Xu.Bestglm: Best subset GLM. 2017. URLhttps://CRAN.R-project.org/package=bestglm. R package version 0.36.

A. Miller. Selection of subsets of regression variables.Journal of the Royal Statistical Society, 147: 389–425, 1984. URLhttps://doi.org/10.2307/2981576.

W. Pan. Akaike’s information criterion in generalized estimating equations.Biometrics, 57: 120–125, 2001. URLhttps://doi.org/10.1111/j.0006-341x.2001.00120.x.

N. Russell.Hashmap: The faster hash map. 2017. URLhttps://CRAN.R-project.org/package=hashmap. R package version 0.2.0.

J. Yan. Geepack: Yet another package for generalized estimating equations.R-News, 2/3: 12–14, 2002.

J. Yan and J. P. Fine. Estimating equations for association structures.Statistics in Medicine, 23: 859–880, 2004. URLhttps://doi.org/10.1002/sim.1650.

Y. Zhang and J. Liu. Bayesian inference of epistatic interactions in case-control studies.Nature Genetics, 39: 2007. URLhttps://doi.org/10.1038/ng2110.