21 Feature Selection using Genetic Algorithms

Contents

21.1 Genetic Algorithms

Genetic algorithms (GAs) mimic Darwinian forces of natural selection to find optimal values of some function (Mitchell, 1998). An initial set of candidate solutions are created and their correspondingfitness values are calculated (where larger values are better). This set of solutions is referred to as a population and each solution as anindividual. The individuals with the best fitness values are combined randomly to produce offsprings which make up the next population. To do so, individual are selected and undergo cross-over (mimicking genetic reproduction) and also are subject to random mutations. This process is repeated again and again and many generations are produced (i.e. iterations of the search procedure) that should create better and better solutions.

For feature selection, the individuals are subsets of predictors that are encoded as binary; a feature is either included or not in the subset. The fitness values are some measure of model performance, such as the RMSE or classification accuracy. One issue with using GAs for feature selection is that the optimization process can be very aggressive and their is potential for the GA to overfit to the predictors (much like the previous discussion for RFE).

21.2 Internal and External Performance Estimates

The genetic algorithm code incaret conducts the search of the feature space repeatedly within resampling iterations. First, the training data are split be whatever resampling method was specified in the control function. For example, if 10-fold cross-validation is selected, the entire genetic algorithm is conducted 10 separate times. For the first fold, nine tenths of the data are used in the search while the remaining tenth is used to estimate the external performance since these data points were not used in the search.

During the genetic algorithm, a measure of fitness is needed to guide the search. This is the internal measure of performance. During the search, the data that are available are the instances selected by the top-level resampling (e.g. the nine tenths mentioned above). A common approach is to conduct another resampling procedure. Another option is to use a holdout set of samples to determine the internal estimate of performance (see the holdout argument of the control function). While this is faster, it is more likely to cause overfitting of the features and should only be used when a large amount of training data are available. Yet another idea is to use a penalized metric (such as the AIC statistic) but this may not exist for some metrics (e.g. the area under the ROC curve).

The internal estimates of performance will eventually overfit the subsets to the data. However, since the external estimate is not used by the search, it is able to make better assessments of overfitting. After resampling, this function determines the optimal number of generations for the GA.

Finally, the entire data set is used in the last execution of the genetic algorithm search and the final model is built on the predictor subset that is associated with the optimal number of generations determined by resampling (although the update function can be used to manually set the number of generations).

21.3 Basic Syntax

The most basic usage of the function is:

obj <-gafs(x = predictors,y = outcome,iters =100)

where

  • x: a data frame or matrix of predictor values
  • y: a factor or numeric vector of outcomes
  • iters: the number of generations for the GA

This isn’t very specific. All of the action is in the control function. That can be used to specify the model to be fit, how predictions are made and summarized as well as the genetic operations.

Suppose that we want to fit a linear regression model. To do this, we can usetrain as an interface and pass arguments to that function throughgafs:

ctrl <-gafsControl(functions = caretGA)obj <-gafs(x = predictors,y = outcome,iters =100,gafsControl = ctrl,            ## Now pass options to `train`method ="lm")

Other options, such aspreProcess, can be passed in as well.

Some important options togafsControl are:

  • method,number,repeats,index,indexOut, etc: options similar to those fortrain top control resampling.
  • metric: this is similar totrain’s option but, in this case, the value should be a named vector with values for the internal and external metrics. If none are specified, the first value returned by the summary functions (see details below) are used and a warning is issued. A similar two-element vector for the optionmaximize is also required. See thelast example here for an illustration.
  • holdout: this is a number between[0, 1) that can be used to hold out samples for computing the internal fitness value. Note that this is independent of the external resampling step. Suppose 10-fold CV is being used. Within a resampling iteration,holdout can be used to sample an additional proportion of the 90% resampled data to use for estimating fitness. This may not be a good idea unless you have a very large training set and want to avoid an internal resampling procedure to estimate fitness.
  • allowParallel andgenParallel: these are logicals to control where parallel processing should be used (if at all). The former will parallelize the external resampling while the latter parallelizes the fitness calculations within a generation.allowParallel will almost always be more advantageous.

There are a few built-in sets of functions to use withgafs:caretGA,rfGA, andtreebagGA. The first is a simple interface totrain. When using this, as shown above, arguments can be passed totrain using the... structure and the resampling estimates of performance can be used as the internal fitness value. The functions provided byrfGA andtreebagGA avoid usingtrain and their internal estimates of fitness come from using the out-of-bag estimates generated from the model.

The GA implementation incaret uses the underlying code from theGA package (Scrucca, 2013).

21.4 Genetic Algorithm Example

Using the example from theprevious page where there are five real predictors and 40 noise predictors:

library(mlbench)n <-100p <-40sigma <-1set.seed(1)sim <-mlbench.friedman1(n,sd = sigma)colnames(sim$x) <-c(paste("real",1:5,sep =""),paste("bogus",1:5,sep =""))bogus <-matrix(rnorm(n*p),nrow = n)colnames(bogus) <-paste("bogus",5+(1:ncol(bogus)),sep ="")x <-cbind(sim$x, bogus)y <-sim$ynormalization <-preProcess(x)x <-predict(normalization, x)x <-as.data.frame(x)

We’ll fit a random forest model and use the out-of-bag RMSE estimate as the internal performance metric and use the same repeated 10-fold cross-validation process used with the search. To do this, we’ll use the built-inrfGA object for this purpose. The default GA operators will be used and conduct 200 generations of the algorithm.

ga_ctrl <-gafsControl(functions = rfGA,method ="repeatedcv",repeats =5)## Use the same random number seed as the RFE process## so that the same CV folds are used for the external## resampling.set.seed(10)rf_ga <-gafs(x = x,y = y,iters =200,gafsControl = ga_ctrl)rf_ga
## ## Genetic Algorithm Feature Selection## ## 100 samples## 50 predictors## ## Maximum generations: 200 ## Population per generation: 50 ## Crossover probability: 0.8 ## Mutation probability: 0.1 ## Elitism: 0 ## ## Internal performance values: RMSE, Rsquared## Subset selection driven to minimize internal RMSE ## ## External performance values: RMSE, Rsquared, MAE## Best iteration chose by minimizing external RMSE ## External resampling method: Cross-Validated (10 fold, repeated 5 times) ## ## During resampling:##   * the top 5 selected variables (out of a possible 50):##     real1 (100%), real2 (100%), real4 (100%), real5 (100%), real3 (92%)##   * on average, 9.3 variables were selected (min = 6, max = 15)## ## In the final search using the entire training set:##    * 12 features selected at iteration 195 including:##      real1, real2, real3, real4, real5 ... ##    * external performance at this iteration is## ##        RMSE    Rsquared         MAE ##      2.8056      0.7607      2.3640

With 5 repeats of 10-fold cross-validation, the GA was executed 50 times. The average external performance is calculated across resamples and these results are used to determine the optimal number of iterations for the final GA to avoid over-fitting. Across the resamples, an average of 9.3 predictors were selected at the end of each of the algorithms.

Theplot function is used to monitor the average of the internal out-of-bag RMSE estimates as well as the average of the external performance estimates calculated from the 50 out-of-sample predictions. By default, this function usesggplot2 package. A black and white theme can be “added” to the output object:

plot(rf_ga)+theme_bw()

Based on these results, the generation associated with the best external RMSE estimate was 2.81.

Using the entire training set, the final GA is conducted and, at generation 195, there were 12 that were selected: real1, real2, real3, real4, real5, bogus3, bogus5, bogus7, bogus8, bogus14, bogus17, bogus29. The random forest model with these predictors is created using the entire training set is trained and this is the model that is used whenpredict.gafs is executed.

Note: the correlation between the internal and external fitness values is somewhat atypical for most real-world problems. This is a function of the nature of the simulations (a small number of uncorrelated informative predictors) and that the OOB error estimate from random forest is a product of hundreds of trees. Your mileage may vary.

21.5 Customizing the Search

21.5.1 Thefit Function

This function builds the model based on a proposed current subset. The arguments for the function must be:

  • x: the current training set of predictor data with the appropriate subset of variables
  • y: the current outcome data (either a numeric or factor vector)
  • lev: a character vector with the class levels (orNULL for regression problems)
  • last: a logical that isTRUE when the final GA search is conducted on the entire data set
  • ...: optional arguments to pass to the fit function in the call togafs

The function should return a model object that can be used to generate predictions. For random forest, the fit function is simple:

rfGA$fit
## function (x, y, lev = NULL, last = FALSE, ...) ## {##     loadNamespace("randomForest")##     randomForest::randomForest(x, y, ...)## }## <bytecode: 0x7fa7162692d8>## <environment: namespace:caret>

21.5.2 Thepred Function

This function returns a vector of predictions (numeric or factors) from the current model . The input arguments must be

  • object: the model generated by thefit function
  • x: the current set of predictor set for the held-back samples

For random forests, the function is a simple wrapper for the predict function:

rfGA$pred
## function (object, x) ## {##     tmp <- predict(object, x)##     if (is.factor(object$y)) {##         out <- cbind(data.frame(pred = tmp), as.data.frame(predict(object, ##             x, type = "prob")))##     }##     else out <- tmp##     out## }## <bytecode: 0x7fa716269770>## <environment: namespace:caret>

For classification, it is probably a good idea to ensure that the resulting factor variables of predictions has the same levels as the input data.

21.5.3 Thefitness_intern Function

Thefitness_intern function takes the fitted model and computes one or more performance metrics. The inputs to this function are:

  • object: the model generated by thefit function
  • x: the current set of predictor set. If the optiongafsControl$holdout is zero, these values will be from the current resample (i.e. the same data used to fit the model). Otherwise, the predictor values are from the hold-out set created bygafsControl$holdout.
  • y: outcome values. See the note for thex argument to understand which data are presented to the function.
  • maximize: a logical fromgafsControl that indicates whether the metric should be maximized or minimized
  • p: the total number of possible predictors

The output should be anamed numeric vector of performance values.

In many cases, some resampled measure of performance is used. In the example above using random forest, the OOB error was used. In other cases, the resampled performance fromtrain can be used and, ifgafsControl$holdout is not zero, a static hold-out set can be used. This depends on the data and problem at hand.

The example function for random forest is:

rfGA$fitness_intern
## function (object, x, y, maximize, p) ## rfStats(object)## <bytecode: 0x7fa71626a570>## <environment: namespace:caret>

21.5.4 Thefitness_extern Function

Thefitness_extern function takes the observed and predicted values form the external resampling process and computes one or more performance metrics. The input arguments are:

  • data: a data frame or predictions generated by thefit function. For regression, the predicted values in a column calledpred. For classification,pred is a factor vector. Class probabilities are usually attached as columns whose names are the class levels (see the random forest example for thefit function above)
  • lev: a character vector with the class levels (orNULL for regression problems)

The output should be anamed numeric vector of performance values.

The example function for random forest is:

rfGA$fitness_extern
## function (data, lev = NULL, model = NULL) ## {##     if (is.character(data$obs)) ##         data$obs <- factor(data$obs, levels = lev)##     postResample(data[, "pred"], data[, "obs"])## }## <bytecode: 0x7fa71626a7a0>## <environment: namespace:caret>

Two functions incaret that can be used as the summary function aredefaultSummary andtwoClassSummary (for classification problems with two classes).

21.5.5 Theinitial Function

This function creates an initial generation. Inputs are:

  • vars: the number of possible predictors
  • popSize: the population size for each generation
  • ...: not currently used

The output should be a binary 0/1 matrix where there arevars columns corresponding to the predictors andpopSize rows for the individuals in the population.

The default function populates the rows randomly with subset sizes varying between 10% and 90% of number of possible predictors. For example:

set.seed(128)starting <-rfGA$initial(vars =12,popSize =8)starting
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]## [1,]    0    1    0    0    1    0    0    0    0     0     0     0## [2,]    0    0    0    0    0    0    1    0    0     0     1     0## [3,]    0    0    1    1    1    0    0    0    1     0     1     0## [4,]    0    1    0    0    1    0    0    1    0     1     0     1## [5,]    0    1    1    1    0    0    1    0    0     1     0     0## [6,]    1    1    1    1    1    1    0    1    1     1     1     1## [7,]    1    1    1    1    1    0    0    1    0     1     1     1## [8,]    1    0    1    1    0    1    1    1    0     1     1     1
apply(starting,1, mean)
## [1] 0.1666667 0.1666667 0.4166667 0.4166667 0.4166667 0.9166667 0.7500000## [8] 0.7500000

gafs has an argument calledsuggestions that is similar to the one in thega function where the initial population can be seeded with specific subsets.

21.5.6 Theselection Function

This function conducts the genetic selection. Inputs are:

  • population: the indicators for the current population
  • fitness: the corresponding fitness values for the population. Note that if the internal performance value is to be minimized, these are the negatives of the actual values
  • r,q: tuning parameters for specific selection functions. Seegafs_lrSelection andgafs_nlrSelection
  • ...: not currently used

The output should be a list with named elements.

  • population: the indicators for the selected individuals
  • fitness: the fitness values for the selected individuals

The default function is a version of theGA package’s ga_lrSelection function.

21.5.7 Thecrossover Function

This function conducts the genetic crossover. Inputs are:

  • population: the indicators for the current population
  • fitness: the corresponding fitness values for the population. Note that if the internal performance value is to be minimized, these are the negatives of the actual values
  • parents: a matrix with two rows containing indicators for the parent individuals.
  • ...: not currently used

The default function is a version of theGA package’sga_spCrossover function. Another function that is a version of that package’s uniform cross-over function is also available.

.The output should be a list with named elements.

  • children: from?ga_spCrossover: “a matrix of dimension 2 times the number of decision variables containing the generated offsprings”"
  • fitness: “a vector of length 2 containing the fitness values for the offsprings. A valueNA is returned if an offspring is different (which is usually the case) from the two parents.”"

21.5.8 Themutation Function

This function conducts the genetic mutation. Inputs are:

  • population: the indicators for the current population
  • parents: a vector of indices for where the mutation should occur.
  • ...: not currently used

The default function is a version of theGA package’sgabin_raMutation function.

.The output should the mutated population.

21.5.9 TheselectIter Function

This function determines the optimal number of generations based on the resampling output. Inputs for the function are:

  • x: a matrix with columns for the performance metrics averaged over resamples
  • metric: a character string of the performance measure to optimize (e.g. RMSE, Accuracy)
  • maximize: a single logical for whether the metric should be maximized

This function should return an integer corresponding to the optimal subset size.

21.6 The Example Revisited

The previous GA included some of the non-informative predictors. We can cheat a little and try to bias the search to get the right solution.

We can try to encourage the algorithm to choose fewer predictors, we can penalize the the RMSE estimate. Normally, a metric like the Akaike information criterion (AIC) statistic would be used. However, with a random forest model, there is no real notion of model degrees of freedom. As an alternative, we can usedesirability functions to penalize the RMSE. To do this, two functions are created that translate the number of predictors and the RMSE values to a measure of “desirability”. For the number of predictors, the most desirable property would be a single predictor and the worst situation would be if the model required all 50 predictors. That desirability function is visualized as:

For the RMSE, the best case would be zero. Many poor models have values around four. To give the RMSE value more weight in the overall desirability calculation, we use a scale parameter value of 2. This desirability function is:

To use the overall desirability to drive the feature selection, theinternal function requires replacement. We make a copy ofrfGA and add code using thedesirability package and the function returns the estimated RMSE and the overall desirability. ThegafsControl function also need changes. Themetric argument needs to reflect that the overall desirability score should be maximized internally but the RMSE estimate should be minimized externally.

library(desirability)rfGA2 <-rfGArfGA2$fitness_intern <-function (object, x, y, maximize, p) {  RMSE <-rfStats(object)[1]  d_RMSE <-dMin(0,4)  d_Size <-dMin(1, p,2)  overall <-dOverall(d_RMSE, d_Size)  D <-predict(overall,data.frame(RMSE,ncol(x)))c(D = D,RMSE =as.vector(RMSE))  }ga_ctrl_d <-gafsControl(functions = rfGA2,method ="repeatedcv",repeats =5,metric =c(internal ="D",external ="RMSE"),maximize =c(internal =TRUE,external =FALSE))set.seed(10)rf_ga_d <-gafs(x = x,y = y,iters =150,gafsControl = ga_ctrl_d)rf_ga_d
## ## Genetic Algorithm Feature Selection## ## 100 samples## 50 predictors## ## Maximum generations: 150 ## Population per generation: 50 ## Crossover probability: 0.8 ## Mutation probability: 0.1 ## Elitism: 0 ## ## Internal performance values: D, RMSE## Subset selection driven to maximize internal D ## ## External performance values: RMSE, Rsquared, MAE## Best iteration chose by minimizing external RMSE ## External resampling method: Cross-Validated (10 fold, repeated 5 times) ## ## During resampling:##   * the top 5 selected variables (out of a possible 50):##     real1 (100%), real2 (100%), real4 (100%), real5 (100%), real3 (40%)##   * on average, 5.2 variables were selected (min = 4, max = 6)## ## In the final search using the entire training set:##    * 6 features selected at iteration 146 including:##      real1, real2, real3, real4, real5 ... ##    * external performance at this iteration is## ##        RMSE    Rsquared         MAE ##      2.7046      0.7665      2.2730

Here are the RMSE values for this search:

plot(rf_ga_d)+theme_bw()

The final GA found 6 that were selected: real1, real2, real3, real4, real5, bogus43. During resampling, the average number of predictors selected was 5.2, indicating that the penalty on the number of predictors was effective.

21.7 Using Recipes

Like the other feature selection routines,gafs can take a data recipe as an input. This is advantageous when your data needs preprocessing before the model, such as:

  • creation of dummy variables from factors
  • specification of interactions
  • missing data imputation
  • more complex feature engineering methods

Liketrain, the recipe’s preprocessing steps are calculated within each resample. This makes sure that the resampling statistics capture the variation and effect that the preprocessing has on the model.

As an example, the Ames housing data is used. These data contain a number of categorical predictors that require conversion to indicators as well as other variables that require processing. To load (and split) the data:

library(AmesHousing)library(rsample)# Create the data and remove one column that is more of# an outcome.ames <-make_ames()%>%select(-Overall_Qual)ncol(ames)
## [1] 80
# How many factor variables?sum(vapply(ames, is.factor,logical(1)))
## [1] 45
# We'll use `rsample` to make the initial split to be consistent with other# analyses of these data. Set the seed first to make sure that you get the# same random numbersset.seed(4595)data_split <-initial_split(ames,strata ="Sale_Price",prop =3/4)ames_train <-training(data_split)%>%as.data.frame()ames_test  <-testing(data_split)%>%as.data.frame()

Here is a recipe that does differetn types of preprocssing on the predictor set:

library(recipes)ames_rec <-recipe(Sale_Price~.,data = ames_train)%>%step_log(Sale_Price,base =10)%>%step_other(Neighborhood,threshold =0.05)%>%step_dummy(all_nominal(),-Bldg_Type)%>%step_interact(~starts_with("Central_Air"):Year_Built)%>%step_zv(all_predictors())%>%step_bs(Longitude, Latitude,options =list(df =5))

If this were executed on the training set, it would produce 280 predictor columns out of the original 79.

Let’s tune some linear models withgafs and, for the sake of computational time, only use 10 generations of the algorithm:

lm_ga_ctrl <-gafsControl(functions = caretGA,method ="cv",number =10)set.seed(23555)lm_ga_search <-gafs(  ames_rec,data = ames_train,iters =10,gafsControl = lm_ga_ctrl,# now options to `train` for caretGAmethod ="lm",trControl =trainControl(method ="cv",allowParallel =FALSE)) lm_ga_search
## ## Genetic Algorithm Feature Selection## ## 2199 samples## 273 predictors## ## Maximum generations: 10 ## Population per generation: 50 ## Crossover probability: 0.8 ## Mutation probability: 0.1 ## Elitism: 0 ## ## Internal performance values: RMSE, Rsquared, MAE## Subset selection driven to minimize internal RMSE ## ## External performance values: RMSE, Rsquared, MAE## Best iteration chose by minimizing external RMSE ## External resampling method: Cross-Validated (10 fold) ## ## During resampling:##   * the top 5 selected variables (out of a possible 273):##     Bldg_Type (100%), Bsmt_Exposure_No (100%), First_Flr_SF (100%), MS_Zoning_Residential_High_Density (100%), Neighborhood_Gilbert (100%)##   * on average, 171.7 variables were selected (min = 150, max = 198)## ## In the final search using the entire training set:##    * 155 features selected at iteration 9 including:##      Lot_Frontage, Year_Built, Year_Remod_Add, BsmtFin_SF_2, Gr_Liv_Area ... ##    * external performance at this iteration is## ##         RMSE     Rsquared          MAE ##      0.06923      0.84659      0.04260