| Title: | Regularized Random Forest |
| Version: | 1.9.4.1 |
| Date: | 2022-05-30 |
| Depends: | R (≥ 2.5.0), stats |
| Suggests: | RColorBrewer, MASS |
| Description: | Feature Selection with Regularized Random Forest. This package is based on the 'randomForest' package by Andy Liaw. The key difference is the RRF() function that builds a regularized random forest. Fortran original by Leo Breiman and Adele Cutler, R port by Andy Liaw and Matthew Wiener, Regularized random forest for classification by Houtao Deng, Regularized random forest for regression by Xin Guan. Reference: Houtao Deng (2013) <doi:10.48550/arXiv.1306.0237>. |
| Maintainer: | Houtao Deng <softwaredeng@gmail.com> |
| License: | GPL-2 |GPL-3 [expanded from: GPL (≥ 2)] |
| URL: | https://sites.google.com/site/houtaodeng/rrf |
| Repository: | CRAN |
| Date/Publication: | 2024-11-04 18:35:21 UTC |
| Packaged: | 2024-11-04 17:49:15 UTC; hornik |
| NeedsCompilation: | yes |
| RoxygenNote: | 5.0.1 |
| Author: | Houtao Deng [aut, cre], Xin Guan [aut], Andy Liaw [aut], Leo Breiman [aut], Adele Cutler [aut] |
Multi-dimensional Scaling Plot of Proximity matrix from RRF
Description
Plot the scaling coordinates of the proximity matrix from RRF.
Usage
MDSplot(rf, fac, k=2, palette=NULL, pch=20, ...)Arguments
rf | an object of class |
fac | a factor that was used as response to train |
k | number of dimensions for the scaling coordinates. |
palette | colors to use to distinguish the classes; length mustbe the equal to the number of levels. |
pch | plotting symbols to use. |
... | other graphical parameters. |
Value
The output ofcmdscale on 1 -rf$proximity isreturned invisibly.
Note
Ifk > 2,pairs is used to produce thescatterplot matrix of the coordinates.
Author(s)
Robert Gentleman, with slight modifications by Andy Liaw
See Also
Examples
set.seed(1)data(iris)iris.rf <- RRF(Species ~ ., iris, proximity=TRUE, keep.forest=FALSE)MDSplot(iris.rf, iris$Species)## Using different symbols for the classes:MDSplot(iris.rf, iris$Species, palette=rep(1, 3), pch=as.numeric(iris$Species))Feature Selection with Regularized Random Forest
Description
RRF implements the regularized random forest algorithm. It is based onthe randomForest R package by Andy Liaw, Matthew Wiener, Leo Breiman and Adele Cutler.
Usage
## S3 method for class 'formula'RRF(formula, data=NULL, ..., subset, na.action=na.fail)## Default S3 method:RRF(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE, coefReg=NULL, flagReg=1, feaIni=NULL,...)## S3 method for class 'RRF'print(x, ...)Arguments
data | an optional data frame containing the variables in the model.By default the variables are taken from the environment which |
subset | an index vector indicating which rows should be used.(NOTE: If given, this argument must be named.) |
na.action | A function to specify the action to be taken if NAsare found. (NOTE: If given, this argument must be named.) |
x,formula | a data frame or a matrix of predictors, or a formuladescribing the model to be fitted (for the |
y | A response vector. If a factor, classification is assumed,otherwise regression is assumed. If omitted, |
xtest | a data frame or matrix (like |
ytest | response for the test set. |
ntree | Number of trees to grow. This should not be set to toosmall a number, to ensure that every input row gets predicted atleast a few times. |
mtry | Number of variables randomly sampled as candidates at eachsplit. Note that the default values are different forclassification (sqrt(p) where p is number of variables in |
replace | Should sampling of cases be done with or withoutreplacement? |
classwt | Priors of the classes. Need not add up to one.Ignored for regression. |
cutoff | (Classification only) A vector of length equal tonumber of classes. The ‘winning’ class for an observation is theone with the maximum ratio of proportion of votes to cutoff.Default is 1/k where k is the number of classes (i.e., majority votewins). |
strata | A (factor) variable that is used for stratified sampling. |
sampsize | Size(s) of sample to draw. For classification, ifsampsize is a vector of the length the number of strata, thensampling is stratified by strata, and the elements of sampsizeindicate the numbers to be drawn from the strata. |
nodesize | Minimum size of terminal nodes. Setting this numberlarger causes smaller trees to be grown (and thus take less time).Note that the default values are different for classification (1)and regression (5). |
maxnodes | Maximum number of terminal nodes trees in the forestcan have. If not given, trees are grown to the maximum possible(subject to limits by |
importance | Should importance of predictors be assessed? |
localImp | Should casewise importance measure be computed?(Setting this to |
nPerm | Number of times the OOB data are permuted per tree forassessing variable importance. Number larger than 1 gives slightlymore stable estimate, but not very effective. Currently onlyimplemented for regression. |
proximity | Should proximity measure among the rows becalculated? |
oob.prox | Should proximity be calculated only on “out-of-bag”data? |
norm.votes | If |
do.trace | If set to |
keep.forest | If set to |
corr.bias | perform bias correction for regression? Note:Experimental. Use at your own risk. |
keep.inbag | Should an |
coefReg | the coefficient(s) of regularization. A smaller coefficient may lead to a smaller feature subset, i.e. there are fewer variables with non-zero importance scores. coefReg must be either a single value (all variables have the same coefficient) or a numeric vector of length equal to the number of predictor variables. default: 0.8 |
flagReg | 1: with regularization; 0: without regularization. default: 1 |
feaIni | initial feature subset, useful only when flagReg = 1 |
... | optional parameters to be passed to the low level function |
Value
An object of classRRF, which is a list with thefollowing components:
call | the original call to |
type | one of |
predicted | the predicted values of the input data based onout-of-bag samples. |
importance | a matrix with |
importanceSD | The “standard errors” of the permutation-basedimportance measure. For classification, a |
localImp | a p by n matrix containing the casewise importancemeasures, the [i,j] element of which is the importance of i-thvariable on the j-th case. |
ntree | number of trees grown. |
mtry | number of predictors sampled for spliting at each node. |
forest | (a list that contains the entire forest; |
err.rate | (classification only) vector error rates of theprediction on the input data, the i-th element being the (OOB) error ratefor all trees up to the i-th. |
confusion | (classification only) the confusion matrix of theprediction (based on OOB data). |
votes | (classification only) a matrix with one row for eachinput data point and one column for each class, giving the fractionor number of (OOB) ‘votes’ from the random forest. |
oob.times | number of times cases are ‘out-of-bag’ (and thus usedin computing OOB error estimate) |
proximity | if |
feaSet | features selected |
mse | (regression only) vector of mean square errors: sum of squaredresiduals divided by |
rsq | (regression only) “pseudo R-squared”: 1 - |
test | if test set is given (through the |
Note
For large data sets, especially those with large number of variables, calling RRF via theformula interface is not advised: There may be too much overhead in handling the formula.
Author(s)
Houtao Dengsoftwaredeng@gmail.com, based on the randomForest R package by Andy Liaw, Matthew Wiener, Leo Breiman and Adele Cutler.
References
Houtao Deng and George C. Runger (2013),Gene Selection with Guided Regularized Random Forest, Pattern Recognition 46(12): 3483-3489.
Houtao Deng and George C. Runger (2012),Feature Selection via Regularized Trees, the 2012 International Joint Conference on Neural Networks (IJCNN).
Houtao Deng (2013),Guided Random Forest in the RRF Package, arXiv:1306.0237.
Examples
#-----Example 1 -----library(RRF);set.seed(1)#only the first feature and last feature are truly usefulX <- matrix(runif(50*50), ncol=50)class <- (X[,1])^2 + (X[,50])^2 class[class>median(class)] <- 1;class[class<=median(class)] <- 0#ordinary random forest. rf <- RRF(X,as.factor(class), flagReg = 0)impRF <- rf$importanceimpRF <- impRF[,"MeanDecreaseGini"]rf$feaSet#regularized random forestrrf <- RRF(X,as.factor(class), flagReg = 1)rrf$feaSet#guided regularized random forestimp <- impRF/(max(impRF))#normalize the importance scoregamma <- 0.5coefReg <- (1-gamma)+gamma*imp #weighted averagegrrf <- RRF(X,as.factor(class),coefReg=coefReg, flagReg=1)grrf$feaSet#guided random forestgamma <- 1coefReg <- (1-gamma)+gamma*imp grf <- RRF(X,as.factor(class),coefReg=coefReg, flagReg=0)grf$feaSet#-----Example 2 XOR learning-----#only the first 3 features are needed#and each individual feature is not useful#bSample <- sample(0:1,20000,replace=TRUE)#X <- matrix(bSample,ncol=40)#class <- xor(xor(X[,1],X[,2]),X[,3])Prototypes of groups.
Description
Prototypes are ‘representative’ cases of a group of data points, giventhe similarity matrix among the points. They are very similar tomedoids. The function is named ‘classCenter’ to avoid conflict withthe functionprototype in themethods package.
Usage
classCenter(x, label, prox, nNbr = min(table(label))-1)Arguments
x | a matrix or data frame |
label | group labels of the rows in |
prox | the proximity (or similarity) matrix, assumed to besymmetric with 1 on the diagonal and in [0, 1] off the diagonal (theorder of row/column must match that of |
nNbr | number of nearest neighbors used to find the prototypes. |
Details
This version only computes one prototype per class. For each case inx, thenNbr nearest neighors are found. Then, for eachclass, the case that has most neighbors of that class is identified.The prototype for that class is then the medoid of these neighbors(coordinate-wise medians for numerical variables and modes forcategorical variables).
This version only computes one prototype per class. In the futuremore prototypes may be computed (by removing the ‘neighbors’ used,then iterate).
Value
A data frame containing one prototype in each row.
Author(s)
Andy Liaw
See Also
Examples
data(iris)iris.rf <- RRF(iris[,-5], iris[,5], prox=TRUE)iris.p <- classCenter(iris[,-5], iris[,5], iris.rf$prox)plot(iris[,3], iris[,4], pch=21, xlab=names(iris)[3], ylab=names(iris)[4], bg=c("red", "blue", "green")[as.numeric(factor(iris$Species))], main="Iris Data with Prototypes")points(iris.p[,3], iris.p[,4], pch=21, cex=2, bg=c("red", "blue", "green"))Combine Ensembles of Trees
Description
Combine two more more ensembles of trees into one.
Usage
combine(...)Arguments
... | two or more objects of class |
Value
An object of classRRF.
Note
Theconfusion,err.rate,mse andrsqcomponents (as well as the corresponding components in thetestcompnent, if exist) of the combined object will beNULL.
Author(s)
Andy Liawandy_liaw@merck.com
See Also
Examples
data(iris)rf1 <- RRF(Species ~ ., iris, ntree=50, norm.votes=FALSE)rf2 <- RRF(Species ~ ., iris, ntree=50, norm.votes=FALSE)rf3 <- RRF(Species ~ ., iris, ntree=50, norm.votes=FALSE)rf.all <- combine(rf1, rf2, rf3)print(rf.all)Extract a single tree from a forest.
Description
This function extract the structure of a tree from aRRF object.
Usage
getTree(rfobj, k=1, labelVar=FALSE)Arguments
rfobj | a |
k | which tree to extract? |
labelVar | Should better labels be used for splitting variablesand predicted class? |
Details
For numerical predictors, data with values of the variable less thanor equal to the splitting point go to the left daughter node.
For categorical predictors, the splitting point is represented by aninteger, whose binary expansion gives the identities of the categoriesthat goes to left or right. For example, if a predictor has fourcategories, and the split point is 13. The binary expansion of 13 is(1, 0, 1, 1) (because13 = 1*2^0 + 0*2^1 + 1*2^2 + 1*2^3), so cases withcategories 1, 3, or 4 in this predictor get sent to the left, and the restto the right.
Value
A matrix (or data frame, iflabelVar=TRUE) with six columns andnumber of rows equal to total number of nodes in the tree. The sixcolumns are:
left daughter | the row where the left daughter node is; 0 if thenode is terminal |
right daughter | the row where the right daughter node is; 0 ifthe node is terminal |
split var | which variable was used to split the node; 0 if thenode is terminal |
split point | where the best split is; see Details forcategorical predictor |
status | is the node terminal (-1) or not (1) |
prediction | the prediction for the node; 0 if the node is notterminal |
Author(s)
Andy Liawandy_liaw@merck.com
See Also
Examples
data(iris)## Look at the third trees in the forest.getTree(RRF(iris[,-5], iris[,5], ntree=10), 3, labelVar=TRUE)Add trees to an ensemble
Description
Add additional trees to an existing ensemble of trees.
Usage
## S3 method for class 'RRF'grow(x, how.many, ...)Arguments
x | an object of class |
how.many | number of trees to add to the |
... | currently ignored. |
Value
An object of classRRF, containinghow.manyadditional trees.
Note
Theconfusion,err.rate,mse andrsqcomponents (as well as the corresponding components in thetestcompnent, if exist) of the combined object will beNULL.
Author(s)
Andy Liawandy_liaw@merck.com
See Also
Examples
data(iris)iris.rf <- RRF(Species ~ ., iris, ntree=50, norm.votes=FALSE)iris.rf <- grow(iris.rf, 50)print(iris.rf)Extract variable importance measure
Description
This is the extractor function for variable importance measures asproduced byRRF.
Usage
## S3 method for class 'RRF'importance(x, type=NULL, class=NULL, scale=TRUE, ...)Arguments
x | an object of class |
.
type | either 1 or 2, specifying the type of importance measure(1=mean decrease in accuracy, 2=mean decrease in node impurity). |
class | for classification problem, which class-specific measureto return. |
scale | For permutation based measures, should the measures bedivided their “standard errors”? |
... | not used. |
Details
Here are the definitions of the variable importance measures. Thefirst measure is computed from permuting OOB data: Foreach tree, the prediction error on the out-of-bag portion of thedata is recorded (error rate for classification, MSE for regression).Then the same is done after permuting each predictor variable. Thedifference between the two are then averaged over all trees, andnormalized by the standard deviation of the differences. If thestandard deviation of the differences is equal to 0 for a variable,the division is not done (but the average is almost always equal to 0in that case).
The second measure is the total decrease in node impurities fromsplitting on the variable, averaged over all trees. Forclassification, the node impurity is measured by the Gini index.For regression, it is measured by residual sum of squares.
Value
A matrix of importance measure, one row for each predictor variable.The column(s) are different importance measures.
See Also
Examples
set.seed(4543)data(mtcars)mtcars.rf <- RRF(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE)importance(mtcars.rf)importance(mtcars.rf, type=1)The Automobile Data
Description
This is the ‘Automobile’ data from the UCI Machine Learning Repository.
Usage
data(imports85)Format
imports85 is a data frame with 205 cases (rows) and 26variables (columns). This data set consists of three types ofentities: (a) the specification of an auto in terms of variouscharacteristics, (b) its assigned insurance risk rating, (c) itsnormalized losses in use as compared to other cars. The second ratingcorresponds to the degree to which the auto is more risky than itsprice indicates. Cars are initially assigned a risk factor symbolassociated with its price. Then, if it is more risky (or less), thissymbol is adjusted by moving it up (or down) the scale. Actuarianscall this process ‘symboling’. A value of +3 indicates that the autois risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insuredvehicle year. This value is normalized for all autos within aparticular size classification (two-door small, station wagons,sports/speciality, etc...), and represents the average loss per carper year.
Author(s)
Andy Liaw
Source
Originally created by Jeffrey C. Schlimmer, from 1985 Model Import Carand Truck Specifications, 1985 Ward's Automotive Yearbook, PersonalAuto Manuals, Insurance Services Office, and Insurance CollisionReport, Insurance Institute for Highway Safety.
The original data is athttp://www.ics.uci.edu/~mlearn/MLSummary.html.
References
1985 Model Import Car and Truck Specifications, 1985 Ward's AutomotiveYearbook.
Personal Auto Manuals, Insurance Services Office,160 Water Street, New York, NY 10038
Insurance Collision Report, Insurance Institute for Highway Safety,Watergate 600, Washington, DC 20037
See Also
Examples
data(imports85)imp85 <- imports85[,-2] # Too many NAs in normalizedLosses.imp85 <- imp85[complete.cases(imp85), ]## Drop empty levels for factors.imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[, drop=TRUE] else x)stopifnot(require(RRF))price.rf <- RRF(price ~ ., imp85, do.trace=10, ntree=100)print(price.rf)numDoors.rf <- RRF(numOfDoors ~ ., imp85, do.trace=10, ntree=100)print(numDoors.rf)Margins of RRF Classifier
Description
Compute or plot the margin of predictions from a RRF classifier.
Usage
## S3 method for class 'RRF'margin(x, ...)## Default S3 method:margin(x, observed, ...)## S3 method for class 'margin'plot(x, sort=TRUE, ...)Arguments
x | an object of class |
observed | the true response corresponding to the data in |
sort | Should the data be sorted by their class labels? |
... | other graphical parameters to be passed to |
Value
Formargin, themargin of observations from theRRF classifier (or whatever classifier thatproduced the predicted probability matrix given tomargin).The margin of a data point is defined as the proportion of votes forthe correct class minus maximum proportion of votes for the otherclasses. Thus under majority votes, positive margin means correctclassification, and vice versa.
Author(s)
Robert Gentlemen, with slight modifications by Andy Liaw
See Also
Examples
set.seed(1)data(iris)iris.rf <- RRF(Species ~ ., iris, keep.forest=FALSE)plot(margin(iris.rf))Rough Imputation of Missing Values
Description
Impute Missing Values by median/mode.
Usage
na.roughfix(object, ...)Arguments
object | a data frame or numeric matrix. |
... | further arguments special methods could require. |
Value
A completed data matrix or data frame. For numeric variables,NAs are replaced with column medians. For factor variables,NAs are replaced with the most frequent levels (breaking tiesat random). Ifobject contains noNAs, it is returnedunaltered.
Note
This is used as a starting point for imputing missing values by randomforest.
Author(s)
Andy Liaw
See Also
Examples
data(iris)iris.na <- irisset.seed(111)## artificially drop some data values.for (i in 1:4) iris.na[sample(150, 20), i] <- NAiris.roughfix <- na.roughfix(iris.na)iris.narf <- RRF(Species ~ ., iris.na, na.action=na.roughfix)print(iris.narf)Compute outlying measures
Description
Compute outlying measures based on a proximity matrix.
Usage
## Default S3 method:outlier(x, cls=NULL, ...)## S3 method for class 'RRF'outlier(x, ...)Arguments
x | a proximity matrix (a square matrix with 1 on the diagonaland values between 0 and 1 in the off-diagonal positions); or an object ofclass |
cls | the classes the rows in the proximity matrix belong to. Ifnot given, all data are assumed to come from the same class. |
... | arguments for other methods. |
Value
A numeric vector containing the outlying measures. The outlyingmeasure of a case is computed as n / sum(squared proximity), normalized bysubtracting the median and divided by the MAD, within each class.
See Also
Examples
set.seed(1)iris.rf <- RRF(iris[,-5], iris[,5], proximity=TRUE)plot(outlier(iris.rf), type="h", col=c("red", "green", "blue")[as.numeric(iris$Species)])Partial dependence plot
Description
Partial dependence plot gives a graphical depiction of the marginaleffect of a variable on the class probability (classification) orresponse (regression).
Usage
## S3 method for class 'RRF'partialPlot(x, pred.data, x.var, which.class, w, plot = TRUE, add = FALSE, n.pt = min(length(unique(pred.data[, xname])), 51), rug = TRUE, xlab=deparse(substitute(x.var)), ylab="", main=paste("Partial Dependence on", deparse(substitute(x.var))), ...)Arguments
x | an object of class |
pred.data | a data frame used for contructing the plot, usuallythe training data used to contruct the random forest. |
x.var | name of the variable for which partialdependence is to be examined. |
which.class | For classification data, the class to focus on(default the first class). |
w | weights to be used in averaging; if not supplied, mean is notweighted |
plot | whether the plot should be shown on the graphic device. |
add | whether to add to existing plot ( |
n.pt | if |
rug | whether to draw hash marks at the bottom of the plotindicating the deciles of |
xlab | label for the x-axis. |
ylab | label for the y-axis. |
main | main title for the plot. |
... | other graphical parameters to be passed on to |
Details
The function being plotted is defined as:
\tilde{f}(x) = \frac{1}{n} \sum_{i=1}^n f(x, x_{iC}),
wherex is the variable for which partial dependence is sought,andx_{iC} is the other variables in the data. The summand isthe predicted regression function for regression, and logits(i.e., log of fraction of votes) forwhich.class forclassification:
f(x) = \log p_k(x) - \frac{1}{K} \sum_{j=1}^K \log p_j(x),
whereK is the number of classes,k iswhich.class,andp_j is the proportion of votes for classj.
Value
A list with two components:x andy, which are the valuesused in the plot.
Note
TheRRF object must contain theforestcomponent; i.e., created withRRF(..., keep.forest=TRUE).
This function runs quite slow for large data sets.
Author(s)
Andy Liawandy_liaw@merck.com
References
Friedman, J. (2001). Greedy function approximation: the gradientboosting machine,Ann. of Stat.
See Also
Examples
data(airquality)airquality <- na.omit(airquality)set.seed(131)ozone.rf <- RRF(Ozone ~ ., airquality)partialPlot(ozone.rf, airquality, Temp)data(iris)set.seed(543)iris.rf <- RRF(Species~., iris)partialPlot(iris.rf, iris, Petal.Width, "versicolor")Plot method for RRF objects
Description
Plot the error rates or MSE of a RRF object
Usage
## S3 method for class 'RRF'plot(x, type="l", main=deparse(substitute(x)), ...)Arguments
x | an object of class |
type | type of plot. |
main | main title of the plot. |
... | other graphical parameters. |
Value
Invisibly, the error rates or MSE of theRRF object.If the object has a non-nulltest component, then the returnedobject is a matrix where the first column is the out-of-bag estimateof error, and the second column is for the test set.
Note
This function does not work forRRF objects that havetype=unsupervised.
If thex has a non-nulltest component, then the testset errors are also plotted.
Author(s)
Andy Liaw
See Also
Examples
data(mtcars)plot(RRF(mpg ~ ., mtcars, keep.forest=FALSE, ntree=100), log="y")predict method for random forest objects
Description
Prediction of test data using random forest.
Usage
## S3 method for class 'RRF'predict(object, newdata, type="response", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE, cutoff, ...)Arguments
object | an object of class |
newdata | a data frame or matrix containing new data. (Note: Ifnot given, the out-of-bag prediction in |
type | one of |
norm.votes | Should the vote counts be normalized (i.e.,expressed as fractions)? Ignored if |
predict.all | Should the predictions of all trees be kept? |
proximity | Should proximity measures be computed? An error isissued if |
nodes | Should the terminal node indicators (an n by ntreematrix) be return? If so, it is in the “nodes” attribute of thereturned object. |
cutoff | (Classification only) A vector of length equal tonumber of classes. The ‘winning’ class for an observation is theone with the maximum ratio of proportion of votes to cutoff.Default is taken from the |
... | not used currently. |
Value
Ifobject$type isregression, a vector of predictedvalues is returned. Ifpredict.all=TRUE, then the returnedobject is a list of two components:aggregate, which is thevector of predicted values by the forest, andindividual, whichis a matrix where each column contains prediction by a tree in theforest.
Ifobject$type isclassification, the object returneddepends on the argumenttype:
response | predicted classes (the classes with majority vote). |
prob | matrix of class probabilities (one column for each classand one row for each input). |
vote | matrix of vote counts (one column for each classand one row for each new input); either in raw counts or in fractions(if |
Ifpredict.all=TRUE, then theindividual component of thereturned object is a character matrix where each column contains thepredicted class by a tree in the forest.
Ifproximity=TRUE, the returned object is a list with twocomponents:pred is the prediction (as described above) andproximity is the proximitry matrix. An error is issued ifobject$type isregression.
Ifnodes=TRUE, the returned object has a “nodes” attribute,which is an n by ntree matrix, each column containing the node numberthat the cases fall in for that tree.
NOTE: If theobject inherits fromRRF.formula,then any data withNA are silently omitted from the prediction.The returned value will containNA correspondingly in theaggregated and individual tree predictions (if requested), but not inthe proximity or node matrices.
NOTE2: Any ties are broken at random, so if this is undesirable, avoid it byusing odd numberntree inRRF().
Author(s)
Andy Liawandy_liaw@merck.com and Matthew Wienermatthew_wiener@merck.com, based on original Fortran code byLeo Breiman and Adele Cutler.
References
Breiman, L. (2001),Random Forests, Machine Learning 45(1),5-32.
See Also
Examples
data(iris)set.seed(111)ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))iris.rf <- RRF(Species ~ ., data=iris[ind == 1,])iris.pred <- predict(iris.rf, iris[ind == 2,])table(observed = iris[ind==2, "Species"], predicted = iris.pred)## Get prediction for all trees.predict(iris.rf, iris[ind == 2,], predict.all=TRUE)## Proximities.predict(iris.rf, iris[ind == 2,], proximity=TRUE)## Nodes matrix.str(attr(predict(iris.rf, iris[ind == 2,], nodes=TRUE), "nodes"))Missing Value Imputations by RRF
Description
Impute missing values in predictor data using proximity from RRF.
Usage
## Default S3 method:rrfImpute(x, y, iter=5, ntree=300, ...)## S3 method for class 'formula'rrfImpute(x, data, ..., subset)Arguments
x | A data frame or matrix of predictors, some containing |
y | Response vector ( |
data | A data frame containing the predictors and response. |
iter | Number of iterations to run the imputation. |
ntree | Number of trees to grow in each iteration ofRRF. |
... | Other arguments to be passed to |
subset | A logical vector indicating which observations to use. |
Details
The algorithm starts by imputingNAs usingna.roughfix. ThenRRF is calledwith the completed data. The proximity matrix from the RRFis used to update the imputation of theNAs. For continuouspredictors, the imputed value is the weighted average of thenon-missing obervations, where the weights are the proximities. Forcategorical predictors, the imputed value is the category with thelargest average proximity. This process is iterateditertimes.
Note: Imputation has not (yet) been implemented for the unsupervisedcase. Also, Breiman (2003) notes that the OOB estimate of error fromRRF tend to be optimistic when run on the data matrix withimputed values.
Value
A data frame or matrix containing the completed data matrix, whereNAs are imputed using proximity from RRF. The firstcolumn contains the response.
Author(s)
Andy Liaw
References
Leo Breiman (2003). Manual for Setting Up, Using, and UnderstandingRandom Forest V4.0.https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
See Also
Examples
data(iris)iris.na <- irisset.seed(111)## artificially drop some data values.for (i in 1:4) iris.na[sample(150, 20), i] <- NAset.seed(222)iris.imputed <- rrfImpute(Species ~ ., iris.na)set.seed(333)iris.rf <- RRF(Species ~ ., iris.imputed)print(iris.rf)Show the NEWS file
Description
Show the NEWS file of the RRF package.
Usage
rrfNews()Value
None.
Random Forest Cross-Valdidation for feature selection
Description
This function shows the cross-validated prediction performance ofmodels with sequentially reduced number of predictors (ranked byvariable importance) via a nested cross-validation procedure.
Usage
rrfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5, mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...)Arguments
trainx | matrix or data frame containing columns of predictorvariables |
trainy | vector of response, must have length equal to the numberof rows in |
cv.fold | number of folds in the cross-validation |
scale | if |
step | if |
mtry | a function of number of remaining predictor variables touse as the |
recursive | whether variable importance is (re-)assessed at eachstep of variable reduction |
... | other arguments passed on to |
Value
A list with the following components:
list(n.var=n.var, error.cv=error.cv, predicted=cv.pred)
n.var | vector of number of variables used at each step |
error.cv | corresponding vector of error rates or MSEs at eachstep |
predicted | list of |
Author(s)
Andy Liaw
References
Svetnik, V., Liaw, A., Tong, C. and Wang, T., “Application of Breiman'sRandom Forest to Modeling Structure-Activity Relationships ofPharmaceutical Molecules”, MCS 2004, Roli, F. and Windeatt, T. (Eds.)pp. 334-343.
See Also
Examples
## The following can take a while to run, so if you really want to try## it, copy and paste the code into R.set.seed(647)myiris <- cbind(iris[1:4], matrix(runif(508 * nrow(iris)), nrow(iris), 508))result <- rrfcv(myiris, iris$Species)with(result, plot(n.var, error.cv, log="x", type="o", lwd=2))result <- replicate(5, rrfcv(myiris, iris$Species), simplify=FALSE)error.cv <- sapply(result, "[[", "error.cv")matplot(result[[1]]$n.var, cbind(rowMeans(error.cv), error.cv), type="l", lwd=c(2, rep(1, ncol(error.cv))), col=1, lty=1, log="x", xlab="Number of variables", ylab="CV Error")Size of trees in an ensemble
Description
Size of trees (number of nodes) in and ensemble.
Usage
treesize(x, terminal=TRUE)Arguments
x | an object of class |
terminal | count terminal nodes only ( |
Value
A vector containing number of nodes for the trees in theRRF object.
Note
TheRRF object must contain theforestcomponent; i.e., created withRRF(..., keep.forest=TRUE).
Author(s)
Andy Liawandy_liaw@merck.com
See Also
Examples
data(iris)iris.rf <- RRF(Species ~ ., iris)hist(treesize(iris.rf))Tune RRF for the optimal mtry parameter
Description
Starting with the default value of mtry, search for the optimal value(with respect to Out-of-Bag error estimate) of mtry for RRF.
Usage
tuneRRF(x, y, mtryStart, ntreeTry=50, stepFactor=2, improve=0.05, trace=TRUE, plot=TRUE, doBest=FALSE, ...)Arguments
x | matrix or data frame of predictor variables |
y | response vector (factor for classification, numeric forregression) |
mtryStart | starting value of mtry; default is the same as in |
ntreeTry | number of trees used at the tuning step |
stepFactor | at each iteration, mtry is inflated (or deflated) bythis value |
improve | the (relative) improvement in OOB error must be by thismuch for the search to continue |
trace | whether to print the progress of the search |
plot | whether to plot the OOB error as function of mtry |
doBest | whether to run a forest using the optimal mtry found |
... | options to be given to |
Value
IfdoBest=FALSE (default), it returns a matrix whose firstcolumn contains the mtry values searched, and the second column thecorresponding OOB error.
IfdoBest=TRUE, it returns theRRFobject produced with the optimalmtry.
See Also
Examples
data(fgl, package="MASS")fgl.res <- tuneRRF(fgl[,-10], fgl[,10], stepFactor=1.5)Variable Importance Plot
Description
Dotchart of variable importance as measured by a Random Forest
Usage
varImpPlot(x, sort=TRUE, n.var=min(30, nrow(x$importance)), type=NULL, class=NULL, scale=TRUE, main=deparse(substitute(x)), ...)Arguments
x | An object of class |
sort | Should the variables be sorted in decreasing order ofimportance? |
n.var | How many variables to show? (Ignored if |
type,class,scale | arguments to be passed on to |
main | plot title. |
... | Other graphical parameters to be passed on to |
Value
Invisibly, the importance of the variables that were plotted.
Author(s)
Andy Liawandy_liaw@merck.com
See Also
Examples
set.seed(4543)data(mtcars)mtcars.rf <- RRF(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE)varImpPlot(mtcars.rf)Variables used in a random forest
Description
Find out which predictor variables are actually used in the random forest.
Usage
varUsed(x, by.tree=FALSE, count=TRUE)Arguments
x | An object of class |
by.tree | Should the list of variables used be broken down bytrees in the forest? |
count | Should the frequencies that variables appear in trees bereturned? |
Value
Ifcount=TRUE andby.tree=FALSE, a integer vector containingfrequencies that variables are used in the forest. Ifby.tree=TRUE, a matrix is returned, breaking down the counts bytree (each column corresponding to one tree and each row to a variable).
Ifcount=FALSE andby.tree=TRUE, a list of integerindices is returned giving the variables used in the trees, else ifby.tree=FALSE, a vector of integer indices giving thevariables used in the entire forest.
Author(s)
Andy Liaw
See Also
Examples
data(iris)set.seed(17)varUsed(RRF(Species~., iris, ntree=100))