Movatterモバイル変換

Title:

Breiman and Cutlers Random Forests for Classification andRegression

Version:

4.7-1.2

Date:

2022-01-24

Depends:

R (≥ 4.1.0), stats

Suggests:

RColorBrewer, MASS

Description:

Classification and regression based on a forest of trees using random inputs, based on Breiman (2001) <doi:10.1023/A:1010933404324>.

License:

GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]

URL:

https://www.stat.berkeley.edu/~breiman/RandomForests/

NeedsCompilation:

yes

Packaged:

2024-09-22 08:30:17 UTC; hornik

Repository:

CRAN

Date/Publication:

2024-09-22 09:14:44 UTC

Author:

Leo Breiman [aut] (Fortran original), Adele Cutler [aut] (Fortran original), Andy Liaw [aut, cre] (R port), Matthew Wiener [aut] (R port)

Maintainer:

Andy Liaw <andy_liaw@merck.com>

Multi-dimensional Scaling Plot of Proximity matrix from randomForest

Description

Plot the scaling coordinates of the proximity matrix from randomForest.

Usage

MDSplot(rf, fac, k=2, palette=NULL, pch=20, ...)

Arguments

rf

an object of classrandomForest that containstheproximity component.

fac

a factor that was used as response to trainrf.

k

number of dimensions for the scaling coordinates.

palette

colors to use to distinguish the classes; length mustbe the equal to the number of levels.

pch

plotting symbols to use.

...

other graphical parameters.

Value

The output ofcmdscale on 1 -rf$proximity isreturned invisibly.

Note

Ifk > 2,pairs is used to produce thescatterplot matrix of the coordinates.

Author(s)

Robert Gentleman, with slight modifications by Andy Liaw

Examples

set.seed(1)data(iris)iris.rf <- randomForest(Species ~ ., iris, proximity=TRUE,                        keep.forest=FALSE)MDSplot(iris.rf, iris$Species)## Using different symbols for the classes:MDSplot(iris.rf, iris$Species, palette=rep(1, 3), pch=as.numeric(iris$Species))

Prototypes of groups.

Description

Prototypes are ‘representative’ cases of a group of data points, giventhe similarity matrix among the points. They are very similar tomedoids. The function is named ‘classCenter’ to avoid conflict withthe functionprototype in themethods package.

Usage

classCenter(x, label, prox, nNbr = min(table(label))-1)

Arguments

x

a matrix or data frame

label

group labels of the rows inx

prox

the proximity (or similarity) matrix, assumed to besymmetric with 1 on the diagonal and in [0, 1] off the diagonal (theorder of row/column must match that ofx)

nNbr

number of nearest neighbors used to find the prototypes.

Details

This version only computes one prototype per class. For each case inx, thenNbr nearest neighors are found. Then, for eachclass, the case that has most neighbors of that class is identified.The prototype for that class is then the medoid of these neighbors(coordinate-wise medians for numerical variables and modes forcategorical variables).

This version only computes one prototype per class. In the futuremore prototypes may be computed (by removing the ‘neighbors’ used,then iterate).

Value

A data frame containing one prototype in each row.

Author(s)

Andy Liaw

Examples

data(iris)iris.rf <- randomForest(iris[,-5], iris[,5], prox=TRUE)iris.p <- classCenter(iris[,-5], iris[,5], iris.rf$prox)plot(iris[,3], iris[,4], pch=21, xlab=names(iris)[3], ylab=names(iris)[4],     bg=c("red", "blue", "green")[as.numeric(factor(iris$Species))],     main="Iris Data with Prototypes")points(iris.p[,3], iris.p[,4], pch=21, cex=2, bg=c("red", "blue", "green"))

Combine Ensembles of Trees

Description

Combine two more more ensembles of trees into one.

Usage

combine(...)

Arguments

...

two or more objects of classrandomForest, to becombined into one.

Value

An object of classrandomForest.

Note

Theconfusion,err.rate,mse andrsqcomponents (as well as the corresponding components in thetestcompnent, if exist) of the combined object will beNULL.

Author(s)

Andy Liawandy_liaw@merck.com

Examples

data(iris)rf1 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)rf3 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)rf.all <- combine(rf1, rf2, rf3)print(rf.all)

Extract a single tree from a forest.

Description

This function extract the structure of a tree from arandomForest object.

Usage

getTree(rfobj, k=1, labelVar=FALSE)

Arguments

rfobj

arandomForest object.

k

which tree to extract?

labelVar

Should better labels be used for splitting variablesand predicted class?

Details

For numerical predictors, data with values of the variable less thanor equal to the splitting point go to the left daughter node.

For categorical predictors, the splitting point is represented by aninteger, whose binary expansion gives the identities of the categoriesthat goes to left or right. For example, if a predictor has fourcategories, and the split point is 13. The binary expansion of 13 is(1, 0, 1, 1) (because13 = 1*2^0 + 0*2^1 + 1*2^2 + 1*2^3), so cases withcategories 1, 3, or 4 in this predictor get sent to the left, and the restto the right.

Value

A matrix (or data frame, iflabelVar=TRUE) with six columns andnumber of rows equal to total number of nodes in the tree. The sixcolumns are:

left daughter

the row where the left daughter node is; 0 if thenode is terminal

right daughter

the row where the right daughter node is; 0 ifthe node is terminal

split var

which variable was used to split the node; 0 if thenode is terminal

split point

where the best split is; see Details forcategorical predictor

status

is the node terminal (-1) or not (1)

prediction

the prediction for the node; 0 if the node is notterminal

Author(s)

Andy Liawandy_liaw@merck.com

Examples

data(iris)## Look at the third trees in the forest.getTree(randomForest(iris[,-5], iris[,5], ntree=10), 3, labelVar=TRUE)

Add trees to an ensemble

Description

Add additional trees to an existing ensemble of trees.

Usage

## S3 method for class 'randomForest'grow(x, how.many, ...)

Arguments

x

an object of classrandomForest, which contains aforest component.

how.many

number of trees to add to therandomForestobject.

...

currently ignored.

Value

An object of classrandomForest, containinghow.manyadditional trees.

Note

Theconfusion,err.rate,mse andrsqcomponents (as well as the corresponding components in thetestcompnent, if exist) of the combined object will beNULL.

Author(s)

Andy Liawandy_liaw@merck.com

Examples

data(iris)iris.rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)iris.rf <- grow(iris.rf, 50)print(iris.rf)

Extract variable importance measure

Description

This is the extractor function for variable importance measures asproduced byrandomForest.

Usage

## S3 method for class 'randomForest'importance(x, type=NULL, class=NULL, scale=TRUE, ...)

Arguments

x

an object of classrandomForest

type

either 1 or 2, specifying the type of importance measure(1=mean decrease in accuracy, 2=mean decrease in node impurity).

class

for classification problem, which class-specific measureto return.

scale

For permutation based measures, should the measures bedivided their “standard errors”?

...

not used.

Details

Here are the definitions of the variable importance measures. Thefirst measure is computed from permuting OOB data: Foreach tree, the prediction error on the out-of-bag portion of thedata is recorded (error rate for classification, MSE for regression).Then the same is done after permuting each predictor variable. Thedifference between the two are then averaged over all trees, andnormalized by the standard deviation of the differences. If thestandard deviation of the differences is equal to 0 for a variable,the division is not done (but the average is almost always equal to 0in that case).

The second measure is the total decrease in node impurities fromsplitting on the variable, averaged over all trees. Forclassification, the node impurity is measured by the Gini index.For regression, it is measured by residual sum of squares.

Value

A matrix of importance measure, one row for each predictor variable.The column(s) are different importance measures.

Examples

set.seed(4543)data(mtcars)mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000,                          keep.forest=FALSE, importance=TRUE)importance(mtcars.rf)importance(mtcars.rf, type=1)

The Automobile Data

Description

This is the ‘Automobile’ data from the UCI Machine Learning Repository.

Usage

data(imports85)

Format

imports85 is a data frame with 205 cases (rows) and 26variables (columns). This data set consists of three types ofentities: (a) the specification of an auto in terms of variouscharacteristics, (b) its assigned insurance risk rating, (c) itsnormalized losses in use as compared to other cars. The second ratingcorresponds to the degree to which the auto is more risky than itsprice indicates. Cars are initially assigned a risk factor symbolassociated with its price. Then, if it is more risky (or less), thissymbol is adjusted by moving it up (or down) the scale. Actuarianscall this process ‘symboling’. A value of +3 indicates that the autois risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insuredvehicle year. This value is normalized for all autos within aparticular size classification (two-door small, station wagons,sports/speciality, etc...), and represents the average loss per carper year.

Author(s)

Andy Liaw

Source

Originally created by Jeffrey C. Schlimmer, from 1985 Model Import Carand Truck Specifications, 1985 Ward's Automotive Yearbook, PersonalAuto Manuals, Insurance Services Office, and Insurance CollisionReport, Insurance Institute for Highway Safety.

The original data is atdoi:10.24432/C5B01C.

References

1985 Model Import Car and Truck Specifications, 1985 Ward's AutomotiveYearbook.

Personal Auto Manuals, Insurance Services Office,160 Water Street, New York, NY 10038

Insurance Collision Report, Insurance Institute for Highway Safety,Watergate 600, Washington, DC 20037

Examples

data(imports85)imp85 <- imports85[,-2]  # Too many NAs in normalizedLosses.imp85 <- imp85[complete.cases(imp85), ]## Drop empty levels for factors.imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[, drop=TRUE] else x)stopifnot(require(randomForest))price.rf <- randomForest(price ~ ., imp85, do.trace=10, ntree=100)print(price.rf)numDoors.rf <- randomForest(numOfDoors ~ ., imp85, do.trace=10, ntree=100)print(numDoors.rf)

Margins of randomForest Classifier

Description

Compute or plot the margin of predictions from a randomForest classifier.

Usage

## S3 method for class 'randomForest'margin(x, ...)## Default S3 method:margin(x, observed, ...)## S3 method for class 'margin'plot(x, sort=TRUE, ...)

Arguments

x

an object of classrandomForest, whosetype is notregression, or a matrix of predictedprobabilities, one column per class and one row per observation.For theplot method,x should be an object returned bymargin.

observed

the true response corresponding to the data inx.

sort

Should the data be sorted by their class labels?

...

other graphical parameters to be passed toplot.default.

Value

Formargin, themargin of observations from therandomForest classifier (or whatever classifier thatproduced the predicted probability matrix given tomargin).The margin of a data point is defined as the proportion of votes forthe correct class minus maximum proportion of votes for the otherclasses. Thus under majority votes, positive margin means correctclassification, and vice versa.

Author(s)

Robert Gentlemen, with slight modifications by Andy Liaw

Examples

set.seed(1)data(iris)iris.rf <- randomForest(Species ~ ., iris, keep.forest=FALSE)plot(margin(iris.rf))

Rough Imputation of Missing Values

Description

Impute Missing Values by median/mode.

Usage

na.roughfix(object, ...)

Arguments

object

a data frame or numeric matrix.

...

further arguments special methods could require.

Value

A completed data matrix or data frame. For numeric variables,NAs are replaced with column medians. For factor variables,NAs are replaced with the most frequent levels (breaking tiesat random). Ifobject contains noNAs, it is returnedunaltered.

Note

This is used as a starting point for imputing missing values by randomforest.

Author(s)

Andy Liaw

Examples

data(iris)iris.na <- irisset.seed(111)## artificially drop some data values.for (i in 1:4) iris.na[sample(150, sample(20, 1)), i] <- NAiris.roughfix <- na.roughfix(iris.na)iris.narf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix)print(iris.narf)

Compute outlying measures

Description

Compute outlying measures based on a proximity matrix.

Usage

## Default S3 method:outlier(x, cls=NULL, ...)## S3 method for class 'randomForest'outlier(x, ...)

Arguments

x

a proximity matrix (a square matrix with 1 on the diagonaland values between 0 and 1 in the off-diagonal positions); or an object ofclassrandomForest, whosetype is notregression.

cls

the classes the rows in the proximity matrix belong to. Ifnot given, all data are assumed to come from the same class.

...

arguments for other methods.

Value

A numeric vector containing the outlying measures. The outlyingmeasure of a case is computed as n / sum(squared proximity), normalized bysubtracting the median and divided by the MAD, within each class.

Examples

set.seed(1)iris.rf <- randomForest(iris[,-5], iris[,5], proximity=TRUE)plot(outlier(iris.rf), type="h",     col=c("red", "green", "blue")[as.numeric(iris$Species)])

Partial dependence plot

Description

Partial dependence plot gives a graphical depiction of the marginaleffect of a variable on the class probability (classification) orresponse (regression).

Usage

## S3 method for class 'randomForest'partialPlot(x, pred.data, x.var, which.class,      w, plot = TRUE, add = FALSE,      n.pt = min(length(unique(pred.data[, xname])), 51),      rug = TRUE, xlab=deparse(substitute(x.var)), ylab="",      main=paste("Partial Dependence on", deparse(substitute(x.var))),      ...)

Arguments

x

an object of classrandomForest, which contains aforest component.

pred.data

a data frame used for contructing the plot, usuallythe training data used to contruct the random forest.

x.var

name of the variable for which partialdependence is to be examined.

which.class

For classification data, the class to focus on(default the first class).

w

weights to be used in averaging; if not supplied, mean is notweighted

plot

whether the plot should be shown on the graphic device.

add

whether to add to existing plot (TRUE).

n.pt

ifx.var is continuous, the number of points on thegrid for evaluating partial dependence.

rug

whether to draw hash marks at the bottom of the plotindicating the deciles ofx.var.

xlab

label for the x-axis.

ylab

label for the y-axis.

main

main title for the plot.

...

other graphical parameters to be passed on toplotorlines.

Details

The function being plotted is defined as:

\tilde{f}(x) = \frac{1}{n} \sum_{i=1}^n f(x, x_{iC}),

wherex is the variable for which partial dependence is sought,andx_{iC} is the other variables in the data. The summand isthe predicted regression function for regression, and logits(i.e., log of fraction of votes) forwhich.class forclassification:

f(x) = \log p_k(x) - \frac{1}{K} \sum_{j=1}^K \log p_j(x),

whereK is the number of classes,k iswhich.class,andp_j is the proportion of votes for classj.

Value

A list with two components:x andy, which are the valuesused in the plot.

Note

TherandomForest object must contain theforestcomponent; i.e., created withrandomForest(..., keep.forest=TRUE).

This function runs quite slow for large data sets.

Author(s)

Andy Liawandy_liaw@merck.com

References

Friedman, J. (2001). Greedy function approximation: the gradientboosting machine,Ann. of Stat.

Examples

data(iris)set.seed(543)iris.rf <- randomForest(Species~., iris)partialPlot(iris.rf, iris, Petal.Width, "versicolor")## Looping over variables ranked by importance:data(airquality)airquality <- na.omit(airquality)set.seed(131)ozone.rf <- randomForest(Ozone ~ ., airquality, importance=TRUE)imp <- importance(ozone.rf)impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]op <- par(mfrow=c(2, 3))for (i in seq_along(impvar)) {    partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i],                main=paste("Partial Dependence on", impvar[i]),                ylim=c(30, 70))}par(op)

Plot method for randomForest objects

Description

Plot the error rates or MSE of a randomForest object

Usage

## S3 method for class 'randomForest'plot(x, type="l", main=deparse(substitute(x)), ...)

Arguments

x

an object of classrandomForest.

type

type of plot.

main

main title of the plot.

...

other graphical parameters.

Value

Invisibly, the error rates or MSE of therandomForest object.If the object has a non-nulltest component, then the returnedobject is a matrix where the first column is the out-of-bag estimateof error, and the second column is for the test set.

Note

This function does not work forrandomForest objects that havetype=unsupervised.

If thex has a non-nulltest component, then the testset errors are also plotted.

Author(s)

Andy Liaw

Examples

data(mtcars)plot(randomForest(mpg ~ ., mtcars, keep.forest=FALSE, ntree=100), log="y")

predict method for random forest objects

Description

Prediction of test data using random forest.

Usage

## S3 method for class 'randomForest'predict(object, newdata, type="response",  norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE,  cutoff, ...)

Arguments

object

an object of classrandomForest, as thatcreated by the functionrandomForest.

newdata

a data frame or matrix containing new data. (Note: Ifnot given, the out-of-bag prediction inobject is returned.

type

one ofresponse,prob. orvotes,indicating the type of output: predicted values, matrix of classprobabilities, or matrix of vote counts.class is allowed, butautomatically converted to "response", for backward compatibility.

norm.votes

Should the vote counts be normalized (i.e.,expressed as fractions)? Ignored ifobject$type isregression.

predict.all

Should the predictions of all trees be kept?

proximity

Should proximity measures be computed? An error isissued ifobject$type isregression.

nodes

Should the terminal node indicators (an n by ntreematrix) be return? If so, it is in the “nodes” attribute of thereturned object.

cutoff

(Classification only) A vector of length equal tonumber of classes. The ‘winning’ class for an observation is theone with the maximum ratio of proportion of votes to cutoff.Default is taken from theforest$cutoff component ofobject (i.e., the setting used when runningrandomForest).

...

not used currently.

Value

Ifobject$type isregression, a vector of predictedvalues is returned. Ifpredict.all=TRUE, then the returnedobject is a list of two components:aggregate, which is thevector of predicted values by the forest, andindividual, whichis a matrix where each column contains prediction by a tree in theforest.

Ifobject$type isclassification, the object returneddepends on the argumenttype:

response

predicted classes (the classes with majority vote).

prob

matrix of class probabilities (one column for each classand one row for each input).

vote

matrix of vote counts (one column for each classand one row for each new input); either in raw counts or in fractions(ifnorm.votes=TRUE).

Ifpredict.all=TRUE, then theindividual component of thereturned object is a character matrix where each column contains thepredicted class by a tree in the forest.

Ifproximity=TRUE, the returned object is a list with twocomponents:pred is the prediction (as described above) andproximity is the proximitry matrix. An error is issued ifobject$type isregression.

Ifnodes=TRUE, the returned object has a “nodes” attribute,which is an n by ntree matrix, each column containing the node numberthat the cases fall in for that tree.

NOTE: If theobject inherits fromrandomForest.formula,then any data withNA are silently omitted from the prediction.The returned value will containNA correspondingly in theaggregated and individual tree predictions (if requested), but not inthe proximity or node matrices.

NOTE2: Any ties are broken at random, so if this is undesirable, avoid it byusing odd numberntree inrandomForest().

Author(s)

Andy Liawandy_liaw@merck.com and Matthew Wienermatthew_wiener@merck.com, based on original Fortran code byLeo Breiman and Adele Cutler.

References

Breiman, L. (2001),Random Forests, Machine Learning 45(1),5-32.

Examples

data(iris)set.seed(111)ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])iris.pred <- predict(iris.rf, iris[ind == 2,])table(observed = iris[ind==2, "Species"], predicted = iris.pred)## Get prediction for all trees.predict(iris.rf, iris[ind == 2,], predict.all=TRUE)## Proximities.predict(iris.rf, iris[ind == 2,], proximity=TRUE)## Nodes matrix.str(attr(predict(iris.rf, iris[ind == 2,], nodes=TRUE), "nodes"))

Classification and Regression with Random Forest

Description

randomForest implements Breiman's random forest algorithm (based onBreiman and Cutler's original Fortran code) for classification andregression. It can also be used in unsupervised mode for assessingproximities among data points.

Usage

## S3 method for class 'formula'randomForest(formula, data=NULL, ..., subset, na.action=na.fail)## Default S3 method:randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,             mtry=if (!is.null(y) && !is.factor(y))             max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),             weights=NULL,             replace=TRUE, classwt=NULL, cutoff, strata,             sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),             nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,             maxnodes = NULL,             importance=FALSE, localImp=FALSE, nPerm=1,             proximity, oob.prox=proximity,             norm.votes=TRUE, do.trace=FALSE,             keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,             keep.inbag=FALSE, ...)## S3 method for class 'randomForest'print(x, ...)

Arguments

data

an optional data frame containing the variables in the model.By default the variables are taken from the environment whichrandomForest is called from.

subset

an index vector indicating which rows should be used.(NOTE: If given, this argument must be named.)

na.action

A function to specify the action to be taken if NAsare found. (NOTE: If given, this argument must be named.)

x,formula

a data frame or a matrix of predictors, or a formuladescribing the model to be fitted (for theprint method, anrandomForest object).

y

A response vector. If a factor, classification is assumed,otherwise regression is assumed. If omitted,randomForestwill run in unsupervised mode.

xtest

a data frame or matrix (likex) containingpredictors for the test set.

ytest

response for the test set.

ntree

Number of trees to grow. This should not be set to toosmall a number, to ensure that every input row gets predicted atleast a few times.

mtry

Number of variables randomly sampled as candidates at eachsplit. Note that the default values are different forclassification (sqrt(p) where p is number of variables inx)and regression (p/3)

weights

A vector of length same asy that are positive weights used only in sampling data to grow each tree (not used in anyother calculation)

replace

Should sampling of cases be done with or withoutreplacement?

classwt

Priors of the classes. Need not add up to one.Ignored for regression.

cutoff

(Classification only) A vector of length equal tonumber of classes. The ‘winning’ class for an observation is theone with the maximum ratio of proportion of votes to cutoff.Default is 1/k where k is the number of classes (i.e., majority votewins).

strata

A (factor) variable that is used for stratified sampling.

sampsize

Size(s) of sample to draw. For classification, ifsampsize is a vector of the length the number of strata, thensampling is stratified by strata, and the elements of sampsizeindicate the numbers to be drawn from the strata.

nodesize

Minimum size of terminal nodes. Setting this numberlarger causes smaller trees to be grown (and thus take less time).Note that the default values are different for classification (1)and regression (5).

maxnodes

Maximum number of terminal nodes trees in the forestcan have. If not given, trees are grown to the maximum possible(subject to limits bynodesize). If set larger than maximumpossible, a warning is issued.

importance

Should importance of predictors be assessed?

localImp

Should casewise importance measure be computed?(Setting this toTRUE will overrideimportance.)

nPerm

Number of times the OOB data are permuted per tree forassessing variable importance. Number larger than 1 gives slightlymore stable estimate, but not very effective. Currently onlyimplemented for regression.

proximity

Should proximity measure among the rows becalculated?

oob.prox

Should proximity be calculated only on “out-of-bag”data?

norm.votes

IfTRUE (default), the final result of votesare expressed as fractions. IfFALSE, raw vote counts arereturned (useful for combining results from different runs).Ignored for regression.

do.trace

If set toTRUE, give a more verbose output asrandomForest is run. If set to some integer, then runningoutput is printed for everydo.trace trees.

keep.forest

If set toFALSE, the forest will not beretained in the output object. Ifxtest is given, defaultstoFALSE.

corr.bias

perform bias correction for regression? Note:Experimental. Use at your own risk.

keep.inbag

Should ann byntree matrix bereturned that keeps track of which samples are “in-bag” in whichtrees (but not how many times, if sampling with replacement)

...

optional parameters to be passed to the low level functionrandomForest.default.

Value

An object of classrandomForest, which is a list with thefollowing components:

call

the original call torandomForest

type

one ofregression,classification, orunsupervised.

predicted

the predicted values of the input data based onout-of-bag samples.

importance

a matrix withnclass + 2 (for classification)or two (for regression) columns. For classification, the firstnclass columns are the class-specific measures computed asmean descrease in accuracy. Thenclass + 1st column is themean descrease in accuracy over all classes. The last column is themean decrease in Gini index. For Regression, the first column isthe mean decrease in accuracy and the second the mean decrease in MSE.Ifimportance=FALSE, the last measure is still returned as avector.

importanceSD

The “standard errors” of the permutation-basedimportance measure. For classification, ap bynclass + 1 matrix corresponding to the firstnclass + 1 columnsof the importance matrix. For regression, a lengthp vector.

localImp

a p by n matrix containing the casewise importancemeasures, the [i,j] element of which is the importance of i-thvariable on the j-th case.NULL iflocalImp=FALSE.

ntree

number of trees grown.

mtry

number of predictors sampled for spliting at each node.

forest

(a list that contains the entire forest;NULL ifrandomForest is run in unsupervised mode or ifkeep.forest=FALSE.

err.rate

(classification only) vector error rates of theprediction on the input data, the i-th element being the (OOB) error ratefor all trees up to the i-th.

confusion

(classification only) the confusion matrix of theprediction (based on OOB data).

votes

(classification only) a matrix with one row for eachinput data point and one column for each class, giving the fractionor number of (OOB) ‘votes’ from the random forest.

oob.times

number of times cases are ‘out-of-bag’ (and thus usedin computing OOB error estimate)

proximity

ifproximity=TRUE whenrandomForest is called, a matrix of proximity measures amongthe input (based on the frequency that pairs of data points are inthe same terminal nodes).

mse

(regression only) vector of mean square errors: sum of squaredresiduals divided byn.

rsq

(regression only) “pseudo R-squared”: 1 -mse /Var(y).

test

if test set is given (through thextest or additionallyytest arguments), this component is a list which contains thecorrespondingpredicted,err.rate,confusion,votes (for classification) orpredicted,mse andrsq (for regression) for the test set. Ifproximity=TRUE, there is also a component,proximity,which contains the proximity among the test set as well as proximitybetween test and training data.

Note

Theforest structure is slightly different betweenclassification and regression. For details on how the trees arestored, see the help page forgetTree.

Ifxtest is given, prediction of the test set is done “inplace” as the trees are grown. Ifytest is also given, anddo.trace is set to some positive integer, then for everydo.trace trees, the test set error is printed. Results for thetest set is returned in thetest component of the resultingrandomForest object. For classification, thevotescomponent (for training or test set data) contain the votes the casesreceived for the classes. Ifnorm.votes=TRUE, the fraction isgiven, which can be taken as predicted probabilities for the classes.

For large data sets, especially those with large number of variables,callingrandomForest via the formula interface is not advised:There may be too much overhead in handling the formula.

The “local” (or casewise) variable importance is computed asfollows: For classification, it is the increase in percent of times acase is OOB and misclassified when the variable is permuted. Forregression, it is the average increase in squared OOB residuals whenthe variable is permuted.

Author(s)

Andy Liawandy_liaw@merck.com and Matthew Wienermatthew_wiener@merck.com, based on original Fortran code byLeo Breiman and Adele Cutler.

References

Breiman, L. (2001),Random Forests, Machine Learning 45(1),5-32.

Breiman, L (2002), “Manual On Setting Up, Using, And UnderstandingRandom Forests V3.1”,https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf.

Examples

## Classification:##data(iris)set.seed(71)iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,                        proximity=TRUE)print(iris.rf)## Look at variable importance:round(importance(iris.rf), 2)## Do MDS on 1 - proximity:iris.mds <- cmdscale(1 - iris.rf$proximity, eig=TRUE)op <- par(pty="s")pairs(cbind(iris[,1:4], iris.mds$points), cex=0.6, gap=0,      col=c("red", "green", "blue")[as.numeric(iris$Species)],      main="Iris Data: Predictors and MDS of Proximity Based on RandomForest")par(op)print(iris.mds$GOF)## The `unsupervised' case:set.seed(17)iris.urf <- randomForest(iris[, -5])MDSplot(iris.urf, iris$Species)## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.(iris.rf2 <- randomForest(iris[1:4], iris$Species,                           sampsize=c(20, 30, 20)))## Regression:## data(airquality)set.seed(131)ozone.rf <- randomForest(Ozone ~ ., data=airquality, mtry=3,                         importance=TRUE, na.action=na.omit)print(ozone.rf)## Show "importance" of variables: higher value mean more important:round(importance(ozone.rf), 2)## "x" can be a matrix instead of a data frame:set.seed(17)x <- matrix(runif(5e2), 100)y <- gl(2, 50)(myrf <- randomForest(x, y))(predict(myrf, x))## "complicated" formula:(swiss.rf <- randomForest(sqrt(Fertility) ~ . - Catholic + I(Catholic < 50),                          data=swiss))(predict(swiss.rf, swiss))## Test use of 32-level factor as a predictor:set.seed(1)x <- data.frame(x1=gl(53, 10), x2=runif(530), y=rnorm(530))(rf1 <- randomForest(x[-3], x[[3]], ntree=10))## Grow no more than 4 nodes per tree:(treesize(randomForest(Species ~ ., data=iris, maxnodes=4, ntree=30)))## test proximity in regressioniris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)str(iris.rrf$proximity)## Using weights: make versicolors having 3 times larger weightsiris_wt <- ifelse( iris$Species == "versicolor", 3, 1 )set.seed(15)iris.wcrf <- randomForest(iris[-5], iris[[5]], weights=iris_wt, keep.inbag=TRUE)print(rowSums(iris.wcrf$inbag))set.seed(15)iris.wrrf <- randomForest(iris[-1], iris[[1]], weights=iris_wt, keep.inbag=TRUE)print(rowSums(iris.wrrf$inbag))

Missing Value Imputations by randomForest

Description

Impute missing values in predictor data using proximity from randomForest.

Usage

## Default S3 method:rfImpute(x, y, iter=5, ntree=300, ...)## S3 method for class 'formula'rfImpute(x, data, ..., subset)

Arguments

x

A data frame or matrix of predictors, some containingNAs, or a formula.

y

Response vector (NA's not allowed).

data

A data frame containing the predictors and response.

iter

Number of iterations to run the imputation.

ntree

Number of trees to grow in each iteration ofrandomForest.

...

Other arguments to be passed torandomForest.

subset

A logical vector indicating which observations to use.

Details

The algorithm starts by imputingNAs usingna.roughfix. ThenrandomForest is calledwith the completed data. The proximity matrix from the randomForestis used to update the imputation of theNAs. For continuouspredictors, the imputed value is the weighted average of thenon-missing obervations, where the weights are the proximities. Forcategorical predictors, the imputed value is the category with thelargest average proximity. This process is iterateditertimes.

Note: Imputation has not (yet) been implemented for the unsupervisedcase. Also, Breiman (2003) notes that the OOB estimate of error fromrandomForest tend to be optimistic when run on the data matrix withimputed values.

Value

A data frame or matrix containing the completed data matrix, whereNAs are imputed using proximity from randomForest. The firstcolumn contains the response.

Author(s)

Andy Liaw

References

Leo Breiman (2003). Manual for Setting Up, Using, and UnderstandingRandom Forest V4.0.https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf

Examples

data(iris)iris.na <- irisset.seed(111)## artificially drop some data values.for (i in 1:4) iris.na[sample(150, sample(20, 1)), i] <- NAset.seed(222)iris.imputed <- rfImpute(Species ~ ., iris.na)set.seed(333)iris.rf <- randomForest(Species ~ ., iris.imputed)print(iris.rf)

Show the NEWS file

Description

Show the NEWS file of the randomForest package.

Usage

rfNews()

Value

None.

Random Forest Cross-Valdidation for feature selection

Description

This function shows the cross-validated prediction performance ofmodels with sequentially reduced number of predictors (ranked byvariable importance) via a nested cross-validation procedure.

Usage

rfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5,     mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...)

Arguments

trainx

matrix or data frame containing columns of predictorvariables

trainy

vector of response, must have length equal to the numberof rows intrainx

cv.fold

number of folds in the cross-validation

scale

if"log", reduce a fixed proportion (step)of variables at each step, otherwise reducestep variables at atime

step

iflog=TRUE, the fraction of variables to remove ateach step, else remove this many variables at a time

mtry

a function of number of remaining predictor variables touse as themtry parameter in therandomForest call

recursive

whether variable importance is (re-)assessed at eachstep of variable reduction

...

other arguments passed on torandomForest

Value

A list with the following components:

list(n.var=n.var, error.cv=error.cv, predicted=cv.pred)

n.var

vector of number of variables used at each step

error.cv

corresponding vector of error rates or MSEs at eachstep

predicted

list ofn.var components, each containingthe predicted values from the cross-validation

Author(s)

Andy Liaw

References

Svetnik, V., Liaw, A., Tong, C. and Wang, T., “Application of Breiman'sRandom Forest to Modeling Structure-Activity Relationships ofPharmaceutical Molecules”, MCS 2004, Roli, F. and Windeatt, T. (Eds.)pp. 334-343.

Examples

set.seed(647)myiris <- cbind(iris[1:4], matrix(runif(96 * nrow(iris)), nrow(iris), 96))result <- rfcv(myiris, iris$Species, cv.fold=3)with(result, plot(n.var, error.cv, log="x", type="o", lwd=2))## The following can take a while to run, so if you really want to try## it, copy and paste the code into R.## Not run: result <- replicate(5, rfcv(myiris, iris$Species), simplify=FALSE)error.cv <- sapply(result, "[[", "error.cv")matplot(result[[1]]$n.var, cbind(rowMeans(error.cv), error.cv), type="l",        lwd=c(2, rep(1, ncol(error.cv))), col=1, lty=1, log="x",        xlab="Number of variables", ylab="CV Error")## End(Not run)

Size of trees in an ensemble

Description

Size of trees (number of nodes) in and ensemble.

Usage

treesize(x, terminal=TRUE)

Arguments

x

an object of classrandomForest, which contains aforest component.

terminal

count terminal nodes only (TRUE) or all nodes(FALSE

Value

A vector containing number of nodes for the trees in therandomForest object.

Note

TherandomForest object must contain theforestcomponent; i.e., created withrandomForest(..., keep.forest=TRUE).

Author(s)

Andy Liawandy_liaw@merck.com

Examples

data(iris)iris.rf <- randomForest(Species ~ ., iris)hist(treesize(iris.rf))

Tune randomForest for the optimal mtry parameter

Description

Starting with the default value of mtry, search for the optimal value(with respect to Out-of-Bag error estimate) of mtry for randomForest.

Usage

tuneRF(x, y, mtryStart, ntreeTry=50, stepFactor=2, improve=0.05,       trace=TRUE, plot=TRUE, doBest=FALSE, ...)

Arguments

x

matrix or data frame of predictor variables

y

response vector (factor for classification, numeric forregression)

mtryStart

starting value of mtry; default is the same as inrandomForest

ntreeTry

number of trees used at the tuning step

stepFactor

at each iteration, mtry is inflated (or deflated) bythis value

improve

the (relative) improvement in OOB error must be by thismuch for the search to continue

trace

whether to print the progress of the search

plot

whether to plot the OOB error as function of mtry

doBest

whether to run a forest using the optimal mtry found

...

options to be given torandomForest

Value

IfdoBest=FALSE (default), it returns a matrix whose firstcolumn contains the mtry values searched, and the second column thecorresponding OOB error.

IfdoBest=TRUE, it returns therandomForestobject produced with the optimalmtry.

Examples

data(fgl, package="MASS")fgl.res <- tuneRF(fgl[,-10], fgl[,10], stepFactor=1.5)

Variable Importance Plot

Description

Dotchart of variable importance as measured by a Random Forest

Usage

varImpPlot(x, sort=TRUE, n.var=min(30, nrow(x$importance)),           type=NULL, class=NULL, scale=TRUE,            main=deparse(substitute(x)), ...)

Arguments

x

An object of classrandomForest.

sort

Should the variables be sorted in decreasing order ofimportance?

n.var

How many variables to show? (Ignored ifsort=FALSE.)

type,class,scale

arguments to be passed on toimportance

main

plot title.

...

Other graphical parameters to be passed on todotchart.

Value

Invisibly, the importance of the variables that were plotted.

Author(s)

Andy Liawandy_liaw@merck.com

Examples

set.seed(4543)data(mtcars)mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,                          importance=TRUE)varImpPlot(mtcars.rf)

Variables used in a random forest

Description

Find out which predictor variables are actually used in the random forest.

Usage

varUsed(x, by.tree=FALSE, count=TRUE)

Arguments

x

An object of classrandomForest.

by.tree

Should the list of variables used be broken down bytrees in the forest?

count

Should the frequencies that variables appear in trees bereturned?

Value

Ifcount=TRUE andby.tree=FALSE, a integer vector containingfrequencies that variables are used in the forest. Ifby.tree=TRUE, a matrix is returned, breaking down the counts bytree (each column corresponding to one tree and each row to a variable).

Ifcount=FALSE andby.tree=TRUE, a list of integerindices is returned giving the variables used in the trees, else ifby.tree=FALSE, a vector of integer indices giving thevariables used in the entire forest.

Author(s)

Andy Liaw

Examples

data(iris)set.seed(17)varUsed(randomForest(Species~., iris, ntree=100))

Movatterモバイル変換

Multi-dimensional Scaling Plot of Proximity matrix from randomForest

Description

Usage

Arguments

Value

Note

Author(s)

See Also

Examples

Prototypes of groups.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Combine Ensembles of Trees

Description

Usage

Arguments

Value

Note

Author(s)

See Also

Examples

Extract a single tree from a forest.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Add trees to an ensemble

Description

Usage

Arguments

Value

Note

Author(s)

See Also

Examples

Extract variable importance measure

Description

Usage

Arguments

Details

Value

See Also

Examples

The Automobile Data

Description

Usage

Format

Author(s)

Source

References

See Also

Examples

Margins of randomForest Classifier

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Rough Imputation of Missing Values

Description

Usage

Arguments

Value

Note

Author(s)

See Also

Examples