| Type: | Package |
| Version: | 1.2.1 |
| Title: | Machine Learning Algorithms with Unified Interface and ConfusionMatrices |
| Description: | A unified interface is provided to various machine learning algorithms like linear or quadratic discriminant analysis, k-nearest neighbors, random forest, support vector machine, ... It allows to train, test, and apply cross-validation using similar functions and function arguments with a minimalist and clean, formula-based interface. Missing data are processed the same way as base and stats R functions for all algorithms, both in training and testing. Confusion matrices are also provided with a rich set of metrics calculated and a few specific plots. |
| Maintainer: | Philippe Grosjean <phgrosjean@sciviews.org> |
| Depends: | R (≥ 3.0.4) |
| Imports: | stats, grDevices, class, nnet, MASS, e1071, randomForest,ipred, rpart |
| Suggests: | mlbench, datasets, RColorBrewer, spelling, knitr, rmarkdown,covr |
| URL: | https://www.sciviews.org/mlearning/ |
| BugReports: | https://github.com/SciViews/mlearning/issues |
| License: | GPL-2 |GPL-3 [expanded from: GPL (≥ 2)] |
| RoxygenNote: | 7.2.3 |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| Language: | en-US |
| NeedsCompilation: | no |
| Packaged: | 2023-08-30 18:46:17 UTC; phgrosjean |
| Author: | Philippe Grosjean |
| Repository: | CRAN |
| Date/Publication: | 2023-08-30 19:10:02 UTC |
Machine Learning Algorithms with Unified Interface and Confusion Matrices
Description
This package provides wrappers around several existing machine learningalgorithms in R, under a unified user interface. Confusion matrices can alsobe calculated and viewed as tables or plots. Key features are:
Unified, formula-based interface for all algorithms, similar to
stats::lm().Optimized code when a simplified formula
y ~ .is used, meaning allvariables in data are used (one of them (yhere) is the class to bepredicted (classification problem, a factor variable), or the dependentvariable of the model (regression problem, a numeric variable).Similar way of dealing with missing data, both in the training set and inpredictions. Underlying algorithms deal differently with missing data. Someaccept them, other not.
Unified way of dealing with factor levels that have no cases in thetraining set. The training succeeds, but the classifier is, of course, unableto classify items in the missing class.
The
predict()methods have similar arguments. They return the class,membership to the classes, both, or something else (probabilities,raw predictions, ...) depending on the algorithm or the problem(classification or regression).The
cvpredict()method is available for all algorithms and it performsvery easily a cross-validation, or even a leave_one_out validation (whencv.k= number of cases). It operates transparently for the end-user.The
confusion()method creates a confusion matrix and the object can beprinted, summarized, plotted. Various metrics are easily derived from theconfusion matrix. Also, it allows to adjust prior probabilities of theclasses in a classification problem, in order to obtain more representativeestimates of the metrics when priors are adjusted to values closes to realproportions of classes in the data.
Seemlearning() for further explanations and an example analysis. SeemlLda() for examples of the different forms of the formula that can beused. Seeplot.confusion() for the different ways to explore the confusionmatrix.
Important functions
ml_lda(),ml_qda(),ml_naive_bayes(),ml_knn(),ml_lvq(),ml_nnet(),ml_rpart(),ml_rforest()andml_svm()to train classifiersor regressors with the different algorithms that are supported in thepackage,predict()andcvpredict()for predictions, including usingcross-validation,confusion()to calculate the confusion matrix (with various methods toanalyze it and to calculate derived metrics like recall, precision, F-score,...)prior()to adjust prior probabilities,response()andtrain()to extract response and training variables fromanmlearning object.
Construct and analyze confusion matrices
Description
Confusion matrices compare two classifications (usually one doneautomatically using a machine learning algorithm versus the trueclassification done by a specialist... but one can also compare two automaticor two manual classifications against each other).
Usage
confusion(x, ...)## Default S3 method:confusion( x, y = NULL, vars = c("Actual", "Predicted"), labels = vars, merge.by = "Id", useNA = "ifany", prior, ...)## S3 method for class 'mlearning'confusion( x, y = response(x), labels = c("Actual", "Predicted"), useNA = "ifany", prior, ...)## S3 method for class 'confusion'print(x, sums = TRUE, error.col = sums, digits = 0, sort = "ward.D2", ...)## S3 method for class 'confusion'summary(object, type = "all", sort.by = "Fscore", decreasing = TRUE, ...)## S3 method for class 'summary.confusion'print(x, ...)Arguments
x | an object with a |
... | further arguments passed to the method. |
y | another object, from which to extract the second classification, or |
vars | the variables of interest in the first and second classificationin the case the objects are lists or data frames. Otherwise, this argumentis ignored and |
labels | labels to use for the two classifications. By default, they arethe same as |
merge.by | a character string with the name of variables to use to mergethe two data frames, or |
useNA | do we keep |
prior | class frequencies to use for first classifier that is tabulatedin the rows of the confusion matrix. For its value, see here under, the |
sums | is the confusion matrix printed with rows and columns sums? |
error.col | is a column with class error for first classifier added(equivalent to false negative rate of FNR)? |
digits | the number of digits after the decimal point to print in theconfusion matrix. The default or zero leads to most compact presentationand is suitable for frequencies, but not for relative frequencies. |
sort | are rows and columns of the confusion matrix sorted so thatclasses with larger confusion are closer together? Sorting is doneusing a hierarchical clustering with |
object | aconfusion object |
type | either |
sort.by | the statistics to use to sort the table (by default, Fmeasure,the F1 score for each class = 2 * recall * precision / (recall + precision)). |
decreasing | do we sort in increasing or decreasing order? |
Value
A confusion matrix in aconfusion object.
See Also
mlearning(),plot.confusion(),prior()
Examples
data("Glass", package = "mlbench")# Use a little bit more informative labels for TypeGlass$Type <- as.factor(paste("Glass", Glass$Type))# Use learning vector quantization to classify the glass types# (using default parameters)summary(glass_lvq <- ml_lvq(Type ~ ., data = Glass))# Calculate cross-validated confusion matrix(glass_conf <- confusion(cvpredict(glass_lvq), Glass$Type))# Raw confusion matrix: no sort and no marginsprint(glass_conf, sums = FALSE, sort = FALSE)summary(glass_conf)summary(glass_conf, type = "Fscore")Supervised classification using k-nearest neighbor
Description
Unified (formula-based) interface version of the k-nearest neighboralgorithm provided byclass::knn().
Usage
mlKnn(train, ...)ml_knn(train, ...)## S3 method for class 'formula'mlKnn(formula, data, k.nn = 5, ..., subset, na.action)## Default S3 method:mlKnn(train, response, k.nn = 5, ...)## S3 method for class 'mlKnn'summary(object, ...)## S3 method for class 'summary.mlKnn'print(x, ...)## S3 method for class 'mlKnn'predict( object, newdata, type = c("class", "prob", "both"), method = c("direct", "cv"), na.action = na.exclude, ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to the classification method or its |
formula | a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use the |
data | a data.frame to use as a training set. |
k.nn | k used for k-NN number of neighbor considered. Default is 5. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor for the classification. |
x,object | anmlKnn object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
method |
|
Value
ml_knn()/mlKnn() creates anmlKnn,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsoclass::knn() andipred::predict.ipredknn() that actually do the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_knn <- ml_knn(data = iris_train, Species ~ .)summary(iris_knn)predict(iris_knn) # This object only returns classes# Self-consistency, do not use for assessing classifier performances!confusion(iris_knn)# Use an independent test set insteadconfusion(predict(iris_knn, newdata = iris_test), iris_test$Species)Supervised classification using linear discriminant analysis
Description
Unified (formula-based) interface version of the linear discriminantanalysis algorithm provided byMASS::lda().
Usage
mlLda(train, ...)ml_lda(train, ...)## S3 method for class 'formula'mlLda(formula, data, ..., subset, na.action)## Default S3 method:mlLda(train, response, ...)## S3 method for class 'mlLda'predict( object, newdata, type = c("class", "membership", "both", "projection"), prior = object$prior, dimension = NULL, method = c("plug-in", "predictive", "debiased", "cv"), ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to |
formula | a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use the |
data | a data.frame to use as a training set. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor for the classification. |
object | anmlLda object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
prior | the prior probabilities of class membership. By default, theprior are obtained from the object and, if they where not changed,correspond to the proportions observed in the training set. |
dimension | the number of the predictive space to use. If |
method |
|
Value
ml_lda()/mlLda() creates anmlLda,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsoMASS::lda() thatactually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_lda <- ml_lda(data = iris_train, Species ~ .)iris_ldasummary(iris_lda)plot(iris_lda, col = as.numeric(response(iris_lda)) + 1)# Prediction using a test setpredict(iris_lda, newdata = iris_test) # class (default type)predict(iris_lda, type = "membership") # posterior probabilitypredict(iris_lda, type = "both") # both class and membership in a list# Type projectionpredict(iris_lda, type = "projection") # Projection on the LD axes# Add test set items to the previous plotpoints(predict(iris_lda, newdata = iris_test, type = "projection"), col = as.numeric(predict(iris_lda, newdata = iris_test)) + 1, pch = 19)# predict() and confusion() should be used on a separate test set# for unbiased estimation (or using cross-validation, bootstrap, ...)# Wrong, cf. biased estimation (so-called, self-consistency)confusion(iris_lda)# Estimation using a separate test setconfusion(predict(iris_lda, newdata = iris_test), iris_test$Species)# Another dataset (binary predictor... not optimal for lda, just for test)data("HouseVotes84", package = "mlbench")house_lda <- ml_lda(data = HouseVotes84, na.action = na.omit, Class ~ .)summary(house_lda)confusion(house_lda) # Self-consistency (biased metrics)print(confusion(house_lda), error.col = FALSE) # Without error column# More complex formulas# Exclude one or more variablesiris_lda2 <- ml_lda(data = iris, Species ~ . - Sepal.Width)summary(iris_lda2)# With calculationiris_lda3 <- ml_lda(data = iris, Species ~ log(Petal.Length) + log(Petal.Width) + I(Petal.Length/Sepal.Length))summary(iris_lda3)# Factor levels with missing items are allowedir2 <- iris[-(51:100), ] # No Iris versicolor in the training setiris_lda4 <- ml_lda(data = ir2, Species ~ .)summary(iris_lda4) # missing class# Missing levels are reinjected in class or membership by predict()predict(iris_lda4, type = "both")# ... but, of course, the classifier is wrong for Iris versicolorconfusion(predict(iris_lda4, newdata = iris), iris$Species)# Simpler interface, but more memory-effectiveiris_lda5 <- ml_lda(train = iris[, -5], response = iris$Species)summary(iris_lda5)Supervised classification using learning vector quantization
Description
Unified (formula-based) interface version of the learning vector quantizationalgorithms provided byclass::olvq1(),class::lvq1(),class::lvq2(),andclass::lvq3().
Usage
mlLvq(train, ...)ml_lvq(train, ...)## S3 method for class 'formula'mlLvq( formula, data, k.nn = 5, size, prior, algorithm = "olvq1", ..., subset, na.action)## Default S3 method:mlLvq(train, response, k.nn = 5, size, prior, algorithm = "olvq1", ...)## S3 method for class 'mlLvq'summary(object, ...)## S3 method for class 'summary.mlLvq'print(x, ...)## S3 method for class 'mlLvq'predict( object, newdata, type = "class", method = c("direct", "cv"), na.action = na.exclude, ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to the classification method or its |
formula | a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use the |
data | a data.frame to use as a training set. |
k.nn | k used for k-NN number of neighbor considered. Default is 5. |
size | the size of the codebook. Defaults tomin(round(0.4 \* nc \* (nc - 1 + p/2),0), n) where nc is the number ofclasses. |
prior | probabilities to represent classes in the codebook (defaultvalues are the proportions in the training set). |
algorithm |
|
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if [ml_lvq)]: R:ml_lvq) |
response | a vector of factor of the classes. |
x,object | anmlLvq object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. For this method, only |
method |
|
Value
ml_lvq()/mlLvq() creates anmlLvq,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsoclass::olvq1(),class::lvq1(),class::lvq2(), andclass::lvq3() that actually do theclassification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_lvq <- ml_lvq(data = iris_train, Species ~ .)summary(iris_lvq)predict(iris_lvq) # This object only returns classes#' # Self-consistency, do not use for assessing classifier performances!confusion(iris_lvq)# Use an independent test set insteadconfusion(predict(iris_lvq, newdata = iris_test), iris_test$Species)Supervised classification using naive Bayes
Description
Unified (formula-based) interface version of the naive Bayes algorithmprovided bye1071::naiveBayes().
Usage
mlNaiveBayes(train, ...)ml_naive_bayes(train, ...)## S3 method for class 'formula'mlNaiveBayes(formula, data, laplace = 0, ..., subset, na.action)## Default S3 method:mlNaiveBayes(train, response, laplace = 0, ...)## S3 method for class 'mlNaiveBayes'predict( object, newdata, type = c("class", "membership", "both"), method = c("direct", "cv"), na.action = na.exclude, threshold = 0.001, eps = 0, ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to the classification method or its |
formula | a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use the |
data | a data.frame to use as a training set. |
laplace | positive number controlling Laplace smoothing for the naiveBayes classifier. The default (0) disables Laplace smoothing. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor with the classes. |
object | anmlNaiveBayes object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
method |
|
threshold | value replacing cells with probabilities within 'eps' range. |
eps | number for specifying an epsilon-range to apply Laplace smoothing(to replace zero or close-zero probabilities by 'threshold'). |
Value
ml_naive_bayes()/mlNaiveBayes() creates anmlNaiveBayes,mlearning object containing the classifier and a lot of additionalmetadata used by the functions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions orextract specific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsoe1071::naiveBayes() that actually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_nb <- ml_naive_bayes(data = iris_train, Species ~ .)summary(iris_nb)predict(iris_nb) # Default type is classpredict(iris_nb, type = "membership")predict(iris_nb, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_nb)# Use an independent test set insteadconfusion(predict(iris_nb, newdata = iris_test), iris_test$Species)# Another datasetdata("HouseVotes84", package = "mlbench")house_nb <- ml_naive_bayes(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_nb)confusion(house_nb) # Self-consistencyconfusion(cvpredict(house_nb), na.omit(HouseVotes84)$Class)Supervised classification and regression using neural network
Description
Unified (formula-based) interface version of the single-hidden-layer neuralnetwork algorithm, possibly with skip-layer connections provided bynnet::nnet().
Usage
mlNnet(train, ...)ml_nnet(train, ...)## S3 method for class 'formula'mlNnet( formula, data, size = NULL, rang = NULL, decay = 0, maxit = 1000, ..., subset, na.action)## Default S3 method:mlNnet(train, response, size = NULL, rang = NULL, decay = 0, maxit = 1000, ...)## S3 method for class 'mlNnet'predict( object, newdata, type = c("class", "membership", "both", "raw"), method = c("direct", "cv"), na.action = na.exclude, ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to |
formula | a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) and theright term with the list of independent, predictive variables, separated witha plus sign. If the data frame provided contains only the dependent andindependent variables, one can use the |
data | a data.frame to use as a training set. |
size | number of units in the hidden layer. Can be zero if there areskip-layer units. If |
rang | initial random weights on [-rang, rang]. Value about 0.5 unlessthe inputs are large, in which case it should be chosen so thatrang * max(|x|) is about 1. If |
decay | parameter for weight decay. Default to 0. |
maxit | maximum number of iterations. Default 1000 (it is 100 in |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor (classification) or numeric (regression). |
object | anmlNnet object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
method |
|
Value
ml_nnet()/mlNnet() creates anmlNnet,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsonnet::nnet()that actually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAset.seed(689) # Useful for reproductibility, use a different value each time!iris_nnet <- ml_nnet(data = iris_train, Species ~ .)summary(iris_nnet)predict(iris_nnet) # Default type is classpredict(iris_nnet, type = "membership")predict(iris_nnet, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_nnet)# Use an independent test set insteadconfusion(predict(iris_nnet, newdata = iris_test), iris_test$Species)# Idem, but two classes predictiondata("HouseVotes84", package = "mlbench")set.seed(325)house_nnet <- ml_nnet(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_nnet)# Cross-validated confusion matrixconfusion(cvpredict(house_nnet), na.omit(HouseVotes84)$Class)# Regressiondata(airquality, package = "datasets")set.seed(74)ozone_nnet <- ml_nnet(data = airquality, Ozone ~ ., na.action = na.omit, skip = TRUE, decay = 1e-3, size = 20, linout = TRUE)summary(ozone_nnet)plot(na.omit(airquality)$Ozone, predict(ozone_nnet, type = "raw"))abline(a = 0, b = 1)Supervised classification using quadratic discriminant analysis
Description
Unified (formula-based) interface version of the quadratic discriminantanalysis algorithm provided byMASS::qda().
Usage
mlQda(train, ...)ml_qda(train, ...)## S3 method for class 'formula'mlQda(formula, data, ..., subset, na.action)## Default S3 method:mlQda(train, response, ...)## S3 method for class 'mlQda'predict( object, newdata, type = c("class", "membership", "both"), prior = object$prior, method = c("plug-in", "predictive", "debiased", "looCV", "cv"), ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to |
formula | a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use the |
data | a data.frame to use as a training set. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor for the classification. |
object | anmlQda object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
prior | the prior probabilities of class membership. By default, theprior are obtained from the object and, if they where not changed,correspond to the proportions observed in the training set. |
method |
|
Value
ml_qda()/mlQda() creates anmlQda,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsoMASS::qda() thatactually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_qda <- ml_qda(data = iris_train, Species ~ .)summary(iris_qda)confusion(iris_qda)confusion(predict(iris_qda, newdata = iris_test), iris_test$Species)# Another dataset (binary predictor... not optimal for qda, just for test)data("HouseVotes84", package = "mlbench")house_qda <- ml_qda(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_qda)Supervised classification and regression using random forest
Description
Unified (formula-based) interface version of the random forest algorithmprovided byrandomForest::randomForest().
Usage
mlRforest(train, ...)ml_rforest(train, ...)## S3 method for class 'formula'mlRforest( formula, data, ntree = 500, mtry, replace = TRUE, classwt = NULL, ..., subset, na.action)## Default S3 method:mlRforest( train, response, ntree = 500, mtry, replace = TRUE, classwt = NULL, ...)## S3 method for class 'mlRforest'predict( object, newdata, type = c("class", "membership", "both", "vote"), method = c("direct", "oob", "cv"), ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to |
formula | a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) ornothing (for unsupervised classification) and the right term with the listof independent, predictive variables, separated with a plus sign. If thedata frame provided contains only the dependent and independent variables,one can use the |
data | a data.frame to use as a training set. |
ntree | the number of trees to generate (use a value large enough to getat least a few predictions for each input row). Default is 500 trees. |
mtry | number of variables randomly sampled as candidates at each split.Note that the default values are different for classification (sqrt(p)where p is number of variables in x) and regression (p/3)? |
replace | sample cases with or without replacement ( |
classwt | priors of the classes. Need not add up to one. Ignored forregression. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor (classification) or numeric (regression),or |
object | anmlRforest object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
method |
|
Value
ml_rforest()/mlRforest() creates anmlRforest,mlearningobject containing the classifier and a lot of additional metadata used bythe functions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsorandomForest::randomForest() that actually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_rf <- ml_rforest(data = iris_train, Species ~ .)summary(iris_rf)plot(iris_rf) # Useful to look at the effect of ntree=# For such a relatively simple case, 50 trees are enoughiris_rf <- ml_rforest(data = iris_train, Species ~ ., ntree = 50)summary(iris_rf)predict(iris_rf) # Default type is classpredict(iris_rf, type = "membership")predict(iris_rf, type = "both")predict(iris_rf, type = "vote")# Out-of-bag prediction (unbiased)predict(iris_rf, method = "oob")# Self-consistency (always very high for random forest, biased, do not use!)confusion(iris_rf)# This one is betterconfusion(iris_rf, method = "oob") # Out-of-bag performances# Cross-validation prediction is also a good choice when there is no test setpredict(iris_rf, method = "cv") # Idem: cvpredict(res)# Cross-validation for performances estimationconfusion(iris_rf, method = "cv")# Evaluation of performances using a separate test setconfusion(predict(iris_rf, newdata = iris_test), iris_test$Species)# Regression using random forest (from ?randomForest)set.seed(131) # Useful for reproducibility (use a different number each time)ozone_rf <- ml_rforest(data = airquality, Ozone ~ ., mtry = 3, importance = TRUE, na.action = na.omit)summary(ozone_rf)# Show "importance" of variables: higher value mean more important variablesround(randomForest::importance(ozone_rf), 2)plot(na.omit(airquality)$Ozone, predict(ozone_rf))abline(a = 0, b = 1)# Unsupervised classification using random forest (from ?randomForest)set.seed(17)iris_urf <- ml_rforest(train = iris[, -5]) # Use only quantitative datasummary(iris_urf)randomForest::MDSplot(iris_urf, iris$Species)plot(stats::hclust(stats::as.dist(1 - iris_urf$proximity), method = "average"), labels = iris$Species)Supervised classification and regression using recursive partitioning
Description
Unified (formula-based) interface version of the recursive partitioningalgorithm as implemented inrpart::rpart().
Usage
mlRpart(train, ...)ml_rpart(train, ...)## S3 method for class 'formula'mlRpart(formula, data, ..., subset, na.action)## Default S3 method:mlRpart(train, response, ..., .args. = NULL)## S3 method for class 'mlRpart'predict( object, newdata, type = c("class", "membership", "both"), method = c("direct", "cv"), ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to |
formula | a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) and theright term with the list of independent, predictive variables, separated witha plus sign. If the data frame provided contains only the dependent andindependent variables, one can use the |
data | a data.frame to use as a training set. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor (classification) or numeric (regression). |
.args. | used internally, do not provide anything here. |
object | anmlRpart object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
method |
|
Value
ml_rpart()/mlRpart() creates anmlRpart,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsorpart::rpart()that actually does the classification.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_rpart <- ml_rpart(data = iris_train, Species ~ .)summary(iris_rpart)# Plot the decision tree for this classifierplot(iris_rpart, margin = 0.03, uniform = TRUE)text(iris_rpart, use.n = FALSE)# Predictionspredict(iris_rpart) # Default type is classpredict(iris_rpart, type = "membership")predict(iris_rpart, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_rpart)# Cross-validation prediction is a good choice when there is no test setpredict(iris_rpart, method = "cv") # Idem: cvpredict(res)confusion(iris_rpart, method = "cv")# Evaluation of performances using a separate test setconfusion(predict(iris_rpart, newdata = iris_test), iris_test$Species)Supervised classification and regression using support vector machine
Description
Unified (formula-based) interface version of the support vector machinealgorithm provided bye1071::svm().
Usage
mlSvm(train, ...)ml_svm(train, ...)## S3 method for class 'formula'mlSvm( formula, data, scale = TRUE, type = NULL, kernel = "radial", classwt = NULL, ..., subset, na.action)## Default S3 method:mlSvm( train, response, scale = TRUE, type = NULL, kernel = "radial", classwt = NULL, ...)## S3 method for class 'mlSvm'predict( object, newdata, type = c("class", "membership", "both"), method = c("direct", "cv"), na.action = na.exclude, ...)Arguments
train | a matrix or data frame with predictors. |
... | further arguments passed to the classification or regressionmethod. See |
formula | a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) ornothing (for unsupervised classification) and the right term with the listof independent, predictive variables, separated with a plus sign. If thedata frame provided contains only the dependent and independent variables,one can use the |
data | a data.frame to use as a training set. |
scale | are the variables scaled (so that mean = 0 and standarddeviation = 1)? |
type | For |
kernel | the kernel used by svm, see |
classwt | priors of the classes. Need not add up to one. |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
response | a vector of factor (classification) or numeric (regression). |
object | anmlSvm object |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
method |
|
Value
ml_svm()/mlSvm() creates anmlSvm,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().
See Also
mlearning(),cvpredict(),confusion(), alsoe1071::svm()that actually does the calculation.
Examples
# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_svm <- ml_svm(data = iris_train, Species ~ .)summary(iris_svm)predict(iris_svm) # Default type is classpredict(iris_svm, type = "membership")predict(iris_svm, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_svm)# Use an independent test set insteadconfusion(predict(iris_svm, newdata = iris_test), iris_test$Species)# Another datasetdata("HouseVotes84", package = "mlbench")house_svm <- ml_svm(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_svm)# Cross-validated confusion matrixconfusion(cvpredict(house_svm), na.omit(HouseVotes84)$Class)# Regression using support vector machinedata(airquality, package = "datasets")ozone_svm <- ml_svm(data = airquality, Ozone ~ ., na.action = na.omit)summary(ozone_svm)plot(na.omit(airquality)$Ozone, predict(ozone_svm))abline(a = 0, b = 1)Machine learning model for (un)supervised classification or regression
Description
Anmlearning object provides an unified (formula-based) interface toseveral machine learning algorithms. They share the same interface and verysimilar arguments. They conform to the formula-based approach, of say,stats::lm() in base R, but with a coherent handling of missing data andmissing class levels. An optimized version exists for the simplifiedy ~ .formula. Finally, cross-validation is also built-in.
Usage
mlearning( formula, data, method, model.args, call = match.call(), ..., subset, na.action = na.fail)## S3 method for class 'mlearning'print(x, ...)## S3 method for class 'mlearning'summary(object, ...)## S3 method for class 'summary.mlearning'print(x, ...)## S3 method for class 'mlearning'plot(x, y, ...)## S3 method for class 'mlearning'predict( object, newdata, type = c("class", "membership", "both"), method = c("direct", "cv"), na.action = na.exclude, ...)cvpredict(object, ...)## S3 method for class 'mlearning'cvpredict( object, type = c("class", "membership", "both"), cv.k = 10, cv.strat = TRUE, ...)Arguments
formula | a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) ornothing (for unsupervised classification) and the right term with the listof independent, predictive variables, separated with a plus sign. If thedata frame provided contains only the dependent and independent variables,one can use the |
data | a data.frame to use as a training set. |
method |
|
model.args | arguments for formula modeling with substituted data andsubset... Not to be used by the end-user. |
call | the function call. Not to be used by the end-user. |
... | further arguments (depends on the method). |
subset | index vector with the cases to define the training set in use(this argument must be named, if provided). |
na.action | function to specify the action to be taken if |
x,object | anmlearning object |
y | a secondmlearning object or nothing (not used in several plots) |
newdata | a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted. |
type | the type of prediction to return. |
cv.k | k for k-fold cross-validation, cf |
cv.strat | is the subsampling stratified or not in cross-validation,cf |
Value
anmlearning object formlearning(). Methods return their ownresults that can be amlearning,data.frame,vector, etc.
See Also
ml_lda(),ml_qda(),ml_naive_bayes(),ml_nnet(),ml_rpart(),ml_rforest(),ml_svm(),confusion() andprior(). Alsoipred::errorest() that internally computes the cross-validationincvpredict().
Examples
# mlearning() should not be calle directly. Use the mlXXX() functions instead# for instance, for Random Forest, use ml_rforest()/mlRforest()# A typical classification involves several steps:## 1) Prepare data: split into training set (2/3) and test set (1/3)# Data cleaning (elimination of unwanted variables), transformation of# others (scaling, log, ratios, numeric to factor, ...) may be necessary# here. Apply the same treatments on the training and test setsdata("iris", package = "datasets")train <- c(1:34, 51:83, 101:133) # Also random or stratified samplingiris_train <- iris[train, ]iris_test <- iris[-train, ]# 2) Train the classifier, use of the simplified formula class ~ . encouraged# so, you may have to prepare the train/test sets to keep only relevant# variables and to possibly transform them before useiris_rf <- ml_rforest(data = iris_train, Species ~ .)iris_rfsummary(iris_rf)train(iris_rf)response(iris_rf)# 3) Find optimal values for the parameters of the model# This is usally done iteratively. Just an example with ntree where a plot# exists to help finding optimal valueplot(iris_rf)# For such a relatively simple case, 50 trees are enough, retrain with itiris_rf <- ml_rforest(data = iris_train, Species ~ ., ntree = 50)summary(iris_rf)# 4) Study the classifier performances. Several metrics and tools exists# like ROC curves, AUC, etc. Tools provided here are the confusion matrix# and the metrics that are calculated on it.predict(iris_rf) # Default type is classpredict(iris_rf, type = "membership")predict(iris_rf, type = "both")# Confusion matrice and metrics using 10-fols cross-validationiris_rf_conf <- confusion(iris_rf, method = "cv")iris_rf_confsummary(iris_rf_conf)# Note you may want to manipulate priors too, see ?prior# 5) Go back to step #1 and refine the process until you are happy with the# results. Then, you can use the classifier to predict unknown items.Plot a confusion matrix
Description
Several graphical representations ofconfusion objects are possible: animage of the matrix with colored squares, a barplot comparing recall andprecision, a stars plot also comparing two metrics, possibly also comparingtwo different classifiers of the same dataset, or a dendrogram grouping theclasses relative to the errors observed in the confusion matrix (classeswith more errors are pooled together more rapidly).
Usage
## S3 method for class 'confusion'plot( x, y = NULL, type = c("image", "barplot", "stars", "dendrogram"), stat1 = "Recall", stat2 = "Precision", names, ...)confusion_image( x, y = NULL, labels = names(dimnames(x)), sort = "ward.D2", numbers = TRUE, digits = 0, mar = c(3.1, 10.1, 3.1, 3.1), cex = 1, asp = 1, colfun, ncols = 41, col0 = FALSE, grid.col = "gray", ...)confusionImage( x, y = NULL, labels = names(dimnames(x)), sort = "ward.D2", numbers = TRUE, digits = 0, mar = c(3.1, 10.1, 3.1, 3.1), cex = 1, asp = 1, colfun, ncols = 41, col0 = FALSE, grid.col = "gray", ...)confusion_barplot( x, y = NULL, col = c("PeachPuff2", "green3", "lemonChiffon2"), mar = c(1.1, 8.1, 4.1, 2.1), cex = 1, cex.axis = cex, cex.legend = cex, main = "F-score (precision versus recall)", numbers = TRUE, min.width = 17, ...)confusionBarplot( x, y = NULL, col = c("PeachPuff2", "green3", "lemonChiffon2"), mar = c(1.1, 8.1, 4.1, 2.1), cex = 1, cex.axis = cex, cex.legend = cex, main = "F-score (precision versus recall)", numbers = TRUE, min.width = 17, ...)confusion_stars( x, y = NULL, stat1 = "Recall", stat2 = "Precision", names, main, col = c("green2", "blue2", "green4", "blue4"), ...)confusionStars( x, y = NULL, stat1 = "Recall", stat2 = "Precision", names, main, col = c("green2", "blue2", "green4", "blue4"), ...)confusion_dendrogram( x, y = NULL, labels = rownames(x), sort = "ward.D2", main = "Groups clustering", ...)confusionDendrogram( x, y = NULL, labels = rownames(x), sort = "ward.D2", main = "Groups clustering", ...)Arguments
x | aconfusion object |
y |
|
type | the kind of plot to produce ( |
stat1 | the first metric to plot for the |
stat2 | the second metric to plot for the |
names | names of the two classifiers to compare |
... | further arguments passed to the function. It can be all argumentsor the corresponding plot. |
labels | labels to use for the two classifications. By default, they arethe same as |
sort | are rows and columns of the confusion matrix sorted so thatclasses with larger confusion are closer together? Sorting is doneusing a hierarchical clustering with |
numbers | are actual numbers indicated in the confusion matrix image? |
digits | the number of digits after the decimal point to print in theconfusion matrix. The default or zero leads to most compact presentationand is suitable for frequencies, but not for relative frequencies. |
mar | graph margins. |
cex | text magnification factor. |
asp | graph aspect ratio. There is little reasons to change the defaultvalue of 1. |
colfun | a function that calculates a series of colors, like e.g., |
ncols | the number of colors to generate. It should preferably be2 * number of levels + 1, where levels is the number of frequencies youwant to evidence in the plot. Default to 41. |
col0 | should null values be colored or not (no, by default)? |
grid.col | color to use for grid lines, or |
col | color(s) to use for the plot. |
cex.axis | idem for axes. If |
cex.legend | idem for legend text. If |
main | main title of the plot. |
min.width | minimum bar width required to add numbers. |
Value
Data calculate to create the plots are returned invisibly. Thesefunctions are mostly used for their side-effect of producing a plot.
Examples
data("Glass", package = "mlbench")# Use a little bit more informative labels for TypeGlass$Type <- as.factor(paste("Glass", Glass$Type))# Use learning vector quantization to classify the glass types# (using default parameters)summary(glass_lvq <- ml_lvq(Type ~ ., data = Glass))# Calculate cross-validated confusion matrix and plot it in different ways(glass_conf <- confusion(cvpredict(glass_lvq), Glass$Type))# Raw confusion matrix: no sort and no marginsprint(glass_conf, sums = FALSE, sort = FALSE)# Plotsplot(glass_conf) # Image by defaultplot(glass_conf, sort = FALSE) # No sortingplot(glass_conf, type = "barplot")plot(glass_conf, type = "stars")plot(glass_conf, type = "dendrogram")# Build another classifier and make a comparisonsummary(glass_naive_bayes <- ml_naive_bayes(Type ~ ., data = Glass))(glass_conf2 <- confusion(cvpredict(glass_naive_bayes), Glass$Type))# Comparison plot for two classifiersplot(glass_conf, glass_conf2)Get or set priors on a confusion matrix
Description
Most metrics in supervised classifications are sensitive to the relativeproportion of the items in the different classes. When a confusion matrix iscalculated on a test set, it uses the proportions observed on that test set.If they are representative of the proportions in the population, metrics arenot biased. When it is not the case, priors of aconfusion object can beadjusted to better reflect proportions that are supposed to be observed inthe different classes in order to get more accurate metrics.
Usage
prior(object, ...)## S3 method for class 'confusion'prior(object, ...)prior(object, ...) <- value## S3 replacement method for class 'confusion'prior(object, ...) <- valueArguments
object | aconfusion object (or another class if a method isimplemented) |
... | further arguments passed to methods |
value | a (named) vector of positive numbers of zeros ofthe same length as the number of classes in theconfusion object. Itcan also be a single >= 0 number and in this case, equal probabilities areapplied to all the classes (use 1 for relative frequencies and 100 forrelative frequencies in percent). If the value has zero length or is |
Value
prior() returns the current class frequencies associated withthe first classification tabulated in theconfusion object, i.e., forrows in the confusion matrix.
See Also
Examples
data("Glass", package = "mlbench")# Use a little bit more informative labels for TypeGlass$Type <- as.factor(paste("Glass", Glass$Type))# Use learning vector quantization to classify the glass types# (using default parameters)summary(glass_lvq <- ml_lvq(Type ~ ., data = Glass))# Calculate cross-validated confusion matrix(glass_conf <- confusion(cvpredict(glass_lvq), Glass$Type))# When the probabilities in each class do not match the proportions in the# training set, all these calculations are useless. Having an idea of# the real proportions (so-called, priors), one should first reweight the# confusion matrix before calculating statistics, for instance:prior1 <- c(10, 10, 10, 100, 100, 100) # Glass types 1-3 are rareprior(glass_conf) <- prior1glass_confsummary(glass_conf, type = c("Fscore", "Recall", "Precision"))# This is very different than if glass types 1-3 are abundants!prior2 <- c(100, 100, 100, 10, 10, 10) # Glass types 1-3 are abundantsprior(glass_conf) <- prior2glass_confsummary(glass_conf, type = c("Fscore", "Recall", "Precision"))# Weight can also be used to construct a matrix of relative frequencies# In this case, all rows sum to oneprior(glass_conf) <- 1print(glass_conf, digits = 2)# However, it is easier to work with relative frequencies in percent# and one gets a more compact presentationprior(glass_conf) <- 100glass_conf# To reset row class frequencies to original propotions, just assign NULLprior(glass_conf) <- NULLglass_confprior(glass_conf)Get the response variable for a mlearning object
Description
The response is either the class to be predicted for a classification problem(and it is a factor), or the dependent variable in a regression model (andit is numeric in that case). For unsupervised classification, response is notprovided and should returnNULL.
Usage
response(object, ...)## Default S3 method:response(object, ...)Arguments
object | an object having a response variable. |
... | further parameter (depends on the method). |
Value
The response variable of the training set, orNULL for unsupervisedclassification.
See Also
mlearning(),train(),confusion()
Examples
data("HouseVotes84", package = "mlbench")house_rf <- ml_rforest(data = HouseVotes84, Class ~ .)house_rfresponse(house_rf)Get the training variable for a mlearning object
Description
The training variables (train) are the variables used to train a classifier,excepted the prediction (class or dependent variable).
Usage
train(object, ...)## Default S3 method:train(object, ...)Arguments
object | an object having a train attribute. |
... | further parameter (depends on the method). |
Value
A data frame containing the training variables of the model.
See Also
mlearning(),response(),confusion()
Examples
data("HouseVotes84", package = "mlbench")house_rf <- ml_rforest(data = HouseVotes84, Class ~ .)house_rftrain(house_rf)