Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Version:1.2.1
Title:Machine Learning Algorithms with Unified Interface and ConfusionMatrices
Description:A unified interface is provided to various machine learning algorithms like linear or quadratic discriminant analysis, k-nearest neighbors, random forest, support vector machine, ... It allows to train, test, and apply cross-validation using similar functions and function arguments with a minimalist and clean, formula-based interface. Missing data are processed the same way as base and stats R functions for all algorithms, both in training and testing. Confusion matrices are also provided with a rich set of metrics calculated and a few specific plots.
Maintainer:Philippe Grosjean <phgrosjean@sciviews.org>
Depends:R (≥ 3.0.4)
Imports:stats, grDevices, class, nnet, MASS, e1071, randomForest,ipred, rpart
Suggests:mlbench, datasets, RColorBrewer, spelling, knitr, rmarkdown,covr
URL:https://www.sciviews.org/mlearning/
BugReports:https://github.com/SciViews/mlearning/issues
License:GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]
RoxygenNote:7.2.3
Config/testthat/edition:3
Encoding:UTF-8
Language:en-US
NeedsCompilation:no
Packaged:2023-08-30 18:46:17 UTC; phgrosjean
Author:Philippe GrosjeanORCID iD [aut, cre], Kevin Denis [aut]
Repository:CRAN
Date/Publication:2023-08-30 19:10:02 UTC

Machine Learning Algorithms with Unified Interface and Confusion Matrices

Description

This package provides wrappers around several existing machine learningalgorithms in R, under a unified user interface. Confusion matrices can alsobe calculated and viewed as tables or plots. Key features are:

Seemlearning() for further explanations and an example analysis. SeemlLda() for examples of the different forms of the formula that can beused. Seeplot.confusion() for the different ways to explore the confusionmatrix.

Important functions


Construct and analyze confusion matrices

Description

Confusion matrices compare two classifications (usually one doneautomatically using a machine learning algorithm versus the trueclassification done by a specialist... but one can also compare two automaticor two manual classifications against each other).

Usage

confusion(x, ...)## Default S3 method:confusion(  x,  y = NULL,  vars = c("Actual", "Predicted"),  labels = vars,  merge.by = "Id",  useNA = "ifany",  prior,  ...)## S3 method for class 'mlearning'confusion(  x,  y = response(x),  labels = c("Actual", "Predicted"),  useNA = "ifany",  prior,  ...)## S3 method for class 'confusion'print(x, sums = TRUE, error.col = sums, digits = 0, sort = "ward.D2", ...)## S3 method for class 'confusion'summary(object, type = "all", sort.by = "Fscore", decreasing = TRUE, ...)## S3 method for class 'summary.confusion'print(x, ...)

Arguments

x

an object with aconfusion() method implemented.

...

further arguments passed to the method.

y

another object, from which to extract the second classification, orNULL if not used.

vars

the variables of interest in the first and second classificationin the case the objects are lists or data frames. Otherwise, this argumentis ignored andx andy must be factors with same length and same levels.

labels

labels to use for the two classifications. By default, they arethe same asvars, or the one in the confusion matrix.

merge.by

a character string with the name of variables to use to mergethe two data frames, orNULL.

useNA

do we keepNAs as a separate category? The default"ifany"creates this category only if there are missing values. Other possibilitiesare"no", or"always".

prior

class frequencies to use for first classifier that is tabulatedin the rows of the confusion matrix. For its value, see here under, the⁠value=⁠ argument.

sums

is the confusion matrix printed with rows and columns sums?

error.col

is a column with class error for first classifier added(equivalent to false negative rate of FNR)?

digits

the number of digits after the decimal point to print in theconfusion matrix. The default or zero leads to most compact presentationand is suitable for frequencies, but not for relative frequencies.

sort

are rows and columns of the confusion matrix sorted so thatclasses with larger confusion are closer together? Sorting is doneusing a hierarchical clustering withhclust(). The clustering methodis"ward.D2" by default, but see thehclust() help for other options).IfFALSE orNULL, no sorting is done.

object

aconfusion object

type

either"all" (by default), or consideringTP is the truepositives,FP is the false positives,TN is the true negatives andFNis the false negatives, one can also specify:"Fscore" (F-score = F-measure= F1 score = harmonic mean of Precision and recall),"Recall"(TP / (TP + FN) = 1 - FNR),"Precision" (TP / (TP + FP) = 1 - FDR),"Specificity" (TN / (TN + FP) = 1 - FPR),"NPV" (Negative predicted value= TN / (TN + FN) = 1 - FOR),"FPR" (False positive rate = 1 - Specificity= FP / (FP + TN)),"FNR" (False negative rate = 1 - Recall = FN / (TP + FN)),"FDR" (False Discovery Rate = 1 - Precision = FP / (TP + FP)),"FOR"(False omission rate = 1 - NPV = FN / (FN + TN)),"LRPT" (Likelihood Ratiofor Positive Tests = Recall / FPR = Recall / (1 - Specificity)),"LRNT"Likelihood Ratio for Negative Tests = FNR / Specificity = (1 - Recall) /Specificity,"LRPS" (Likelihood Ratio for Positive Subjects = Precision /FOR = Precision / (1 - NPV)),"LRNS" (Likelihood Ratio Negative Subjects =FDR / NPV = (1 - Precision) / (1 - FOR)),"BalAcc" (Balanced accuracy= (Sensitivity + Specificity) / 2),"MCC" (Matthews correlation coefficient),"Chisq" (Chisq metric), or"Bray" (Bray-Curtis metric)

sort.by

the statistics to use to sort the table (by default, Fmeasure,the F1 score for each class = 2 * recall * precision / (recall + precision)).

decreasing

do we sort in increasing or decreasing order?

Value

A confusion matrix in aconfusion object.

See Also

mlearning(),plot.confusion(),prior()

Examples

data("Glass", package = "mlbench")# Use a little bit more informative labels for TypeGlass$Type <- as.factor(paste("Glass", Glass$Type))# Use learning vector quantization to classify the glass types# (using default parameters)summary(glass_lvq <- ml_lvq(Type ~ ., data = Glass))# Calculate cross-validated confusion matrix(glass_conf <- confusion(cvpredict(glass_lvq), Glass$Type))# Raw confusion matrix: no sort and no marginsprint(glass_conf, sums = FALSE, sort = FALSE)summary(glass_conf)summary(glass_conf, type = "Fscore")

Supervised classification using k-nearest neighbor

Description

Unified (formula-based) interface version of the k-nearest neighboralgorithm provided byclass::knn().

Usage

mlKnn(train, ...)ml_knn(train, ...)## S3 method for class 'formula'mlKnn(formula, data, k.nn = 5, ..., subset, na.action)## Default S3 method:mlKnn(train, response, k.nn = 5, ...)## S3 method for class 'mlKnn'summary(object, ...)## S3 method for class 'summary.mlKnn'print(x, ...)## S3 method for class 'mlKnn'predict(  object,  newdata,  type = c("class", "prob", "both"),  method = c("direct", "cv"),  na.action = na.exclude,  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed to the classification method or itspredict() method (not used here for now).

formula

a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use theclass ~ . shortversion (that one is strongly encouraged). Variables with minus sign areeliminated. Calculations on variables are possible according to usual formulaconvention (possibly protected by usingI()).

data

a data.frame to use as a training set.

k.nn

k used for k-NN number of neighbor considered. Default is 5.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_knn()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor for the classification.

x,object

anmlKnn object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"prob" the "probability" for thedifferent classes as assessed by the number of neighbors of these classes,or"both" to return classes and "probabilities",

method

"direct" (default) or"cv"."direct" predicts new cases in⁠newdata=⁠ if this argument is provided, or the cases in the training setif not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different data set in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_knn()/mlKnn() creates anmlKnn,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsoclass::knn() andipred::predict.ipredknn() that actually do the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_knn <- ml_knn(data = iris_train, Species ~ .)summary(iris_knn)predict(iris_knn) # This object only returns classes# Self-consistency, do not use for assessing classifier performances!confusion(iris_knn)# Use an independent test set insteadconfusion(predict(iris_knn, newdata = iris_test), iris_test$Species)

Supervised classification using linear discriminant analysis

Description

Unified (formula-based) interface version of the linear discriminantanalysis algorithm provided byMASS::lda().

Usage

mlLda(train, ...)ml_lda(train, ...)## S3 method for class 'formula'mlLda(formula, data, ..., subset, na.action)## Default S3 method:mlLda(train, response, ...)## S3 method for class 'mlLda'predict(  object,  newdata,  type = c("class", "membership", "both", "projection"),  prior = object$prior,  dimension = NULL,  method = c("plug-in", "predictive", "debiased", "cv"),  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed toMASS::lda() or itspredict()method (see the corresponding help page).

formula

a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use theclass ~ . shortversion (that one is strongly encouraged). Variables with minus sign areeliminated. Calculations on variables are possible according to usualformula convention (possibly protected by usingI()).

data

a data.frame to use as a training set.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_lda()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor for the classification.

object

anmlLda object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership" the membership (anumber between 0 and 1) to the different classes, or"both" to returnclasses and memberships. Thetype = "projection" returns a projectionof the individuals in the plane represented by the⁠dimension= ⁠discriminant components.

prior

the prior probabilities of class membership. By default, theprior are obtained from the object and, if they where not changed,correspond to the proportions observed in the training set.

dimension

the number of the predictive space to use. IfNULL (thedefault) a reasonable value is used. If this is less than min(p, ng-1),only the firstdimension discriminant components are used (except formethod = "predictive"), and only those dimensions are returned in x.

method

"plug-in","predictive","debiased", or"cv"."plug-in" (default) the usual unbiased parameter estimates are used.With"predictive", the parameters are integrated out using a vague prior.With"debiased", an unbiased estimator of the log posterior probabilitiesis used. With"cv", cross-validation is used instead. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_lda()/mlLda() creates anmlLda,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsoMASS::lda() thatactually does the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_lda <- ml_lda(data = iris_train, Species ~ .)iris_ldasummary(iris_lda)plot(iris_lda, col = as.numeric(response(iris_lda)) + 1)# Prediction using a test setpredict(iris_lda, newdata = iris_test) # class (default type)predict(iris_lda, type = "membership") # posterior probabilitypredict(iris_lda, type = "both") # both class and membership in a list# Type projectionpredict(iris_lda, type = "projection") # Projection on the LD axes# Add test set items to the previous plotpoints(predict(iris_lda, newdata = iris_test, type = "projection"),  col = as.numeric(predict(iris_lda, newdata = iris_test)) + 1, pch = 19)# predict() and confusion() should be used on a separate test set# for unbiased estimation (or using cross-validation, bootstrap, ...)# Wrong, cf. biased estimation (so-called, self-consistency)confusion(iris_lda)# Estimation using a separate test setconfusion(predict(iris_lda, newdata = iris_test), iris_test$Species)# Another dataset (binary predictor... not optimal for lda, just for test)data("HouseVotes84", package = "mlbench")house_lda <- ml_lda(data = HouseVotes84, na.action = na.omit, Class ~ .)summary(house_lda)confusion(house_lda) # Self-consistency (biased metrics)print(confusion(house_lda), error.col = FALSE) # Without error column# More complex formulas# Exclude one or more variablesiris_lda2 <- ml_lda(data = iris, Species ~ . - Sepal.Width)summary(iris_lda2)# With calculationiris_lda3 <- ml_lda(data = iris, Species ~ log(Petal.Length) +  log(Petal.Width) + I(Petal.Length/Sepal.Length))summary(iris_lda3)# Factor levels with missing items are allowedir2 <- iris[-(51:100), ] # No Iris versicolor in the training setiris_lda4 <- ml_lda(data = ir2, Species ~ .)summary(iris_lda4) # missing class# Missing levels are reinjected in class or membership by predict()predict(iris_lda4, type = "both")# ... but, of course, the classifier is wrong for Iris versicolorconfusion(predict(iris_lda4, newdata = iris), iris$Species)# Simpler interface, but more memory-effectiveiris_lda5 <- ml_lda(train = iris[, -5], response = iris$Species)summary(iris_lda5)

Supervised classification using learning vector quantization

Description

Unified (formula-based) interface version of the learning vector quantizationalgorithms provided byclass::olvq1(),class::lvq1(),class::lvq2(),andclass::lvq3().

Usage

mlLvq(train, ...)ml_lvq(train, ...)## S3 method for class 'formula'mlLvq(  formula,  data,  k.nn = 5,  size,  prior,  algorithm = "olvq1",  ...,  subset,  na.action)## Default S3 method:mlLvq(train, response, k.nn = 5, size, prior, algorithm = "olvq1", ...)## S3 method for class 'mlLvq'summary(object, ...)## S3 method for class 'summary.mlLvq'print(x, ...)## S3 method for class 'mlLvq'predict(  object,  newdata,  type = "class",  method = c("direct", "cv"),  na.action = na.exclude,  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed to the classification method or itspredict() method (not used here for now).

formula

a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use theclass ~ . shortversion (that one is strongly encouraged). Variables with minus sign areeliminated. Calculations on variables are possible according to usual formulaconvention (possibly protected by usingI()).

data

a data.frame to use as a training set.

k.nn

k used for k-NN number of neighbor considered. Default is 5.

size

the size of the codebook. Defaults tomin(round(0.4 \* nc \* (nc - 1 + p/2),0), n) where nc is the number ofclasses.

prior

probabilities to represent classes in the codebook (defaultvalues are the proportions in the training set).

algorithm

"olvq1" (by default, the optimized 'lvq1' version), or"lvq1","lvq2","lvq3".

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. For [ml_lvq)]na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

[ml_lvq)]: R:ml_lvq)

response

a vector of factor of the classes.

x,object

anmlLvq object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return. For this method, only"class"is accepted, and it is the default. It returns the predicted classes.

method

"direct" (default) or"cv"."direct" predicts new cases in⁠newdata=⁠ if this argument is provided, or the cases in the training setif not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different dataset in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_lvq()/mlLvq() creates anmlLvq,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsoclass::olvq1(),class::lvq1(),class::lvq2(), andclass::lvq3() that actually do theclassification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_lvq <- ml_lvq(data = iris_train, Species ~ .)summary(iris_lvq)predict(iris_lvq) # This object only returns classes#' # Self-consistency, do not use for assessing classifier performances!confusion(iris_lvq)# Use an independent test set insteadconfusion(predict(iris_lvq, newdata = iris_test), iris_test$Species)

Supervised classification using naive Bayes

Description

Unified (formula-based) interface version of the naive Bayes algorithmprovided bye1071::naiveBayes().

Usage

mlNaiveBayes(train, ...)ml_naive_bayes(train, ...)## S3 method for class 'formula'mlNaiveBayes(formula, data, laplace = 0, ..., subset, na.action)## Default S3 method:mlNaiveBayes(train, response, laplace = 0, ...)## S3 method for class 'mlNaiveBayes'predict(  object,  newdata,  type = c("class", "membership", "both"),  method = c("direct", "cv"),  na.action = na.exclude,  threshold = 0.001,  eps = 0,  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed to the classification method or itspredict() method (not used here for now).

formula

a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use theclass ~ . shortversion (that one is strongly encouraged). Variables with minus sign areeliminated. Calculations on variables are possible according to usual formulaconvention (possibly protected by usingI()).

data

a data.frame to use as a training set.

laplace

positive number controlling Laplace smoothing for the naiveBayes classifier. The default (0) disables Laplace smoothing.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_naive_bayes()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor with the classes.

object

anmlNaiveBayes object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership", the posteriorprobability or"both" to return classes and memberships,

method

"direct" (default) or"cv"."direct" predicts new cases in⁠newdata=⁠ if this argument is provided, or the cases in the training setif not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different dataset in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

threshold

value replacing cells with probabilities within 'eps' range.

eps

number for specifying an epsilon-range to apply Laplace smoothing(to replace zero or close-zero probabilities by 'threshold').

Value

ml_naive_bayes()/mlNaiveBayes() creates anmlNaiveBayes,mlearning object containing the classifier and a lot of additionalmetadata used by the functions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions orextract specific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsoe1071::naiveBayes() that actually does the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_nb <- ml_naive_bayes(data = iris_train, Species ~ .)summary(iris_nb)predict(iris_nb) # Default type is classpredict(iris_nb, type = "membership")predict(iris_nb, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_nb)# Use an independent test set insteadconfusion(predict(iris_nb, newdata = iris_test), iris_test$Species)# Another datasetdata("HouseVotes84", package = "mlbench")house_nb <- ml_naive_bayes(data = HouseVotes84, Class ~ .,  na.action = na.omit)summary(house_nb)confusion(house_nb) # Self-consistencyconfusion(cvpredict(house_nb), na.omit(HouseVotes84)$Class)

Supervised classification and regression using neural network

Description

Unified (formula-based) interface version of the single-hidden-layer neuralnetwork algorithm, possibly with skip-layer connections provided bynnet::nnet().

Usage

mlNnet(train, ...)ml_nnet(train, ...)## S3 method for class 'formula'mlNnet(  formula,  data,  size = NULL,  rang = NULL,  decay = 0,  maxit = 1000,  ...,  subset,  na.action)## Default S3 method:mlNnet(train, response, size = NULL, rang = NULL, decay = 0, maxit = 1000, ...)## S3 method for class 'mlNnet'predict(  object,  newdata,  type = c("class", "membership", "both", "raw"),  method = c("direct", "cv"),  na.action = na.exclude,  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed tonnet::nnet() that has many moreparameters (see its help page).

formula

a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) and theright term with the list of independent, predictive variables, separated witha plus sign. If the data frame provided contains only the dependent andindependent variables, one can use theclass ~ . short version (that one isstrongly encouraged). Variables with minus sign are eliminated. Calculationson variables are possible according to usual formula convention (possiblyprotected by usingI()).

data

a data.frame to use as a training set.

size

number of units in the hidden layer. Can be zero if there areskip-layer units. IfNULL (the default), a reasonable value is computed.

rang

initial random weights on [-rang, rang]. Value about 0.5 unlessthe inputs are large, in which case it should be chosen so thatrang * max(|x|) is about 1. IfNULL, a reasonable default is computed.

decay

parameter for weight decay. Default to 0.

maxit

maximum number of iterations. Default 1000 (it is 100 innnet::nnet()).

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_nnet()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor (classification) or numeric (regression).

object

anmlNnet object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership" the membership (numberbetween 0 and 1) to the different classes, or"both" to return classesand memberships. Also type"raw" as non normalized result as returned bynnet::nnet() (useful for regression, see examples).

method

"direct" (default) or"cv"."direct" predicts new casesin⁠newdata=⁠ if this argument is provided, or the cases in the trainingset if not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different data set in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_nnet()/mlNnet() creates anmlNnet,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsonnet::nnet()that actually does the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAset.seed(689) # Useful for reproductibility, use a different value each time!iris_nnet <- ml_nnet(data = iris_train, Species ~ .)summary(iris_nnet)predict(iris_nnet) # Default type is classpredict(iris_nnet, type = "membership")predict(iris_nnet, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_nnet)# Use an independent test set insteadconfusion(predict(iris_nnet, newdata = iris_test), iris_test$Species)# Idem, but two classes predictiondata("HouseVotes84", package = "mlbench")set.seed(325)house_nnet <- ml_nnet(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_nnet)# Cross-validated confusion matrixconfusion(cvpredict(house_nnet), na.omit(HouseVotes84)$Class)# Regressiondata(airquality, package = "datasets")set.seed(74)ozone_nnet <- ml_nnet(data = airquality, Ozone ~ ., na.action = na.omit,  skip = TRUE, decay = 1e-3, size = 20, linout = TRUE)summary(ozone_nnet)plot(na.omit(airquality)$Ozone, predict(ozone_nnet, type = "raw"))abline(a = 0, b = 1)

Supervised classification using quadratic discriminant analysis

Description

Unified (formula-based) interface version of the quadratic discriminantanalysis algorithm provided byMASS::qda().

Usage

mlQda(train, ...)ml_qda(train, ...)## S3 method for class 'formula'mlQda(formula, data, ..., subset, na.action)## Default S3 method:mlQda(train, response, ...)## S3 method for class 'mlQda'predict(  object,  newdata,  type = c("class", "membership", "both"),  prior = object$prior,  method = c("plug-in", "predictive", "debiased", "looCV", "cv"),  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed toMASS::qda() or itspredict()method (see the corresponding help page).

formula

a formula with left term being the factor variable to predictand the right term with the list of independent, predictive variables,separated with a plus sign. If the data frame provided contains only thedependent and independent variables, one can use theclass ~ . shortversion (that one is strongly encouraged). Variables with minus sign areeliminated. Calculations on variables are possible according to usualformula convention (possibly protected by usingI()).

data

a data.frame to use as a training set.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_qda()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor for the classification.

object

anmlQda object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership" the membership (anumber between 0 and 1) to the different classes, or"both" to returnclasses and memberships.

prior

the prior probabilities of class membership. By default, theprior are obtained from the object and, if they where not changed,correspond to the proportions observed in the training set.

method

"plug-in","predictive","debiased","looCV", or"cv"."plug-in" (default) the usual unbiased parameter estimates areused. With"predictive", the parameters are integrated out using a vagueprior. With"debiased", an unbiased estimator of the log posteriorprobabilities is used. With"looCV", the leave-one-out cross-validationfits to the original data set are computed and returned. With"cv",cross-validation is used instead. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_qda()/mlQda() creates anmlQda,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsoMASS::qda() thatactually does the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_qda <- ml_qda(data = iris_train, Species ~ .)summary(iris_qda)confusion(iris_qda)confusion(predict(iris_qda, newdata = iris_test), iris_test$Species)# Another dataset (binary predictor... not optimal for qda, just for test)data("HouseVotes84", package = "mlbench")house_qda <- ml_qda(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_qda)

Supervised classification and regression using random forest

Description

Unified (formula-based) interface version of the random forest algorithmprovided byrandomForest::randomForest().

Usage

mlRforest(train, ...)ml_rforest(train, ...)## S3 method for class 'formula'mlRforest(  formula,  data,  ntree = 500,  mtry,  replace = TRUE,  classwt = NULL,  ...,  subset,  na.action)## Default S3 method:mlRforest(  train,  response,  ntree = 500,  mtry,  replace = TRUE,  classwt = NULL,  ...)## S3 method for class 'mlRforest'predict(  object,  newdata,  type = c("class", "membership", "both", "vote"),  method = c("direct", "oob", "cv"),  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed torandomForest::randomForest() or itspredict() method. There are many more arguments, see the correspondinghelp page.

formula

a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) ornothing (for unsupervised classification) and the right term with the listof independent, predictive variables, separated with a plus sign. If thedata frame provided contains only the dependent and independent variables,one can use theclass ~ . short version (that one is strongly encouraged).Variables with minus sign are eliminated. Calculations on variables arepossible according to usual formula convention (possibly protected by usingI()).

data

a data.frame to use as a training set.

ntree

the number of trees to generate (use a value large enough to getat least a few predictions for each input row). Default is 500 trees.

mtry

number of variables randomly sampled as candidates at each split.Note that the default values are different for classification (sqrt(p)where p is number of variables in x) and regression (p/3)?

replace

sample cases with or without replacement (TRUE by default)?

classwt

priors of the classes. Need not add up to one. Ignored forregression.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_rforest()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor (classification) or numeric (regression),orNULL (unsupervised classification).

object

anmlRforest object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership" the membership (numberbetween 0 and 1) to the different classes as assessed by the number ofneighbors of these classes, or"both" to return classes and memberships.One can also use"vote", which returns the number of trees that votedfor each class.

method

"direct" (default),"oob" or"cv"."direct" predictsnew cases in⁠newdata=⁠ if this argument is provided, or the cases in thetraining set if not. Take care that not providing⁠newdata=⁠ means that youjust calculate theself-consistency of the classifier but cannot usethe metrics derived from these results for the assessment of itsperformances (in the case of Random Forest, these metrics would mostcertainly falsely indicate a perfect classifier). Either use a differentdata set in⁠newdata=⁠ or use the alternate approaches: out-of-bag("oob") or cross-validation ("cv"). The out-of-bag approach usesindividuals that are not used to build the trees to assess performances. Itis an unbiased estimates. If you specifymethod = "cv" thencvpredict()is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_rforest()/mlRforest() creates anmlRforest,mlearningobject containing the classifier and a lot of additional metadata used bythe functions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsorandomForest::randomForest() that actually does the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_rf <- ml_rforest(data = iris_train, Species ~ .)summary(iris_rf)plot(iris_rf) # Useful to look at the effect of ntree=# For such a relatively simple case, 50 trees are enoughiris_rf <- ml_rforest(data = iris_train, Species ~ ., ntree = 50)summary(iris_rf)predict(iris_rf) # Default type is classpredict(iris_rf, type = "membership")predict(iris_rf, type = "both")predict(iris_rf, type = "vote")# Out-of-bag prediction (unbiased)predict(iris_rf, method = "oob")# Self-consistency (always very high for random forest, biased, do not use!)confusion(iris_rf)# This one is betterconfusion(iris_rf, method = "oob") # Out-of-bag performances# Cross-validation prediction is also a good choice when there is no test setpredict(iris_rf, method = "cv")  # Idem: cvpredict(res)# Cross-validation for performances estimationconfusion(iris_rf, method = "cv")# Evaluation of performances using a separate test setconfusion(predict(iris_rf, newdata = iris_test), iris_test$Species)# Regression using random forest (from ?randomForest)set.seed(131) # Useful for reproducibility (use a different number each time)ozone_rf <- ml_rforest(data = airquality, Ozone ~ ., mtry = 3,  importance = TRUE, na.action = na.omit)summary(ozone_rf)# Show "importance" of variables: higher value mean more important variablesround(randomForest::importance(ozone_rf), 2)plot(na.omit(airquality)$Ozone, predict(ozone_rf))abline(a = 0, b = 1)# Unsupervised classification using random forest (from ?randomForest)set.seed(17)iris_urf <- ml_rforest(train = iris[, -5]) # Use only quantitative datasummary(iris_urf)randomForest::MDSplot(iris_urf, iris$Species)plot(stats::hclust(stats::as.dist(1 - iris_urf$proximity),  method = "average"), labels = iris$Species)

Supervised classification and regression using recursive partitioning

Description

Unified (formula-based) interface version of the recursive partitioningalgorithm as implemented inrpart::rpart().

Usage

mlRpart(train, ...)ml_rpart(train, ...)## S3 method for class 'formula'mlRpart(formula, data, ..., subset, na.action)## Default S3 method:mlRpart(train, response, ..., .args. = NULL)## S3 method for class 'mlRpart'predict(  object,  newdata,  type = c("class", "membership", "both"),  method = c("direct", "cv"),  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed torpart::rpart() or itspredict()method (see the corresponding help page.

formula

a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) and theright term with the list of independent, predictive variables, separated witha plus sign. If the data frame provided contains only the dependent andindependent variables, one can use theclass ~ . short version (that one isstrongly encouraged). Variables with minus sign are eliminated. Calculationson variables are possible according to usual formula convention (possiblyprotected by usingI()).

data

a data.frame to use as a training set.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_rpart()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor (classification) or numeric (regression).

.args.

used internally, do not provide anything here.

object

anmlRpart object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership" the membership (numberbetween 0 and 1) to the different classes, or"both" to return classesand memberships,

method

"direct" (default) or"cv"."direct" predicts new cases in⁠newdata=⁠ if this argument is provided, or the cases in the training setif not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different data set in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_rpart()/mlRpart() creates anmlRpart,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsorpart::rpart()that actually does the classification.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_rpart <- ml_rpart(data = iris_train, Species ~ .)summary(iris_rpart)# Plot the decision tree for this classifierplot(iris_rpart, margin = 0.03, uniform = TRUE)text(iris_rpart, use.n = FALSE)# Predictionspredict(iris_rpart) # Default type is classpredict(iris_rpart, type = "membership")predict(iris_rpart, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_rpart)# Cross-validation prediction is a good choice when there is no test setpredict(iris_rpart, method = "cv")  # Idem: cvpredict(res)confusion(iris_rpart, method = "cv")# Evaluation of performances using a separate test setconfusion(predict(iris_rpart, newdata = iris_test), iris_test$Species)

Supervised classification and regression using support vector machine

Description

Unified (formula-based) interface version of the support vector machinealgorithm provided bye1071::svm().

Usage

mlSvm(train, ...)ml_svm(train, ...)## S3 method for class 'formula'mlSvm(  formula,  data,  scale = TRUE,  type = NULL,  kernel = "radial",  classwt = NULL,  ...,  subset,  na.action)## Default S3 method:mlSvm(  train,  response,  scale = TRUE,  type = NULL,  kernel = "radial",  classwt = NULL,  ...)## S3 method for class 'mlSvm'predict(  object,  newdata,  type = c("class", "membership", "both"),  method = c("direct", "cv"),  na.action = na.exclude,  ...)

Arguments

train

a matrix or data frame with predictors.

...

further arguments passed to the classification or regressionmethod. Seee1071::svm().

formula

a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) ornothing (for unsupervised classification) and the right term with the listof independent, predictive variables, separated with a plus sign. If thedata frame provided contains only the dependent and independent variables,one can use theclass ~ . short version (that one is strongly encouraged).Variables with minus sign are eliminated. Calculations on variables arepossible according to usual formula convention (possibly protected by usingI()).

data

a data.frame to use as a training set.

scale

are the variables scaled (so that mean = 0 and standarddeviation = 1)?TRUE by default. If a vector is provided, it is appliedto variables with recycling.

type

Forml_svm()/mlSvm(), the type of classification orregression machine to use. The default value ofNULL uses"C-classification" if response variable is factor andeps-regressionif it is numeric. It can also be"nu-classification" or"nu-regression". The "C" and "nu" versions are basically the same butwith a different parameterisation. The range of C is from zero to infinity,while the range for nu is from zero to one. A fifth option is"one_classification" that is specific to novelty detection (find theitems that are different from the rest).Forpredict(), the type of prediction to return."class" by default,the predicted classes. Other options are"membership" the membership(number between 0 and 1) to the different classes, or"both" to returnclasses and memberships.

kernel

the kernel used by svm, seee1071::svm() for furtherexplanations. Can be"radial","linear","polynomial" or"sigmoid".

classwt

priors of the classes. Need not add up to one.

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_svm()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

response

a vector of factor (classification) or numeric (regression).

object

anmlSvm object

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

method

"direct" (default) or"cv"."direct" predicts new cases in⁠newdata=⁠ if this argument is provided, or the cases in the training setif not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different data set in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case.

Value

ml_svm()/mlSvm() creates anmlSvm,mlearning objectcontaining the classifier and a lot of additional metadata used by thefunctions and methods you can apply to it likepredict() orcvpredict(). In case you want to program new functions or extractspecific components, inspect the "unclassed" object usingunclass().

See Also

mlearning(),cvpredict(),confusion(), alsoe1071::svm()that actually does the calculation.

Examples

# Prepare data: split into training set (2/3) and test set (1/3)data("iris", package = "datasets")train <- c(1:34, 51:83, 101:133)iris_train <- iris[train, ]iris_test <- iris[-train, ]# One case with missing data in train set, and another case in test setiris_train[1, 1] <- NAiris_test[25, 2] <- NAiris_svm <- ml_svm(data = iris_train, Species ~ .)summary(iris_svm)predict(iris_svm) # Default type is classpredict(iris_svm, type = "membership")predict(iris_svm, type = "both")# Self-consistency, do not use for assessing classifier performances!confusion(iris_svm)# Use an independent test set insteadconfusion(predict(iris_svm, newdata = iris_test), iris_test$Species)# Another datasetdata("HouseVotes84", package = "mlbench")house_svm <- ml_svm(data = HouseVotes84, Class ~ ., na.action = na.omit)summary(house_svm)# Cross-validated confusion matrixconfusion(cvpredict(house_svm), na.omit(HouseVotes84)$Class)# Regression using support vector machinedata(airquality, package = "datasets")ozone_svm <- ml_svm(data = airquality, Ozone ~ ., na.action = na.omit)summary(ozone_svm)plot(na.omit(airquality)$Ozone, predict(ozone_svm))abline(a = 0, b = 1)

Machine learning model for (un)supervised classification or regression

Description

Anmlearning object provides an unified (formula-based) interface toseveral machine learning algorithms. They share the same interface and verysimilar arguments. They conform to the formula-based approach, of say,stats::lm() in base R, but with a coherent handling of missing data andmissing class levels. An optimized version exists for the simplifiedy ~ .formula. Finally, cross-validation is also built-in.

Usage

mlearning(  formula,  data,  method,  model.args,  call = match.call(),  ...,  subset,  na.action = na.fail)## S3 method for class 'mlearning'print(x, ...)## S3 method for class 'mlearning'summary(object, ...)## S3 method for class 'summary.mlearning'print(x, ...)## S3 method for class 'mlearning'plot(x, y, ...)## S3 method for class 'mlearning'predict(  object,  newdata,  type = c("class", "membership", "both"),  method = c("direct", "cv"),  na.action = na.exclude,  ...)cvpredict(object, ...)## S3 method for class 'mlearning'cvpredict(  object,  type = c("class", "membership", "both"),  cv.k = 10,  cv.strat = TRUE,  ...)

Arguments

formula

a formula with left term being the factor variable to predict(for supervised classification), a vector of numbers (for regression) ornothing (for unsupervised classification) and the right term with the listof independent, predictive variables, separated with a plus sign. If thedata frame provided contains only the dependent and independent variables,one can use theclass ~ . short version (that one is strongly encouraged).Variables with minus sign are eliminated. Calculations on variables arepossible according to usual formula convention (possibly protected by usingI()). Supervised classification, regression or unsupervised classificationare not available for all algorithms. Check respective help pages.

data

a data.frame to use as a training set.

method

"direct" (default) or"cv"."direct" predicts new casesin⁠newdata=⁠ if this argument is provided, or the cases in the trainingset if not. Take care that not providing⁠newdata=⁠ means that you justcalculate theself-consistency of the classifier but cannot use themetrics derived from these results for the assessment of its performances.Either use a different dataset in⁠newdata=⁠ or use the alternatecross-validation ("cv") technique. If you specifymethod = "cv" thencvpredict() is used and you cannot provide⁠newdata=⁠ in that case. Othermethods may be provided by the various algorithms (check their help pages)

model.args

arguments for formula modeling with substituted data andsubset... Not to be used by the end-user.

call

the function call. Not to be used by the end-user.

...

further arguments (depends on the method).

subset

index vector with the cases to define the training set in use(this argument must be named, if provided).

na.action

function to specify the action to be taken ifNAs arefound. Forml_qda()na.fail is used by default. The calculation isstopped if there is anyNA in the data. Another option isna.omit,where cases with missing values on any required variable are dropped (thisargument must be named, if provided). For thepredict() method, thedefault, and most suitable option, isna.exclude. In that case, rows withNAs in⁠newdata=⁠ are excluded from prediction, but reinjected in thefinal results so that the number of items is still the same (and in thesame order as⁠newdata=⁠).

x,object

anmlearning object

y

a secondmlearning object or nothing (not used in several plots)

newdata

a new dataset with same conformation as the training set (samevariables, except may by the class for classification or dependent variablefor regression). Usually a test set, or a new dataset to be predicted.

type

the type of prediction to return."class" by default, thepredicted classes. Other options are"membership" the membership (anumber between 0 and 1) to the different classes, or"both" to returnclasses and memberships. Other types may be provided for some algorithms(read respective help pages).

cv.k

k for k-fold cross-validation, cfipred::errorest().By default, 10.

cv.strat

is the subsampling stratified or not in cross-validation,cfipred::errorest().TRUE by default.

Value

anmlearning object formlearning(). Methods return their ownresults that can be amlearning,data.frame,vector, etc.

See Also

ml_lda(),ml_qda(),ml_naive_bayes(),ml_nnet(),ml_rpart(),ml_rforest(),ml_svm(),confusion() andprior(). Alsoipred::errorest() that internally computes the cross-validationincvpredict().

Examples

# mlearning() should not be calle directly. Use the mlXXX() functions instead# for instance, for Random Forest, use ml_rforest()/mlRforest()# A typical classification involves several steps:## 1) Prepare data: split into training set (2/3) and test set (1/3)#    Data cleaning (elimination of unwanted variables), transformation of#    others (scaling, log, ratios, numeric to factor, ...) may be necessary#    here. Apply the same treatments on the training and test setsdata("iris", package = "datasets")train <- c(1:34, 51:83, 101:133) # Also random or stratified samplingiris_train <- iris[train, ]iris_test <- iris[-train, ]# 2) Train the classifier, use of the simplified formula class ~ . encouraged#    so, you may have to prepare the train/test sets to keep only relevant#    variables and to possibly transform them before useiris_rf <- ml_rforest(data = iris_train, Species ~ .)iris_rfsummary(iris_rf)train(iris_rf)response(iris_rf)# 3) Find optimal values for the parameters of the model#    This is usally done iteratively. Just an example with ntree where a plot#    exists to help finding optimal valueplot(iris_rf)# For such a relatively simple case, 50 trees are enough, retrain with itiris_rf <- ml_rforest(data = iris_train, Species ~ ., ntree = 50)summary(iris_rf)# 4) Study the classifier performances. Several metrics and tools exists#    like ROC curves, AUC, etc. Tools provided here are the confusion matrix#    and the metrics that are calculated on it.predict(iris_rf) # Default type is classpredict(iris_rf, type = "membership")predict(iris_rf, type = "both")# Confusion matrice and metrics using 10-fols cross-validationiris_rf_conf <- confusion(iris_rf, method = "cv")iris_rf_confsummary(iris_rf_conf)# Note you may want to manipulate priors too, see ?prior# 5) Go back to step #1 and refine the process until you are happy with the#    results. Then, you can use the classifier to predict unknown items.

Plot a confusion matrix

Description

Several graphical representations ofconfusion objects are possible: animage of the matrix with colored squares, a barplot comparing recall andprecision, a stars plot also comparing two metrics, possibly also comparingtwo different classifiers of the same dataset, or a dendrogram grouping theclasses relative to the errors observed in the confusion matrix (classeswith more errors are pooled together more rapidly).

Usage

## S3 method for class 'confusion'plot(  x,  y = NULL,  type = c("image", "barplot", "stars", "dendrogram"),  stat1 = "Recall",  stat2 = "Precision",  names,  ...)confusion_image(  x,  y = NULL,  labels = names(dimnames(x)),  sort = "ward.D2",  numbers = TRUE,  digits = 0,  mar = c(3.1, 10.1, 3.1, 3.1),  cex = 1,  asp = 1,  colfun,  ncols = 41,  col0 = FALSE,  grid.col = "gray",  ...)confusionImage(  x,  y = NULL,  labels = names(dimnames(x)),  sort = "ward.D2",  numbers = TRUE,  digits = 0,  mar = c(3.1, 10.1, 3.1, 3.1),  cex = 1,  asp = 1,  colfun,  ncols = 41,  col0 = FALSE,  grid.col = "gray",  ...)confusion_barplot(  x,  y = NULL,  col = c("PeachPuff2", "green3", "lemonChiffon2"),  mar = c(1.1, 8.1, 4.1, 2.1),  cex = 1,  cex.axis = cex,  cex.legend = cex,  main = "F-score (precision versus recall)",  numbers = TRUE,  min.width = 17,  ...)confusionBarplot(  x,  y = NULL,  col = c("PeachPuff2", "green3", "lemonChiffon2"),  mar = c(1.1, 8.1, 4.1, 2.1),  cex = 1,  cex.axis = cex,  cex.legend = cex,  main = "F-score (precision versus recall)",  numbers = TRUE,  min.width = 17,  ...)confusion_stars(  x,  y = NULL,  stat1 = "Recall",  stat2 = "Precision",  names,  main,  col = c("green2", "blue2", "green4", "blue4"),  ...)confusionStars(  x,  y = NULL,  stat1 = "Recall",  stat2 = "Precision",  names,  main,  col = c("green2", "blue2", "green4", "blue4"),  ...)confusion_dendrogram(  x,  y = NULL,  labels = rownames(x),  sort = "ward.D2",  main = "Groups clustering",  ...)confusionDendrogram(  x,  y = NULL,  labels = rownames(x),  sort = "ward.D2",  main = "Groups clustering",  ...)

Arguments

x

aconfusion object

y

NULL (not used), or a secondconfusion object when twodifferent classifications are compared in the plot ("stars" type).

type

the kind of plot to produce ("image", the default, or"barplot","stars","dendrogram").

stat1

the first metric to plot for the"stars" type (Recall bydefault).

stat2

the second metric to plot for the"stars" type (Precision bydefault).

names

names of the two classifiers to compare

...

further arguments passed to the function. It can be all argumentsor the corresponding plot.

labels

labels to use for the two classifications. By default, they arethe same asvars, or the one in the confusion matrix.

sort

are rows and columns of the confusion matrix sorted so thatclasses with larger confusion are closer together? Sorting is doneusing a hierarchical clustering withhclust(). The clustering methodis"ward.D2" by default, but see thehclust() help for other options).IfFALSE orNULL, no sorting is done.

numbers

are actual numbers indicated in the confusion matrix image?

digits

the number of digits after the decimal point to print in theconfusion matrix. The default or zero leads to most compact presentationand is suitable for frequencies, but not for relative frequencies.

mar

graph margins.

cex

text magnification factor.

asp

graph aspect ratio. There is little reasons to change the defaultvalue of 1.

colfun

a function that calculates a series of colors, like e.g.,cm.colors() that accepts one argument being the number of colorsto be generated.

ncols

the number of colors to generate. It should preferably be2 * number of levels + 1, where levels is the number of frequencies youwant to evidence in the plot. Default to 41.

col0

should null values be colored or not (no, by default)?

grid.col

color to use for grid lines, orNULL for not drawing gridlines.

col

color(s) to use for the plot.

cex.axis

idem for axes. IfNULL, the axis is not drawn.

cex.legend

idem for legend text. IfNULL, no legend is added.

main

main title of the plot.

min.width

minimum bar width required to add numbers.

Value

Data calculate to create the plots are returned invisibly. Thesefunctions are mostly used for their side-effect of producing a plot.

Examples

data("Glass", package = "mlbench")# Use a little bit more informative labels for TypeGlass$Type <- as.factor(paste("Glass", Glass$Type))# Use learning vector quantization to classify the glass types# (using default parameters)summary(glass_lvq <- ml_lvq(Type ~ ., data = Glass))# Calculate cross-validated confusion matrix and plot it in different ways(glass_conf <- confusion(cvpredict(glass_lvq), Glass$Type))# Raw confusion matrix: no sort and no marginsprint(glass_conf, sums = FALSE, sort = FALSE)# Plotsplot(glass_conf) # Image by defaultplot(glass_conf, sort = FALSE) # No sortingplot(glass_conf, type = "barplot")plot(glass_conf, type = "stars")plot(glass_conf, type = "dendrogram")# Build another classifier and make a comparisonsummary(glass_naive_bayes <- ml_naive_bayes(Type ~ ., data = Glass))(glass_conf2 <- confusion(cvpredict(glass_naive_bayes), Glass$Type))# Comparison plot for two classifiersplot(glass_conf, glass_conf2)

Get or set priors on a confusion matrix

Description

Most metrics in supervised classifications are sensitive to the relativeproportion of the items in the different classes. When a confusion matrix iscalculated on a test set, it uses the proportions observed on that test set.If they are representative of the proportions in the population, metrics arenot biased. When it is not the case, priors of aconfusion object can beadjusted to better reflect proportions that are supposed to be observed inthe different classes in order to get more accurate metrics.

Usage

prior(object, ...)## S3 method for class 'confusion'prior(object, ...)prior(object, ...) <- value## S3 replacement method for class 'confusion'prior(object, ...) <- value

Arguments

object

aconfusion object (or another class if a method isimplemented)

...

further arguments passed to methods

value

a (named) vector of positive numbers of zeros ofthe same length as the number of classes in theconfusion object. Itcan also be a single >= 0 number and in this case, equal probabilities areapplied to all the classes (use 1 for relative frequencies and 100 forrelative frequencies in percent). If the value has zero length or isNULL, original prior probabilities (from the test set) are used. If thevector is named, names must correspond to existing class names in theconfusion object.

Value

prior() returns the current class frequencies associated withthe first classification tabulated in theconfusion object, i.e., forrows in the confusion matrix.

See Also

confusion()

Examples

data("Glass", package = "mlbench")# Use a little bit more informative labels for TypeGlass$Type <- as.factor(paste("Glass", Glass$Type))# Use learning vector quantization to classify the glass types# (using default parameters)summary(glass_lvq <- ml_lvq(Type ~ ., data = Glass))# Calculate cross-validated confusion matrix(glass_conf <- confusion(cvpredict(glass_lvq), Glass$Type))# When the probabilities in each class do not match the proportions in the# training set, all these calculations are useless. Having an idea of# the real proportions (so-called, priors), one should first reweight the# confusion matrix before calculating statistics, for instance:prior1 <- c(10, 10, 10, 100, 100, 100) # Glass types 1-3 are rareprior(glass_conf) <- prior1glass_confsummary(glass_conf, type = c("Fscore", "Recall", "Precision"))# This is very different than if glass types 1-3 are abundants!prior2 <- c(100, 100, 100, 10, 10, 10) # Glass types 1-3 are abundantsprior(glass_conf) <- prior2glass_confsummary(glass_conf, type = c("Fscore", "Recall", "Precision"))# Weight can also be used to construct a matrix of relative frequencies# In this case, all rows sum to oneprior(glass_conf) <- 1print(glass_conf, digits = 2)# However, it is easier to work with relative frequencies in percent# and one gets a more compact presentationprior(glass_conf) <- 100glass_conf# To reset row class frequencies to original propotions, just assign NULLprior(glass_conf) <- NULLglass_confprior(glass_conf)

Get the response variable for a mlearning object

Description

The response is either the class to be predicted for a classification problem(and it is a factor), or the dependent variable in a regression model (andit is numeric in that case). For unsupervised classification, response is notprovided and should returnNULL.

Usage

response(object, ...)## Default S3 method:response(object, ...)

Arguments

object

an object having a response variable.

...

further parameter (depends on the method).

Value

The response variable of the training set, orNULL for unsupervisedclassification.

See Also

mlearning(),train(),confusion()

Examples

data("HouseVotes84", package = "mlbench")house_rf <- ml_rforest(data = HouseVotes84, Class ~ .)house_rfresponse(house_rf)

Get the training variable for a mlearning object

Description

The training variables (train) are the variables used to train a classifier,excepted the prediction (class or dependent variable).

Usage

train(object, ...)## Default S3 method:train(object, ...)

Arguments

object

an object having a train attribute.

...

further parameter (depends on the method).

Value

A data frame containing the training variables of the model.

See Also

mlearning(),response(),confusion()

Examples

data("HouseVotes84", package = "mlbench")house_rf <- ml_rforest(data = HouseVotes84, Class ~ .)house_rftrain(house_rf)

[8]ページ先頭

©2009-2025 Movatter.jp