Learning Tasks

Source:vignettes/tutorial/task.Rmd

task.Rmd

Learning tasks encapsulate the data set and further relevant information about a machine learning problem, for example the name of the target variable for supervised problems.

Task types and creation

The tasks are organized in a hierarchy, with the genericTask() at the top. The following tasks can be instantiated and all inherit from the virtual superclassTask():

RegrTask() for regression problems,
ClassifTask() for binary andmulti-class classification problems with class-dependent costs can be handled as well),
SurvTask() for survival analysis,
ClusterTask() for cluster analysis,
MultilabelTask() for multilabel classification problems,
CostSensTask() for generalcost sensitive classification (with example-specific costs).

To create a task, just callmake<TaskType>, e.g.,makeClassifTask(). All tasks require an identifier (argumentid) and abase::data.frame() (argumentdata). If no ID is provided it is automatically generated using the variable name of the data. The ID will be later used to name results, for example ofbenchmark experiments, and to annotate plots. Depending on the nature of the learning problem, additional arguments may be required and are discussed in the following sections.

Regression

For supervised learning like regression (as well as classification and survival analysis) we, in addition todata, have to specify the name of thetarget variable.

data(BostonHousing, package="mlbench")regr.task=makeRegrTask(id="bh", data=BostonHousing, target="medv")regr.task## Supervised task: bh## Type: regr## Target: medv## Observations: 506## Features:##    numerics     factors     ordered functionals##          12           1           0           0## Missings: FALSE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE

As you can see, theTask() records the type of the learning problem and basic information about the data set, e.g., the types of the features (base::numeric() vectors,base::factors() or ordered factors), the number of observations, or whether missing values are present.

Creating tasks for classification and survival analysis follows the same scheme, the data type of the target variables included indata is simply different. For each of these learning problems some specifics are described below.

Classification

For classification the target column has to be afactor.

In the following example we define a classification task for themlbench::BreastCancer() data set and exclude the variableId from all further model fitting and evaluation.

data(BreastCancer, package="mlbench")df=BreastCancerdf$Id=NULLclassif.task=makeClassifTask(id="BreastCancer", data=df, target="Class")classif.task## Supervised task: BreastCancer## Type: classif## Target: Class## Observations: 699## Features:##    numerics     factors     ordered functionals##           0           4           5           0## Missings: TRUE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE## Classes: 2##    benign malignant##       458       241## Positive class: benign

In binary classification the two classes are usually referred to aspositive andnegative class with the positive class being the category of greater interest. This is relevant for manyperformance measures like thetrue positive rate orROC analysis. Moreover,mlr, where possible, permits to set options (like thesetThreshold() ormakeWeightedClassesWrapper()) and returns and plots results (like class posterior probabilities) for the positive class only.

makeClassifTask() by default selects the first factor level of the target variable as the positive class, in the above examplebenign. Classmalignant can be manually selected as follows:

classif.task=makeClassifTask(id="BreastCancer", data=df, target="Class", positive="malignant")

Survival analysis

Survival tasks use two target columns. For left and right censored problems these consist of the survival time and a binary event indicator. For interval censored data the two target columns must be specified in the"interval2" format (seesurvival::Surv()).

data(lung, package="survival")lung$status=(lung$status==2)# convert to logicalsurv.task=makeSurvTask(data=lung, target=c("time","status"))surv.task## Supervised task: lung## Type: surv## Target: time,status## Events: 165## Observations: 228## Features:##    numerics     factors     ordered functionals##           8           0           0           0## Missings: TRUE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE

The type of censoring can be specified via the argumentcensoring, which defaults to"rcens" for right censored data.

Multilabel classification

In multilabel classification each object can belong to more than one category at the same time.

Thedata are expected to contain as many target columns as there are class labels. The target columns should be logical vectors that indicate which class labels are present. The names of the target columns are taken as class labels and need to be passed to thetarget argument ofmakeMultilabelTask().

In the following example we get the data of the yeast data set, extract the label names, and pass them to thetarget argument inmakeMultilabelTask().

yeast=getTaskData(yeast.task)labels=colnames(yeast)[1:14]yeast.task=makeMultilabelTask(id="multi", data=yeast, target=labels)yeast.task## Supervised task: multi## Type: multilabel## Target: label1,label2,label3,label4,label5,label6,label7,label8,label9,label10,label11,label12,label13,label14## Observations: 2417## Features:##    numerics     factors     ordered functionals##         103           0           0           0## Missings: FALSE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE## Classes: 14##  label1  label2  label3  label4  label5  label6  label7  label8  label9 label10##     762    1038     983     862     722     597     428     480     178     253## label11 label12 label13 label14##     289    1816    1799      34

Cluster analysis

As cluster analysis is unsupervised, the only mandatory argument to construct a cluster analysis task is thedata. Below we create a learning task from the data setdatasets::mtcars().

data(mtcars, package="datasets")cluster.task=makeClusterTask(data=mtcars)cluster.task## Unsupervised task: mtcars## Type: cluster## Observations: 32## Features:##    numerics     factors     ordered functionals##          11           0           0           0## Missings: FALSE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE

Cost-sensitive classification

The standard objective in classification is to obtain a high prediction accuracy, i.e., to minimize the number of errors. All types of misclassification errors are thereby deemed equally severe. However, in many applications different kinds of errors cause different costs.

In case ofclass-dependent costs, that solely depend on the actual and predicted class labels, it is sufficient to create an ordinaryClassifTask().

In order to handleexample-specific costs it is necessary to generate aCostSensTask(). In this scenario, each example\((x, y)\) is associated with an individual cost vector of length\(K\) with\(K\) denoting the number of classes. The\(k\)-th component indicates the cost of assigning\(x\) to class\(k\). Naturally, it is assumed that the cost of the intended class label\(y\) is minimal.

As the cost vector contains all relevant information about the intended class\(y\), only the feature values\(x\) and acost matrix, which contains the cost vectors for all examples in the data set, are required to create theCostSensTask().

In the following example we use thedatasets::iris() data and an artificial cost matrix (which is generated as proposed byBeygelzimer et al., 2005):

df=iriscost=matrix(runif(150*3,0,2000),150)*(1-diag(3))[df$Species,]df$Species=NULLcostsens.task=makeCostSensTask(data=df, cost=cost)costsens.task## Supervised task: df## Type: costsens## Observations: 150## Features:##    numerics     factors     ordered functionals##           4           0           0           0## Missings: FALSE## Has blocking: FALSE## Has coordinates: FALSE## Classes: 3## y1, y2, y3

For more details see the page oncost sensitive classification.

Further settings

TheTask() help page also lists several other arguments to describe further details of the learning problem.

For example, we could include ablocking factor in the task. This would indicate that some observations “belong together” and should not be separated when splitting the data into training and test sets forresampling.

Another option is to assignweights to observations. These can simply indicate observation frequencies or result from the sampling scheme used to collect the data. Note that you should use this option only if the weights really belong to the task. If you plan to train some learning algorithms with different weights on the sameTask(),mlr offers several other ways to set observation or class weights (for supervised classification). See for example the tutorial page abouttraining or functionmakeWeightedClassesWrapper().

Accessing a learning task

We provide many operators to access the elements stored in aTask(). The most important ones are listed in the documentation ofTask() andgetTaskData().

To access theTaskDesc() that contains basic information about the task you can use:

getTaskDesc(classif.task)## $id## [1] "BreastCancer"#### $type## [1] "classif"#### $target## [1] "Class"#### $size## [1] 699#### $n.feat##    numerics     factors     ordered functionals##           0           4           5           0#### $has.missings## [1] TRUE#### $has.weights## [1] FALSE#### $has.blocking## [1] FALSE#### $has.coordinates## [1] FALSE#### $class.levels## [1] "benign"    "malignant"#### $positive## [1] "malignant"#### $negative## [1] "benign"#### $class.distribution####    benign malignant##       458       241#### attr(,"class")## [1] "ClassifTaskDesc"    "SupervisedTaskDesc" "TaskDesc"

Note thatTaskDesc() have slightly different elements for different types ofTask()s. Frequently required elements can also be accessed directly.

# Get the IDgetTaskId(classif.task)## [1] "BreastCancer"# Get the type of taskgetTaskType(classif.task)## [1] "classif"# Get the names of the target columnsgetTaskTargetNames(classif.task)## [1] "Class"# Get the number of observationsgetTaskSize(classif.task)## [1] 699# Get the number of input variablesgetTaskNFeats(classif.task)## [1] 9# Get the class levels in classif.taskgetTaskClassLevels(classif.task)## [1] "benign"    "malignant"

Moreover,mlr provides several functions to extract data from aTask().

# Accessing the data set in classif.taskstr(getTaskData(classif.task))## 'data.frame':    699 obs. of  10 variables:##  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...##  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...##  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...##  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...##  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...##  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...##  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...##  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...# Get the names of the input variables in cluster.taskgetTaskFeatureNames(cluster.task)##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"## [11] "carb"# Get the values of the target variables in surv.taskhead(getTaskTargets(surv.task))##   time status## 1  306   TRUE## 2  455   TRUE## 3 1010  FALSE## 4  210   TRUE## 5  883   TRUE## 6 1022  FALSE# Get the cost matrix in costsens.taskhead(getTaskCosts(costsens.task))##      y1        y2         y3## [1,]  0 1694.9063 1569.15053## [2,]  0  995.0545   18.85981## [3,]  0  775.8181 1558.13177## [4,]  0  492.8980 1458.78130## [5,]  0  222.1929 1260.26371## [6,]  0  779.9889  961.82166

Note thatgetTaskData() offers many options for converting the data set into a convenient format. This especially comes in handy when youintegrate a new learner from anotherR package intomlr. In this regard functiongetTaskFormula() is also useful.

Modifying a learning task

mlr provides several functions to alter an existingTask(), which is often more convenient than creating a newTask() from scratch. Here are some examples.

# Select observations and/or featurescluster.task=subsetTask(cluster.task, subset=4:17)# It may happen, especially after selecting observations, that features are constant.# These should be removed.removeConstantFeatures(cluster.task)## Removing 1 columns: am## Unsupervised task: mtcars## Type: cluster## Observations: 14## Features:##    numerics     factors     ordered functionals##          10           0           0           0## Missings: FALSE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE# Remove selected featuresdropFeatures(surv.task,c("meal.cal","wt.loss"))## Supervised task: lung## Type: surv## Target: time,status## Events: 165## Observations: 228## Features:##    numerics     factors     ordered functionals##           6           0           0           0## Missings: TRUE## Has weights: FALSE## Has blocking: FALSE## Has coordinates: FALSE# Standardize numerical featurestask=normalizeFeatures(cluster.task, method="range")summary(getTaskData(task))##       mpg              cyl              disp              hp##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000##  1st Qu.:0.3161   1st Qu.:0.5000   1st Qu.:0.1242   1st Qu.:0.2801##  Median :0.5107   Median :1.0000   Median :0.4076   Median :0.6311##  Mean   :0.4872   Mean   :0.7143   Mean   :0.4430   Mean   :0.5308##  3rd Qu.:0.6196   3rd Qu.:1.0000   3rd Qu.:0.6618   3rd Qu.:0.7473##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000##       drat              wt              qsec              vs##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000##  1st Qu.:0.2672   1st Qu.:0.1275   1st Qu.:0.2302   1st Qu.:0.0000##  Median :0.3060   Median :0.1605   Median :0.3045   Median :0.0000##  Mean   :0.4544   Mean   :0.3268   Mean   :0.3752   Mean   :0.4286##  3rd Qu.:0.7026   3rd Qu.:0.3727   3rd Qu.:0.4908   3rd Qu.:1.0000##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000##        am           gear             carb##  Min.   :0.5   Min.   :0.0000   Min.   :0.0000##  1st Qu.:0.5   1st Qu.:0.0000   1st Qu.:0.3333##  Median :0.5   Median :0.0000   Median :0.6667##  Mean   :0.5   Mean   :0.2857   Mean   :0.6429##  3rd Qu.:0.5   3rd Qu.:0.7500   3rd Qu.:1.0000##  Max.   :0.5   Max.   :1.0000   Max.   :1.0000

For more functions and more detailed explanations have a look at thedata preprocessing page.

Example tasks and convenience functions

For your conveniencemlr provides pre-definedTask()s for each type of learning problem. These are also used throughout this tutorial in order to get shorter and more readable code. A list of allTask()s can be found in theAppendix.

Moreover,mlr’s functionconvertMLBenchObjToTask() can generateTask()s from the data sets and data generating functions in packagemlbench::mlbench().

Movatterモバイル変換