TheFFTrees package makes it very easy to visualizeand evaluate fast-and-frugal trees (FFTs):
Use the mainFFTrees() function to create a set ofFFTs (e.g., as an R object x of typeFFTrees).
Visualize a tree in x by evaluatingplot(x).
The two key arguments for plotting arewhat andtree: Whereas thetree argument allowsselecting between different trees in x (usingtree = 1 by default), thewhat argumentdistinguishes between five main types of plots:
plot(x, what = 'all') visualizes a tree andcorresponding performance statistics. This is also the default whenevaluatingplot(x).
plot(x, what = 'tree') visualizes only the treediagram of the selected tree (without performance statistics).
plot(x, what = 'icontree') visualizes the treediagram of the selected tree with icon arrays on exit nodes (withadditional options forshow.iconguide andn.per.icon.
plot(x, what = 'cues') visualizes the current cueaccuracies in ROC space (by calling theshowcues()function).
plot(x, what = 'roc') visualizes a performancecomparison of FFTs and competing algorithms in ROC space.
The other arguments of theplot.FFTrees() function allowfurther customization of the plot (e.g., by defining labels andparameters, or selectively hiding or showing elements).
In the following, we illustrate both ways by creating FFTs based onthetitanic data (included in theFFTreespackage).
Thetitanic dataset contains basic survival statisticsofTitanic passengers. For each passenger, we know in whichclass s/he traveled, as well as binary categories specifying age, sex,and survival information. To get a first impression, we inspect a randomsample of cases:
set.seed(12)# reproducible randomnessrcases<-sort(sample(1:nrow(titanic),10))# Sample of data:knitr::kable(titanic[rcases, ],caption ="A sample of 10 observations from the `titanic` data.")| class | age | sex | survived | |
|---|---|---|---|---|
| 82 | first | adult | male | FALSE |
| 91 | first | adult | male | FALSE |
| 336 | second | adult | male | TRUE |
| 346 | second | adult | male | FALSE |
| 450 | second | adult | male | FALSE |
| 546 | second | adult | female | TRUE |
| 1093 | third | adult | female | TRUE |
| 1160 | third | adult | female | FALSE |
| 1271 | third | child | male | FALSE |
| 1500 | crew | adult | male | TRUE |
Our current goal is to fit FFTs to this dataset. This essentiallyasks:
class,age andsex to decide whether a passengersurvived?First, let’s create anFFTrees object (calledtitanic.fft) from thetitanic dataset:
# Create FFTs for the titanic data:titanic.fft<-FFTrees(formula = survived~.,data = titanic,main ="Surviving the Titanic",decision.labels =c("Died","Survived"))Note that we used the entiretitanic data (i.e., all2201 cases) to traintitanic.fft, rather than specifyingtrain.p to set aside some proportion of it or specifying adedicateddata.test set for predictive purposes. Thisimplies that our present goal isto fit FFTs to the historicdata, rather than on create and use FFTsto predict newcases.
We can visualize individual cue accuracies (specifically theirsensitivities and specificities) by including thewhat = 'cues' argument within theplot()function. Let’s apply the function to thetitanic.fftobject to see how accurate each of the cues were on their own inpredicting survival:
#> Plotting cue training statistics:#> — Cue accuracies ranked by baccFigure 1: Cue accuracies of FFTs predicting survival inthetitanic dataset.
Given the axes of this plot, good performing cues should be near thetop left corner of the graph (i.e., exhibit both a low false alarm rateand a high hit rate). For thetitanic data, this impliesthatnone of the cues predicts very well on its own. The bestindividual cue appears to besex (indicated as 1), followedbyclass (2). By contrast,age (3) seems apretty poor cue for predicting survival on its own (despite itsspecificity of 97%).
Inspecting cue accuracies can provide valuable information forconstructing FFTs. While they provide lower bounds on the performance oftrees (as combining cues is only worthwhile when this yields a benefit),even poor individual cues can shine in combination with otherpredictors.
To visualize the tree from anFFTrees object, useplot(). Let’s plot one of the trees (Tree #1, i.e., thebest one, given our current goal):
Figure 2: Plotting the best FFT of anFFTrees object.
The resulting plot visualizes one out ofrtitanic.fft\(trees\)n\ possible trees in thetitanic.fftobject. Astree=1corresponds to the best tree given our currentgoalfor selecting FFTs, we could have plotted the same tree by specifyingtree= ‘best.train’`.
AsFigure 2 contains a lot of information in threedistinct panels, let’s briefly consider their contents:
Basic dataset information: The top row of the plot showsbasic information on the current dataset: Its population size (N) andthe baseline frequencies of the two categories of the criterionvariable.
FFT and classification performance: The middle row showsthe tree (in the center) as well as how many cases (here: persons) wereclassified at each level in the tree (on either side). For example, thecurrent tree (Tree #1 of 4) can be understood as:
Accuracy and performance information: The bottom rowshows general performance statistics of the FFT:
As our models intitanic.fft were trained on the entiretitanic dataset, we fitted FFTs to its 2201 cases, ratherthan setting aside some data for predictive purposes. The panel labelreflects this important distinction:
If the results of fitting data (i.e., data used to build thetree) are displayed, we’ll see a “Training” label.
If a testing dataset separate from the one used to build the treeis used, we’ll see a “Prediction” label.
The bottom panel provides performance information and is structuredinto three subpanels:
The classification table (on the left) shows the relationshipbetween the true criterion states (as columns) and predicted decisions(as rows). The abbreviationshi (hits) andcr (Correctrejections) denote correct decisions;mi (misses) andfa (false-alarms) denote incorrect decisions.
A range of vertical levels (in the middle) show the tree’scumulative performance in terms of two frugality measures(mcu andpci) and various accuracy measures(sensitivity, specificity, accuracy, and balanced accuracy (seeAccuracy statistics fordetails).
Finally, the plot (on the right) shows an ROC curve comparing theperformance of all trees in theFFTrees object.Additionally, the performance of logistic regression (blue) and CART(red) are shown. The tree plotted in the middle panel is highlighted ina solid green color (i.e., Figure 2 shows Tree #1).
Specifying additional arguments ofplot() changes whatand how various elements are being displayed.
what should be visualized? Thewhatargument selects contents to be plotted:
Whenwhat = 'all' (as by default), the plot showsboth a tree diagram and a range of corresponding performance statistics.Using one of the otherwhat options narrows the range ofwhat is being shown:
To only visualize a bare tree diagram (without performancestatistics), we specifywhat = "tree" (formerlystats = FALSE).
To visualize the tree diagram with icon arrays on exit nodes, wespecifywhat = "icontree" (with additional options forshow.iconguide andn.per.icon).
To visualize the performance comparison (for different FFTs andcompeting algorithms) in ROC space, we specifywhat = "roc".
The following examples illustrate the wide range of correspondingplots:
# Plot tree diagram with icon arrays:plot(titanic.fft,what ="icontree",n.per.icon =50,show.iconguide =TRUE)Figure 3: An FFT diagram with icon arrays on exitnodes.
Figure 4: Performance comparison of FFTs in ROC space.
what do not suffice, variousshow. arguments (i.e.,show.header,show.tree,show.confusion,show.levels,show.roc,show.icons, andshow.iconguide) allow toselectively turn on or turn off specific elements of the overall plot.For example:# Hide some elements of the FFT plot:plot(titanic.fft,show.icons =FALSE,# hide iconsshow.iconguide =FALSE,# hide icon guideshow.header =FALSE# hide header )Figure 5: Plotting selected elements.
tree: Which tree do we want to plot? AsFFTrees objects typically contain multiple FFTs, we need toindicate which tree we want to visualize. We usually specify the tree toshow by an integer value, such astree = 2, which will plotthe corresponding tree (i.e., Tree #2) of theFFTreesobject. Alternatively, we can specifytree = "best.train"ortree = "best.test" to visualize the best training orprediction tree, respectively. This selects and shows the tree with thehighest goal value (e.g., weighted accuracy wacc) whenfitting or testing data.
data: Which data do we want to apply the tree to? Wecan specifydata = "train" ordata = "test" todistinguish between a training and testing dataset (if available) in theFFTrees object. As not allFFTrees objectscontain test data,data is set todata = "train" by default.
As thedata andtree arguments can bothrefer to datasets used for training or fitting (i.e., the “train” or“test” sets), they should be specified consistently. For instance, thefollowing command would visualize the best training tree intitanic.fft:
asdata = "train" by default. However, the followinganalog expression would fail:
for two distinct reasons:
Whendata remains unspecified, its default isdata = "train". Thus, asking fortree = "best.test" would require switching todata = "test".
More crucially,titanic.fft was created without anytest data. Hence, asking for the best test tree does not make sense —which is whyplot() will show the best training tree (witha warning).
Shifting our emphasis from fitting to prediction, we primarily needto specify some test data that was not used to train theFFTrees object.When predicting performance for a new dataset (e.g.;data = test.data), the plotting and printing functions willautomatically apply an existingFFTrees object to the newdata and compute corresponding performance statistics (using thefftrees_apply() function). However, when applying existingFFTs to new data, the changes to theFFTrees object are notstored in the input object, unless the (invisible) output ofplot.FFTrees() orprint.FFTrees() isre-assigned to that object. The best way to fit FFTs to training dataand evaluate them to test data is to explicitly include both datasets inthe originalFFTrees() command by either using itsdata.test or itstrain.p argument.
For example, we can repeat the previous analysis, but now let’screate separate training and test datasets by including thetrain.p = .50 argument. This will split the dataset into a50% training set, and a distinct 50% testing set. (Alternatively, wecould specify a dedicated test data set by using thedata.test argument.)
set.seed(100)# for replicability of the training/test splittitanic.pred.fft<-FFTrees(formula = survived~.,data = titanic,train.p = .50,# use 50% to train, 50% to testmain ="Titanic",decision.labels =c("Died","Survived") )Here is the best training tree applied to thetrainingdata:
Figure 6: Plotting the best FFT ontrainingdata.
Tree #1 is the best training tree — and could also be visualized byplot(titanic.pred.fft, tree = "best.train"). This tree hasa high specificity of 92%, but a much lower sensitivity of just 51%. Theoverall accuracy of the tree’s classifications is at 79%, which exceedsthe baseline, but is far from perfect. However, as we can see in theROC table, a logistic regression (LR) would not perform much better, andCART performed even worse than Tree #1.
Now let’s inspect the performance of the same tree on thetest data:
Figure 7: Plotting the best FFT ontest data.
We could have visualized the same tree by asking forplot(titanic.pred.fft, data = "test", tree = "best.test").Note that the label of the bottom panel has now switched from “Accuracy(Training)” to “Accuracy (Testing)”. Both the sensitivity andspecificity values have decreased somewhat, which is typical when usinga model (fitted on training data) for predicting new (test) data.
Let’s visualize the prediction performance of Tree #2, the mostliberal tree (i.e., with the highest sensitivity):
Figure 8: Plotting Tree #2.
This alternative tree has a better sensitivity (of 63%), but itsoverall accuracy decreased to about baseline level (of 67%).
Whereas comparing training with test performance illustrates thetrade-offs between mere fitting and genuine predictive modeling,comparing the performance details of various FFTs illustrates thetypical trade-offs that any model for solving binary classificationproblems engages in. Importantly, both types of trade-offs are renderedtransparent when usingFFTrees.
Here is a complete list of the vignettes available in theFFTrees package:
| Vignette | Description | |
|---|---|---|
| Main guide: FFTreesoverview | An overview of theFFTreespackage | |
| 1 | Tutorial: FFTs for heartdisease | An example of usingFFTrees() to modelheart disease diagnosis |
| 2 | Accuracystatistics | Definitions of accuracy statistics used throughout thepackage |
| 3 | Creating FFTs withFFTrees() | Details on the mainFFTrees()function |
| 4 | Manually specifyingFFTs | How to directly create FFTs without using the built-inalgorithms |
| 5 | Visualizing FFTs | PlottingFFTrees objects, from full treesto icon arrays |
| 6 | Examples ofFFTs | Examples of FFTs from different datasets contained inthe package |