Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Graphical User Interface for Data Science in R
Version:5.5.1
Date:2022-03-20
Depends:R (≥ 3.5.0), tibble, bitops
Imports:stats, utils, ggplot2, grDevices, graphics, magrittr, methods,stringi, stringr, tidyr, dplyr, XML, rpart.plot
Suggests:pmml (≥ 1.2.13), colorspace, ada, amap, arules, arulesViz,biclust, cairoDevice, cba, cluster, corrplot, descr, doBy,e1071, ellipse, fBasics, foreign, fpc, gdata, ggdendro, gplots,grid, gridExtra, gtools, Hmisc, janitor, kernlab, Matrix, mice,nnet, party, plyr, psych, RGtk2, randomForest, RColorBrewer,readxl, reshape, ROCR, RODBC, rpart, scales, SnowballC,survival, timeDate, tm, xgboost
Description:The R Analytic Tool To Learn Easily (Rattle) provides a collection of utilities functions for the data scientist. A Gnome (RGtk2) based graphical interface is included with the aim to provide a simple and intuitive introduction to R for data science, allowing a user to quickly load data from a CSV file (or via ODBC), transform and explore the data, build and evaluate models, and export models as PMML (predictive modelling markup language) or as scores. A key aspect of the GUI is that all R commands are logged and commented through the log tab. This can be saved as a standalone R script file and as an aid for the user to learn R or to copy-and-paste directly into R itself. Note that RGtk2 and cairoDevice have been archived on CRAN. Seehttps://rattle.togaware.com for installation instructions.
License:GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]
LazyLoad:yes
LazyData:yes
URL:https://rattle.togaware.com/
NeedsCompilation:no
Packaged:2022-03-20 00:54:54 UTC; gjw
Author:Graham Williams [aut, cph, cre], Mark Vere Culp [cph], Ed Cox [ctb], Anthony Nolan [ctb], Denis White [cph], Daniele Medri [ctb], Akbar Waljee [ctb] (OOB AUC for Random Forest), Brian Ripley [cph] (print.summary.nnet), Jose Magana [ctb] (ggpairs plots), Surendra Tipparaju [ctb] (initial RevoScaleR/XDF), Durga Prasad Chappidi [ctb] (initial RevoScaleR/XDF), Dinesh Manyam Venkata [ctb] (initial RevoScaleR/XDF), Mrinal Chakraborty [ctb] (initial RevoScaleR/XDF), Fang Zhou [ctb] (initial xgboost), Cameron Chisholm [ctb] (risk plot on risk chart)
Maintainer:Graham Williams <Graham.Williams@togaware.com>
Repository:CRAN
Date/Publication:2022-03-21 13:10:02 UTC

Generate the audit dataset.

Description

Rattle uses an artificial dataset for demonstration purposes. Thisfunction retrieves the source datahttps://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.dataand then transforms the data in a variety of ways.

Usage

acquireAuditData(write.to.file=FALSE)

Arguments

write.to.file

Whether to generate a colleciton of files basedon the data. The files generated include: audit.csv, audit.Rdata,audit.arf, and audit\_missing.csv

Details

See the function definition for details of the processing done on thedata downloaded from the UCI repository.

Value

By default the function returns a data frame containing the auditdataset. If write.to.file is TRUE then the data frame is returnedinvisibly.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

audit,rattle.


List the rules corresponding to the rpart decision tree

Description

Display a list of rules for an rpart decision tree.

Usage

asRules(model, compact=FALSE, ...)

Arguments

model

an rpart model.

compact

whether to list cateogricals compactly.

...

further arguments passed to or from other methods.

Details

Traverse a decision tree to generate the equivalent set of rules, onerule for each path from the root node to a leaf node.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Not run: asRules.rpart(my.rpart)

List the rules corresponding to the rpart decision tree

Description

Display a list of rules for an rpart decision tree.

Usage

## S3 method for class 'rpart'asRules(model, compact=FALSE, classes=NULL, ...)

Arguments

model

an rpart model.

compact

whether to list cateogricals compactly (default FALSE).

classes

which target classes should be listed (default all).

...

further arguments passed to or from other methods.

Details

Traverse a decision tree to generate the equivalent set of rules, onerule for each path from the root node to a leaf node.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Not run: asRules.rpart(my.rpart)

Sample dataset to illustrate Rattle functionality.

Description

The audit dataset is an artificially constructed dataset that has someof the characteristics of a true financial audit dataset for modellingproductive and non-productive audits of a person's financialstatement. A productive audit is one which identifies errors orinaccuracies in the information provided by a client. A non-productiveaudit is usually an audit which found all supplied information to bein order.

The audit dataset is used to illustrate binary classification. Thetarget variable is identified asTARGET\_Adjusted.

The dataset is quite small, consisting of just 2000 entities. Itsprimary purpose is to illustrate modelling in Rattle, so a minimallysized dataset is suitable.

The dataset itself is derived from publicly available data (which hasnothing to do with audits).

Format

A data frame. In line with data mining terminology we refer to therows of the data frame (or the observations) as entities. The columnsare refered to as variables. The entities represent people in thiscase. We describe the variables here:

ID

This is a unique identifier for each person.

Age

The age.

Employment

The type of employment.

Education

The highest level of education.

Marital

Current marital status.

Occupation

The type of occupation.

Income

The amount of income declared.

Gender

The persons gender.

Deductions

Total amount of expenses that a personclaims in their financial statement.

Hours

The average hours worked on a weekly basis.

IGNORE_Accounts

The main country in which the personhas most of their money banked. Note that the variable name isprefixed with IGNORE. This is recognised by Rattle as the defaultrole for this variable.

RISK_Adjustment

This variable records the monetaryamount of any adjustment to the person's financial claims as aresult of a productive audit. This variable, which should not betreated as an input variable, is thus a measure of the size of therisk associated with the person.

TARGET_Adjusted

The target variable for modelling(generally for classification modelling). This is a numeric fieldof class integer, but limited to 0 and 1, indicatingnon-productive and productive audits, respectively. Productiveaudits are those that result in an adjustment being made to aclient's financial statement.


Perform binning over numeric data

Description

Perform binning.

Usage

binning(x, bins=4, method=c("quantile", "wtd.quantile", "kmeans"),                     labels=NULL, ordered=TRUE, weights=NULL)

Arguments

x

the numeric data to bin.

bins

the number of bins to use.

method

whether to use "quantile", weighted quantile"wtd.quantile" or "kmeans" binning.

labels

the labels or names to use for each of the bins.

ordered

whether to build an ordered factor or not.

weights

vector of numeric weights for each observation forweighted quantile binning.

Details

Bin the provided nmeric data into the specified number of bins usingone of the supported methods. The bins will have the names specifiedby labels, if supplied. The result can optionally be an orderedfactor.

Value

A factor is returned.

Author(s)

Daniele Medri and Graham Williams

References

Package home page:https://rattle.togaware.com


Generate a frequency count of the initial digits

Description

In the context of Benford's Law calculate the distribution of thefrequencies of the first digit of the numbers supplied as theargument.

Usage

calcInitialDigitDistr(l, digit=1, len=1, sp=c("none", "positive", "negative"))

Arguments

l

a vector of numbers.

digit

the digit to generate frequencies for.

len

The number of digits.

sp

whether and how to split the digits.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


Determine area under a curve (e.g. a risk or recall curve) of a risk chart

Description

Given the evaluation returned by evaluateRisk, for example, calculatethe area under the risk or recall curves, to use as a metric tocompare the performance of a model.

Usage

calculateAUC(x, y)

Arguments

x

a vector of values for the x points.

y

a vector of values for the y points.

Details

The area is returned.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

evaluateRisk.

Examples

## this is usually used in the context of the evaluateRisk function## Not run: ev <- evaluateRisk(predicted, actual, risk)## imitate this output hereev <- data.frame(Caseload=c(1.0, 0.8, 0.6, 0.4, 0.2, 0),                 Precision=c(0.15, 0.18, 0.21, 0.25, 0.28, 0.30),                 Recall=c(1.0, 0.95, 0.80, 0.75, 0.5, 0.0),                 Risk=c(1.0, 0.98, 0.90, 0.77, 0.30, 0.0))## Calculate the areas unde the Risk and the Recall curves.calculateAUC(ev$Caseload, ev$Risk)calculateAUC(ev$Caseload, ev$Recall)

List Cluster Centers for a Hierarchical Cluster

Description

Generate a matrix of centers from a hierarchical cluster.

Usage

centers.hclust(x, object, nclust=10, use.median=FALSE)

Arguments

x

The data used to build the cluster.

object

A hclust object.

nclust

Number of clusters.

use.median

Use meadion instead of mean.

Details

For the specified number of clusters, cut the hierarchical clusterappropriately to that number of clusters, and return the mean (ormedian) of each resulting cluster.

Author(s)

Daniele Medri andGraham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


Echo data in a human readable form.

Description

Format data in the most appropriate human readable form.

Usage

comcat(x, ...)

Arguments

x

object.

...

additional arguments passed on to format.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

  comcat(dim(iris))

Draw nodes of a decision tree

Description

Draw the nodes of a decision tree

Usage

drawTreeNodes(tree, cex = par("cex"), pch = par("pch"),                           size = 4 * cex, col = NULL, nodeinfo = FALSE,                           units = "", cases = "obs",                            digits = getOption("digits"),                           decimals = 2,                           print.levels = TRUE, new = TRUE)

Arguments

tree

an rpart decision tree.

cex

.

pch

.

size

.

col

.

nodeinfo

.

units

.

cases

.

digits

.

decimals

the number of decimal digits to include in numericsplit nodes.

print.levels

.

new

.

Details

A variation of draw.tree() from the maptree package.

Author(s)

Graham.Williams@togaware.com, Denis White

References

Package home page:https://rattle.togaware.com

Examples

## this is usually used in the context of the plotRisk function## Not run: drawTreeNodes(rpart(Species ~ ., iris))

Draw trees from an Ada model

Description

Using the Rattle drawTreeNodes, draw a selection of Ada trees.

Usage

drawTreesAda(model, trees=0, title="")

Arguments

model

an ada model.

trees

The list of trees to draw. Use 0 to draw all trees.

title

An option title to add.

Details

Using Rattle's drawTreeNodes underneath, a plot for each of thespecified trees from an Ada model will be displayed.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Not run: drawTreesAda(ds.ada)

Generate an error matrix from actua and predicted data.

Description

An error matrix reports the true/false potisitve/negative rates.

Usage

errorMatrix(actual,                        predicted,                        percentage=TRUE,                        digits=ifelse(percentage,1,3),                        count=FALSE)

Arguments

actual

a vector of true values.

predicted

a vector of predicted values.

percentage

return percentages.

digits

the number of digits to round results.

count

return counts.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

  ## Not run: errorMatrix(model)

Summarise the performance of a data mining model

Description

By taking predicted values, actual values, and measures of the riskassociated with each case, generate a summary that groups the distinctpredicted values, calculating the accumulative percentage Caseload,Recall, Risk, Precision, and Measure.

Usage

evaluateRisk(predicted, actual, risks)

Arguments

predicted

a numeric vector of probabilities (between 0 and 1)representing the probability of each entity being a 1.

actual

a numeric vector of classes (0 or 1).

risks

a numeric vector of risk (e.g., dollar amounts)associated with each entity that has a acutal of 1.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

plotRisk.

Examples

## simulate the data that is typical in data mining## we often have only a small number of positive known casecases <- 1000actual <- as.integer(rnorm(cases) > 1)adjusted <- sum(actual)nfa <- cases - adjusted## risks might be dollar values associated adjusted casesrisks <- rep(0, cases)risks[actual==1] <- round(abs(rnorm(adjusted, 10000, 5000)), 2)## our models will generated a probability of a case being a 1predicted <- rep(0.1, cases) predicted[actual==1] <- predicted[actual==1] + rnorm(adjusted, 0.3, 0.1)predicted[actual==0] <- predicted[actual==0] + rnorm(nfa, 0.1, 0.08)predicted <- signif(predicted)## call upon evaluateRisk to generate performance summaryev <- evaluateRisk(predicted, actual, risks)## have a look at the first few and last fewhead(ev)tail(ev)## the performance is usually presented as a Risk Chart## under the CRAN MS/Windows this causes a problem, so don't run for now## Not run: plotRisk(ev$Caseload, ev$Precision, ev$Recall, ev$Risk)

A wrapper for plotting rpart trees using prp

Description

Plots a fancy RPart decision tree using the pretty rpart plotter.

Usage

fancyRpartPlot(model, main="", sub, caption, palettes, type=2, ...)

Arguments

model

an rpart object.

main

title for the plot.

sub

sub title for the plot. The default is a Rattle string withdate, time and username.

caption

caption for bottom right of plot.

palettes

a list of sequential palettes names. As supported byRColorBrewer::brewer.pal the available names are Blues BuGn BuPu GnBuGreens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGnYlGnBu YlOrBr YlOrRd.

type

the type of plot to generate (2).

...

additional arguments passed on to prp.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Use rpart to build a decision tree.## Not run: library(rpart)## Set up the data for modelling.set.seed(42)ds     <- weathertarget <- "RainTomorrow"risk   <- "RISK_MM"ignore <- c("Date", "Location", risk)vars   <- setdiff(names(ds), ignore)nobs   <- nrow(ds)form   <- formula(paste(target, "~ ."))train  <- sample(nobs, 0.7*nobs)test   <- setdiff(seq_len(nobs), train)actual <- ds[test, target]risks  <- ds[test, risk]# Fit the model.fit <- rpart(form, data=ds[train, vars])## Plot the model.fancyRpartPlot(fit)## Choose different colours.fancyRpartPlot(fit, palettes=c("Greys", "Oranges"))## Add a main title to the plot.fancyRpartPlot(fit, main=target) ## End(Not run)

Generate a string to add a title to a plot

Description

Generate a string that is intended to beeval'd thatwill add a title and sub-title to a plot. The string is a call totitle, supplying the given arguments,pasted together, as the main title, and generating asub-title that begins with ‘Rattle’ and continues with the current dateand time, and finishes with the current user's username. This is usedinternally in Rattle to adorn a plot with relevant information, butmay be useful outside of Rattle.

Usage

genPlotTitleCmd(..., vector=FALSE)

Arguments

...

one or more strings that will be pasted together to formthe main title.

vector

whether to return a vector as the result.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

eval,title,plotRisk.

Examples

# generate some random plotplot(rnorm(100))# generate the string representing the command to add titlestl <- genPlotTitleCmd("Sample Plot of", "No Particular Importance")# cause the string to be executed as an R commandeval(parse(text=tl))

Model.

Description

Model.

Usage

ggVarImp(model, ...)

Arguments

model

object.

...

arguments passed on.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Not run: ggVarImp(model)

List the variables used by an adaboost model

Description

Returns a list of the variables used and their frequencies.

Usage

listAdaVarsUsed(model)

Arguments

model

an rpart object.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


List trees from an Ada model

Description

Display the textual representation of a selection of Ada trees.

Usage

listTreesAda(model, trees=0)

Arguments

model

an ada model.

trees

The list of trees to list. Use 0 to list all trees.

Details

Using rpart's print method display each of the specified trees from anAda model.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Not run: listTreesAda(ds.ada)

Versions of Installed Packages

Description

Generate a list of packages installed and their version number.

Usage

listVersions(file="", ...)

Arguments

file

a character string naming a file or a connection open forwriting. '""' indicates output to the console.

...

arguments towrite.csv.

Details

This function is useful in reporting problems or bugs, to ensure thereis a clear match of R package versions between the system exhibitingthe issue and the test system replicating the issue.

By default the information is written to the console in a commaseparated form, that is ideally designed to be written to a CSV filefor emailing.

Author(s)

Graham.Williams@togaware.com

See Also

write.csv


Calculate the mode of a vector, array or list.

Description

The mode is the most common or modal value of a list.

Usage

modalvalue(x, na.rm=FALSE)

Arguments

x

A vector, array or list.

na.rm

Whether to remove missing values.

Details

This function calculates the mode of a vector, array or list (listsare flattened). This code originated from an anonymous post on the RWiki.


Plot three lines on a risk chart, one vertical and two horizontal

Description

Plots a a vertical line at x up to max of y1 and y2, then horizontalfrom this line at y1 and y2. Intended for plotting on a plotRisk.

Usage

plotOptimalLine(x, y1, y2, pr = NULL, colour = "plum", label = NULL)

Arguments

x

location of vertical line.

y1

location of one horizontal line.

y2

location of other horizontal line.

pr

Aprint a percentage at this point.

colour

of the line.

label

at bottom of line.

Details

Intended to plot an optimal line on a Risk Chart as plotted byplotRisk.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

plotRisk.

Examples

## this is usually used in the context of the plotRisk function## Not run: ev <- evaluateRisk(predicted, actual, risk)## imitate this output hereev <- NULLev$Caseload  <- c(1.0, 0.8, 0.6, 0.4, 0.2, 0)ev$Precision <- c(0.15, 0.18, 0.21, 0.25, 0.28, 0.30)ev$Recall    <- c(1.0, 0.95, 0.80, 0.75, 0.5, 0.0)ev$Risk      <- c(1.0, 0.98, 0.90, 0.77, 0.30, 0.0)## plot the Risk ChartplotRisk(ev$Caseload, ev$Precision, ev$Recall, ev$Risk,         chosen=60, chosen.label="Pr=0.45")## plot the optimal pointplotOptimalLine(40, 77, 75, colour="maroon")

Plot a risk chart

Description

Plots a Rattle Risk Chart. Such a chart has been developed in apractical context to present the performance of data mining models toclients, plotting a caseload against performance, allowing a client tosee the tradeoff between coverage and performance.

Usage

plotRisk(cl, pr, re, ri = NULL, title = NULL,    show.legend = TRUE, xleg = 60, yleg = 55,    optimal = NULL, optimal.label = "", chosen = NULL, chosen.label = "",    include.baseline = TRUE, dev = "", filename = "", show.knots = NULL,    show.lift=TRUE, show.precision=TRUE,    risk.name = "Risk", recall.name = "Recall",    precision.name = "Precision")

Arguments

cl

a vector of caseloads corresponding to different probabilitycutoffs. Can be either percentages (between 0 and 100) or fractions(between 0 and 1).

pr

a vector of precision values for each probabilitycutoff. Can be either percentages (between 0 and 100) or fractions(between 0 and 1).

re

a vector of recall values for each probability cutoff. Canbe either percentages (between 0 and 100) or fractions (between 0and 1).

ri

a vector of risk values for each probability cutoff. Can beeither percentages (between 0 and 100) or fractions (between 0 and1).

title

the main title to place at the top of the plot.

show.legend

whether to display the legend in the plot.

xleg

the x coordinate for the placement of the legend.

yleg

the y coordinate for the placement of the legend.

optimal

a caseload (percentage or fraction) that represents anoptimal performance point which is also plotted. If instead the valueisTRUE then the optimal point is identified internally(maximum valud for(recall-casload)+(risk-caseload)) andplotted.

optimal.label

a string which is added to label the line drawnas the optimal point.

chosen

a caseload (percentage or fraction) that represents auser chosen optimal performance point which is also plotted.

chosen.label

a string which is added to label the line drawn asthe chosen point.

include.baseline

if TRUE (the default) then display thediagonal baseline.

dev

a string which, if supplied, identifies a device type asthe target for the plot. This might be one ofwmf (forgenerating a Windows Metafile, but only available on MS/Windows),pdf, orpng.

filename

a string naming a file. Ifdev is not giventhen the filename extension is used to identify the image format asone of those recognised by thedev argument.

show.knots

a vector of caseload values at which a vertical lineshould be drawn. These might correspond, for example, to individualpaths through a decision tree, illustrating the impact of each path onthe caseload and performance.

show.lift

whether to label the right axis with lift.

show.precision

whether to show the precision plot.

risk.name

a string used within the plot's legend that gives aname to the risk. Often the risk is a dollar amount at risk from afraud or from a bank loan point of view, so the default isRevenue.

recall.name

a string used within the plot's legend that gives aname to the recall. The recall is often the percentage of cases thatare positive hits, and in practise these might correspond to knowncases of fraud or reviews where some adjustment to perhaps a incom taxreturn or application for credit had to be made on reviewing the case,and so the default isAdjustments.

precision.name

a string used within the plot's legend that gives aname to the precision. A common name for precision isStrike Rate, which is the default here.

Details

Caseload is the percentage of the entities in the dataset covered bythe model at a particular probability cutoff, so that with a cutoff of0, all (100%) of the entities are covered by the model. With a cutoffof 1 (0%) no entities are covered by the model. A diagonal line isdrawn to represent a baseline random performance. Then the percentageof positive cases (the recall) covered for a particular caseload isplotted, and optionally a measure of the percentage of the total riskthat is also covered for a particular caseload may be plotted. Such achart allows a user to select an appropriate tradeoff between caseloadand performance. The charts are similar to ROC curves. The precision(i.e., strike rate) is also plotted.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

evaluateRisk,genPlotTitleCmd.

Examples

## this is usually used in the context of the evaluateRisk function## Not run: ev <- evaluateRisk(predicted, actual, risk)## imitate this output hereev <- NULLev$Caseload  <- c(1.0, 0.8, 0.6, 0.4, 0.2, 0)ev$Precision <- c(0.15, 0.18, 0.21, 0.25, 0.28, 0.30)ev$Recall    <- c(1.0, 0.95, 0.80, 0.75, 0.5, 0.0)ev$Risk      <- c(1.0, 0.98, 0.90, 0.77, 0.30, 0.0)## plot the Risk ChartplotRisk(ev$Caseload, ev$Precision, ev$Recall, ev$Risk,         chosen=60, chosen.label="Pr=0.45")## Add a titleeval(parse(text=genPlotTitleCmd("Sample Risk Chart")))

Print a representation of the Random Forest models to the console

Description

A randomForest model, by default, consists of 500 decision trees. Thisfunction walks through each tree and generates a set of rules whichare printed to the console. This takes a considerable amount of timeand is provided for users to access the actual model, but it is notyet used within the Rattle GUI. It may be used to display the outputof the RF (but it takes longer to generate than the model itself!). Orit might only be used on export to PMML or SQL.

Usage

printRandomForests(model, models=NULL, include.class=NULL, format="")

Arguments

model

a randomForest model.

models

a list of integers limiting the models in MODEL that aredisplayed.

include.class

limit the output to the specific class.

format

possible values are "VB".

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Display a ruleset for a specific model amongst the 500.## Not run: printRandomForests(rfmodel, 5)## Display a ruleset for specific models amongst the 500.## Not run: printRandomForests(rfmodel, c(5,10,15))## Display a ruleset for each of the 500 models.## Not run: printRandomForests(rfmodel)

Generate accessible data structure of a randomForest model

Description

A randomForest model, by default, consists of 500 decision trees. Thisfunction walks through each tree and generates a set of rules. Thistakes a considerable amount of time and is provided for users toaccess the actual model, but it is not yet used within the RattleGUI. It may be used to display the output of the RF (but it takeslonger to generate than the model itself!). Or it might only be usedon export to PMML or SQL.

Usage

randomForest2Rules(model, models=NULL)

Arguments

model

a randomForest model.

models

a list of integers limiting the models in MODEL that areconverted.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Generate a ruleset for a specific model amongst the 500.## Not run: randomForest2Rules(rfmodel, 5)## Generate a ruleset for specific models amongst the 500.## Not run: randomForest2Rules(rfmodel, c(5,10,15))## Generate a ruleset for each of the 500 models.## Not run: randomForest2Rules(rfmodel)

Display the Rattle User Interface

Description

The Rattle user interface uses the RGtk2 package to present anintuitive point and click interface for data mining, extensivelybuilding on the excellent collection of R packages by very manyauthors for data manipulation, exploration, analysis, and evaluation.

Usage

rattle(csvname=NULL, dataset=NULL, useGtkBuilder=TRUE)

Arguments

csvname

the optional name of a CSV file to load into Rattle onstartup.

dataset

The optional name as a character string of a dataset toload into Rattle on startup.

useGtkBuilder

if not supplied then automatically determine whether touse the new GtkBuilder rather than the deprecated libglade. A usercan override the heuristic choice with TRUE or FALSE.

Details

Refer to the Rattle home page in the URL below for a growing referencemanual for using Rattle.

Whilst the underlying functionality of Rattle is built upon a vastcollection of other R packages, Rattle itself provides a collection ofutility functions used within Rattle. These are made available throughloading the rattle package into your R library. The See Also sectionlists these utility functions that may be useful outside of Rattle.

Rattle can initialise some options using a .Rattle file if the folderin which Rattle is started. The currently supported options are.RATTLE.DATA, .RATTLE.SCORE.IN, and .RATTLE.SCORE.OUT.

If the environment variable RATTLE\_DATA is defined then that is setas the default CSV file name to load. Otherwise, if .RATTLE.DATA isdefined then that will be used as the CSV file to load. Otherwise, ifcsvname is provided then that will be used.

Two environments are exported by Rattle, capturing the current rattlestate (crs) and the current rattle variables (crv).

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

evaluateRisk,genPlotTitleCmd,plotRisk.

Examples

# You can start rattle with a path to a csv file to pre-specify the# dataset. You then need to click Execute to load the data.## Not run: rattle(system.file("csv", "weather.csv", package = "rattle"))

Print information about a multinomial model

Description

Displays a textual reveiw of the performance of a multinom model.

Usage

rattle.print.summary.multinom(x, digits = x$digits, ...)

Arguments

x

An rpart object.

digits

Number of digist to print for numbers.

...

Other arguments.

Details

Print a summary of a multinom model. This is sipmly a modification ofthe print.summary.multinom function to add the number of entities!

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


Extract Rattle and related package information.

Description

Display system information, including versions of Rattle and R,operating system, and versions of other packages used byRattle. Useful for reporting bugs but also invisibly returns a list ofpackages that have updates available and can be passed toinstall.packages().

Usage

rattleInfo(all.dependencies=FALSE,           include.not.installed=FALSE,           include.not.available=FALSE,           include.libpath=FALSE)

Arguments

all.dependencies

If TRUE then check the full dependency graphfor Rattle and list all of those packages (which may take quite afew seconds to compute), or else just list those key packages thatRattle Depends on and Suggests.

include.not.installed

If TRUE then make mention of any packagesthat are not installed, but are available.

include.not.available

If TRUE then make mention of any packagesthat are not available from CRAN.

include.libpath

If TRUE then list the library location whereeach package is installed.

Details

This is a support function to list useful information to provide thedevelopers with information about the system environment when runningRattle. It is intended to provide the information that is useful inreporting bugs.

It also lists the currently installed version of a number of packagesthat Rattle makes use of as well as checking for any updates availablefor those packages.

If updates are found then a command is generated and printed so that auser can simply copy and paste the command to update the relevantpackages. The function also invisibly returns the list of packagesthat can be updated, so that we can do something like:install.packages(rattleInfo()).

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

rattle.


Interal Rattle user interface callbacks.

Description

These are exported from the package so that the GUI canaccess the callbacks. They should otherwise be ignored.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


Transform a numeric vector by grouping it according to the values ofthe supplied factor and then rescaling within the groups.

Description

The numeric vector is remapped to integers from 0 to max-1, with anymissing values mapped to the midpoint. Original idea from TonyNolan. This will eventually be generalised to do the remapping usingany of the rescaling functions.

Usage

rescale.by.group(x, by=NULL, type = "irank", itop = 100)

Arguments

x

The numeric vector to rescale.

by

A factor of the same length as x used to define the groups.

type

The type of rescaling to perform.

itop

For an integer remapping this is the number of groups, sothat the numeric values are maped to the integers from 0 to (max-1).

Details

This Rattle support function, which is also useful by itself, providesa simple mechanism to rescale a numeric variable. Several rescalingsare possible. The rescaling is done by first grouping the observationsaccording to the by argument.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

rattle.


Plot a risk chart

Description

Plots a Rattle Risk Chart for binary classification models usingggplot2. Such a chart has been developed in a practical context topresent the performance of data mining models to clients, plotting acaseload against performance, allowing a client to see the tradeoffbetween coverage and performance.

Usage

riskchart(pr,          ac,          ri               = NULL,          title            = "Risk Chart",          title.size       = 10,          subtitle         = NULL,          caption          = TRUE,          show.legend      = TRUE,          optimal          = NULL,          optimal.label    = "",          chosen           = NULL,          chosen.label     = "",          include.baseline = TRUE,          dev              = "",          filename         = "",          show.knots       = NULL,          show.lift        = TRUE,          show.precision   = TRUE,          show.maximal     = TRUE,          risk.name        = "Risk",          recall.name      = "Recall",          precision.name   = "Precision",          thresholds       = NULL,          legend.horiz     = TRUE)

Arguments

pr

The predicted class for each observation.

ac

The actual class for each observation.

ri

The risk class for each observation.

title

the main title to place at the top of the plot.

title.size

font size for the main title.

subtitle

subtitle under the main title.

caption

caption for the bottom right of plot.

show.legend

whether to display the legend in the plot.

optimal

a caseload (percentage or fraction) that represents anoptimal performance point which is also plotted. If instead the valueisTRUE then the optimal point is identified internally(maximum valud for(recall-casload)+(risk-caseload)) andplotted.

optimal.label

a string which is added to label the line drawnas the optimal point.

chosen

a caseload (percentage or fraction) that represents auser chosen optimal performance point which is also plotted.

chosen.label

a string which is added to label the line drawn asthe chosen point.

include.baseline

if TRUE (the default) then display thediagonal baseline.

dev

a string which, if supplied, identifies a device type asthe target for the plot. This might be one ofwmf (forgenerating a Windows Metafile, but only available on MS/Windows),pdf, orpng.

filename

a string naming a file. Ifdev is not giventhen the filename extension is used to identify the image format asone of those recognised by thedev argument.

show.knots

a vector of caseload values at which a vertical lineshould be drawn. These might correspond, for example, to individualpaths through a decision tree, illustrating the impact of each path onthe caseload and performance.

show.lift

whether to label the right axis with lift.

show.precision

whether to show the precision plot.

show.maximal

whether to show the maximal performance line.

risk.name

a string used within the plot's legend that gives aname to the risk. Often the risk is a dollar amount at risk from afraud or from a bank loan point of view, so the default isRevenue.

recall.name

a string used within the plot's legend that gives aname to the recall. The recall is often the percentage of cases thatare positive hits, and in practise these might correspond to knowncases of fraud or reviews where some adjustment to perhaps a incom taxreturn or application for credit had to be made on reviewing the case,and so the default isAdjustments.

precision.name

a string used within the plot's legend that givesa name to the precision. A common name for precision isStrike Rate, which is the default here.

thresholds

whether to display scores along the top axis.

legend.horiz

whether to display a horizontal legend.

Details

Caseload is the percentage of the entities in the dataset covered bythe model at a particular probability cutoff, so that with a cutoff of0, all (100%) of the entities are covered by the model. With a cutoffof 1 (0%) no entities are covered by the model. A diagonal line isdrawn to represent a baseline random performance. Then the percentageof positive cases (the recall) covered for a particular caseload isplotted, and optionally a measure of the percentage of the total riskthat is also covered for a particular caseload may be plotted. Such achart allows a user to select an appropriate tradeoff between caseloadand performance. The charts are similar to ROC curves. The precision(i.e., strike rate) is also plotted.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

See Also

evaluateRisk,genPlotTitleCmd.

Examples

## Not run: ## Use rpart to build a decision tree.library(rpart)## Set up the data for modelling.set.seed(42)ds     <- weathertarget <- "RainTomorrow"risk   <- "RISK_MM"ignore <- c("Date", "Location", risk)vars   <- setdiff(names(ds), ignore)nobs   <- nrow(ds)form   <- formula(paste(target, "~ ."))train  <- sample(nobs, 0.7*nobs)test   <- setdiff(seq_len(nobs), train)actual <- ds[test, target]risks  <- ds[test, risk]# Build the model.model <- rpart(form, data=ds[train, vars])## Obtain predictions.predicted <- predict(model, ds[test, vars], type="prob")[,2]## Plot the Risk Chart.riskchart(predicted, actual, risks)## End(Not run)

Save a plot in some way

Description

For the current device, or for the device identified, save the plotdisplayed there in some way. This is either saved to file, copied tothe clipboard for pasting into other applications, or sent to theprinter for saving a hard copy.

Usage

savePlotToFile(file.name, dev.num=dev.cur())copyPlotToClipboard(dev.num=dev.cur())printPlot(dev.num=dev.cur())

Arguments

file.name

Character string naming the file including the filename extension which is used to specify the type of file to save.

dev.num

A device number indicating which device to save.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


Given specific contents of env add other dataset related variables.

Description

This rattle support function is used for encapsulating data miningobjects. The supplied environment is augmented with other data derivedfrom the supplied data, such as a sample trianing dataset, list ofnumeric variables, and a formula for modelling.

Usage

setupDataset(env, seed=NULL)

Arguments

env

the environment to modify.

seed

optionally set the seed for repeatability.

Details

The supplied object (an environment) is assumed to also contain thevariables data (a data frame), target (a character string naming thetarget variable), risk (a character string naming the risk variable),and inputs (a character vector naming all the input variables). Thisfunction then adds in the variables vars (the variables used formodelling), numerics (the numeric vars within inputs), nobs (thenumber of observations), form (the formula for building models), train(a 70% training dataset).

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


Generate a representation of a tree in a Random Forest

Description

Often we want to view the actual trees built by a randomforest. Although reviewing all 500 trees might be a bit much, thisfunction allows us to at least list them.

Usage

treeset.randomForest(model, n=1, root=1, format="R")

Arguments

model

a randomForest model.

n

a specific tree to list.

root

where to start the stree from, primarily for internal use.

format

one of "R", "VB".

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com

Examples

## Display a treeset for a specific model amongst the 500.## Not run: treeset.randomForests(rfmodel, 5)

Sample dataset of daily weather observations from Canberraairport in Australia.

Description

One year of daily weather observations collected from the Canberraairport in Australia was obtained from the Australian CommonwealthBureau of Meteorology and processed to create this sample dataset forillustrating data mining using R and Rattle.

The data has been processed to provide a target variableRainTomorrow (whether there is rain on the following day -No/Yes) and a risk variableRISK_MM (how much rain recorded inmillimetres). Various transformations were performed on the sourcedata. The dataset is quite small and is useful only for repeatabledemonstration of various data science operations.

The source dataset is Copyright by the Australian Commonwealth Bureauof Meteorology and is provided as part of the rattle package withpermission.

Usage

weather

Format

Theweather dataset is a data frame containing one year ofdaily observations from a single weather station (Canberra).

Date

The date of observation (a Date object).

Location

The common name of the location of theweather station.

MinTemp

The minimum temperature in degrees celsius.

MaxTemp

The maximum temperature in degrees celsius.

Rainfall

The amount of rainfall recorded for the day in mm.

Evaporation

The so-called Class A pan evaporation (mm)in the 24 hours to 9am.

Sunshine

The number of hours of bright sunshine in the day.

WindGustDir

The direction of the strongest wind gustin the 24 hours to midnight.

WindGustSpeed

The speed (km/h) of the strongest windgust in the 24 hours to midnight.

Temp9am

Temperature (degrees C) at 9am.

RelHumid9am

Relative humidity (percent) at 9am.

Cloud9am

Fraction of sky obscured by cloud at9am. This is measured in "oktas", which are a unit of eigths. Itrecords how many eigths of the sky are obscured by cloud. A 0measure indicates completely clear sky whilst an 8 indicates thatit is completely overcast.

WindSpeed9am

Wind speed (km/hr) averaged over 10 minutes prior to 9am.

Pressure9am

Atmospheric pressure (hpa) reduced to mean sea level at 9am.

Temp3pm

Temperature (degrees C) at 3pm.

RelHumid3pm

Relative humidity (percent) at 3pm.

Cloud3pm

Fraction of sky obscured by cloud (in "oktas": eighths) at3pm. See Cload9am for a description of the values.

WindSpeed3pm

Wind speed (km/hr) averaged over 10 minutes prior to 3pm.

Pressure3pm

Atmospheric pressure (hpa) reduced to mean sea level at 3pm.

ChangeTemp

Change in temperature.

ChangeTempDir

Direction of change in temperature.

ChangeTempMag

Magnitude of change in temperature.

ChangeWindDirect

Direction of wind change.

MaxWindPeriod

Period of maximum wind.

RainToday

Integer: 1 if precipitation (mm) in the 24 hours to 9am exceeds1mm, otherwise 0.

TempRange

Difference between minimum and maximum temperatures (degrees C) inthe 24 hours to 9am.

PressureChange

Change in pressure.

RISK_MM

The amount of rain. A kind of measure of the "risk".

RainTomorrow

The target variable. Did it rain tomorrow?

Author(s)

Graham.Williams@togaware.com

Source

The daily observations are available fromhttps://www.bom.gov.au/climate/data. Copyright Commonwealth ofAustralia 2010, Bureau of Meteorology.

Definitions adapted fromhttps://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml

References

Package home page:https://rattle.togaware.com. Data source:https://www.bom.gov.au/climate/dwo/ andhttps://www.bom.gov.au/climate/data.

See Also

weatherAUS,audit.


Daily weather observations from multiple Australian weather stations.

Description

Daily weather observations from multiple locations around Australia,obtained from the Australian Commonwealth Bureau of Meteorology andprocessed to create this realtively large sample dataset forillustrating analytics, data mining, and data science using R andRattle.

The data has been processed to provide a target variableRainTomorrow (whether there is rain on the following day -No/Yes) and a risk variableRISK_MM (how much rain recorded inmillimeters). Various transformations are performed on the data.

TheweatherAUS dataset is regularly updated an updates of thispackage usually correspond to updates to this dataset. The data isupdated from the Bureau of Meteorology web site.

ThelocationsAUS dataset records the location of each weatherstation.

The source dataset comes from the Australian Commonwealth Bureau ofMeteorology. The Bureau provided permission to use the data with theBureau of Meteorology acknowledged as the source of the data, as peremail from Cathy Toby (C.Toby@bom.gov.au) of the Climate InformationServices of the National CLimate Centre, 17 Dec 2008.

A CSV version of this dataset is available ashttps://rattle.togaware.com/weatherAUS.csv.

Usage

weatherAUS

Format

TheweatherAUS dataset is a data frame containing over 140,000daily observations from over 45 Australian weather stations.

Date

The date of observation (a Date object).

Location

The common name of the location of theweather station.

MinTemp

The minimum temperature in degrees celsius.

MaxTemp

The maximum temperature in degrees celsius.

Rainfall

The amount of rainfall recorded for the day in mm.

Evaporation

The so-called Class A pan evaporation (mm)in the 24 hours to 9am.

Sunshine

The number of hours of bright sunshine in the day.

WindGustDir

The direction of the strongest wind gustin the 24 hours to midnight.

WindGustSpeed

The speed (km/h) of the strongest windgust in the 24 hours to midnight.

Temp9am

Temperature (degrees C) at 9am.

RelHumid9am

Relative humidity (percent) at 9am.

Cloud9am

Fraction of sky obscured by cloud at9am. This is measured in "oktas", which are a unit of eigths. Itrecords how many eigths of the sky are obscured by cloud. A 0measure indicates completely clear sky whilst an 8 indicates thatit is completely overcast.

WindSpeed9am

Wind speed (km/hr) averaged over 10 minutes prior to 9am.

Pressure9am

Atmospheric pressure (hpa) reduced to mean sea level at 9am.

Temp3pm

Temperature (degrees C) at 3pm.

RelHumid3pm

Relative humidity (percent) at 3pm.

Cloud3pm

Fraction of sky obscured by cloud (in "oktas": eighths) at3pm. See Cload9am for a description of the values.

WindSpeed3pm

Wind speed (km/hr) averaged over 10 minutes prior to 3pm.

Pressure3pm

Atmospheric pressure (hpa) reduced to mean sea level at 3pm.

ChangeTemp

Change in temperature.

ChangeTempDir

Direction of change in temperature.

ChangeTempMag

Magnitude of change in temperature.

ChangeWindDirect

Direction of wind change.

MaxWindPeriod

Period of maximum wind.

RainToday

Integer: 1 if precipitation (mm) in the 24 hours to 9am exceeds1mm, otherwise 0.

TempRange

Difference between minimum and maximum temperatures (degrees C) inthe 24 hours to 9am.

PressureChange

Change in pressure.

RISK_MM

The amount of rain. A kind of measure of the "risk".

RainTomorrow

The target variable. Did it rain tomorrow?

Author(s)

Graham.Williams@togaware.com

Source

Observations were drawn from numerous weather stations. The dailyobservations are available fromhttps://www.bom.gov.au/climate/data. Copyright Commonwealth ofAustralia 2010, Bureau of Meteorology.

Definitions adapted fromhttps://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml

References

Package home page:https://rattle.togaware.com. Data source:https://www.bom.gov.au/climate/dwo/ andhttps://www.bom.gov.au/climate/data.

See Also

weather,audit.


Returns a list of the names of the numeric variables in a data frame.

Description

A rattle support function.

Usage

whichNumerics(data)

Arguments

data

a data frame.

Author(s)

Graham.Williams@togaware.com

References

Package home page:https://rattle.togaware.com


The wine dataset from the UCI Machine Learning Repository.

Description

Thewine dataset contains the results of a chemical analysis ofwines grown in a specific area of Italy. Three types of wine arerepresented in the 178 samples, with the results of 13 chemicalanalyses recorded for each sample. TheType variable has beentransformed into a categoric variable.

The data contains no missing values and consits of only numeric data,with a three class target variable (Type) for classification.

Usage

wine

Format

A data frame containing 178 observations of 13 variables.

Type

The type of wine, into one of three classes, 1(59 obs), 2(71 obs), and 3 (48 obs).

Alcohol

Alcohol

Malic

Malic acid

Ash

Ash

Alcalinity

Alcalinity of ash

Magnesium

Magnesium

Phenols

Total phenols

Flavanoids

Flavanoids

Nonflavanoids

Nonflavanoid phenols

Proanthocyanins

Proanthocyanins

Color

Color intensity.

Hue

Hue

Dilution

D280/OD315 of diluted wines.

Proline

Proline

Source

The data was downloaded from the UCI Machine Learning Repository.

It was read as a CSV file with no header usingread.csv. The columns were then given the appropriatenames usingcolnames and the Type was transformed into afactor usingas.factor. The compressed R data file wassaved usingsave:

  UCI <- "https://archive.ics.uci.edu/ml"  REPOS <- "machine-learning-databases"  wine.url <- sprintf("  wine <- read.csv(wine.url, header=FALSE)   colnames(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash',                       'Alcalinity', 'Magnesium', 'Phenols',                       'Flavanoids', 'Nonflavanoids',                      'Proanthocyanins', 'Color', 'Hue',                       'Dilution', 'Proline')  wine$Type <- as.factor(wine$Type)  save(wine, file="wine.Rdata", compress=TRUE)

References

Asuncion, A. & Newman, D.J. (2007).UCI Machine LearningRepository[https://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA:University of California, School of Information and Computer Science.


[8]ページ先頭

©2009-2025 Movatter.jp