- Notifications
You must be signed in to change notification settings - Fork49
Aurelius randomForest
Download and install Aurelius. This article was tested with Aurelius 0.8.3; newer versions should work with no modification. R >= 3.0.1 is required.
Launch an R prompt and load theaurelius
library:
R version 3.0.1 (2013-05-16) -- "Good Sport"Copyright (C) 2013 The R Foundation for Statistical ComputingPlatform: x86_64-pc-linux-gnu (64-bit)R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English localeR is a collaborative project with many contributors.Type 'contributors()' for more information and'citation()' on how to cite R or R packages in publications.Type 'demo()' for some demos, 'help()' for on-line help, or'help.start()' for an HTML browser interface to help.Type 'q()' to quit R.> library(aurelius)>
TherandomForest library for random forest models usually isn't packaged with an R distribution, but it is available on CRAN. This page assumes that you are not only familiar with the randomForest package, but have already created and fine-tuned your linear model, having produced aforestObject
of class"randomForest"
.
Conversion to PFA proceeds in three steps:
- Extract parameters from the
forestObject
. - Format them as an R list-of-lists that is equivalent to a data structure in PFA.
- Create the PFA document, including the line of PFA code that evaluates the linear model.
These steps are not combined into one function call to allow for variations in how the model is invoked, including preprocessing, postprocessing, and attaching additional information to the linear fit object.
Tree structures in PFA differ, depending on whether the splits are all numerical or if some are categorical (levels in R). The comparison value at each tree node needs to have a data type that can encompass all possible splits, so if the predictors are all numerical, it can beavro.double
, if they are all categorical, it can beavro.string
, but if they're both, it must beavro.union(avro.double, avro.string)
.
Furthermore, the randomForest library summarizes the level data very succinctly, while PFA represents them by their string values. You need to make a lookup table. For each categorical feature in your tree, create an ordered list of levels and put all of these in one named list, where the names are feature names. For instance,
dataLevels<-list(field1=list("value1","value2","value3"),field2=list("val1","val2","val3","val4","val5"),...)
You may be able to automate this from the level data in R.
Now you can extract theforestObject
into a list-of-lists. UselabelVar = TRUE
if the comparison value is a union (mixed numerical and categorical regressors),labelVar = FALSE
if it is not (all numerical or all categorical).
forest<-list()for (iin1:forestObject$nTree) {treeTable<- pfa.randomForest.extractTree(forestObject,i,labelVar=TRUE) cat(paste("tree",i,"has", length(treeTable$status),"nodes\n"))forest[[length(forest)+1]]<- pfa.randomForest.buildOneTree(treeTable,1,labelVar,dataLevels, lapply(dataLevels,function (v)avro.string))$TreeNode}
where thecat
line is to track progress.
Assuming that the input schema contains the tree fields and nothing else (no preprocessing), build the input schema like the following.
fieldNames<- as.list(forestObject$xNames)fieldTypes<- rep(avro.double, length(fieldNames))names(fieldTypes)<-fieldNamesfor (nin names(dataLevels))fieldTypes[[n]]<-avro.stringinputSchema<- avro.record(fieldTypes,"Input")
If any of your field names contain dots (.
), you'll have to convert them to underscores or something (Avro field names accept alphanumeric and underscores).
Thepfa.randomForest.buildOneTree
function above formats each tree from the forest in a PFA list-of-lists, and it's more convenient to leave that transformation in the same loop with the extraction.
It is good practice to use anavro.typemap
to ensure that named types are declared only once in the output PFA.
If the trees in the forest are decision trees (categorical decisions), theOutput
type should beavro.string
and the score type (unioned with"TreeNode"
inpass
andfail
), should also beavro.string
. If they are regression trees, theOutput
and score types should beavro.double
.
If the tree splits are purely numeric, the value isavro.double
; if purely categorical, the value isavro.string
; if any splits include subset comparisons (e.g. field value is X, Y, or Z), then the value isavro.array(avro.string)
. If it is any combination of these, take the appropriate union. A wide union consisting of all three is safe.
tm<- avro.typemap(Input=inputSchema,Output=avro.string,TreeNode= avro.record(list(field= avro.enum(fieldNames),operator=avro.string,value= avro.union(avro.double,avro.string, avro.array(avro.string)),pass= avro.union("TreeNode",avro.string),fail= avro.union("TreeNode",avro.string)),"TreeNode"))
The following PFA document applies standard tree-scoring to each tree in the forest and reports the majority vote.
pfaDocument<- pfa.config(input= tm("Input"),output= tm("Output"),cells=list(forest= pfa.cell(avro.array(tm("TreeNode")),forest)),action= expression(treeScores<- a.map(forest,function(tree= tm("TreeNode")->avro.string) model.tree.simpleTree(input,tree)), a.mode(treeScores) ))
If, instead of reporting the most popular score, you want to report the fraction that score a particular way, you could use this PFA document instead.
pfaDocument<- pfa.config(input= tm("Input"),output=avro.double,cells=list(forest= pfa.cell(avro.array(tm("TreeNode")),forest)),action= expression(treeScores<- a.map(forest,function(tree= tm("TreeNode")->avro.string) model.tree.simpleTree(input,tree)), a.count(treeScores,"favoriteScore")/ a.len(treeScores) ))
Or maybe you want to output a map of counts for each output category, use something like this.
pfaDocument<- pfa.config(input= tm("Input"),output= avro.map(avro.double),cells=list(forest= pfa.cell(avro.array(tm("TreeNode")),forest)),action= expression(treeScores<- a.map(forest,function(tree= tm("TreeNode")->avro.string) model.tree.simpleTree(input,tree)), new(avro.map(avro.double),score1= a.count(treeScores,"score1")/ a.len(treeScores),score2= a.count(treeScores,"score2")/ a.len(treeScores),score3= a.count(treeScores,"score3")/ a.len(treeScores)) ))
For many possible scores, the latter could be automated.
The tree-scoring process can also be expanded for more functionality. Consider, for instance,
pfaDocument<- pfa.config(input= tm("Input"),output= tm("Output"),cells=list(forest= pfa.cell(avro.array(tm("TreeNode")),forest)),action= expression(treeScores<- a.map(forest,function(tree= tm("TreeNode")->avro.string) model.tree.simpleWalk(input,tree,function(d= tm("Input"),t= tm("TreeNode")->avro.boolean) model.tree.simpleTest(d,t) )), a.mode(treeScores) ))
where themodel.tree.simpleTree
function has been expanded into two parts,model.tree.simpleWalk
andmodel.tree.simpleTest
. Themodel.tree.simpleTest
function decides how to evaluate each tree node as "pass" or "fail" and themodel.tree.simpleWalk
function repeatedly applies it from tree root to tree leaf. Multi-branch trees or trees with missing values can be evaluated by swapping one or both functions for alternatives from themodel.tree
library.
To write the PFA to a file, use
json(pfaDocument,fileName="mymodel.pfa")
If you have Titus and rPython installed (seeinstallation page), you can test the scoring engine without leaving R.
engine<- pfa.engine(pfaDocument)# verifies that pfaDocument is internally consistentengine$action(list(field1=3.14,field2="hello"))
wherefield1
,field2
, etc. are named fields.
Return to theHadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).