Machine Learning Glossary: Decision Forests

  • Decision forests are models composed of multiple decision trees that work together to make predictions.

  • Decision trees use conditions to split data and make decisions, with leaves representing the final predictions.

  • Various techniques like bagging, attribute sampling, and gradient boosting are used to improve the accuracy and robustness of decision forests.

  • Feature importances reveal which input features are most influential in a decision forest's predictions.

  • Ensembles, including random forests and gradient boosted trees, leverage the wisdom of the crowd for enhanced performance.

This page contains Decision Forests glossary terms. For all glossary terms,click here.

A

attribute sampling

#df

A tactic for training adecision forest in which eachdecision tree considers only a random subset of possiblefeatures when learning thecondition.Generally, a different subset of features is sampled for eachnode. In contrast, when training a decision treewithout attribute sampling, all possible features are considered for each node.

axis-aligned condition

#df

In adecision tree, aconditionthat involves only a singlefeature. For example, ifareais a feature, then the following is an axis-aligned condition:

area > 200

Contrast withoblique condition.

B

bagging

#df

A method totrain anensemble where eachconstituentmodel trains on a random subset of trainingexamplessampled with replacement.For example, arandom forest is a collection ofdecision trees trained with bagging.

The termbagging is short forbootstrapaggregating.

SeeRandom forestsin the Decision Forests course for more information.

binary condition

#df

In adecision tree, aconditionthat has only two possible outcomes, typicallyyes orno.For example, the following is a binary condition:

temperature >= 100

Contrast withnon-binary condition.

SeeTypes of conditionsin the Decision Forests course for more information.

C

condition

#df
In adecision tree, anynode thatperforms a test. For example, the following decision tree containstwo conditions:

A decision tree consisting of two conditions: (x > 0) and          (y > 0).

A condition is also called a split or a test.

Contrast condition withleaf.

See also:

SeeTypes of conditionsin the Decision Forests course for more information.

D

decision forest

#df

A model created from multipledecision trees.A decision forest makes a prediction by aggregating the predictions ofits decision trees. Popular types of decision forests includerandom forests andgradient boosted trees.

See theDecisionForestssection in the Decision Forests course for more information.

decision tree

#df

A supervised learning model composed of a set ofconditions andleaves organized hierarchically.For example, the following is a decision tree:

A decision tree consisting of four conditions arranged          hierarchically, which lead to five leaves.

E

entropy

#df
#Metric

Ininformation theory,a description of how unpredictable a probabilitydistribution is. Alternatively, entropy is also defined as how muchinformation eachexample contains. A distribution hasthe highest possible entropy when all values of a random variable areequally likely.

The entropy of a set with two possible values "0" and "1" (for example,the labels in abinary classification problem)has the following formula:

  H = -p log p - q log q = -p log p - (1-p) * log (1-p)

where:

  • H is the entropy.
  • p is the fraction of "1" examples.
  • q is the fraction of "0" examples. Note that q = (1 - p)
  • log is generally log2. In this case, the entropyunit is a bit.

For example, suppose the following:

  • 100 examples contain the value "1"
  • 300 examples contain the value "0"

Therefore, the entropy value is:

  • p = 0.25
  • q = 0.75
  • H = (-0.25)log2(0.25) - (0.75)log2(0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s)would have an entropy of 1.0 bit per example. As a set becomes moreimbalanced, its entropy moves towards 0.0.

Indecision trees, entropy helps formulateinformation gain to help thesplitter select theconditionsduring the growth of a classification decision tree.

Compare entropy with:

Entropy is often calledShannon's entropy.

SeeExact splitter for binary classification with numericalfeaturesin the Decision Forests course for more information.

F

feature importances

#df
#Metric

Synonym forvariable importances.

G

gini impurity

#df
#Metric

A metric similar toentropy.Splittersuse values derived from either gini impurity or entropy to composeconditions for classificationdecision trees.Information gain is derived from entropy.No universally accepted equivalent term for the metric derivedfrom gini impurity exists; however, this unnamed metric is just as important asinformation gain.

Gini impurity is also calledgini index, or simplygini.

Click the icon for mathematical details about gini impurity.

Gini impurity is the probability of misclassifying a new piece of datataken from the same distribution. The gini impurity of a set with twopossible values "0" and "1" (for example, the labels in abinary classification problem)is calculated from the following formula:

  I = 1 - (p2 + q2) = 1 - (p2 + (1-p)2)

where:

  • I is the gini impurity.
  • p is the fraction of "1" examples.
  • q is the fraction of "0" examples. Note thatq = 1-p

For example, consider the following dataset:

  • 100 labels (0.25 of the dataset) contain the value "1"
  • 300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

  • p = 0.25
  • q = 0.75
  • I = 1 - (0.252 + 0.752) =0.375

Consequently, a random label from the same dataset would have a 37.5% chanceof being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have agini impurity of 0.5. A highlyimbalanced label would have agini impurity close to 0.0.


gradient boosted (decision) trees (GBT)

#df

A type ofdecision forest in which:

SeeGradient Boosted DecisionTrees in theDecision Forests course for more information.

gradient boosting

#df

A training algorithm where weak models are trained to iterativelyimprove the quality (reduce the loss) of a strong model. For example,a weak model could be a linear or small decision tree model.The strong model becomes the sum of all the previously trained weak models.

In the simplest form of gradient boosting, at each iteration, a weak modelis trained to predict the loss gradient of the strong model. Then, thestrong model's output is updated by subtracting the predicted gradient,similar togradient descent.

$$F_{0} = 0$$$$F_{i+1} = F_i - \xi f_i $$

where:

  • $F_{0}$ is the starting strong model.
  • $F_{i+1}$ is the next strong model.
  • $F_{i}$ is the current strong model.
  • $\xi$ is a value between 0.0 and 1.0 calledshrinkage, which is analogous to thelearning rate in gradient descent.
  • $f_{i}$ is the weak model trained to predict the loss gradient of $F_{i}$.

Modern variations of gradient boosting also include the second derivative(Hessian) of the loss in their computation.

Decision trees are commonly used as weak models ingradient boosting. Seegradient boosted (decision) trees.

I

inference path

#df

In adecision tree, duringinference,the route a particularexample takes from theroot to otherconditions, terminating withaleaf. For example, in the following decision tree, thethicker arrows show the inference path for an example with the followingfeature values:

  • x = 7
  • y = 12
  • z = -3

The inference path in the following illustration travels through threeconditions before reaching the leaf (Zeta).

A decision tree consisting of four conditions and five leaves.          The root condition is (x > 0). Since the answer is Yes, the          inference path travels from the root to the next condition (y > 0).          Since the answer is Yes, the inference path then travels to the          next condition (z > 0). Since the answer is No, the inference path          travels to its terminal node, which is the leaf (Zeta).

The three thick arrows show the inference path.

SeeDecision treesin the Decision Forests course for more information.

information gain

#df
#Metric

Indecision forests, the difference betweena node'sentropy and the weighted (by number of examples)sum of the entropy of its children nodes. A node's entropy is the entropyof the examples in that node.

For example, consider the following entropy values:

  • entropy of parent node = 0.6
  • entropy of one child node with 16 relevant examples = 0.2
  • entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in theother child node. Therefore:

  • weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

  • information gain = entropy of parent node - weighted entropy sum of child nodes
  • information gain = 0.6 - 0.14 = 0.46

Mostsplitters seek to createconditionsthat maximize information gain.

in-set condition

#df

In adecision tree, aconditionthat tests for the presence of one item in a set of items.For example, the following is an in-set condition:

  house-style in [tudor, colonial, cape]

During inference, if the value of the house-stylefeatureistudor orcolonial orcape, then this condition evaluates to Yes. Ifthe value of the house-style feature is something else (for example,ranch),then this condition evaluates to No.

In-set conditions usually lead to more efficient decision trees thanconditions that testone-hot encoded features.

L

leaf

#df

Any endpoint in adecision tree. Unlike acondition, a leaf doesn't perform a test.Rather, a leaf is a possible prediction. A leaf is also the terminalnode of aninference path.

For example, the following decision tree contains three leaves:

A decision tree with two conditions leading to three leaves.

SeeDecision treesin the Decision Forests course for more information.

N

node (decision tree)

#df

In adecision tree, anycondition orleaf.

A decision tree with two conditions and three leaves.

SeeDecision Treesin the Decision Forests course for more information.

non-binary condition

#df

Acondition containing more than two possible outcomes.For example, the following non-binary condition contains three possibleoutcomes:

A condition (number_of_legs = ?) that leads to three possible          outcomes. One outcome (number_of_legs = 8) leads to a leaf          named spider. A second outcome (number_of_legs = 4) leads to          a leaf named dog. A third outcome (number_of_legs = 2) leads to          a leaf named penguin.

SeeTypes of conditionsin the Decision Forests course for more information.

O

oblique condition

#df

In adecision tree, acondition that involves more than onefeature. For example, if height and width are both features,then the following is an oblique condition:

  height > width

Contrast withaxis-aligned condition.

SeeTypes of conditionsin the Decision Forests course for more information.

out-of-bag evaluation (OOB evaluation)

#df

A mechanism for evaluating the quality of adecision forest by testing eachdecision tree against theexamplesnot used duringtraining of that decision tree. For example, in thefollowing diagram, notice that the system trains each decision treeon about two-thirds of the examples and then evaluates against theremaining one-third of the examples.

A decision forest consisting of three decision trees.          One decision tree trains on two-thirds of the examples          and then uses the remaining one-third for OOB evaluation.          A second decision tree trains on a different two-thirds          of the examples than the previous decision tree, and then          uses a different one-third for OOB evaluation than the          previous decision tree.

Out-of-bag evaluation is a computationally efficient and conservativeapproximation of thecross-validation mechanism.In cross-validation, one model is trained for each cross-validation round(for example, 10 models are trained in a 10-fold cross-validation).With OOB evaluation, a single model is trained. Becausebaggingwithholds some data from each tree during training, OOB evaluation can usethat data to approximate cross-validation.

SeeOut-of-bag evaluationin the Decision Forests course for more information.

P

permutation variable importances

#df
#Metric

A type ofvariable importance that evaluatesthe increase in the prediction error of a modelafter permuting thefeature's values. Permutation variable importance is a model-independentmetric.

R

random forest

#df

Anensemble ofdecision trees inwhich each decision tree is trained with a specific random noise,such asbagging.

Random forests are a type ofdecision forest.

SeeRandom Forestin the Decision Forests course for more information.

root

#df

The startingnode (the firstcondition) in adecision tree.By convention, diagrams put the root at the top of the decision tree.For example:

A decision tree with two conditions and three leaves. The          starting condition (x > 2) is the root.

S

sampling with replacement

#df

A method of picking items from a set of candidate items in which the sameitem can be picked multiple times. The phrase "with replacement" meansthat after each selection, the selected item is returned to the poolof candidate items. The inverse method,sampling without replacement,means that a candidate item can only be picked once.

For example, consider the following fruit set:

fruit = {kiwi, apple, pear, fig, cherry, lime, mango}

Suppose that the system randomly picksfig as the first item.If using sampling with replacement, then the system picks thesecond item from the following set:

fruit = {kiwi, apple, pear, fig, cherry, lime, mango}

Yes, that's the same set as before, so the system could potentiallypickfig again.

If using sampling without replacement, once picked, a sample can't bepicked again. For example, if the system randomly picksfig as thefirst sample, thenfig can't be picked again. Therefore, the systempicks the second sample from the following (reduced) set:

fruit = {kiwi, apple, pear, cherry, lime, mango}

Click the icon for additional notes.

The wordreplacement insampling with replacement confusesmany people. In English,replacement means "substitution."However,sampling with replacement actually uses the French definitionforreplacement, which means "putting something back."

The English wordreplacement is translated as the Frenchwordremplacement.


shrinkage

#df

Ahyperparameter ingradient boosting that controlsoverfitting. Shrinkage in gradient boostingis analogous tolearning rate ingradient descent. Shrinkage is a decimalvalue between 0.0 and 1.0. A lower shrinkage value reduces overfittingmore than a larger shrinkage value.

split

#df

In adecision tree, another name for acondition.

splitter

#df

While training adecision tree, the routine(and algorithm) responsible for finding the bestcondition at eachnode.

T

test

#df

In adecision tree, another name for acondition.

threshold (for decision trees)

#df

In anaxis-aligned condition, the value that afeature is being compared against. For example, 75 is thethreshold value in the following condition:

grade >= 75
This form of the termthreshold is different thanclassification threshold.

SeeExact splitter for binary classification with numerical featuresin the Decision Forests course for more information.

V

variable importances

#df
#Metric

A set of scores that indicates the relative importance of eachfeature to the model.

For example, consider adecision tree thatestimates house prices. Suppose this decision tree uses threefeatures: size, age, and style. If a set of variable importancesfor the three features are calculated to be{size=5.8, age=2.5, style=4.7}, then size is more important to thedecision tree than age or style.

Different variable importance metrics exist, which can informML experts about different aspects of models.

W

wisdom of the crowd

#df

The idea that averaging the opinions or estimates of a large groupof people ("the crowd") often produces surprisingly good results.For example, consider a game in which people guess the number ofjelly beans packed into a large jar. Although most individualguesses will be inaccurate, the average of all the guesses has beenempirically shown to be surprisingly close to the actual number ofjelly beans in the jar.

Ensembles are a software analog of wisdom of the crowd.Even if individual models make wildly inaccurate predictions,averaging the predictions of many models often generates surprisinglygood predictions. For example, although an individualdecision tree might make poor predictions, adecision forest often makes very good predictions.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.