| Type: | Package |
| Title: | Partial Dependence Plots |
| Version: | 0.8.2 |
| Description: | A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R. |
| License: | GPL-2 |GPL-3 [expanded from: GPL (≥ 2)] |
| URL: | https://github.com/bgreenwell/pdp,http://bgreenwell.github.io/pdp/ |
| BugReports: | https://github.com/bgreenwell/pdp/issues |
| Depends: | R (≥ 3.6.0) |
| Suggests: | adabag, AmesHousing, C50, caret, covr, Cubist, doParallel,dplyr, e1071, earth, gbm, gridExtra, ICEbox, ipred, keras,kernlab, magrittr, MASS, Matrix, mda, mlbench, nnet, party,partykit, randomForest, ranger, reticulate, rpart, tinytest,xgboost (≥ 0.6-0), knitr, rmarkdown, vip |
| Imports: | foreach, ggplot2 (≥ 3.0.0), grDevices, lattice, methods,rlang (≥ 0.3.0), stats, utils |
| LazyData: | TRUE |
| RoxygenNote: | 7.3.2 |
| Encoding: | UTF-8 |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2024-10-28 17:22:10 UTC; bgreenwell |
| Author: | Brandon M. Greenwell |
| Maintainer: | Brandon M. Greenwell <greenwell.brandon@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2024-10-28 17:50:02 UTC |
pdp: A general framework for constructing partial dependence (i.e., marginaleffect) plots from various types machine learning models in R.
Description
Partial dependence plots (PDPs) help visualize the relationship between asubset of the features (typically 1-3) and the response while accounting forthe average effect of the other predictors in the model. They areparticularly effective with black box models like random forests and supportvector machines.
Details
The development version can be found on GitHub: https://github.com/bgreenwell/pdp.As of right now,pdp exports four functions:
partial- construct partial dependence functions (i.e., objects of class"partial") from various fitted model objects;plotPartial- plot partial dependence functions (i.e., objects of class"partial") usinglatticegraphics;autoplot- plot partial dependence functions (i.e., objects of class"partial") usingggplot2graphics;topPredictors- extract most "important" predictors from various types of fitted models.
Author(s)
Maintainer: Brandon M. Greenwellgreenwell.brandon@gmail.com (ORCID)
See Also
Useful links:
Report bugs athttps://github.com/bgreenwell/pdp/issues
Plotting Partial Dependence Functions
Description
Plots partial dependence functions (i.e., marginal effects) usingggplot2 graphics.
Usage
## S3 method for class 'partial'autoplot( object, center = FALSE, plot.pdp = TRUE, pdp.color = "red", pdp.size = 1, pdp.linetype = 1, rug = FALSE, smooth = FALSE, smooth.method = "auto", smooth.formula = y ~ x, smooth.span = 0.75, smooth.method.args = list(), contour = FALSE, contour.color = "white", train = NULL, xlab = NULL, ylab = NULL, main = NULL, legend.title = "yhat", ...)## S3 method for class 'ice'autoplot( object, center = FALSE, plot.pdp = TRUE, pdp.color = "red", pdp.size = 1, pdp.linetype = 1, rug = FALSE, train = NULL, xlab = NULL, ylab = NULL, main = NULL, ...)## S3 method for class 'cice'autoplot( object, plot.pdp = TRUE, pdp.color = "red", pdp.size = 1, pdp.linetype = 1, rug = FALSE, train = NULL, xlab = NULL, ylab = NULL, main = NULL, ...)Arguments
object | An object that inherits from the |
center | Logical indicating whether or not to produce centered ICEcurves (c-ICE curves). Only useful when |
plot.pdp | Logical indicating whether or not to plot the partialdependence function on top of the ICE curves. Default is |
pdp.color | Character string specifying the color to use for the partialdependence function when |
pdp.size | Positive number specifying the line width to use for thepartial dependence function when |
pdp.linetype | Positive number specifying the line type to use for thepartial dependence function when |
rug | Logical indicating whether or not to include rug marks on thepredictor axes. Default is |
smooth | Logical indicating whether or not to overlay a LOESS smooth.Default is |
smooth.method | Character string specifying the smoothing method(function) to use (e.g., |
smooth.formula | Formula to use in smoothing function (e.g., |
smooth.span | Controls the amount of smoothing for the default loesssmoother. Smaller numbers produce wigglier lines, larger numbers producesmoother lines. Default is |
smooth.method.args | List containing additional arguments to be passedon to the modeling function defined by |
contour | Logical indicating whether or not to add contour lines to thelevel plot. |
contour.color | Character string specifying the color to use for thecontour lines when |
train | Data frame containing the original training data. Only requiredif |
xlab | Character string specifying the text for the x-axis label. |
ylab | Character string specifying the text for the y-axis label. |
main | Character string specifying the text for the main title of theplot. |
legend.title | Character string specifying the text for the legend title.Default is |
... | Additional (optional) arguments to be passed onto |
Value
A"ggplot" object.
Examples
## Not run: ## Regression example (requires randomForest package to run)## Load required packageslibrary(ggplot2) # for autoplot() genericlibrary(gridExtra) # for `grid.arrange()`library(magrittr) # for forward pipe operator `%>%`library(randomForest)# Fit a random forest to the Boston housing datadata (boston) # load the boston housing dataset.seed(101) # for reproducibilityboston.rf <- randomForest(cmedv ~ ., data = boston)# Partial dependence of cmedv on lstatboston.rf %>% partial(pred.var = "lstat") %>% autoplot(rug = TRUE, train = boston) + theme_bw()# Partial dependence of cmedv on lstat and rmboston.rf %>% partial(pred.var = c("lstat", "rm"), chull = TRUE, progress = TRUE) %>% autoplot(contour = TRUE, legend.title = "cmedv", option = "B", direction = -1) + theme_bw()# ICE curves and c-ICE curvesage.ice <- partial(boston.rf, pred.var = "lstat", ice = TRUE)grid.arrange( autoplot(age.ice, alpha = 0.1), # ICE curves autoplot(age.ice, center = TRUE, alpha = 0.1), # c-ICE curves ncol = 2)## End(Not run)Boston Housing Data
Description
Data on median housing values from 506 census tracts in the suburbs of Bostonfrom the 1970 census. This data frame is a corrected version of the originaldata by Harrison and Rubinfeld (1978) with additional spatial information.The data were taken directly fromBostonHousing2 andunneeded columns (i.e., name of town, census tract, and the uncorrectedmedian home value) were removed.
Usage
data(boston)Format
A data frame with 506 rows and 16 variables.
lonLongitude of census tract.latLatitude of census tract.cmedvCorrected median value of owner-occupied homes in USD 1000'scrimPer capita crime rate by town.znProportion of residential land zoned for lots over 25,000 sq.ft.indusProportion of non-retail business acres per town.chasCharles River dummy variable (= 1 if tract bounds river; 0 otherwise).noxNitric oxides concentration (parts per 10 million).rmAverage number of rooms per dwelling.ageProportion of owner-occupied units built prior to 1940.disWeighted distances to five Boston employment centers.radIndex of accessibility to radial highways.taxFull-value property-tax rate per USD 10,000.ptratioPupil-teacher ratio by town.b$1000(B - 0.63)^2$ where B is the proportion of blacks by town.lstatPercentage of lower status of the population.
References
Harrison, D. and Rubinfeld, D.L. (1978). Hedonic prices and the demand forclean air. Journal of Environmental Economics and Management, 5, 81-102.
Gilley, O.W., and R. Kelley Pace (1996). On the Harrison and Rubinfeld Data.Journal of Environmental Economics and Management, 31, 403-405.
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repositoryof machine learning databases[http://www.ics.uci.edu/~mlearn/MLRepository.html] Irvine, CA: University ofCalifornia, Department of Information and Computer Science.
Pace, R. Kelley, and O.W. Gilley (1997). Using the Spatial Configuration ofthe Data to Improve Estimation. Journal of the Real Estate Finance andEconomics, 14, 333-340.
Friedrich Leisch & Evgenia Dimitriadou (2010). mlbench: Machine LearningBenchmark Problems. R package version 2.1-1.
Examples
head(boston)Exemplar observation
Description
Construct a single "exemplar" record from a data frame. For now, all numericcolumns (including"Date" objects) are replaced with theircorresponding median value and non-numeric columns are replaced with theirmost frequent value.
Usage
exemplar(object)## S3 method for class 'data.frame'exemplar(object)## S3 method for class 'matrix'exemplar(object)## S3 method for class 'dgCMatrix'exemplar(object)Arguments
object | A data frame, matrix, or |
Value
A data frame with the same number of columns asobject and asingle row.
Examples
set.seed(1554) # for reproducibilitytrain <- data.frame( x = rnorm(100), y = sample(letters[1L:3L], size = 100, replace = TRUE, prob = c(0.1, 0.1, 0.8)))exemplar(train)Partial Dependence Functions
Description
Compute partial dependence functions (i.e., marginal effects) for variousmodel fitting objects.
Usage
partial(object, ...)## Default S3 method:partial( object, pred.var, pred.grid, pred.fun = NULL, grid.resolution = NULL, ice = FALSE, center = FALSE, approx = FALSE, quantiles = FALSE, probs = 1:9/10, trim.outliers = FALSE, type = c("auto", "regression", "classification"), inv.link = NULL, which.class = 1L, prob = FALSE, recursive = TRUE, plot = FALSE, plot.engine = c("lattice", "ggplot2"), smooth = FALSE, rug = FALSE, chull = FALSE, levelplot = TRUE, contour = FALSE, contour.color = "white", alpha = 1, train, cats = NULL, check.class = TRUE, progress = FALSE, parallel = FALSE, paropts = NULL, ...)## S3 method for class 'model_fit'partial(object, ...)Arguments
object | A fitted model object of appropriate class (e.g., |
... | Additional optional arguments to be passed onto |
pred.var | Character string giving the names of the predictor variablesof interest. For reasons of computation/interpretation, this should includeno more than three variables. |
pred.grid | Data frame containing the joint values of interest for thevariables listed in |
pred.fun | Optional prediction function that requires two arguments: |
grid.resolution | Integer giving the number of equally spaced points touse for the continuous variables listed in |
ice | Logical indicating whether or not to compute individualconditional expectation (ICE) curves. Default is |
center | Logical indicating whether or not to produce centered ICEcurves (c-ICE curves). Only used when |
approx | Logical indicating whether or not to compute a faster, butapproximate, marginal effect plot (similar in spirit to theplotmo package). If |
quantiles | Logical indicating whether or not to use the samplequantiles of the continuous predictors listed in |
probs | Numeric vector of probabilities with values in [0,1]. (Values upto 2e-14 outside that range are accepted and moved to the nearby endpoint.)Default is |
trim.outliers | Logical indicating whether or not to trim off outliersfrom the continuous predictors listed in |
type | Character string specifying the type of supervised learning.Current options are |
inv.link | Function specifying the transformation to be applied to thepredictions before the partial dependence function is computed(experimental). Default is |
which.class | Integer specifying which column of the matrix of predictedprobabilities to use as the "focus" class. Default is to use the first class.Only used for classification problems (i.e., when |
prob | Logical indicating whether or not partial dependence forclassification problems should be returned on the probability scale, ratherthan the centered logit. If |
recursive | Logical indicating whether or not to use the weighted treetraversal method described in Friedman (2001). This only applies to objectsthat inherit from class |
plot | Logical indicating whether to return a data frame containing thepartial dependence values ( |
plot.engine | Character string specifying which plotting engine to usewhenever |
smooth | Logical indicating whether or not to overlay a LOESS smooth.Default is |
rug | Logical indicating whether or not to include a rug display on thepredictor axes. The tick marks indicate the min/max and deciles of thepredictor distributions. This helps reduce the risk of interpreting thepartial dependence plot outside the region of the data (i.e., extrapolating).Only used when |
chull | Logical indicating whether or not to restrict the values of thefirst two variables in |
levelplot | Logical indicating whether or not to use a false color levelplot ( |
contour | Logical indicating whether or not to add contour lines to thelevel plot. Only used when |
contour.color | Character string specifying the color to use for thecontour lines when |
alpha | Numeric value in |
train | An optional data frame, matrix, or sparse matrix containing theoriginal training data. This may be required depending on the class of |
cats | Character string indicating which columns of |
check.class | Logical indicating whether or not to make sure each columnin |
progress | Logical indicating whether or not to display a text-basedprogress bar. Default is |
parallel | Logical indicating whether or not to run |
paropts | List containing additional options to be passed onto |
Value
By default,partial returns an object of classc("data.frame", "partial"). Ifice = TRUE andcenter = FALSE then an object of classc("data.frame", "ice")is returned. Ifice = TRUE andcenter = TRUE then an object ofclassc("data.frame", "cice") is returned. These three classesdetermine the behavior of theplotPartial function which isautomatically called wheneverplot = TRUE. Specifically, whenplot = TRUE, a"trellis" object is returned (seelattice for details); the"trellis" object willalso include an additional attribute,"partial.data", containing thedata displayed in the plot.
Note
In some cases it is difficult forpartial to extract the originaltraining data fromobject. In these cases an error message isdisplayed requesting the user to supply the training data via thetrain argument in the call topartial. In most cases wherepartial can extract the required training data fromobject,it is taken from the same environment in whichpartial is called.Therefore, it is important to not change the training data used to constructobject before callingpartial. This problem is completelyavoided when the training data are passed to thetrain argument in thecall topartial.
It is recommended to callpartial withplot = FALSE and storethe results. This allows for more flexible plotting, and the user will nothave to waste time callingpartial again if the default plot is notsufficient.
It is possible to retrieve the last printed"trellis" object, such asthose produced byplotPartial, usingtrellis.last.object().
Ifice = TRUE or the prediction function given topred.funreturns a prediction for each observation innewdata, then the resultwill be a curve for each observation. These are called individual conditionalexpectation (ICE) curves; see Goldstein et al. (2015) andice for details.
References
J. H. Friedman. Greedy function approximation: A gradient boosting machine.Annals of Statistics,29: 1189-1232, 2001.
Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E., Peeking Inside theBlack Box: Visualizing Statistical Learning With Plots of IndividualConditional Expectation. (2014)Journal of Computational and GraphicalStatistics,24(1): 44-65, 2015.
Examples
## Not run: ## Regression example (requires randomForest package to run)## Fit a random forest to the boston housing datalibrary(randomForest)data (boston) # load the boston housing dataset.seed(101) # for reproducibilityboston.rf <- randomForest(cmedv ~ ., data = boston)# Using randomForest's partialPlot functionpartialPlot(boston.rf, pred.data = boston, x.var = "lstat")# Using pdp's partial functionhead(partial(boston.rf, pred.var = "lstat")) # returns a data framepartial(boston.rf, pred.var = "lstat", plot = TRUE, rug = TRUE)# The partial function allows for multiple predictorspartial(boston.rf, pred.var = c("lstat", "rm"), grid.resolution = 40, plot = TRUE, chull = TRUE, progress = TRUE)# The plotPartial function offers more flexible plottingpd <- partial(boston.rf, pred.var = c("lstat", "rm"), grid.resolution = 40)plotPartial(pd, levelplot = FALSE, zlab = "cmedv", drape = TRUE, colorkey = FALSE, screen = list(z = -20, x = -60))# The autplot function can be used to produce graphics based on ggplot2library(ggplot2)autoplot(pd, contour = TRUE, legend.title = "Partial\ndependence")## Individual conditional expectation (ICE) curves## Use partial to obtain ICE/c-ICE curvesrm.ice <- partial(boston.rf, pred.var = "rm", ice = TRUE)plotPartial(rm.ice, rug = TRUE, train = boston, alpha = 0.2)autoplot(rm.ice, center = TRUE, alpha = 0.2, rug = TRUE, train = boston)## Classification example (requires randomForest package to run)## Fit a random forest to the Pima Indians diabetes datadata (pima) # load the boston housing dataset.seed(102) # for reproducibilitypima.rf <- randomForest(diabetes ~ ., data = pima, na.action = na.omit)# Partial dependence of positive test result on glucose (default logit scale)partial(pima.rf, pred.var = "glucose", plot = TRUE, chull = TRUE, progress = TRUE)# Partial dependence of positive test result on glucose (probability scale)partial(pima.rf, pred.var = "glucose", prob = TRUE, plot = TRUE, chull = TRUE, progress = TRUE)## End(Not run)Pima Indians Diabetes Data
Description
Diabetes test results collected by the the US National Institute of Diabetesand Digestive and Kidney Diseases from a population of women who were atleast 21 years old, of Pima Indian heritage, and living near Phoenix,Arizona. The data were taken directly fromPimaIndiansDiabetes2.
Usage
data(pima)Format
A data frame with 768 observations on 9 variables.
pregnantNumber of times pregnant.glucosePlasma glucose concentration (glucose tolerance test).pressureDiastolic blood pressure (mm Hg).tricepsTriceps skin fold thickness (mm).insulin2-Hour serum insulin (mu U/ml).massBody mass index (weight in kg/(height in m)^2).pedigreeDiabetes pedigree function.ageAge (years).diabetesFactor indicating the diabetes test result (neg/pos).
References
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repositoryof machine learning databases[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University ofCalifornia, Department of Information and Computer Science.
Brian D. Ripley (1996), Pattern Recognition and Neural Networks, CambridgeUniversity Press, Cambridge.
Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), SoftClassification a.k.a. Risk Estimation via Penalized Log Likelihood andSmoothing Spline Analysis of Variance, in D. H. Wolpert (1995), TheMathematics of Generalization, 331-359, Addison-Wesley, Reading, MA.
Friedrich Leisch & Evgenia Dimitriadou (2010). mlbench: Machine LearningBenchmark Problems. R package version 2.1-1.
Examples
head(pima)Plotting Partial Dependence Functions
Description
Plots partial dependence functions (i.e., marginal effects) usinglattice graphics.
Usage
plotPartial(object, ...)## S3 method for class 'ice'plotPartial( object, center = FALSE, plot.pdp = TRUE, pdp.col = "red2", pdp.lwd = 2, pdp.lty = 1, rug = FALSE, train = NULL, ...)## S3 method for class 'cice'plotPartial( object, plot.pdp = TRUE, pdp.col = "red2", pdp.lwd = 2, pdp.lty = 1, rug = FALSE, train = NULL, ...)## S3 method for class 'partial'plotPartial( object, center = FALSE, plot.pdp = TRUE, pdp.col = "red2", pdp.lwd = 2, pdp.lty = 1, smooth = FALSE, rug = FALSE, chull = FALSE, levelplot = TRUE, contour = FALSE, contour.color = "white", col.regions = NULL, number = 4, overlap = 0.1, train = NULL, ...)Arguments
object | An object that inherits from the |
... | Additional optional arguments to be passed onto |
center | Logical indicating whether or not to produce centered ICEcurves (c-ICE curves). Only useful when |
plot.pdp | Logical indicating whether or not to plot the partialdependence function on top of the ICE curves. Default is |
pdp.col | Character string specifying the color to use for the partialdependence function when |
pdp.lwd | Integer specifying the line width to use for the partialdependence function when |
pdp.lty | Integer or character string specifying the line type to usefor the partial dependence function when |
rug | Logical indicating whether or not to include rug marks on thepredictor axes. Default is |
train | Data frame containing the original training data. Only requiredif |
smooth | Logical indicating whether or not to overlay a LOESS smooth.Default is |
chull | Logical indicating whether or not to restrict the first twovariables in |
levelplot | Logical indicating whether or not to use a false color levelplot ( |
contour | Logical indicating whether or not to add contour lines to thelevel plot. Only used when |
contour.color | Character string specifying the color to use for thecontour lines when |
col.regions | Vector of colors to be passed on to |
number | Integer specifying the number of conditional intervals to usefor the continuous panel variables. See |
overlap | The fraction of overlap of the conditioning variables. See |
Examples
## Not run: ## Regression example (requires randomForest package to run)## Load required packageslibrary(gridExtra) # for `grid.arrange()`library(magrittr) # for forward pipe operator `%>%`library(randomForest)# Fit a random forest to the Boston housing datadata (boston) # load the boston housing dataset.seed(101) # for reproducibilityboston.rf <- randomForest(cmedv ~ ., data = boston)# Partial dependence of cmedv on lstatboston.rf %>% partial(pred.var = "lstat") %>% plotPartial(rug = TRUE, train = boston)# Partial dependence of cmedv on lstat and rmboston.rf %>% partial(pred.var = c("lstat", "rm"), chull = TRUE, progress = TRUE) %>% plotPartial(contour = TRUE, legend.title = "rm")# ICE curves and c-ICE curvesage.ice <- partial(boston.rf, pred.var = "lstat", ice = TRUE)p1 <- plotPartial(age.ice, alpha = 0.1)p2 <- plotPartial(age.ice, center = TRUE, alpha = 0.1)grid.arrange(p1, p2, ncol = 2)## End(Not run)Extract Most "Important" Predictors (Experimental)
Description
Extract the most "important" predictors for regression and classificationmodels.
Usage
topPredictors(object, n = 1L, ...)## Default S3 method:topPredictors(object, n = 1L, ...)## S3 method for class 'train'topPredictors(object, n = 1L, ...)Arguments
object | A fitted model object of appropriate class (e.g., |
n | Integer specifying the number of predictors to return. Default is |
... | Additional optional arguments to be passed onto |
Details
This function uses the generic functionvarImp tocalculate variable importance scores for each predictor. After that, they aresorted at the names of then highest scoring predictors are returned.
Examples
## Not run: ## Regression example (requires randomForest package to run)#Load required packageslibrary(ggplot2)library(randomForest)# Fit a random forest to the mtcars datasetdata(mtcars, package = "datasets")set.seed(101)mtcars.rf <- randomForest(mpg ~ ., data = mtcars, mtry = 5, importance = TRUE)# Topfour predictorstop4 <- topPredictors(mtcars.rf, n = 4)# Construct partial dependence functions for top four predictorspd <- NULLfor (i in top4) { tmp <- partial(mtcars.rf, pred.var = i) names(tmp) <- c("x", "y") pd <- rbind(pd, cbind(tmp, predictor = i))}# Display partial dependence functionsggplot(pd, aes(x, y)) + geom_line() + facet_wrap(~ predictor, scales = "free") + theme_bw() + ylab("mpg")## End(Not run)Retrieve the last trellis object
Description
Seetrellis.last.object for more details.
Usage
trellis.last.object(..., prefix)