The broom package takes the messy output of built-in functions in R,such aslm,nls, ort.test, andturns them into tidy tibbles.
The concept of “tidy data”,as introduced by HadleyWickham, offers a powerful framework for data manipulation andanalysis. That paper makes a convincing statement of the problem thispackage tries to solve (emphasis mine):
While model inputs usually require tidy inputs, suchattention to detail doesn’t carry over to model outputs. Outputs such aspredictions and estimated coefficients aren’t always tidy. This makes itmore difficult to combine results from multiple models. Forexample, in R, the default representation of model coefficients is nottidy because it does not have an explicit variable that records thevariable name for each estimate, they are instead recorded as row names.In R, row names must be unique, so combining coefficients from manymodels (e.g., from bootstrap resamples, or subgroups) requiresworkarounds to avoid losing important information.This knocksyou out of the flow of analysis and makes it harder to combine theresults from multiple models. I’m not currently aware of any packagesthat resolve this problem.
broom is an attempt to bridge the gap from untidy outputs ofpredictions and estimations to the tidy data we want to work with. Itcenters around three S3 methods, each of which take common objectsproduced by R statistical functions (lm,t.test,nls, etc) and convert them into atibble. broom is particularly designed to work with Hadley’sdplyr package (see thebroom+dplyr vignette for more).
broom should be distinguished from packages likereshape2 andtidyr, which rearrange andreshape data frames into different forms. Those packages performcritical tasks in tidy data analysis but focus on manipulating dataframes in one specific format into another. In contrast, broom isdesigned to take format that isnot in a tabular data format(sometimes not anywhere close) and convert it to a tidy tibble.
Tidying model outputs is not an exact science, and it’s based on ajudgment of the kinds of values a data scientist typically wants out ofa tidy analysis (for instance, estimates, test statistics, andp-values). You may lose some of the information in the original objectthat you wanted, or keep more information than you need. If you thinkthe tidy output for a model should be changed, or if you’re missing atidying function for an S3 class that you’d like, I strongly encourageyou toopen anissue or a pull request.
This package provides three S3 methods that do three distinct kindsof tidying.
tidy: constructs a tibble that summarizes the model’sstatistical findings. This includes coefficients and p-values for eachterm in a regression, per-cluster information in clusteringapplications, or per-test information formulttestfunctions.augment: add columns to the original data that wasmodeled. This includes predictions, residuals, and clusterassignments.glance: construct a conciseone-row summary ofthe model. This typically contains values such as R^2, adjusted R^2, andresidual standard error that are computed once for the entiremodel.Note that some classes may have only one or two of these methodsdefined.
Consider as an illustrative example a linear fit on the built-inmtcars dataset.
## ## Call:## lm(formula = mpg ~ wt, data = mtcars)## ## Coefficients:## (Intercept) wt ## 37.285 -5.344## ## Call:## lm(formula = mpg ~ wt, data = mtcars)## ## Residuals:## Min 1Q Median 3Q Max ## -4.5432 -2.3647 -0.1252 1.4096 6.8727 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***## wt -5.3445 0.5591 -9.559 1.29e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.046 on 30 degrees of freedom## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 ## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10This summary output is useful enough if you just want to read it.However, converting it to tabular data that contains all the sameinformation, so that you can combine it with other models or do furtheranalysis, is not trivial. You have to docoef(summary(lmfit)) to get a matrix of coefficients, theterms are still stored in row names, and the column names areinconsistent with other packages (e.g. Pr(>|t|) comparedtop.value).
Instead, you can use thetidy function, from the broompackage, on the fit:
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 37.3 1.88 19.9 8.24e-19## 2 wt -5.34 0.559 -9.56 1.29e-10This gives you a tabular data representation. Note that the row nameshave been moved into a column calledterm, and the columnnames are simple and consistent (and can be accessed using$).
Instead of viewing the coefficients, you might be interested in thefitted values and residuals for each of the original points in theregression. For this, useaugment, which augments theoriginal data with information from the model:
## # A tibble: 32 × 9## .rownames mpg wt .fitted .resid .hat .sigma .cooksd .std.resid## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Mazda RX4 21 2.62 23.3 -2.28 0.0433 3.07 1.33e-2 -0.766 ## 2 Mazda RX4 Wag 21 2.88 21.9 -0.920 0.0352 3.09 1.72e-3 -0.307 ## 3 Datsun 710 22.8 2.32 24.9 -2.09 0.0584 3.07 1.54e-2 -0.706 ## 4 Hornet 4 Drive 21.4 3.22 20.1 1.30 0.0313 3.09 3.02e-3 0.433 ## 5 Hornet Sportabout 18.7 3.44 18.9 -0.200 0.0329 3.10 7.60e-5 -0.0668## 6 Valiant 18.1 3.46 18.8 -0.693 0.0332 3.10 9.21e-4 -0.231 ## 7 Duster 360 14.3 3.57 18.2 -3.91 0.0354 3.01 3.13e-2 -1.31 ## 8 Merc 240D 24.4 3.19 20.2 4.16 0.0313 3.00 3.11e-2 1.39 ## 9 Merc 230 22.8 3.15 20.5 2.35 0.0314 3.07 9.96e-3 0.784 ## 10 Merc 280 19.2 3.44 18.9 0.300 0.0329 3.10 1.71e-4 0.100 ## # ℹ 22 more rowsNote that each of the new columns begins with a. (toavoid overwriting any of the original columns).
Finally, several summary statistics are computed for the entireregression, such as R^2 and the F-statistic. These can be accessed withtheglance function:
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170.## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>This distinction between thetidy,augmentandglance functions is explored in a different context inthek-meansvignette.
These functions apply equally well to the output fromglm:
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 12.0 4.51 2.67 0.00759## 2 wt -4.02 1.44 -2.80 0.00509## # A tibble: 32 × 9## .rownames am wt .fitted .resid .hat .sigma .cooksd .std.resid## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Mazda RX4 1 2.62 1.50 0.635 0.126 0.803 0.0184 0.680## 2 Mazda RX4 Wag 1 2.88 0.471 0.985 0.108 0.790 0.0424 1.04 ## 3 Datsun 710 1 2.32 2.70 0.360 0.0963 0.810 0.00394 0.379## 4 Hornet 4 Drive 0 3.22 -0.897 -0.827 0.0744 0.797 0.0177 -0.860## 5 Hornet Sportabout 0 3.44 -1.80 -0.553 0.0681 0.806 0.00647 -0.572## 6 Valiant 0 3.46 -1.88 -0.532 0.0674 0.807 0.00590 -0.551## 7 Duster 360 0 3.57 -2.33 -0.432 0.0625 0.809 0.00348 -0.446## 8 Merc 240D 0 3.19 -0.796 -0.863 0.0755 0.796 0.0199 -0.897## 9 Merc 230 0 3.15 -0.635 -0.922 0.0776 0.793 0.0242 -0.960## 10 Merc 280 0 3.44 -1.80 -0.553 0.0681 0.806 0.00647 -0.572## # ℹ 22 more rows## # A tibble: 1 × 8## null.deviance df.null logLik AIC BIC deviance df.residual nobs## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>## 1 43.2 31 -9.59 23.2 26.1 19.2 30 32Note that the statistics computed byglance aredifferent forglm objects than forlm(e.g. deviance rather than R^2):
These functions also work on other fits, such as nonlinear models(nls):
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 k 45.8 4.25 10.8 7.64e-12## 2 b 4.39 1.54 2.85 7.74e- 3## # A tibble: 32 × 14## .rownames mpg cyl disp hp drat wt qsec vs am gear carb## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4## 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1## 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1## 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1## 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4## # ℹ 22 more rows## # ℹ 2 more variables: .fitted <dbl>, .resid <dbl>## # A tibble: 1 × 9## sigma isConv finTol logLik AIC BIC deviance df.residual nobs## <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>## 1 2.77 TRUE 0.00000000681 -77.0 160. 164. 231. 30 32Thetidy function can also be applied tohtest objects, such as those output by popular built-infunctions liket.test,cor.test, andwilcox.test.
## # A tibble: 1 × 10## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1.36 3.77 2.41 5.49 0.00000627 29.2 0.853 1.86## # ℹ 2 more variables: method <chr>, alternative <chr>Some cases might have fewer columns (for example, no confidenceinterval):
## # A tibble: 1 × 4## statistic p.value method alternative## <dbl> <dbl> <chr> <chr> ## 1 230. 0.0000435 Wilcoxon rank sum test with continuity correc… two.sidedSince thetidy output is already only one row,glance returns the same output:
## # A tibble: 1 × 10## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1.36 3.77 2.41 5.49 0.00000627 29.2 0.853 1.86## # ℹ 2 more variables: method <chr>, alternative <chr>## # A tibble: 1 × 4## statistic p.value method alternative## <dbl> <dbl> <chr> <chr> ## 1 230. 0.0000435 Wilcoxon rank sum test with continuity correc… two.sidedaugment method is defined only for chi-squared tests,since there is no meaningful sense, for other tests, in which ahypothesis test produces output about each initial data point.
## # A tibble: 1 × 4## statistic p.value parameter method ## <dbl> <dbl> <int> <chr> ## 1 350. 1.56e-75 3 Pearson's Chi-squared test## # A tibble: 8 × 9## Sex Class .observed .prop .row.prop .col.prop .expected .resid .std.resid## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Male 1st 180 0.0818 0.104 0.554 256. -4.73 -11.1 ## 2 Female 1st 145 0.0659 0.309 0.446 69.4 9.07 11.1 ## 3 Male 2nd 179 0.0813 0.103 0.628 224. -3.02 -6.99## 4 Female 2nd 106 0.0482 0.226 0.372 60.9 5.79 6.99## 5 Male 3rd 510 0.232 0.295 0.722 555. -1.92 -5.04## 6 Female 3rd 196 0.0891 0.417 0.278 151. 3.68 5.04## 7 Male Crew 862 0.392 0.498 0.974 696. 6.29 17.6 ## 8 Female Crew 23 0.0104 0.0489 0.0260 189. -12.1 -17.6In order to maintain consistency, we attempt to follow someconventions regarding the structure of returned data.
tidy,augment andglance functions isalways a tibble.pval orPValue?” every time).The examples below are not all the possible column names, nor will alltidy output contain all or even any of these columns.tidy output typically represents somewell-defined concept, such as one term in a regression, one test, or onecluster/class. This meaning varies across models but is usuallyself-evident. The one thing each row cannot represent is a point in theinitial data (for that, use theaugment method).term“” the term in a regression or model that is beingestimated.p.value: this spelling was chosen (over commonalternatives such aspvalue,PValue, orpval) to be consistent with functions in R’s built-instats packagestatistic a test statistic, usually the one used tocompute the p-value. Combining these across many sub-groups is areliable way to perform (e.g.) bootstrap hypothesis testingestimateconf.low the low end of a confidence interval on theestimateconf.high the high end of a confidence interval on theestimatedf degrees of freedomaugment(model, data) adds columns to the original data.data argument is missing,augmentattempts to reconstruct the data from the model (note that this may notalways be possible, and usually won’t contain columns not used in themodel).augment output matches the correspondingrow in the original data.augment turnsthem into a column called.rownames.. to avoidoverwriting columns in the original data..fitted: the predicted values, on the same scale as thedata..resid: residuals: the actual y values minus the fittedvalues.cluster: cluster assignmentsglance always returns a one-row tibble.glance(NULL) returns anempty tibble.glm glance output doesnot need to contain a field forfamily, since that isdecided by the user callingglm rather than the modelingfunction itself.r.squared the fraction of variance explained by themodeladj.r.squared\(R^2\)adjusted based on the degrees of freedomsigma the square root of the estimated variance of theresiduals