Movatterモバイル変換


[0]ホーム

URL:


Introduction to broom

2025-12-03

broom: let’s tidy up a bit

The broom package takes the messy output of built-in functions in R,such aslm,nls, ort.test, andturns them into tidy tibbles.

The concept of “tidy data”,as introduced by HadleyWickham, offers a powerful framework for data manipulation andanalysis. That paper makes a convincing statement of the problem thispackage tries to solve (emphasis mine):

While model inputs usually require tidy inputs, suchattention to detail doesn’t carry over to model outputs. Outputs such aspredictions and estimated coefficients aren’t always tidy. This makes itmore difficult to combine results from multiple models. Forexample, in R, the default representation of model coefficients is nottidy because it does not have an explicit variable that records thevariable name for each estimate, they are instead recorded as row names.In R, row names must be unique, so combining coefficients from manymodels (e.g., from bootstrap resamples, or subgroups) requiresworkarounds to avoid losing important information.This knocksyou out of the flow of analysis and makes it harder to combine theresults from multiple models. I’m not currently aware of any packagesthat resolve this problem.

broom is an attempt to bridge the gap from untidy outputs ofpredictions and estimations to the tidy data we want to work with. Itcenters around three S3 methods, each of which take common objectsproduced by R statistical functions (lm,t.test,nls, etc) and convert them into atibble. broom is particularly designed to work with Hadley’sdplyr package (see thebroom+dplyr vignette for more).

broom should be distinguished from packages likereshape2 andtidyr, which rearrange andreshape data frames into different forms. Those packages performcritical tasks in tidy data analysis but focus on manipulating dataframes in one specific format into another. In contrast, broom isdesigned to take format that isnot in a tabular data format(sometimes not anywhere close) and convert it to a tidy tibble.

Tidying model outputs is not an exact science, and it’s based on ajudgment of the kinds of values a data scientist typically wants out ofa tidy analysis (for instance, estimates, test statistics, andp-values). You may lose some of the information in the original objectthat you wanted, or keep more information than you need. If you thinkthe tidy output for a model should be changed, or if you’re missing atidying function for an S3 class that you’d like, I strongly encourageyou toopen anissue or a pull request.

Tidying functions

This package provides three S3 methods that do three distinct kindsof tidying.

Note that some classes may have only one or two of these methodsdefined.

Consider as an illustrative example a linear fit on the built-inmtcars dataset.

lmfit<-lm(mpg~ wt, mtcars)lmfit
## ## Call:## lm(formula = mpg ~ wt, data = mtcars)## ## Coefficients:## (Intercept)           wt  ##      37.285       -5.344
summary(lmfit)
## ## Call:## lm(formula = mpg ~ wt, data = mtcars)## ## Residuals:##     Min      1Q  Median      3Q     Max ## -4.5432 -2.3647 -0.1252  1.4096  6.8727 ## ## Coefficients:##             Estimate Std. Error t value Pr(>|t|)    ## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***## wt           -5.3445     0.5591  -9.559 1.29e-10 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.046 on 30 degrees of freedom## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 ## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

This summary output is useful enough if you just want to read it.However, converting it to tabular data that contains all the sameinformation, so that you can combine it with other models or do furtheranalysis, is not trivial. You have to docoef(summary(lmfit)) to get a matrix of coefficients, theterms are still stored in row names, and the column names areinconsistent with other packages (e.g. Pr(>|t|) comparedtop.value).

Instead, you can use thetidy function, from the broompackage, on the fit:

library(broom)tidy(lmfit)
## # A tibble: 2 × 5##   term        estimate std.error statistic  p.value##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>## 1 (Intercept)    37.3      1.88      19.9  8.24e-19## 2 wt             -5.34     0.559     -9.56 1.29e-10

This gives you a tabular data representation. Note that the row nameshave been moved into a column calledterm, and the columnnames are simple and consistent (and can be accessed using$).

Instead of viewing the coefficients, you might be interested in thefitted values and residuals for each of the original points in theregression. For this, useaugment, which augments theoriginal data with information from the model:

augment(lmfit)
## # A tibble: 32 × 9##    .rownames           mpg    wt .fitted .resid   .hat .sigma .cooksd .std.resid##    <chr>             <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>##  1 Mazda RX4          21    2.62    23.3 -2.28  0.0433   3.07 1.33e-2    -0.766 ##  2 Mazda RX4 Wag      21    2.88    21.9 -0.920 0.0352   3.09 1.72e-3    -0.307 ##  3 Datsun 710         22.8  2.32    24.9 -2.09  0.0584   3.07 1.54e-2    -0.706 ##  4 Hornet 4 Drive     21.4  3.22    20.1  1.30  0.0313   3.09 3.02e-3     0.433 ##  5 Hornet Sportabout  18.7  3.44    18.9 -0.200 0.0329   3.10 7.60e-5    -0.0668##  6 Valiant            18.1  3.46    18.8 -0.693 0.0332   3.10 9.21e-4    -0.231 ##  7 Duster 360         14.3  3.57    18.2 -3.91  0.0354   3.01 3.13e-2    -1.31  ##  8 Merc 240D          24.4  3.19    20.2  4.16  0.0313   3.00 3.11e-2     1.39  ##  9 Merc 230           22.8  3.15    20.5  2.35  0.0314   3.07 9.96e-3     0.784 ## 10 Merc 280           19.2  3.44    18.9  0.300 0.0329   3.10 1.71e-4     0.100 ## # ℹ 22 more rows

Note that each of the new columns begins with a. (toavoid overwriting any of the original columns).

Finally, several summary statistics are computed for the entireregression, such as R^2 and the F-statistic. These can be accessed withtheglance function:

glance(lmfit)
## # A tibble: 1 × 12##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>## 1     0.753         0.745  3.05      91.4 1.29e-10     1  -80.0  166.  170.## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

This distinction between thetidy,augmentandglance functions is explored in a different context inthek-meansvignette.

Other Examples

Generalized linear and non-linear models

These functions apply equally well to the output fromglm:

glmfit<-glm(am~ wt, mtcars,family ="binomial")tidy(glmfit)
## # A tibble: 2 × 5##   term        estimate std.error statistic p.value##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>## 1 (Intercept)    12.0       4.51      2.67 0.00759## 2 wt             -4.02      1.44     -2.80 0.00509
augment(glmfit)
## # A tibble: 32 × 9##    .rownames            am    wt .fitted .resid   .hat .sigma .cooksd .std.resid##    <chr>             <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>##  1 Mazda RX4             1  2.62   1.50   0.635 0.126   0.803 0.0184       0.680##  2 Mazda RX4 Wag         1  2.88   0.471  0.985 0.108   0.790 0.0424       1.04 ##  3 Datsun 710            1  2.32   2.70   0.360 0.0963  0.810 0.00394      0.379##  4 Hornet 4 Drive        0  3.22  -0.897 -0.827 0.0744  0.797 0.0177      -0.860##  5 Hornet Sportabout     0  3.44  -1.80  -0.553 0.0681  0.806 0.00647     -0.572##  6 Valiant               0  3.46  -1.88  -0.532 0.0674  0.807 0.00590     -0.551##  7 Duster 360            0  3.57  -2.33  -0.432 0.0625  0.809 0.00348     -0.446##  8 Merc 240D             0  3.19  -0.796 -0.863 0.0755  0.796 0.0199      -0.897##  9 Merc 230              0  3.15  -0.635 -0.922 0.0776  0.793 0.0242      -0.960## 10 Merc 280              0  3.44  -1.80  -0.553 0.0681  0.806 0.00647     -0.572## # ℹ 22 more rows
glance(glmfit)
## # A tibble: 1 × 8##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>## 1          43.2      31  -9.59  23.2  26.1     19.2          30    32

Note that the statistics computed byglance aredifferent forglm objects than forlm(e.g. deviance rather than R^2):

These functions also work on other fits, such as nonlinear models(nls):

nlsfit<-nls(mpg~ k/ wt+ b, mtcars,start =list(k =1,b =0))tidy(nlsfit)
## # A tibble: 2 × 5##   term  estimate std.error statistic  p.value##   <chr>    <dbl>     <dbl>     <dbl>    <dbl>## 1 k        45.8       4.25     10.8  7.64e-12## 2 b         4.39      1.54      2.85 7.74e- 3
augment(nlsfit, mtcars)
## # A tibble: 32 × 14##    .rownames     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4##  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1##  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1##  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4## # ℹ 22 more rows## # ℹ 2 more variables: .fitted <dbl>, .resid <dbl>
glance(nlsfit)
## # A tibble: 1 × 9##   sigma isConv        finTol logLik   AIC   BIC deviance df.residual  nobs##   <dbl> <lgl>          <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>## 1  2.77 TRUE   0.00000000681  -77.0  160.  164.     231.          30    32

Hypothesis testing

Thetidy function can also be applied tohtest objects, such as those output by popular built-infunctions liket.test,cor.test, andwilcox.test.

tt<-t.test(wt~ am, mtcars)tidy(tt)
## # A tibble: 1 × 10##   estimate estimate1 estimate2 statistic    p.value parameter conf.low conf.high##      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>## 1     1.36      3.77      2.41      5.49 0.00000627      29.2    0.853      1.86## # ℹ 2 more variables: method <chr>, alternative <chr>

Some cases might have fewer columns (for example, no confidenceinterval):

wt<-wilcox.test(wt~ am, mtcars)tidy(wt)
## # A tibble: 1 × 4##   statistic   p.value method                                         alternative##       <dbl>     <dbl> <chr>                                          <chr>      ## 1      230. 0.0000435 Wilcoxon rank sum test with continuity correc… two.sided

Since thetidy output is already only one row,glance returns the same output:

glance(tt)
## # A tibble: 1 × 10##   estimate estimate1 estimate2 statistic    p.value parameter conf.low conf.high##      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>## 1     1.36      3.77      2.41      5.49 0.00000627      29.2    0.853      1.86## # ℹ 2 more variables: method <chr>, alternative <chr>
glance(wt)
## # A tibble: 1 × 4##   statistic   p.value method                                         alternative##       <dbl>     <dbl> <chr>                                          <chr>      ## 1      230. 0.0000435 Wilcoxon rank sum test with continuity correc… two.sided

augment method is defined only for chi-squared tests,since there is no meaningful sense, for other tests, in which ahypothesis test produces output about each initial data point.

chit<-chisq.test(xtabs(Freq~ Sex+ Class,data =as.data.frame(Titanic)))tidy(chit)
## # A tibble: 1 × 4##   statistic  p.value parameter method                    ##       <dbl>    <dbl>     <int> <chr>                     ## 1      350. 1.56e-75         3 Pearson's Chi-squared test
augment(chit)
## # A tibble: 8 × 9##   Sex    Class .observed  .prop .row.prop .col.prop .expected .resid .std.resid##   <fct>  <fct>     <dbl>  <dbl>     <dbl>     <dbl>     <dbl>  <dbl>      <dbl>## 1 Male   1st         180 0.0818    0.104     0.554      256.   -4.73     -11.1 ## 2 Female 1st         145 0.0659    0.309     0.446       69.4   9.07      11.1 ## 3 Male   2nd         179 0.0813    0.103     0.628      224.   -3.02      -6.99## 4 Female 2nd         106 0.0482    0.226     0.372       60.9   5.79       6.99## 5 Male   3rd         510 0.232     0.295     0.722      555.   -1.92      -5.04## 6 Female 3rd         196 0.0891    0.417     0.278      151.    3.68       5.04## 7 Male   Crew        862 0.392     0.498     0.974      696.    6.29      17.6 ## 8 Female Crew         23 0.0104    0.0489    0.0260     189.  -12.1      -17.6

Conventions

In order to maintain consistency, we attempt to follow someconventions regarding the structure of returned data.

All functions

  • The output of thetidy,augment andglance functions isalways a tibble.
  • The output never has rownames. This ensures that you can combine itwith other tidy outputs without fear of losing information (sincerownames in R cannot contain duplicates).
  • Some column names are kept consistent, so that they can be combinedacross different models and so that you know what to expect (in contrastto asking “is itpval orPValue?” every time).The examples below are not all the possible column names, nor will alltidy output contain all or even any of these columns.

tidy functions

  • Each row in atidy output typically represents somewell-defined concept, such as one term in a regression, one test, or onecluster/class. This meaning varies across models but is usuallyself-evident. The one thing each row cannot represent is a point in theinitial data (for that, use theaugment method).
  • Common column names include:
    • term“” the term in a regression or model that is beingestimated.
    • p.value: this spelling was chosen (over commonalternatives such aspvalue,PValue, orpval) to be consistent with functions in R’s built-instats package
    • statistic a test statistic, usually the one used tocompute the p-value. Combining these across many sub-groups is areliable way to perform (e.g.) bootstrap hypothesis testing
    • estimate
    • conf.low the low end of a confidence interval on theestimate
    • conf.high the high end of a confidence interval on theestimate
    • df degrees of freedom

augment functions

  • augment(model, data) adds columns to the original data.
    • If thedata argument is missing,augmentattempts to reconstruct the data from the model (note that this may notalways be possible, and usually won’t contain columns not used in themodel).
  • Each row in anaugment output matches the correspondingrow in the original data.
  • If the original data contained rownames,augment turnsthem into a column called.rownames.
  • Newly added column names begin with. to avoidoverwriting columns in the original data.
  • Common column names include:
    • .fitted: the predicted values, on the same scale as thedata.
    • .resid: residuals: the actual y values minus the fittedvalues
    • .cluster: cluster assignments

glance functions

  • glance always returns a one-row tibble.
    • The only exception is thatglance(NULL) returns anempty tibble.
  • We avoid including arguments that weregiven to themodeling function. For example, aglm glance output doesnot need to contain a field forfamily, since that isdecided by the user callingglm rather than the modelingfunction itself.
  • Common column names include:
    • r.squared the fraction of variance explained by themodel
    • adj.r.squared\(R^2\)adjusted based on the degrees of freedom
    • sigma the square root of the estimated variance of theresiduals

[8]ページ先頭

©2009-2025 Movatter.jp