library(ggplot2)library(bayesplot)theme_set(bayesplot::theme_default())This vignette illustrates the effects on posterior inference ofpooling data (a.k.a sharing strength) across units for repeated binarytrial data. It provides R code to fit and check predictive models forthree situations: (a) complete pooling, which assumes each unit is thesame, (b) no pooling, which assumes the units are unrelated, and (c)partial pooling, where the similarity among the units is estimated. Thenote explains with working examples how to (i) fit the models usingrstanarm and plot the results, (ii) estimate eventprobabilities, (iii) evaluate posterior predictive densities to evaluatemodel predictions on held-out data, (iv) rank units by chance ofsuccess, (v) perform multiple comparisons in several settings, (vi)replicate new data for posterior\(p\)-values, and (vii) perform graphicalposterior predictive checks.
The content of the vignette is based on Bob Carpenter’s Stan tutorialHierarchicalPartial Pooling for Repeated Binary Trials, but here we showhow to fit the models and carry out predictions and model checking andcomparison usingrstanarm. Most of the text is takenfrom the original, with some additions and subtractions to make thecontent more useful forrstanarm users. The Stan codefrom the original tutorial has also been entirely removed, asrstanarm will fit all of the models in Stan without theuser having to write the underlying Stan programs. The Stan code in theoriginal document is a good reference for anyone interested in how thesemodels are estimated “under-the-hood”, though the parameterizations usedinternally byrstanarm differ somewhat from those inthe original.
Suppose that for each of\(N\) units\(n \in 1{:}N\), we observe\(y_n\) successes out of\(K_n\) trials. For example, the data mayconsist of
rat tumor development, with\(y_n\) rats developing tumors of\(K_n\) total rats in experimental controlgroup\(n \in 1{:}N\) (Tarone1982)
surgical mortality, with\(y_n\)surgical patients dying in\(K_n\)surgeries for hospitals\(n \in 1{:}N\)(Spiegelhalter et al. 1996)
baseball batting ability, with\(y_n\) hits in\(K_n\) at bats for baseball players\(n \in 1{:}N\) (Efron and Morris 1975;Carpenter 2009)
machine learning system accuracy, with\(y_n\) correct classifications out of\(K_n\) examples for systems\(n \in 1{:}N\) (ML conference proceedings;Kaggle competitions)
In this vignette we use the small baseball data set of Efron andMorris (1975), but we also provide the rat control data of Tarone(1982), the surgical mortality data of Spiegelhalter et al. (1996) andthe extended baseball data set of Carpenter (2009).
As a running example, we will use the data from Table 1 of (Efron andMorris 1975), which is included inrstanarm under thenamebball1970 (it was downloaded 24 Dec 2015 fromhere).It is drawn from the 1970 Major League Baseball season (from bothleagues).
library(rstanarm)data(bball1970)bball<- bball1970print(bball) Player AB Hits RemainingAB RemainingHits1 Clemente 45 18 367 1272 Robinson 45 17 426 1273 Howard 45 16 521 1444 Johnstone 45 15 275 615 Berry 45 14 418 1146 Spencer 45 14 466 1267 Kessinger 45 13 586 1558 Alvarado 45 12 138 299 Santo 45 11 510 13710 Swaboda 45 11 200 4611 Petrocelli 45 10 538 14212 Rodriguez 45 10 186 4213 Scott 45 10 435 13214 Unser 45 10 277 7315 Williams 45 10 591 19516 Campaneris 45 9 558 15917 Munson 45 8 408 12918 Alvis 45 7 70 14# A few quantities we'll use throughoutN<-nrow(bball)K<- bball$ABy<- bball$HitsK_new<- bball$RemainingABy_new<- bball$RemainingHitsThe data separates the outcome from the initial 45 at-bats from therest of the season. After running this code,N is thenumber of units (players). Then for each unitn,K[n] is the number of initial trials (at-bats),y[n] is the number of initial successes (hits),K_new[n] is the remaining number of trials (remainingat-bats), andy_new[n] is the number of successes in theremaining trials (remaining hits).
The remaining data can be used to evaluate the predictive performanceof our models conditioned on the observed data. That is, we will “train”on the first 45 at bats and see how well our various models do atpredicting the rest of the season.
Withcomplete pooling, each unit is assumed to have the samechance of success. Withno pooling, each unit is assumed tohave a completely unrelated chance of success. Withpartialpooling, each unit is assumed to have a different chance ofsuccess, but the data for all of the observed units informs theestimates for each unit.
Partial pooling is typically accomplished through hierarchicalmodels. Hierarchical models directly model the population of units. Froma population model perspective, no pooling corresponds to infinitepopulation variance, whereas complete pooling corresponds to zeropopulation variance.
In the following sections, all three types of pooling models will befit for the baseball data.
First we’ll create some useful objects to use throughout the rest ofthis vignette. One of them is a functionbatting_avg, whichjust formats a number to include three decimal places to the right ofzero when printing, as is customary for batting averages.
batting_avg<-function(x)print(format(round(x,digits =3),nsmall =3),quote =FALSE)player_avgs<- y/ K# player avgs through 45 ABtot_avg<-sum(y)/sum(K)# overall avg through 45 ABcat("Player averages through 45 at-bats:\n")batting_avg(player_avgs)cat("Overall average through 45 at-bats:\n")batting_avg(tot_avg)Player averages through 45 at-bats: [1] 0.400 0.378 0.356 0.333 0.311 0.311 0.289 0.267 0.244 0.244 0.222 0.222[13] 0.222 0.222 0.222 0.200 0.178 0.156Overall average through 45 at-bats:[1] 0.265The complete pooling model assumes a single parameter\(\theta\) representing the chance of successfor all units (in this case players).
Assuming each player’s at-bats are independent Bernoulli trials, theprobability distribution for each player’s number of hits\(y_n\) is modeled as
\[p(y_n \, | \, \theta)\ = \\mathsf{Binomial}(y_n \, | \, K_n, \theta).\]
When viewed as a function of\(\theta\) for fixed\(y_n\), this is called the likelihoodfunction.
Assuming each player is independent leads to the complete datalikelihood
\[p(y \, | \, \theta) = \prod_{n=1}^N \mathsf{Binomial}(y_n \, | \, K_n,\theta).\]
Usingfamily=binomial("logit"), thestan_glm function inrstanarm willparameterize the model in terms of the log-odds\(\alpha\), which are defined by the logittransform as
\[\alpha= \mathrm{logit}(\theta)= \log \, \frac{\theta}{1 - \theta}.\]
For example,\(\theta = 0.25\)corresponds to odds of\(.25\) to\(.75\) (equivalently,\(1\) to\(3\)), or log-odds of\(\log .25 / .75 = -1.1\).
The model is therefore
\[p(y_n \, | \, K_n, \alpha)\ = \ \mathsf{Binomial}(y_n \, | \, K_n, \ \mathrm{logit}^{-1}(\alpha))\]
The inverse logit function is the logisticsigmoid fromwhich logistic regression gets its name because the inverse logitfunction is also the standard logistic Cumulative Distribution Function(CDF),
\[\mathrm{logit}^{-1}(\alpha) = \frac{1}{1 + \exp(-\alpha)} = \theta.\]
By construction, for any\(\alpha \in(-\infty, \infty)\),\(\mathrm{logit}^{-1}(\alpha) \in (0, 1)\);the sigmoid converts arbitrary log odds back to the probabilityscale.
We will use a normal distribution with mean\(-1\) and standard deviation\(1\) as the prior on the log-odds\(\alpha\). This is a weakly informativeprior that places about 95% of the prior probability in the interval\((-3, 1)\), which inverse-logittransforms to the interval\((0.05,0.73)\). The prior median\(-1\)corresponds to a\(0.27\) chance ofsuccess. In fact, an even narrower prior is actually motivated here fromsubstantial baseball knowledge.
The figure below shows both this prior on\(\alpha\) as well as the prior it implies onthe probability\(\theta\).
To fit the model we callstan_glm with the formulacbind(Hits, AB - Hits) ~ 1. The left-hand side of theformula specifies the binomial outcome by providing the number ofsuccesses (hits) and failures (at-bats) for each player, and theright-hand side indicates that we want an intercept-only model.
SEED<-101wi_prior<-normal(-1,1)# weakly informative prior on log-oddsfit_pool<-stan_glm(cbind(Hits, AB- Hits)~1,data = bball,family =binomial("logit"),prior_intercept = wi_prior,seed = SEED)Thesummary function will compute all sorts of summarystatistics from the fitted model, but here we’ll create a small functionthat will compute just a few posterior summary statistics that we’llwant for each of the models we estimate. Thesummary_statsfunction, defined below, will take a matrix of posterior draws as itsinput, apply an inverse-logit transformation (to convert from log-oddsto probabilities) and then compute the median and 80% interval.
invlogit<- plogis# function(x) 1/(1 + exp(-x))summary_stats<-function(posterior) { x<-invlogit(posterior)# log-odds -> probabilitiest(apply(x,2, quantile,probs =c(0.1,0.5,0.9)))}pool<-summary_stats(as.matrix(fit_pool))# as.matrix extracts the posterior drawspool<-matrix(pool,# replicate to give each player the same estimatesnrow(bball),ncol(pool),byrow =TRUE,dimnames =list(bball$Player,c("10%","50%","90%")))batting_avg(pool) 10% 50% 90% Clemente 0.245 0.265 0.285Robinson 0.245 0.265 0.285Howard 0.245 0.265 0.285Johnstone 0.245 0.265 0.285Berry 0.245 0.265 0.285Spencer 0.245 0.265 0.285Kessinger 0.245 0.265 0.285Alvarado 0.245 0.265 0.285Santo 0.245 0.265 0.285Swaboda 0.245 0.265 0.285Petrocelli 0.245 0.265 0.285Rodriguez 0.245 0.265 0.285Scott 0.245 0.265 0.285Unser 0.245 0.265 0.285Williams 0.245 0.265 0.285Campaneris 0.245 0.265 0.285Munson 0.245 0.265 0.285Alvis 0.245 0.265 0.285With more data, such as from more players or from the rest of theseason, the posterior approaches a delta function around the maximumlikelihood estimate and the posterior interval around the centralposterior intervals will shrink. Nevertheless, even if we know aplayer’s chance of success exactly, there is a large amount ofuncertainty in running\(K\) binarytrials with that chance of success; using a binomial model fundamentallybounds our prediction accuracy.
Although this model will be a good baseline for comparison, we havegood reason to believe from a large amount of prior data (players withas many as 10,000 trials) that it is very unlikely that all baseballplayers have the same chance of success.
A model with no pooling involves a separate chance-of-successparameter\(\theta_n \in [0,1]\) foreach player\(n\), where the\(\theta_n\) are assumed to beindependent.
rstanarm will again parameterize the model in termsof the log-odds,\(\alpha_n =\mathrm{logit}(\theta_n)\), so the likelihood then uses thelog-odds of success\(\alpha_n\) forunit\(n\) in modeling the number ofsuccesses\(y_n\) as
\[p(y_n \, | \, \alpha_n) =\mathsf{Binomial}(y_n \, | \, K_n, \mathrm{logit}^{-1}(\alpha_n)).\]
Assuming the\(y_n\) are independent(conditional on\(\theta\)), this leadsto the total data likelihood
\[p(y \, | \, \alpha) = \prod_{n=1}^N \mathsf{Binomial}(y_n \, | \, K_n,\mathrm{logit}^{-1}(\alpha_n)).\]
To fit the model we need only tweak the model formula used for thefull pooling model to drop the intercept and instead include as the onlypredictor the factor variablePlayer. This is equivalent toestimating a separate intercept on the log-odds scale for each player.We’ll also use theprior (rather thanprior_intercept) argument sincePlayer isconsidered a predictor rather than an intercept from R’s perspective.Using the same weakly informative prior now means that the each\(\alpha_n\) gets a\(\mathsf{Normal}(-1, 1)\) prior, independentof the others.
fit_nopool<-update(fit_pool,formula = .~0+ Player,prior = wi_prior)nopool<-summary_stats(as.matrix(fit_nopool))rownames(nopool)<-as.character(bball$Player)batting_avg(nopool) parameters 10% 50% 90% Clemente 0.300 0.386 0.473 Robinson 0.279 0.366 0.458 Howard 0.263 0.344 0.435 Johnstone 0.244 0.326 0.414 Berry 0.227 0.305 0.393 Spencer 0.226 0.306 0.389 Kessinger 0.209 0.284 0.370 Alvarado 0.190 0.266 0.352 Santo 0.172 0.244 0.330 Swaboda 0.172 0.243 0.328 Petrocelli 0.154 0.223 0.305 Rodriguez 0.157 0.226 0.305 Scott 0.156 0.224 0.306 Unser 0.156 0.225 0.303 Williams 0.156 0.225 0.305 Campaneris 0.138 0.204 0.282 Munson 0.124 0.185 0.258 Alvis 0.108 0.166 0.241Each 80% interval is much wider than the estimated interval for thepopulation in the complete pooling model; this is to be expected—thereare only 45 data units for each parameter here as opposed to 810 in thecomplete pooling case. If the units each had different numbers oftrials, the intervals would also vary based on size.
As the estimated chance of success goes up toward 0.5, the 80%intervals gets wider. This is to be expected for chance of successparameters, because the variance is maximized when\(\theta = 0.5\).
Based on our existing knowledge of baseball, the no-pooling model isalmost certainly overestimating the high abilities and underestimatinglower abilities (Ted Williams, 30 years prior to the year this data wascollected, was the last player with a 40% observed success rate over aseason, whereas 20% or less is too low for all but a few rare defensivespecialists).
Complete pooling provides estimated abilities that are too narrowlydistributed for the units and removes any chance of modeling populationvariation. Estimating each chance of success separately without anypooling provides estimated abilities that are too broadly distributedfor the units and hence too variable. Clearly some amount of poolingbetween these two extremes is called for. But how much?
A hierarchical model treats the players as belonging to a populationof players. The properties of this population will be estimated alongwith player abilities, implicitly controlling the amount of pooling thatis applied. The more variable the (estimate of the) population, the lesspooling is applied. Mathematically, the hierarchical model places aprior on the abilities with parameters that are themselvesestimated.
This model can be estimated using thestan_glmerfunction.
fit_partialpool<-stan_glmer(cbind(Hits, AB- Hits)~ (1| Player),data = bball,family =binomial("logit"),prior_intercept = wi_prior,seed = SEED)Becausestan_glmer (likeglmer) estimatesthe varying intercepts forPlayer by estimating a singleglobal intercept\(\alpha_0\) andindividual deviations from that intercept for each player\(\delta_n = \alpha_n - \alpha_0\), to getthe posterior distribution for each\(\alpha_n\) we need to shift each of theposterior draws by the corresponding draw for the intercept. We can dothis easily using thesweep function.
# shift each player's estimate by intercept (and then drop intercept)shift_draws<-function(draws) {sweep(draws[,-1],MARGIN =1,STATS = draws[,1],FUN ="+")}alphas<-shift_draws(as.matrix(fit_partialpool))partialpool<-summary_stats(alphas)partialpool<- partialpool[-nrow(partialpool),]rownames(partialpool)<-as.character(bball$Player)batting_avg(partialpool) parameters 10% 50% 90% Clemente 0.249 0.283 0.349 Robinson 0.245 0.281 0.341 Howard 0.243 0.277 0.331 Johnstone 0.240 0.274 0.323 Berry 0.238 0.271 0.317 Spencer 0.235 0.271 0.316 Kessinger 0.233 0.268 0.310 Alvarado 0.229 0.265 0.303 Santo 0.222 0.262 0.299 Swaboda 0.223 0.262 0.298 Petrocelli 0.216 0.258 0.294 Rodriguez 0.218 0.259 0.293 Scott 0.217 0.259 0.294 Unser 0.217 0.258 0.293 Williams 0.215 0.258 0.296 Campaneris 0.211 0.255 0.290 Munson 0.205 0.253 0.287 Alvis 0.198 0.250 0.285Here the estimates are less extreme than in the no-pooling case,which we should expect due to the partial pooling. It is also clear fromthe wide posteriors for the\(\theta_n\) that there is considerableuncertainty in the estimates of chance-of-success on an unit-by-unit(player-by-player) basis.
Figure 5.4 from (Gelman et al. 2013) plots the observed number ofsuccesses\(y_n\) for the first\(K_n\) trials versus the median and 80%intervals for the estimated chance-of-success parameters\(\theta_n\) in the posterior. The followingR code reproduces a similar plot for our data.
library(ggplot2)models<-c("complete pooling","no pooling","partial pooling")estimates<-rbind(pool, nopool, partialpool)colnames(estimates)<-c("lb","median","ub")plotdata<-data.frame(estimates,observed =rep(player_avgs,times =length(models)),model =rep(models,each = N),row.names =NULL)ggplot(plotdata,aes(x = observed,y = median,ymin = lb,ymax = ub))+geom_hline(yintercept = tot_avg,color ="lightpink",size =0.75)+geom_abline(intercept =0,slope =1,color ="skyblue")+geom_linerange(color ="gray60",size =0.75)+geom_point(size =2.5,shape =21,fill ="gray30",color ="white",stroke =0.2)+facet_grid(.~ model)+coord_fixed()+scale_x_continuous(breaks =c(0.2,0.3,0.4))+labs(x ="Observed Hits / AB",y ="Predicted chance of hit")+ggtitle("Posterior Medians and 80% Intervals")The horizontal axis is the observed rate of success, broken out byplayer (the overplotting is from players with the same number ofsuccesses—they all had the same number of trials in this data). The dotsare the posterior medians with bars extending to cover the central 80%posterior interval. Players with the same observed rates areindistinguishable, any differences in estimates are due to MCMCerror.
The horizontal red line has an intercept equal to the overall successrate, The overall success rate is also the posterior mode (i.e., maximumlikelihood estimate) for the complete pooling model. The diagonal blueline has intercept 0 and slope 1. Estimates falling on this line make upthe maximum likelihood estimates for the no-pooling model. Overall, theplot makes the amount of pooling toward the prior evident.
After we have fit a model using some “training” data, we are usuallyinterested in the predictions of the fitted model for new data, which wecan use to
make predictions for new data points; e.g., predict how many hitswill Roberto Clemente get in the rest of the season,
evaluate predictions against observed future data; e.g., how welldid we predict how many hits Roberto Clemente actually got in the restof the season, and
generate new simulated data to validate our model fits.
With full Bayesian inference, we do not make a point estimate ofparameters and use those prediction—we instead use an average ofpredictions weighted by the posterior.
Given data\(y\) and a model withparameters\(\theta\), the posteriorpredictive distribution for new data\(\tilde{y}\) is defined by
\[p(\tilde{y} \, | \, y)\ = \\int_{\Theta} p(\tilde{y} \, | \, \theta) \ p(\theta \, | \, y) \\mathrm{d}\theta,\]
where\(\Theta\) is the support ofthe parameters\(\theta\). What anintegral of this form says is that\(p(\tilde{y} \, | \, y)\) is defined as aweighted average over the legal parameter values\(\theta \in \Theta\) of the likelihoodfunction\(p(\tilde{y} \, | \,\theta)\), with weights given by the posterior,\(p(\theta \, | \, y)\). While we do not wantto get sidetracked with the notational and mathematical subtleties ofexpectations here, the posterior predictive density reduces to theexpectation of\(p(\tilde{y} \, | \,\theta)\) conditioned on\(y\).
Because the posterior predictive density is formulated as anexpectation over the posterior, it is possible to compute via MCMC. With\(M\) draws\(\theta^{(m)}\) from the posterior\(p(\theta \, | \, y)\), the posteriorpredictive log density for new data\(y^{\mathrm{new}}\) is given by the MCMCapproximation
\[\log \frac{1}{M} \, \sum_{m=1}^M \ p\left( y^{\mathrm{new}} \, | \,\theta^{(m)} \right).\]
In practice, this requires care to prevent underflow in floatingpoint calculations; a robust calculation on the log scale is providedbelow.
It is also straightforward to use forward simulation from theprobability distribution of the data\(p(y \,| \, \theta)\) to generate replicated data\(y^{\mathrm{rep}}\) according to theposterior predictive distribution. (Recall that\(p(y \, | \, \theta)\) is called theprobability distribution when\(\theta\) is fixed and the likelihood when\(y\) is fixed.)
With\(M\) draws\(\theta^{(m)}\) from the posterior\(p(\theta \, | \, y)\), replicated data canbe simulated by drawing a sequence of\(M\) simulations according\(y^{\mathrm{rep} \ (m)}\) with each drawnaccording to distribution\(p(y \, | \,\theta^{(m)})\). This latter random variate generation canusually be done efficiently (both computationally and statistically) bymeans of forward simulation from the probability distribution of thedata; we provide an example below.
Efron and Morris’s (1975) baseball data includes not only theobserved hit rate in the initial 45 at bats, but also includes the datafor how the player did for the rest of the season. The question arisesas to how well these models predict a player’s performance for the restof the season based on their initial 45 at bats.
A well calibrated statistical model is one in which the uncertaintyin the predictions matches the uncertainty in further data. That is, ifwe estimate posterior 50% intervals for predictions on new data (here,number of hits in the rest of the season for each player), roughly 50%of the new data should fall in its predicted 50% interval. If the modelis true in the sense of correctly describing the generative process ofthe data, then Bayesian inference is guaranteed to be well calibrated.Given that our models are rarely correct in this deep sense, in practicewe are concerned with testing their calibration on quantities ofinterest.
Given two well calibrated models, the one that makes the more precisepredictions in the sense of having narrower intervals is betterpredictively (Gneiting et al. 2007). To see this in an example, we wouldrather have a well-calibrated prediction that there’s a 90% chance thenumber of hits for a player in the rest of the season will fall in\((120, 130)\) than a 90% prediction that thenumber of hits will fall in\((100,150)\).
For the models introduced here, a posterior that is a delta functionprovides the sharpest predictions. Even so, there is residualuncertainty due to the repeated trials; with\(K^{\mathrm{new}}\) further trials and a afixed\(\theta_n\) chance of success,the random variable\(Y^{\mathrm{new}}_n\) denoting the number offurther successes for unit\(n\) has astandard deviation from the repeated binary trials of
\[\mathrm{sd}[Y^{\mathrm{new}}_n] \ = \ \sqrt{K \\theta \, (1 - \theta)}.\]
The predictive posterior density directly measures the probability ofseeing the new data. The higher the probability assigned to the newdata, the better job the model has done at predicting the outcome. Inthe limit, an ideal model would perfectly predict the new outcome withno uncertainty (probability of 1 for a discrete outcome or a deltafunction at the true value for the density in a continuous outcome).This notion is related to the notion of sharpness discussed in theprevious section, because if the new observations have higher predictivedensities, they’re probably within narrower posterior intervals(Gneiting et al. 2007).
The log of posterior predictive density is defined in the obvious wayas
\[\log p(\tilde{y} \, | \, y)= \log \int_{\Theta} p(\tilde{y} \, | \, \theta) \ p(\theta \, | \, y) \ \mathrm{d}\theta.\]
This is not a posterior expectation, but rather the log of aposterior expectation. In particular, it should not be confused with theposterior expectation of the log predictive density, which is givenby
\[\int_{\Theta} \left( \log p(\tilde{y} \, | \, \theta) \right) \ p(\theta \, | \, y) \ \mathrm{d}\theta.\]
Although this is easy to compute in Stan in a stable fashion, it doesnot produce the same answer (as we show below).
Because\(-\log(u)\) is convex, alittle wrangling withJensen’sinequality shows that the expectation of the log is less than orequal to the log of the expectation,
\[\int_{\Theta} \left( \, \log p(\tilde{y} \, | \, \theta) \, \right) \p(\theta \, | \, y) \ \mathrm{d}\theta\ \leq \\log \int_{\Theta} p(\tilde{y} \, | \, \theta) \ p(\theta \, | \, y) \\mathrm{d}\theta\]
We’ll compute both expectations and demonstrate Jensen’s inequalityin our running example.
The variablesK_new[n] andy_new[n] holdthe number of at bats (trials) and the number of hits (successes) forplayer (unit)n. With the held out data we can compute thelog density of each data point using thelog_lik function,which, likeposterior_predict, accepts anewdata argument. Thelog_lik function willreturn an\(M \times N\) matrix, where\(M\) is the size of the posteriorsample (the number of draws we obtained from the posterior distribution)and\(N\) is the number of data pointsinnewdata. We can then take the row sums of this matrix tosum over the data points.
newdata<-data.frame(Hits = y_new,AB = K_new,Player = bball$Player)fits<-list(Pooling = fit_pool,NoPooling = fit_nopool,PartialPooling = fit_partialpool)# compute log_p_new matrix with each of the models in 'fits'log_p_new_mats<-lapply(fits, log_lik,newdata = newdata)# for each matrix in the list take the row sumslog_p_new<-sapply(log_p_new_mats, rowSums)M<-nrow(log_p_new)head(log_p_new) Pooling NoPooling PartialPooling[1,] -87.50510 -270.1454 -98.24389[2,] -73.97575 -310.3988 -116.16486[3,] -80.95743 -250.4451 -93.64268[4,] -77.54662 -281.8998 -101.46736[5,] -74.39093 -172.3741 -115.82690[6,] -88.18157 -171.3092 -74.83273We now have the distributions oflog_p_new in a matrixwith a column for each model.
For each model, the posterior mean forlog_p_new willgive us
\[\int_{\Theta} \left( \log p(\tilde{y} \, | \, \theta) \right) \ p(\theta \, | \, y) \ \mathrm{d}\theta\ \approx \\frac{1}{M} \, \sum_{m=1}^M \log p(y^{\mathrm{new}} \, | \,\theta^{(m)}).\]
To compute this for each of the models we only need to take the meanof the corresponding column oflog_p_new.
mean_log_p_new<-colMeans(log_p_new)round(sort(mean_log_p_new,decreasing =TRUE),digits =1) Pooling PartialPooling NoPooling -81.8 -99.6 -207.8From a predictive standpoint, the models are ranked by the amount ofpooling they do, with complete pooling being the best, and no poolingbeing the worst predictively. All of these models do predictions byaveraging over their posteriors, with the amount of posterioruncertainty also being ranked in reverse order of the amount of poolingthey do.
As we will now see, the ranking of the models can change when wecompute the posterior expectation of the log predictive density.
The straight path to calculate this would be to define a generatedquantity\(p(y^{\mathrm{new}} \, |y)\), look at the posterior mean computed by Stan, and takes itslog. That is,
\[\log p(y^{\mathrm{new}} \, | \, y)\ \approx \\log \frac{1}{M} \, \sum_{m=1}^M p(y^{\mathrm{new}} \, | \,\theta^{(m)}).\]
Unfortunately, this won’t work in most cases because when we try tocompute\(p(y^{\mathrm{new}} \, | \,\theta^{(m)})\) directly, it is prone to underflow. For example,2000 outcomes\(y^{\mathrm{new}}_n\),each with likelihood 0.5 for\(\theta^{(m)}\), will underflow, because\(0.5^{2000}\) is smaller than thesmallest positive number that a computer can represent using standarddouble-precisionfloating point (used by Stan, R, etc.).
In contrast, if we work on the log scale,\(\log p(y^{\mathrm{new}} \, | \, y)\) willnot underflow. It’s a sum of a bunch of terms of order 1. But we alreadysaw we can’t just average the log to get the log of the average.
To avoid underflow, we’re going to use thelog-sum-of-exponentialstrick, which begins by noting the obvious,
\[\log \frac{1}{M} \, \sum_{m=1}^M \ p(y^{\mathrm{new}} \, | \,\theta^{(m)}).\ = \\log \frac{1}{M} \, \sum_{m=1}^M \ \exp \left( \log p(y^{\mathrm{new}} \, | \,\theta^{(m)}) \right).\]
We’ll then write that last expression as
\[-\log M+ \mathrm{log\_sum\_exp \, } \ \log p(y^{\mathrm{new}} \, | \, \theta^{(m)})\]
We can compute\(\mathrm{log\_sum\_exp}\) stably bysubtracting the max value. Suppose\(u = u_1,\ldots, u_M\), and\(\max(u)\)is the largest\(u_m\). We cancalculate
\[\mathrm{log\_sum\_exp \, } \ u_m\ = \\log \sum_{m=1}^M \exp(u_m)\ = \\max(u) + \log \sum_{m=1}^M \exp(u_m - \max(u)).\]
Because\(u_m - \max(u) \leq 0\),the exponentiations cannot overflow. They may underflow to zero, butthis will not lose precision because of the leading\(\max(u)\) term; the only way underflow canarise is if\(u_m - \max(u)\) is verysmall, meaning that it won’t add significant digits to\(\max(u)\) if it had not underflowed.
We can implement\(\mathrm{log\_sum\_exp}\) in R asfollows:
log_sum_exp<-function(u) { max_u<-max(u) a<-0for (nin1:length(u)) { a<- a+exp(u[n]- max_u) } max_u+log(a)}# Or equivalently using vectorizationlog_sum_exp<-function(u) { max_u<-max(u) max_u+log(sum(exp(u- max_u)))}and then include the\(-\log M\)term to make itlog_mean_exp:
log_mean_exp<-function(u) { M<-length(u)-log(M)+log_sum_exp(u)}We can then use it to compute the log posterior predictive densitiesfor each of the models:
new_lps<-lapply(log_p_new_mats,function(x)apply(x,2, log_mean_exp))# sum over the data pointsnew_lps_sums<-sapply(new_lps, sum)round(sort(new_lps_sums,decreasing =TRUE),digits =1)PartialPooling Pooling NoPooling -71.8 -73.1 -81.5Now the ranking is different! As expected, the values here aregreater than the expectation of the log density due to Jensen’sinequality. The partial pooling model appears to be making slightlybetter predictions than the full pooling model, which in turn is makingslightly better predictions than the no pooling model.
Vehtari, Gelman, and Gabry (2016) shows that the expected logpredictive density can be approximated using theloofunction for each model and then compared across models:
loo_compare(loo(fit_partialpool),loo(fit_pool),loo(fit_nopool)) elpd_diff se_difffit_pool 0.0 0.0 fit_partialpool -0.1 0.5 fit_nopool -6.0 2.6The third column is the leave-one-out (loo) approximation to theexpected log predictive density. This approximation is onlyasymptotically valid and with only 18 observations in this case,substantially underestimates the expected log predictive densities foundin the previous subsection. Nevertheless, the relative ranking of themodels is essentially the same with the pooled and partially pooledmodels being virtually indistinguishable but much better than the nopooling model.
Withrstanarm it is straightforward to generatedraws from the posterior predictive distribution using theposterior_predict function. With this capability, we caneither generate predictions for new data or we can apply it to thepredictors we already have.
There will be two sources of uncertainty in our predictions, thefirst being the uncertainty in\(\theta\) in the posterior\(p(\theta \, | \, y)\) and the second beingthe uncertainty due to the likelihood\(p(\tilde{y} \, | \, \theta)\).
We let\(z_n\) be the number ofsuccesses for unit\(n\) in\(K^{\mathrm{new}}_n\) further trials. Itmight seem tempting to eliminate that second source of uncertainty andset\(z_n^{(m)}\) to its expectation,\(\theta_n^{(m)} \, K^{\mathrm{new}}\),at each draw\(m\) from the posteriorrather than simulating a new value. Or it might seem tempting to removethe first source of uncertainty and use the posterior mean (or median ormode or …) rather than draws from the posterior. Either way, theresulting values would suffice for estimating the posterior mean, butwould not capture the uncertainty in the prediction for\(y^{\mathrm{new}}_n\) and would thus not beuseful in estimating predictive standard deviations or quantiles or asthe basis for decision making under uncertainty. In other words, thepredictions would not be properly calibrated (in a sense we definebelow).
To predict\(z\) for each player wecan use the following code:
newdata<-data.frame(Hits = y_new,AB = K_new,Player = bball$Player)ppd_pool<-posterior_predict(fit_pool, newdata)ppd_nopool<-posterior_predict(fit_nopool, newdata)ppd_partialpool<-posterior_predict(fit_partialpool, newdata)colnames(ppd_pool)<-colnames(ppd_nopool)<-colnames(ppd_partialpool)<-as.character(bball$Player)colMeans(ppd_partialpool) Clemente Robinson Howard Johnstone Berry Spencer Kessinger 107.23925 122.72675 147.32300 76.56375 114.86050 127.86900 158.09975 Alvarado Santo Swaboda Petrocelli Rodriguez Scott Unser 36.63425 133.22975 52.15450 137.93575 47.70450 111.53375 71.25625 Williams Campaneris Munson Alvis 151.86475 141.23425 101.53275 17.14850Translating the posterior number of hits into a season battingaverage,\(\frac{y_n + z_n}{K_n +K^{\mathrm{new}}_n}\), we get an 80% posterior interval of
z_1<- ppd_partialpool[,1]clemente_80pct<- (y[1]+quantile(z_1,prob =c(0.1,0.9)))/ (K[1]+ K_new[1])batting_avg(clemente_80pct) 10% 90% 0.255 0.362for Roberto Clemente from the partial pooling model. Part of ouruncertainty here is due to our uncertainty in Clemente’s underlyingchance of success, and part of our uncertainty is due to there being 367remaining trials (at bats) modeled as binomial. In the remaining at batsfor the season, Clemente’s success rate (batting average) was\(127 / 367 = 0.346\).
For each model, the following plot shows each player’s posteriorpredictive 50% interval for predicted batting average (success rate) inhis remaining at bats (trials); the observed success rate in theremainder of the season is shown as a blue dot.
ppd_intervals<-function(x)t(apply(x,2, quantile,probs =c(0.25,0.75)))ppd_summaries<- (1/ K_new)*rbind(ppd_intervals(ppd_pool),ppd_intervals(ppd_nopool),ppd_intervals(ppd_partialpool))df_ppd<-data.frame(player =rep(1:length(y_new),3),y =rep(y_new/ K_new,3),lb = ppd_summaries[,"25%"],ub = ppd_summaries[,"75%"],model =rep(models,each =length(y_new)))ggplot(df_ppd,aes(x=player,y=y,ymin=lb,ymax=ub))+geom_linerange(color ="gray60",size =2)+geom_point(size =2.5,color ="skyblue4")+facet_grid(.~ model)+labs(x =NULL,y ="batting average")+scale_x_continuous(breaks =NULL)+ggtitle(expression(atop("Posterior Predictions for Batting Average in Remainder of Season",atop("50% posterior predictive intervals (gray bars); observed (blue dots)",""))))We choose to plot 50% posterior intervals as they are a good singlepoint for checking calibration. Rather than plotting the number of hitson the vertical axis, we have standardized all the predictions andoutcomes to a success rate. Because each unit (player) has a differentnumber of subsequent trials (at bats), the posterior intervals arerelatively wider or narrower within the plots for each model (moretrials imply narrower intervals for the average). Because each unit hadthe same number of initial observed trials, this variation is primarilydue to the uncertainty from the binomial model of outcomes.
With 50% intervals, we expect half of our estimates to lie outsidetheir intervals in a well-calibrated model. If fewer than the expectednumber of outcomes lie in their estimated posterior intervals, we havereason to believe the model is not well calibrated—its posteriorintervals are too narrow. This is also true if too many outcomes lie intheir estimated posterior intervals—in this case the intervals are toobroad. Of course, there is variation in the tests as the number of unitslying in their intervals is itself a random variable (see theexercises), so in practice we are only looking for extreme values asindicators of miscalibration.
Each of the models other than the complete pooling model appears tobe reasonably well calibrated, and even the calibration for the completepooling model is not bad (the variation in chance-of-success amongplayers has low enough variance that the complete pooling model cannotbe rejected as a possibility with only the amount of data we usedhere).
Consider the width of the posterior predictive intervals for theunits across the models. The model with no pooling has the broadestposterior predictive intervals and the complete pooling model thenarrowest. This is to be expected given the number of observations usedto fit each model; 45 each in the no pooling case and 810 in thecomplete pooling case, and relatively something in between for thepartial pooling models. Because the log odds model is doing morepooling, its intervals are slightly narrower than that of the directhierarchical model.
For two well calibrated models, the one with the narrower posteriorintervals is preferable because its predictions are more tighter. Theterm introduced for this by Gneiting et al. (2007) is “sharpness.” Inthe limit, a perfect model would provide a delta function at the trueanswer with a vanishing posterior interval.
The 80% interval in the partial pooling model coincidentally shows usthat our model estimates a roughly 10% chance of Roberto Clementebatting 0.400 or better for the season based on batting 0.400 in hisfirst 45 at bats. Not great, but non-trivial. Rather than fishing forthe right quantile and hoping to get lucky, we can write a model todirectly estimate event probabilities, such as Robert Clemente’s battingaverage is 0.400 or better for the season.
Event probabilities are defined as expectations of indicatorfunctions over parameters and data. For example, the probability ofplayer\(n\)’s batting average being0.400 or better conditioned on the data\(y\) is defined by the conditional eventprobability
\[\mathrm{Pr}\left[\frac{(y_n + z_n)}{(45 + K^{\mathrm{new}}_n)} \geq 0.400\, \Big| \,y\right]\ = \\int_{\Theta}\mathrm{I}\left[\frac{(y_n + z_n)}{(45 + K^{\mathrm{new}}_n)} \geq0.400\right] \ p(z_n \, | \, \theta_n, K^{\mathrm{new}}_n) \ p(\theta \, | \, y, K) \ \mathrm{d}\theta.\]
The indicator function\(\mathrm{I}[c]\) evaluates to 1 if thecondition\(c\) is true and 0 if it isfalse. Because it is just another expectation with respect to theposterior, we can calculate this event probability using MCMC as
\[\mathrm{Pr}\left[\frac{(y_n + z_n)}{(45 + K^{\mathrm{new}}_n)} \geq0.400 \, \Big| \, y \right]\ \approx \\frac{1}{M} \, \sum_{m=1}^M \mathrm{I}\left[\frac{(y_n + z_n^{(m)})}{(45+ K^{\mathrm{new}}_n)} \geq 0.400\right].\]
This event is about the season batting average being greater than0.400. What if we care about ability (chance of success), not battingaverage (success rate) for the rest of the season? Then we would ask thequestion of whether\(\mathrm{Pr}[\theta_n> 0.4]\). This is defined as a weighted average over the priorand computed via MCMC as the previous case.
\[\mathrm{Pr}\left[\theta_n \geq 0.400 \, | \, y \right]\ = \\int_{\Theta}\mathrm{I}\left[\theta_n \geq 0.400\right] \ p(\theta \, | \, y, K) \ \mathrm{d}\theta\ \approx \\frac{1}{M} \, \sum_{m=1}^M \mathrm{I}[\theta_n^{(m)} \geq 0.400].\]
draws_partialpool<-shift_draws(as.matrix(fit_partialpool))thetas_partialpool<-plogis(draws_partialpool)thetas_partialpool<- thetas_partialpool[,-ncol(thetas_partialpool)]colnames(thetas_partialpool)<-as.character(bball$Player)ability_gt_400<- thetas_partialpool>0.4cat("Pr(theta_n >= 0.400 | y)\n")colMeans(ability_gt_400)[c(1,5,10)]some_gt_350<-apply(thetas_partialpool,1,function(x)max(x)>0.35)cat("Pr(at least one theta_n >= 0.350 | y)\n")mean(some_gt_350)Pr(theta_n >= 0.400 | y)Clemente Berry Swaboda 0.02175 0.00325 0.00050 Pr(at least one theta_n >= 0.350 | y)[1] 0.22875We snuck in a “multiple comparison” event in the last section, namelywhether there was some player with an a chance of success for hits of.350 or greater.
With traditional significance testing over multiple trials, it iscommon to adjust for falsely rejecting the null hypothesis (a so-calledTypeI error) by inflating the conventional (and arguably far too low) 5%target for reporting “significance.”
For example, suppose we have our 18 players with ability parameters\(\theta_n\) and we have\(N\) null hypotheses of the form\(H_0^n: \theta_n < 0.350\). Now supposewe evaluate each of these 18 hypotheses independently at theconventional\(p = 0.05\) significancelevel, giving each a 5% chance of rejecting the null hypothesis inerror. When we run all 18 hypothesis tests, the overall chance offalsely rejecting at least one of the null hypotheses is a whopping\(1 - (1 - 0.05)^{18} = 0.60\).
The traditional solution to this problem is to apply aBonferroniadjustment to control the false rejection rate; the typicaladjustment is to divide the\(p\)-valueby the number of hypothesis tests in the “family” (that is, thecollective test being done). Here that sets the rate to\(p = 0.05/18\), or approximately\(p = 0.003\), and results in a slightly lessthan 5% chance of falsely rejecting a null hypothesis in error.
Although the Bonferroni correction does reduce the overall chance offalsely rejecting a null hypothesis, it also reduces the statisticalpower of the test to the same degree. This means that many nullhypotheses will fail to be rejected in error.
Rather than doing classical multiple comparison adjustments to adjustfor false-discovery rate, such as a Bonferroni correction, Gelman etal. (2012) suggest using a hierarchical model to perform partial poolinginstead. As already shown, hierarchical models partially pool the data,which pulls estimates toward the population mean with a strengthdetermined by the amount of observed variation in the population (seealso Figure 2 of (Gelman et al. 2012)). This automatically reduces thefalse-discovery rate, though not in a way that is intrinsicallycalibrated to false discovery, which is good, because reducing theoverall false discovery rate in and of itself reduces the true discoveryrate at the same time.
The generated quantitysome_ability_gt_350 will be setto 1 if the maximum ability estimate in\(\theta\) is greater than 0.35. And thus theposterior mean of this generated quantity will be the eventprobability
\[\mathrm{Pr}[\mathrm{max}(\theta) > 0.350]\ = \ \int_{\Theta} \mathrm{I}[\mathrm{max}(\theta) > 0.35] \p(\theta \, | \, y, K) \ \mathrm{d}\theta\ \approx \ \frac{1}{M} \, \sum_{m=1}^M \\mathrm{I}[\mathrm{max}(\theta^{(m)}) > 0.35]\]
where\(\theta^{(m)}\) is thesequence of posterior draws for the ability parameter vector. Stanreports this value as the posterior mean of the generated quantitysome_ability_gt_350, which takes on the value\(\mathrm{I}[\mathrm{max}(\theta^{(m)}) >0.35]\) in each iteration.
The probability estimate of there being a player with an ability(chance of success) greater than 0.350 is essentially zero in thecomplete and is essentially guaranteed in the no pooling model. Thepartially pooled estimates would not be considered significant atconventional p=0.05 thresholds. One way to get a handle on what’s goingon is to inspect the posterior 80% intervals for chance-of-successestimates in the first graph above.
In addition to multiple comparisons, we can use the simultaneousestimation of the ability parameters to rank the units. In this section,we rank ballplayers by (estimated) chance of success (i.e., battingability).
Of course, ranking players by ability makes no sense for the completepooling model, where every player is assumed to have the sameability.
reverse_rank<-function(x)1+length(x)-rank(x)# so lower rank is betterrank<-apply(thetas_partialpool,1, reverse_rank)t(apply(rank,1, quantile,prob =c(0.1,0.5,0.9))) parameters 10% 50% 90% Clemente 1 5 14 Robinson 1 5 14 Howard 1 6 14 Johnstone 2 7 15 Berry 2 8 15 Spencer 2 8 15 Kessinger 2 9 16 Alvarado 3 9 16 Santo 3 10 17 Swaboda 3 10 17 Petrocelli 4 11 17 Rodriguez 4 11 17 Scott 4 11 17 Unser 4 11 17 Williams 3 11 17 Campaneris 4 12 17 Munson 4 13 18 Alvis 5 14 18It is again abundantly clear from the posterior intervals that ouruncertainty is very great after only 45 at bats.
In the original Volume I BUGSexampleof surgical mortality, the posterior distribution over ranks was plottedfor each hospital. It is now straightforward to reproduce that figurehere for the baseball data.
df_rank<-data.frame(name =rep(bball$Player,each = M),rank =c(t(rank)))ggplot(df_rank,aes(rank))+stat_count(width =0.8)+facet_wrap(~ name)+scale_x_discrete("Rank",limits =c(1,5,10,15))+scale_y_discrete("Probability",limits =c(0,0.1* M,0.2* M),labels =c("0.0","0.1","0.2"))+ggtitle("Rankings for Partial Pooling Model")We can use our ranking statistic to calculate the event probabilityfor unit\(n\) that the unit has thehighest chance of success using MCMC as
\[\mathrm{Pr}[\theta_n = \max(\theta)]\ = \\int_{\Theta} \mathrm{I}[\theta_n = \mathrm{max}(\theta)] \ p(\theta \, | \, y, K) \ \mathrm{d}\theta\ \approx \\frac{1}{M} \, \sum_{m=1}^M \mathrm{I}[\theta^{(m)}_n =\mathrm{max}(\theta^{(m)})].\]
Like our other models, the partial pooling mitigates the implicitmultiple comparisons being done to calculate the probabilities ofrankings. Contrast this with an approach that does a pairwisesignificance test and then applies a false-discovery correction.
We can compute this straightforwardly using the rank data we havealready computed or we could compute it directly as above. Because\(\mathrm{Pr}[\theta_n = \theta_{n'}] =0\) for\(n \neq n'\), wedon’t have to worry about ties.
thetas_nopool<-plogis(as.matrix(fit_nopool))colnames(thetas_nopool)<-as.character(bball$Player)rank_nopool<-apply(thetas_nopool,1, reverse_rank)is_best_nopool<-rowMeans(rank_nopool==1)is_best_partialpool<-rowMeans(rank==1)df_is_best<-data.frame(unit =rep(bball$Player,2),is_best =c(is_best_partialpool, is_best_nopool),model =rep(c("partial pooling","no pooling"),each = N))ggplot(df_is_best,aes(x=unit,y=is_best))+geom_bar(stat ="identity")+facet_wrap(~ model)+scale_y_continuous(name ="Pr[player is best]")+ggtitle("Who is the Best Player?")+theme(axis.text.x =element_text(angle =-45,vjust =1,hjust =0))This question of which player has the highest chance of success(batting ability) doesn’t even make sense in the complete pooling model,because the chance of success parameters are all the same by definition.In the other models, the amount of pooling directly determines theprobabilities of being the best player. That is, the probability ofbeing best goes down for high performing players with more pooling,whereas it goes up for below-average players.
We can simulate data from the predictive distribution and compare itto the original data used for fitting the model. If they are notconsistent, then either our model is not capturing the aspects of thedata we are probing with test statistics or the measurement we made ishighly unlikely. That is, extreme\(p\)-values lead us to suspect there issomething wrong with our model that deserves further exploration.
In some cases, we are willing to work with models that are wrong insome measurable aspects, but accurately capture quantities of interestfor an application. That is, it’s possible for a model to capture some,but not all, aspects of a data set, and still be useful.
A test statistic\(T\) is a functionfrom data to a real value. Following (Gelman et al. 2013), we willconcentrate on four specific test statistics for repeated binary trialdata (though these choices are fairly general): minimum value, maximumvalue, sample mean, and sample standard deviation.
Given a test statistic\(T\) anddata\(y\), the Bayesian\(p\)-value has a direct definition as aprobability,
\[p_B = \mathrm{Pr}[T(y^{\mathrm{rep}}) \geq T(y) \, | \, y].\]
Bayesian\(p\)-values, like theirtraditional counterparts, are probabilities, but not probabilities thata model is true. They simply measure discrepancies between the observeddata and what we would expect if the model is true.
Values of Bayesian\(p\)-values near0 or 1 indicate that the data\(y\)used to estimate the model is unlikely to have been generated by theestimated model. As with other forms of full Bayesian inference, ourestimate is the full posterior, not just a point estimate.
As with other Bayesain inferences, we average over the posteriorrather than working from a point estimate of the parameters. Expandingthis as an expectation of an indicator function,
\[p_B\ = \ \int_{\Theta, Y^{\mathrm{rep}}} \mathrm{I}[T(y^{\mathrm{rep}}) \geq T(y)] \ p(y^{\mathrm{rep}} \, | \, \theta) \ p(\theta \, | \, y) \ \mathrm{d}\theta,\]
We treat\(y^{\mathrm{rep}}\) as aparameter in parallel with\(\theta\),integrating over possible values\(y^{\mathrm{rep}} \in Y^{\mathrm{rep}}\). Asusual, we use the integration sign in a general way intended to includesummation, as with the discrete variable\(y^{\mathrm{rep}}\).
The formulation as an expectation leads to the obvious MCMCcalculation based on posterior draws\(y^{\mathrm{rep} (m)}\) for\(m \in 1{:}M\),
\[p_B\approx\frac{1}{M}\,\sum_{m=1}^M\mathrm{I}[T(y^{\mathrm{rep} \ (m)}) \geq T(y)].\]
Using thepp_check inrstanarm, we caneasily reproduce Figure 6.12 from (Gelman et al. 2013), which shows theposterior predictive distribution for the test statistic, the observedvalue as a vertical line, and the\(p\)-value for each of the tests. First,here is just the plot for the no pooling model using the mean as thetest statistic:
pp_check(fit_nopool,plotfun ="stat",stat ="mean")Thestat argument can the be the name of any R function(including your own functions defined in the Global Environment) thattakes a vector as an input and returns a scalar.
To make plots for each of the models for several test statistics wecan use the following code, which will create a list of ggplot objectsfor each model and then arrange everything in a single plot.
tstat_plots<-function(model, stats) {lapply(stats,function(stat) { graph<-pp_check(model,plotfun ="stat",stat = stat,seed = SEED)# optional arguments graph+xlab(stat)+theme(legend.position ="none") })}Tstats<-c("mean","sd","min","max")ppcs_pool<-tstat_plots(fit_pool, Tstats)ppcs_nopool<-tstat_plots(fit_nopool, Tstats)ppcs_partialpool<-tstat_plots(fit_partialpool, Tstats)if (require(gridExtra)) {grid.arrange(arrangeGrob(grobs = ppcs_pool,nrow =1,left ="Pooling"),arrangeGrob(grobs = ppcs_nopool,nrow =1,left ="No Pooling"),arrangeGrob(grobs = ppcs_partialpool,nrow =1,left ="Partial Pooling") )}The only worrisomely extreme value visible in the plots is the\(p\)-value for standard deviation in theno-pooling model, where the vast majority of the simulated data setsunder the model had standard deviations greater than the actualdata.
We didn’t actually compute this\(p\)-value because extreme\(p\)-values are easy to detect visually andwhether or not the\(p\)-value is lessthan\(0.05\) or some other arbitraryvalue is of little use to us beyond what we can already see in the plot.However, if we did want to actually compute the\(p\)-value we can do so easily:
yrep<-posterior_predict(fit_nopool,seed = SEED)# seed is optionalTy<-sd(y)Tyrep<-apply(yrep,1, sd)# tail-area probabilityp<-1-mean(Tyrep> Ty)print(p)[1] 0.01325Following the advice of Gelman et al. (2013), we will take the fittedparameters of the data set and generate replicated data sets, thencompare the replicated data sets visually to the observed data we usedto fit the model. In this section we’ll create the plots for the modelusing partial pooling, but the same plots can be made for the othermodels too.
Again usingrstanarm’spp_checkfunction, we can plot some of the simulated data sets along with theoriginal data set to do a visual inspection as suggested by Gelman etal. (2013). For this type of posterior predictive check we set thecheck argument to"distributions" and we usenreps to specify how many replicated sets of data togenerate from the posterior predictive distribution. Because our modelshave a binomial outcome, instead of plotting the number of successes(hits in this case) on the x-axis,pp_check will plot theproportion of successes.
pp_check(fit_partialpool,plotfun ="hist",nreps =15,binwidth =0.025)+ggtitle("Model: Partial Pooling")These simulations are not unreasonable for a binomial likelihood, butthey are more spread out than the actual data. In this case, this mayactually have more to do with how the data were selected out of all themajor league baseball players than the actual data distribution. Efronand Morris (1975, p 312) write
This sample was chosen because we wanted between 30 and 50 at bats toassure a satisfactory approximation of the binomial by the normaldistribution while leaving the bulk of at bats to be estimated. We alsowanted to include an unusually good hitter (Clemente) to test the methodwith at least one extreme parameter, a situation expected to be lessfavorable to Stein’s estimator. Stein’s estimator requires equalvariances, or in this situation, equal at bats, so the remaining 17players are all whom either the April 26 or May 3New YorkTimes reported with 45 at bats.
A hierarchical model introduces an estimation bias toward thepopulation mean and the stronger the bias, the less variance there is inthe estimates for the units. Exactly how much bias and variance iswarranted can be estimated by further calibrating the model and testingwhere its predictions do not bear out.
With very little data, there is very little we can do to gain sharpinferences other than provide more informative priors, which is wellworth doing when prior information is available.
On the other hand, with more data, the models provide similar results(see the exercises), and in the limit, all of the models (other thancomplete pooling) converge to posteriors that are delta functions aroundthe empirical chance of success (i.e., the maximum likelihood estimate).Meanwhile, Bayesian inference is allowing us to make more accuratepredictions with the data available before we hit that asymptoticregime.
Generate fake data according to the pooling, no-pooling, andpartial pooling models. Fit the model and consider the coverage of theposterior 80% intervals.
Try generating data where each player has a different number ofat-bats (trials) and then fitting the models. What effect does thenumber of initial trials have on the posterior? Is there a way toquantify the effect?
In the section where we fit the complete pooling model we show aplot of the prior distribution on the probability of success\(\theta\) implied by the\(\mathsf{Normal}(-1,1)\) prior on thelog-odds\(\alpha\). If\(\theta = \mathrm{logit}^{-1}(\alpha)\) and\(p(\alpha) = \mathsf{Normal}(\alpha \,|\, -1,1)\), what is\(p(\theta)\)? Fora hint, seehere.
How sensitive is the basic no-pooling model to the choice ofprior? We used a somewhat informative prior due to our knowledge ofbaseball, but the prior could be made more or less informative. How, ifat all, does this affect posterior inference?
What are some other test statistics that might be used toevaluate our model fit to data? Try some out usingpp_check(model, plotfun="stat", stat = "my_test"), wheremy_test is your function that computes the test statistic.For example, to check the 25% quantile you could first define a functionq25 <- function(x) quantile(x, 0.25) and then callpp_check(model, plotfun = "stat", stat = "q25").
Discuss the difference between batting average and on-basepercentage as random variables. Consider particularly the denominator(at-bat versus plate appearance). Is the denominator in these kinds ofproblems always a random variable itself? Why might this be important ininference?
Betancourt, M. and Girolami, M. (2015) Hamiltonian Monte Carlofor hierarchical models.Current Trends in Bayesian Methodology withApplications79.
Efron, B. and Morris, C. (1975) Data analysis using Stein’sestimator and its generalizations.Journal of the AmericanStatistical Association70(350), 311–319. [pdf]
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari,A., and Rubin, D. B. (2013)Bayesian Data Analysis, 3rdEdition. Chapman & Hall/CRC Press, London.
Gelman, A. and Hill, J. (2007)Data Analysis Using Regressionand Multilevel-Hierarchical Models. Cambridge University Press,Cambridge, United Kingdom.
Gelman, A., Hill, J., and Yajima, M. (2012) Why we (usually)don’t have to worry about multiple comparisons.Journal of Researchon Educational Effectiveness5, 189–211. [preprint]
Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007)Probabilistic forecasts, calibration and sharpness.Journal of theRoyal Statistical Society: Series B (Statistical Methodology),69(2), 243–268.
Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter,D. (2013)The BUGS Book: A Practical Introduction to BayesianAnalysis. Chapman & Hall/CRC Press.
Neal, R. M. (2003) Slice sampling.Annals of Statistics31(3):705–767.
Papaspiliopoulos, O., Roberts, G. O., and Skold, M. (2003)Non-centered parameterisations for hierarchical models and dataaugmentation. InBayesian Statistics 7: Proceedings of the SeventhValencia International Meeting, edited by Bernardo, J. M., Bayarri,M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., andWest, M. Oxford University Press, Chicago.
Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA:Convergence diagnosis and output analysis for MCMC.R News,6(1), 7–11.
Spiegelhalter, D., Thomas, A., Best, N., & Gilks, W. (1996)BUGS 0.5 Examples. MRC Biostatistics Unit, Institute of Public health,Cambridge, UK.
Stan Development Team (2015)Stan Modeling Language User’sGuide and Reference Manual.[web page]
Tarone, R. E. (1982) The use of historical control information intesting for a trend in proportions.Biometrics38(1):215–220.
Vehtari, A, Gelman, A., & Gabry, J. (2016) Practical Bayesianmodel evaluation using leave-one-out cross-validation and WAIC. [pdf]
The following additional data sets have a similar structure to thebaseball data used in this vignette and are included withrstanarm.
Tarone (1982) provides a data set of tumor incidence in historicalcontrol groups of rats; specifically endometrial stromal polyps infemale lab rats of type F344. The data set is taken from the book sitefor (Gelman et al. 2013):
data(tumors, package = "rstanarm")Spiegelhalter et al. (1996) provide a data set of mortality rates in12 hospitals performing cardiac surgery in babies. We just manuallyentered the data from the paper; it is also available in the Stanexample models repository in R format.
data(mortality, package = "rstanarm")Carpenter (2009) updates Efron and Morris’s (1975) data set for theentire set of players for the entire 2006 American League season ofMajor League Baseball. The data was originally downloaded from theseanlahman.com, which is currently not working.
data(bball2006, package = "rstanarm")