This vignette provides an introduction on how to fit non-linearmultilevel models withbrms. Non-linear models areincredibly flexible and powerful, but require much more care withrespect to model specification and priors than typical generalizedlinear models. Ignoring group-level effects for the moment, thepredictor term\(\eta_n\) of ageneralized linear model for observation\(n\) can be written as follows:
\[\eta_n = \sum_{i = 1}^K b_ix_{ni}\]
where\(b_i\) is the regressioncoefficient of predictor\(i\) and\(x_{ni}\) is the data of predictor\(i\) for observation\(n\). This also comprises interaction termsand various other data transformations. However, the structure of\(\eta_n\) is always linear in the sense thatthe regression coefficients\(b_i\) aremultiplied by some predictor values and then summed up. This impliesthat the hypothetical predictor term
\[\eta_n = b_1 \exp(b_2 x_n)\]
wouldnot be alinear predictor anymore and wecould not fit it using classical techniques of generalized linearmodels. We thus need a more general model class, which we will callnon-linear models. Note that the term ‘non-linear’ does not sayanything about the assumed distribution of the response variable. Inparticular it does not mean ‘not normally distributed’ as we can applynon-linear predictor terms to all kinds of response distributions (formore details on response distributions available inbrms seevignette("brms_families")).
We begin with a simple example using simulated data.
As stated above, we cannot use a generalized linear model to estimate\(b\) so we go ahead an specify anon-linear model.
prior1<-prior(normal(1,2),nlpar ="b1")+prior(normal(0,2),nlpar ="b2")fit1<-brm(bf(y~ b1*exp(b2* x), b1+ b2~1,nl =TRUE),data = dat1,prior = prior1)When looking at the above code, the first thing that becomes obviousis that we changed theformula syntax to display thenon-linear formula including predictors (i.e.,x) andparameters (i.e.,b1 andb2) wrapped in a calltobf. This stands in contrast to classicalR formulas, where only predictors are given andparameters are implicit. The argumentb1 + b2 ~ 1 servestwo purposes. First, it provides information, which variables informula are parameters, and second, it specifies the linearpredictor terms for each parameter. In fact, we should think ofnon-linear parameters as placeholders for linear predictor terms ratherthan as parameters themselves (see also the following examples). In thepresent case, we have no further variables to predictb1andb2 and thus we just fit intercepts that represent ourestimates of\(b_1\) and\(b_2\) in the model equation above. Theformulab1 + b2 ~ 1 is a short form ofb1 ~ 1, b2 ~ 1 that can be used if multiple non-linearparameters share the same formula. Settingnl = TRUE tellsbrms that the formula should be treated asnon-linear.
In contrast to generalized linear models, priors on population-levelparameters (i.e., ‘fixed effects’) are often mandatory to identify anon-linear model. Thus,brms requires the user toexplicitly specify these priors. In the present example, we used anormal(1, 2) prior on (the population-level intercept of)b1, while we used anormal(0, 2) prior on (thepopulation-level intercept of)b2. Setting priors is anon-trivial task in all kinds of models, especially in non-linearmodels, so you should always invest some time to think of appropriatepriors. Quite often, you may be forced to change your priors afterfitting a non-linear model for the first time, when you observedifferent MCMC chains converging to different posterior regions. This isa clear sign of an identification problem and one solution is to setstronger (i.e., more narrow) priors.
To obtain summaries of the fitted model, we apply
Family: gaussian Links: mu = identity Formula: y ~ b1 * exp(b2 * x) b1 ~ 1 b2 ~ 1 Data: dat1 (Number of observations: 100) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000Regression Coefficients: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSb1_Intercept 1.90 0.09 1.72 2.09 1.00 1374 1976b2_Intercept 0.75 0.03 0.70 0.80 1.00 1474 1858Further Distributional Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsigma 0.94 0.07 0.82 1.08 1.00 2098 1749Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).Thesummary method reveals that we were able to recoverthe true parameter values pretty nicely. According to theplot method, our MCMC chains have converged well and to thesame posterior. Theconditional_effects method visualizesthe model-implied (non-linear) regression line.
We might be also interested in comparing our non-linear model to aclassical linear model.
Family: gaussian Links: mu = identity Formula: y ~ x Data: dat1 (Number of observations: 100) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000Regression Coefficients: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSIntercept 2.49 0.16 2.17 2.80 1.00 4469 3040x 2.30 0.16 1.99 2.61 1.00 4784 3164Further Distributional Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsigma 1.53 0.11 1.33 1.76 1.00 3720 2914Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).To investigate and compare model fit, we can apply graphicalposterior predictive checks, which make use of thebayesplot package on the backend.
We can also easily compare model fit using leave-one-outcross-validation.
Output of model 'fit1':Computed from 4000 by 100 log-likelihood matrix. Estimate SEelpd_loo -136.8 7.2p_loo 2.8 0.8looic 273.7 14.4------MCSE of elpd_loo is NA.MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.1]).Pareto k diagnostic values: Count Pct. Min. ESS(-Inf, 0.7] (good) 99 99.0% 1142 (0.7, 1] (bad) 1 1.0% <NA> (1, Inf) (very bad) 0 0.0% <NA> See help('pareto-k-diagnostic') for details.Output of model 'fit2':Computed from 4000 by 100 log-likelihood matrix. Estimate SEelpd_loo -189.4 17.2p_loo 9.0 5.9looic 378.7 34.4------MCSE of elpd_loo is NA.MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.3]).Pareto k diagnostic values: Count Pct. Min. ESS(-Inf, 0.7] (good) 99 99.0% 595 (0.7, 1] (bad) 0 0.0% <NA> (1, Inf) (very bad) 1 1.0% <NA> See help('pareto-k-diagnostic') for details.Model comparisons: elpd_diff se_difffit1 0.0 0.0 fit2 -52.5 16.9Since smallerLOOIC values indicate better model fit, itis immediately evident that the non-linear model fits the data better,which is of course not too surprising since we simulated the data fromexactly that model.
On his blog, Markus Gesmann predicts the growth of cumulativeinsurance loss payments over time, originated from different originyears (seehttps://www.magesblog.com/post/2015-11-03-loss-developments-via-growth-curves-and/).We will use a slightly simplified version of his model for demonstrationpurposes here. It looks as follows:
\[cum_{AY, dev} \sim N(\mu_{AY, dev},\sigma)\]\[\mu_{AY, dev} = ult_{AY}\left(1 - \exp\left(- \left( \frac{dev}{\theta} \right)^\omega \right)\right)\]
The cumulative insurance payments\(cum\) will grow over time, and we modelthis dependency using the variable\(dev\). Further,\(ult_{AY}\) is the (to be estimated)ultimate loss of accident each year. It constitutes a non-linearparameter in our framework along with the parameters\(\theta\) and\(\omega\), which are responsible for thegrowth of the cumulative loss and are assumed to be the same acrossyears. The data is already shipped with brms.
AY dev cum premium1 1991 6 357.848 100002 1991 18 1124.788 100003 1991 30 1735.330 100004 1991 42 2182.708 100005 1991 54 2745.596 100006 1991 66 3319.994 10000and translate the proposed model into a non-linearbrms model.
fit_loss<-brm(bf(cum~ ult* (1-exp(-(dev/theta)^omega)), ult~1+ (1|AY), omega~1, theta~1,nl =TRUE),data = loss,family =gaussian(),prior =c(prior(normal(5000,1000),nlpar ="ult"),prior(normal(1,2),nlpar ="omega"),prior(normal(45,10),nlpar ="theta") ),control =list(adapt_delta =0.9))We estimate a group-level effect of accident year (variableAY) for the ultimate lossult. This also showsnicely how a non-linear parameter is actually a placeholder for a linearpredictor, which in case ofult, contains only an varyingintercept over year. Again, priors on population-level effects arerequired and, for the present model, are actually mandatory to ensureidentifiability. We summarize the model using well known methods.
Family: gaussian Links: mu = identity Formula: cum ~ ult * (1 - exp(-(dev/theta)^omega)) ult ~ 1 + (1 | AY) omega ~ 1 theta ~ 1 Data: loss (Number of observations: 55) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000Multilevel Hyperparameters:~AY (Number of levels: 10) Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsd(ult_Intercept) 747.55 240.07 423.46 1381.79 1.00 1090 1644Regression Coefficients: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSult_Intercept 5283.29 298.18 4708.31 5916.86 1.00 1110 1746omega_Intercept 1.34 0.05 1.24 1.43 1.00 2436 2558theta_Intercept 46.24 2.25 42.50 51.14 1.00 2272 2002Further Distributional Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsigma 140.21 16.06 113.42 176.58 1.00 3289 2676Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).Next, we show marginal effects separately for each year.
conditions<-data.frame(AY =unique(loss$AY))rownames(conditions)<-unique(loss$AY)me_loss<-conditional_effects( fit_loss,conditions = conditions,re_formula =NULL,method ="predict")plot(me_loss,ncol =5,points =TRUE)It is evident that there is some variation in cumulative loss acrossaccident years, for instance due to natural disasters happening only incertain years. Further, we see that the uncertainty in the predictedcumulative loss is larger for later years with fewer available datapoints. For a more detailed discussion of this data set, see Section 4.5in Gesmann & Morris (2020).
As a third example, we want to show how to model more advanceditem-response models using the non-linear model framework ofbrms. For simplicity, suppose we have a single forcedchoice item with three alternatives of which only one is correct. Ourresponse variable is whether a person answers the item correctly (1) ornot (0). Person are assumed to vary in their ability to answer the itemcorrectly. However, every person has a 33% chance of getting the itemright just by guessing. We thus simulate some data to reflect thissituation.
inv_logit<-function(x)1/ (1+exp(-x))ability<-rnorm(300)p<-0.33+0.67*inv_logit(ability)answer<-ifelse(runif(300,0,1)< p,1,0)dat_ir<-data.frame(ability, answer)The most basic item-response model is equivalent to a simple logisticregression model.
However, this model completely ignores the guessing probability andwill thus likely come to biased estimates and predictions.
Family: bernoulli Links: mu = logit Formula: answer ~ ability Data: dat_ir (Number of observations: 300) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000Regression Coefficients: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSIntercept 0.74 0.14 0.47 1.01 1.00 2751 2374ability 0.86 0.16 0.57 1.19 1.00 2477 2175Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).A more sophisticated approach incorporating the guessing probabilitylooks as follows:
fit_ir2<-brm(bf(answer~0.33+0.67*inv_logit(eta), eta~ ability,nl =TRUE),data = dat_ir,family =bernoulli("identity"),prior =prior(normal(0,5),nlpar ="eta"))It is very important to set the link function of thebernoulli family toidentity or else we willapply two link functions. This is because our non-linear predictor termalready contains the desired link function(0.33 + 0.67 * inv_logit), but thebernoullifamily applies the defaultlogit link on top of it. Thiswill of course lead to strange and uninterpretable results. Thus, pleasemake sure that you set the link function toidentity,whenever your non-linear predictor term already contains the desiredlink function.
Family: bernoulli Links: mu = identity Formula: answer ~ 0.33 + 0.67 * inv_logit(eta) eta ~ ability Data: dat_ir (Number of observations: 300) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000Regression Coefficients: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSeta_Intercept -0.10 0.20 -0.52 0.28 1.00 2120 1779eta_ability 1.39 0.29 0.86 1.99 1.00 2478 2301Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).Comparing model fit via leave-one-out cross-validation
Output of model 'fit_ir1':Computed from 4000 by 300 log-likelihood matrix. Estimate SEelpd_loo -174.7 7.5p_loo 2.2 0.2looic 349.3 15.0------MCSE of elpd_loo is 0.0.MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 0.9]).All Pareto k estimates are good (k < 0.7).See help('pareto-k-diagnostic') for details.Output of model 'fit_ir2':Computed from 4000 by 300 log-likelihood matrix. Estimate SEelpd_loo -173.2 7.6p_loo 1.9 0.2looic 346.5 15.1------MCSE of elpd_loo is 0.0.MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.0]).All Pareto k estimates are good (k < 0.7).See help('pareto-k-diagnostic') for details.Model comparisons: elpd_diff se_difffit_ir2 0.0 0.0 fit_ir1 -1.4 1.5shows that both model fit the data equally well, but remember thatpredictions of the first model might still be misleading as they maywell be below the guessing probability for low ability values. Now,suppose that we don’t know the guessing probability and want to estimateit from the data. This can easily be done changing the previous modeljust a bit.
fit_ir3<-brm(bf(answer~ guess+ (1- guess)*inv_logit(eta), eta~0+ ability, guess~1,nl =TRUE),data = dat_ir,family =bernoulli("identity"),prior =c(prior(normal(0,5),nlpar ="eta"),prior(beta(1,1),nlpar ="guess",lb =0,ub =1) ))Here, we model the guessing probability as a non-linear parametermaking sure that it cannot exceed the interval\([0, 1]\). We did not estimate an interceptforeta, as this will lead to a bias in the estimatedguessing parameter (try it out; this is an excellent example of howcareful one has to be in non-linear models).
Family: bernoulli Links: mu = identity Formula: answer ~ guess + (1 - guess) * inv_logit(eta) eta ~ 0 + ability guess ~ 1 Data: dat_ir (Number of observations: 300) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000Regression Coefficients: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSeta_ability 1.32 0.26 0.85 1.87 1.00 2163 1850guess_Intercept 0.31 0.05 0.21 0.41 1.00 2725 2217Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).The results show that we are able to recover the simulated modelparameters with this non-linear model. Of course, real item-responsedata have multiple items so that accounting for item and personvariability (e.g., using a multilevel model with varying intercepts)becomes necessary as we have multiple observations per item and person.Luckily, this can all be done within the non-linear framework ofbrms and I hope that this vignette serves as a goodstarting point.
Gesmann M. & Morris J. (2020). Hierarchical CompartmentalReserving Models.CAS Research Papers.