The package hdpGLM makes it easy to estimate semi-parametricregression models, and summarize and visualize the results. The packageis useful for many purposes:
ThehdpGLM works similarly to linear regression models.It estimates coefficients of linear regressions, including generalizedlinear models, such as logit coefficients. But it simultaneouslysearches for latent clusters in the data and estimates the linearcoefficients for those clusters. The result of the estimation is\(K\) vectors of linear coefficients, onevector for each cluster. If there are no clusters in the data, itreturns the estimated coefficients similar to classical regressionmodels.
The clustering procedure is based on a hierarchical semi-parametricBayesian model proposed inFerrari (2020).The model, called Hierarchical Dirichlet process Generalized LinearModels (hdpGLM), can be used to deal with latent heterogeneity indifferent situations, including those that emerge due to unobservedvariables. It deals with the latent heterogeneity in two ways: (1) itfinds latent clusters which can be better described by differentregression models and (2) estimate the coefficient of those models. ThehdpGLM can also be used with hierarchical data to estimate latentheterogeneity in multiple contexts and check if the clusters arecontext-dependent (see an example in sectionEstimatingContext-dependent latent heterogeneity).
The linear model is estimated by sampling from the posteriordistribution using a Gibbs sampler. Non-linear models (e.g., logit withbinary outcomes) use Hamiltonian Monte Carlo within Gibbs. Thealgorithms are presented inFerrari(2020).
Why should we estimate clusters of linear regressions instead offitting a single regression model?
First, it improves the predictive performance of the regression modeland keeps the interpretability of the regression coefficients.hdpGLM is much more flexible than traditional regressionand produces monotonically lower root mean square error (seeFerrari (2020) for details).
Second, latent heterogeneity emerges when there are omitted variablesin the estimation of regression models. ThehdpGLM can beused to estimate marginal effects even when interactions were omitted.It recovers the linear coefficients of each latent group.
The functionhdpGLM estimates a semi-parametric Bayesianregression model. The syntax is similar to other R functions such aslm(),glm(), andlmer().
Here is a toy example. Suppose we are studying how income inequalityaffects support policies that help alleviate poverty in a given countryA. Yet, suppose further that (1) the effect of inequality varies betweengroups of people; for some people, inequality increases support forwelfare policies, but for others, it decreases welfare policy support;(2) we don’t know which individual belongs to which group. The data setwelfare contains simulated data for this example.
## loading and looking at the datawelfare=read.csv2('welfare.csv')head(welfare)#> support inequality income ideology#> 1 -18.5649610 0.3392724 0.1425111 1.9225985#> 2 -9.3905812 -0.9906646 -0.5117102 0.2483346#> 3 0.9276234 -2.2318510 -0.3856288 -1.3619216#> 4 -12.3594498 -3.0079501 -0.9440585 -0.2088675#> 5 -2.4834411 0.1000455 0.8322192 0.1321378#> 6 -11.4187853 -0.9543883 -0.8810503 0.2916444Now, suppose that inequality increases support for welfare only amongwomen, but it decreases support among men. We didn’t collect data ongender (male versus female). We could estimate thehdpGLMand recover the coefficients even if gender wasn’t observed. The packageprovides a function calledhdpGLM, which estimates asemi-parametric Bayesian generalized linear model using a Dirichletmixture. Let’s estimate the model. The example uses few iterations inthe MCMC, but in real applications, one should use a much largernumber.
library(hdpGLM)#>#> ## ===============================================================#> ## Hierarchial Dirichlet Process Generalized Linear Model (hdpGLM)#> ## ===============================================================#>#> Author: Diogo Ferrari#> Usage : https://github.com/DiogoFerrari/hdpGLM#>#> Attaching package: 'hdpGLM'#> The following object is masked _by_ '.GlobalEnv':#>#> welfare## estimating the modelmcmc=list(burn.in=10,## MCMC burn-in periodn.iter =500)## number of MCMC iterations to keepmod=hdpGLM(support~ inequality+ income+ ideology,data=welfare,mcmc=mcmc)## printing the outcomesummary(mod)#>#> --------------------------------#> dpGLM model object#>#> Maximum number of clusters activated during the estimation: 12#> Number of MCMC iterations: 500#> burn-in: 10#> --------------------------------#>#> Summary statistics of clusters with data points#>#> --------------------------------#> Coefficients for cluster 1 (cluster label 1)#>#> Post.Mean Post.Median HPD.lower HPD.upper#> 1 (Intercept) -3.8159618 -3.8162366 -3.9052748 -3.744552#> 2 inequality -1.5269843 -1.5263007 -1.5927229 -1.453077#> 3 income 3.8842866 3.8824522 3.8165851 3.950905#> 4 ideology -8.2581637 -8.2578851 -8.3304965 -8.181748#> 5 sigma 0.9623287 0.9646702 0.8064385 1.071583#>#> --------------------------------#> Coefficients for cluster 2 (cluster label 2)#>#> Post.Mean Post.Median HPD.lower HPD.upper#> 1 (Intercept) -3.849547 -3.8535366 -3.9941828 -3.720849#> 2 inequality 1.999751 1.9959474 1.8553687 2.160528#> 3 income 3.850860 3.8486272 3.7392669 4.013699#> 4 ideology -8.294882 -8.2964950 -8.4246809 -8.141463#> 5 sigma 0.996870 0.9992607 0.8938969 1.095496#>#> --------------------------------The summary function prints the result in a tidy format. The columnk in the summary shows the label of the estimated clusters.The columnMean is the average of the posteriordistribution for each linear coefficient in each cluster.
The functionclassify can be used to classify the datapoints into clusters based on the estimation.
welfare_clustered=classify(welfare, mod)head(welfare_clustered)#> Cluster support inequality income ideology#> 1 2 -18.5649610 0.3392724 0.1425111 1.9225985#> 2 2 -9.3905812 -0.9906646 -0.5117102 0.2483346#> 3 2 0.9276234 -2.2318510 -0.3856288 -1.3619216#> 4 2 -12.3594498 -3.0079501 -0.9440585 -0.2088675#> 5 1 -2.4834411 0.1000455 0.8322192 0.1321378#> 6 2 -11.4187853 -0.9543883 -0.8810503 0.2916444tail(welfare_clustered)#> Cluster support inequality income ideology#> 1995 1 -1.5230053 1.055855140 -0.7295937 -0.7067871#> 1996 1 0.4814892 0.582588091 2.0051082 0.3090389#> 1997 1 -14.1929956 0.391164197 -0.9607449 0.7765482#> 1998 1 -8.2396789 0.074437376 1.2020300 1.0874928#> 1999 1 -23.1583753 0.434223018 -0.6176438 2.0387294#> 2000 1 -7.2075582 0.008355317 -0.4538951 0.2268072There are a series of built-in functions, with various options, toplot the results. In the example below, you see two of those options.Theseparate parameter plot the posterior samples for eachcluster separately, and the optionncols controls how manycolumns to use for the panels in the figure (to see more, runhelp(plot.hdpGLM) andhelp(plot.dpGLM)).
To continue the previous toy example, suppose that we are analyzingdata from many countries, and we suspect that the latent heterogeneityis different in each country. The effect of inequality on support forwelfare may be gender-specific only in some countries (contexts). Ormaybe the way it is gender-specific varies from country to country.Suppose we didn’t have data on gender, but we collect information oncountries’ gender gap in welfare provision. Let’s look at this new dataset.
## loading and looking at the datawelfare=read.csv2('welfare2.csv')head(welfare)#> support inequality income ideology country gap#> 1 -18.5649610 0.3392724 0.1425111 1.9225985 0 0.1#> 2 -9.3905812 -0.9906646 -0.5117102 0.2483346 0 0.1#> 3 0.9276234 -2.2318510 -0.3856288 -1.3619216 0 0.1#> 4 -12.3594498 -3.0079501 -0.9440585 -0.2088675 0 0.1#> 5 -2.4834411 0.1000455 0.8322192 0.1321378 0 0.1#> 6 -11.4187853 -0.9543883 -0.8810503 0.2916444 0 0.1tail(welfare)#> support inequality income ideology country gap#> 3195 0.3190583 -0.7504798 -0.7839583 0.92300705 4 -0.8280808#> 3196 -1.3837239 0.6620435 -1.5566268 0.05634618 4 -0.8280808#> 3197 -1.3820016 -0.4298706 -1.0945688 0.71559078 4 -0.8280808#> 3198 0.6878775 0.5450604 2.6175887 -1.94844469 4 -0.8280808#> 3199 -7.9282930 1.7846004 1.6755823 1.29160208 4 -0.8280808#> 3200 -1.7472485 0.5030992 -0.5395479 0.20109879 4 -0.8280808The variablecountry indicates the country (context) ofthe observation, and the variablegap the gender gap inwelfare provision in the respective country. The estimation is similarto the previous example, but now there is a secondformulafor the context-level variables. Again, the example below uses fewiterations in the MCMC, but in practical applications, one needs toincrease that).
## estimating the modelmcmc=list(burn.in=1,## MCMC burn-in periodn.iter =50)## number of MCMC iterations to keepmod=hdpGLM(support~ inequality+ income+ ideology, support~ gap,data=welfare,mcmc=mcmc)summary(mod)#>#> --------------------------------#> hdpGLM Object#>#> Maximum number of clusters activated during the estimation: 1#> Number of MCMC iterations: 50#> Burn-in: 1#>#> Number of contexts : 5#>#> Number of clusters (summary across contexts):#>#> Average Std.Dev Median Min. Max.#> 1 5.2 0.83666 5 4 6#> --------------------------------#>#>#> Summary statistics of clusters with data points in each context#>#> --------------------------------#> Coefficients and clusters for context 1#>#> Post.Mean Post.Median HPD.lower HPD.upper Cluster#> 1 (Intercept) -3.8508252 -3.859614 -3.9361270 -3.7581585 1#> 2 inequality 1.6795992 1.976401 0.4014574 2.0598792 1#> 3 income 3.8990512 3.902687 3.8247336 4.0075751 1#> 4 ideology -8.3435473 -8.333404 -8.4502791 -8.2766985 1#> 5 (Intercept) -3.8130032 -3.749660 -4.5050084 -2.7374457 4#> 6 inequality -0.6098537 -2.354290 -2.6987625 5.4618768 4#> 7 income 4.9778185 4.776308 3.6026619 7.0462220 4#> 8 ideology -5.1363714 -6.628349 -7.9231736 0.4146712 4#> 9 (Intercept) -4.1094749 -4.034605 -4.6699483 -3.8500810 5#> 10 inequality -1.6150049 -1.583848 -2.1899147 -1.4298019 5#> 11 income 3.3949692 3.807179 0.1061649 3.9317688 5#> 12 ideology -7.3069598 -8.304223 -8.5459941 -1.2372392 5#> 13 (Intercept) -2.7642664 -3.752799 -4.1028144 1.3576243 7#> 14 inequality -1.8505055 -1.991604 -2.2634982 -0.3813180 7#> 15 income 3.1157268 4.080310 -1.8121934 4.4820476 7#> 16 ideology -6.4502089 -8.116473 -8.9164903 3.0623095 7#> 17 (Intercept) -2.9969480 -3.195679 -4.1504863 -1.8258749 9#> 18 inequality 1.2620466 1.210754 0.1562227 2.6864306 9#> 19 income 2.0307599 2.992239 -2.7150366 3.9775826 9#> 20 ideology -7.8537355 -8.156346 -9.4268634 -5.3086520 9#> 21 (Intercept) -2.8118313 -3.465384 -3.8990752 0.7209031 10#> 22 inequality -1.6164010 -1.409314 -3.1191718 -1.0297680 10#> 23 income 2.3078304 3.599242 -4.6264123 3.9325062 10#> 24 ideology -7.6774134 -8.192444 -8.4266731 -4.4648880 10#>#> --------------------------------#> Coefficients and clusters for context 3#>#> Post.Mean Post.Median HPD.lower HPD.upper Cluster#> 1 (Intercept) -1.50736723 -0.55398852 -9.57535319 5.0568153 1#> 2 inequality -0.17440831 -0.02412360 -5.45599405 4.8790720 1#> 3 income -2.63626822 -3.69800081 -8.17463363 3.8587637 1#> 4 ideology 0.61498813 1.15448983 -5.79959247 4.4157765 1#> 5 (Intercept) 0.44883979 0.40998951 -0.07548416 1.0820132 3#> 6 inequality 0.01843115 -0.02895256 -0.38730503 0.5045607 3#> 7 income -3.28732657 -3.34621408 -3.53823586 -2.4938661 3#> 8 ideology 1.51530337 1.50109885 0.93528785 1.9591949 3#> 9 (Intercept) -1.75425838 -1.73811077 -2.21158097 -1.4521884 4#> 10 inequality -0.09797227 -0.02860764 -1.03270897 0.1358764 4#> 11 income -3.28988798 -3.34992919 -3.63177050 -2.4613382 4#> 12 ideology 1.23380538 1.19142051 1.05233989 1.5956233 4#> 13 (Intercept) -0.20792714 -0.32803836 -0.87764404 -0.1176504 5#> 14 inequality -0.61587268 -0.69178150 -1.21359296 -0.4394695 5#> 15 income -4.35718369 -4.35827326 -4.61384573 -4.1245655 5#> 16 ideology 0.32889861 0.44317796 0.11515730 1.7881627 5#>#> --------------------------------#> Coefficients and clusters for context 4#>#> Post.Mean Post.Median HPD.lower HPD.upper Cluster#> 1 (Intercept) -0.20181718 0.146005784 -3.89001067 1.21316926 1#> 2 inequality -0.86428581 -1.094719571 -2.24969881 4.44579396 1#> 3 income -2.25908252 -2.652068169 -4.42721732 2.46706764 1#> 4 ideology -0.41386360 -0.064351304 -6.51780691 2.00942006 1#> 5 (Intercept) 1.05384653 1.015728561 0.26466007 1.70891447 2#> 6 inequality -0.99780553 -1.003502944 -1.14413300 -0.83636254 2#> 7 income -2.53009250 -2.555908866 -2.90747311 -1.74957676 2#> 8 ideology 0.22951716 0.131210799 -0.07568282 0.92556620 2#> 9 (Intercept) -0.89252572 -0.768736315 -1.21016392 -0.46761110 3#> 10 inequality -1.17908119 -1.221185196 -1.36694150 -0.97207171 3#> 11 income -2.25089335 -2.134428976 -2.80868158 -1.98835575 3#> 12 ideology -0.13686049 -0.116948999 -0.54723653 0.10373661 3#> 13 (Intercept) -0.38128570 -0.343802217 -0.75693338 0.04738585 4#> 14 inequality -1.14370069 -1.159222929 -1.37857461 -0.86958602 4#> 15 income -2.88773203 -2.898615665 -3.14150593 -2.50697579 4#> 16 ideology -0.11721832 -0.008659932 -0.98810644 0.47618514 4#> 17 (Intercept) -0.09939041 0.342077677 -3.62305077 1.99814891 5#> 18 inequality -1.29862724 -1.271922381 -2.52327117 0.38119463 5#> 19 income -2.61597705 -2.483197201 -4.76438420 2.11035619 5#> 20 ideology -1.14440054 -0.933366756 -3.03739425 0.78382085 5#>#> --------------------------------#> Coefficients and clusters for context 2#>#> Post.Mean Post.Median HPD.lower HPD.upper Cluster#> 1 (Intercept) -0.39780706 -0.37809565 -0.779061996 -0.08694575 2#> 2 inequality -1.31534899 -1.45838086 -1.966488480 0.19746934 2#> 3 income -0.54347475 -0.53702891 -0.786955921 -0.29637545 2#> 4 ideology -2.42271850 -2.42127509 -2.741445947 -1.94173523 2#> 5 (Intercept) 0.76243202 0.89946840 -0.643099795 1.45156453 3#> 6 inequality 1.62118686 1.66024836 1.110827597 2.27615000 3#> 7 income 0.93654257 0.67746126 0.009106206 1.72531122 3#> 8 ideology -2.55558755 -2.52379355 -3.344904028 -2.13508020 3#> 9 (Intercept) -1.15212302 -1.32272073 -3.662128021 4.78558107 6#> 10 inequality -0.89480477 -0.41952795 -6.698527871 2.43085273 6#> 11 income 1.13532111 0.98438175 -0.951793411 3.40071745 6#> 12 ideology -1.68021405 -1.70799230 -6.914250491 3.24176008 6#> 13 (Intercept) 1.26181317 1.26942674 1.040925276 1.51558814 7#> 14 inequality 0.80953306 0.84248475 0.231085561 1.06958125 7#> 15 income 0.87106506 0.84954669 0.685260937 1.23096260 7#> 16 ideology -1.52507078 -1.64658555 -1.807455273 -0.83657062 7#> 17 (Intercept) 0.24636885 0.14258258 -0.583773465 1.95988339 8#> 18 inequality -0.86642276 -0.85949975 -1.470426825 -0.34399141 8#> 19 income -0.35318895 -0.40457542 -0.886326048 0.66112793 8#> 20 ideology -1.88373346 -1.88212600 -2.372975397 -0.97347723 8#> 21 (Intercept) 0.05388184 0.06101514 -0.495474888 0.73339434 9#> 22 inequality -1.37448170 -1.45032499 -1.912511155 0.16795850 9#> 23 income -0.76298729 -0.66805991 -1.861040125 -0.15009456 9#> 24 ideology -2.00827911 -2.13122983 -2.529214836 -1.06262452 9#>#> --------------------------------#> Coefficients and clusters for context 5#>#> Post.Mean Post.Median HPD.lower HPD.upper Cluster#> 1 (Intercept) -0.24240173 -0.22694438 -1.37392885 0.48264961 2#> 2 inequality 1.17259165 1.10433414 0.11165336 2.54947605 2#> 3 income -0.84368226 -0.70317826 -1.59194138 0.05606758 2#> 4 ideology -2.45344228 -2.48267526 -2.99850367 -1.83030247 2#> 5 (Intercept) -1.86511951 -1.79551079 -2.49660056 -1.43323752 4#> 6 inequality -0.98377933 -0.93409981 -2.00748031 -0.48056738 4#> 7 income -0.24914309 -0.02543453 -2.20809386 0.30870344 4#> 8 ideology -1.49043107 -1.48673526 -1.85450353 -1.18101234 4#> 9 (Intercept) 0.29352997 0.49635257 -1.30750781 1.02142811 5#> 10 inequality 1.27350873 1.65669950 -0.69709761 1.85155475 5#> 11 income 0.25310930 -0.09395183 -1.71661454 2.50254166 5#> 12 ideology -1.49312313 -1.23333972 -4.79083142 -1.03114579 5#> 13 (Intercept) -0.06073803 -0.08218137 -0.50728697 0.59054012 6#> 14 inequality 0.79956869 0.85322344 0.09015116 1.33014352 6#> 15 income -0.63267516 -0.25395468 -4.45622532 0.32483259 6#> 16 ideology -1.72853382 -1.90779076 -2.31122456 0.06124412 6#> 17 (Intercept) 0.05967276 0.21739651 -2.09959632 0.45102184 8#> 18 inequality -2.97911217 -2.82584439 -5.05762731 -2.63972645 8#> 19 income -0.06461465 -0.14988090 -0.45124027 1.49177503 8#> 20 ideology -2.21842948 -2.27832015 -2.55764076 -1.50517650 8#>#> --------------------------------#> Context-level coefficients:#> Description Post.Mean HPD.lower HPD.upper#> 1 Intercept of beta[0] -0.9709097 -3.305093 1.2161629#> 2 Intercept of beta[1] -0.6572283 -2.469259 2.1801314#> 3 Intercept of beta[2] -0.1530786 -3.084569 2.2066620#> 4 Intercept of beta[3] -1.9637478 -4.986880 0.8479351#> 5 Effect of gap on beta[0] -0.2784919 -2.840100 2.2355586#> 6 Effect of gap on beta[1] 0.2859149 -2.078423 2.8677946#> 7 Effect of gap on beta[2] -0.6870031 -2.855376 1.8402540#> 8 Effect of gap on beta[3] 0.5121591 -1.653244 2.0679228#>#> --------------------------------Thesummary contains more information now. As before,the columnk indicates the estimated clusters. The columnj indicates the country (context) of the estimated valuefor the respective cluster’s coefficient. The second summary($tau) shows the marginal effect of the context-levelfeature (gap). Details of the interpretation can be foundinFerrari (2020).
There are a series of built-in functions to visualize the output. Thefunctionplot_tau() displays the estimation of the effectof the context-level variables.
The functionplot_pexp_beta() displays the associationbetween the context-level features and the latent heterogeneity in theeffect of the linear coefficients in each context. The paramter‘smooth.line’ plots a line representing the linear association betweenthe context-level feature (gap) and the posterior averagesof the marginal effects in each cluster. The parameterncol.beta controls the number of columns in the figure forthe panels. For more options, seehelp(plot_pexp_beta)
plot_pexp_beta(mod,smooth.line=TRUE,ncol.beta=2)#>#>#> Generating plots ...#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as#> of ggplot2 3.3.4.#> ℹ The deprecated feature was likely used in the hdpGLM package.#> Please report the issue at <https://github.com/DiogoFerrari/hdpGLM/issues>.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was#> generated.#> `geom_smooth()` using formula = 'y ~ x'#> `geom_smooth()` using formula = 'y ~ x'