- Notifications
You must be signed in to change notification settings - Fork14
An R package for interpretable visualizations of bivariate density estimates
License
Unknown, MIT licenses found
Licenses found
jamesotto852/ggdensity
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
ggdensity extendsggplot2 providing moreinterpretable visualizations of density estimates based on highestdensity regions (HDRs).ggdensity offers drop-in replacements forggplot2 functions:
- instead of
ggplot2::geom_density_2d_filled(), useggdensity::geom_hdr(); - instead of
ggplot2::geom_density_2d(), useggdensity::geom_hdr_lines().
Also included are the functionsgeom_hdr_fun() andgeom_hdr_lines_fun() for plotting HDRs of user-specified bivariateprobability density functions.
ggdensity is available on CRAN and can be installed with:
install.packages("ggdensity")Alternatively, you can install the latest development version fromGitHub with:
if (!requireNamespace("remotes")) install.packages("remotes")remotes::install_github("jamesotto852/ggdensity")
The standard way to visualize the joint distribution of two continuousvariables inggplot2 is to useggplot2::geom_density_2d() orgeom_density_2d_filled(). Here’s an example:
library("ggplot2"); theme_set(theme_minimal())theme_update(panel.grid.minor= element_blank())library("ggdensity")library("patchwork")df<-data.frame("x"= rnorm(1000),"y"= rnorm(1000))p<- ggplot(df, aes(x,y))+ coord_equal()p+ geom_density_2d_filled()
While it’s a nice looking plot, it isn’t immediately clear how we shouldunderstand it. That’s becausegeom_density_2d_filled() generates itscontours as equidistant level sets of the estimated bivariate density,i.e. taking horizontal slices of the 3d surface at equally-spacedheights, and projecting the intersections down into the plane. So youget a general feel of where the density is high, but not much else. Tointerpret a contour, you would need to multiply its height by the areait bounds, which of course is very challenging to do by just looking atit.
geom_hdr() tries to get around this problem by presenting you withregions of the estimated distribution that are immediatelyinterpretable:
p+ geom_hdr()
probs here tells us the probability bounded by the correspondingregion, and the regions are computed to be the smallest such regionsthat bound that level of probability; these are called highest densityregions or HDRs. By default, the plotted regions show theprobs argument togeom_hdr(). Notice that yourtake-away from the plot made withgeom_density_2d_filled() is subtlelyyet significantly different than that of the plot made bygeom_hdr().
ggdensity’s functions were designed to be seamlessly consistent withthe rest of theggplot2 framework. As a consequence, pretty mucheverything you would expect to just work does. (Well, we hope!Let usknow if that’snot true.)
For example, becausegeom_hdr() maps probability to thealphaaesthetic, thefill andcolor aesthetics are available for mappingto variables. You can use them to visualize subpopulations in your data.For example, in thepenguins data frompalmerpenguins youmay want to look at how the relationship between bill length and flipperlength changes across different species of penguins. Here’s one way youcould look at that:
library("palmerpenguins")ggplot(penguins, aes(flipper_length_mm,bill_length_mm,fill=species))+ geom_hdr(xlim= c(160,240),ylim= c(30,70))+ geom_point(shape=21)
Nice, but a bit overplotted. To alleviate overplotting, we can usegeom_hdr_lines():
ggplot(penguins, aes(flipper_length_mm,bill_length_mm,color=species))+ geom_hdr_lines(xlim= c(160,240),ylim= c(30,70))+ geom_point(size=1)
Or you could facet the plot:
ggplot(penguins, aes(flipper_length_mm,bill_length_mm,fill=species))+ geom_hdr(xlim= c(160,240),ylim= c(30,70))+ geom_point(shape=21)+ facet_wrap(vars(species))
The main point here is that you should really think ofgeom_hdr() andgeom_hdr_lines() as drop-in replacements for functions likegeom_density_2d_filled(),geom_density2d(), and so on, and you canexpect all of the rest of theggplot2 stuff to just work.
The underlying stat used bygeom_hdr() creates the computed variableprobs that can be mapped in the standard way you map computedvariables inggplot2, withafter_stat().
For example,geom_hdr() andgeom_hdr_lines() mapprobs to thealpha aesthetic by default. But you can override it like this, just besure to override thealpha aesthetic by settingalpha = 1.
ggplot(faithful, aes(eruptions,waiting))+ geom_hdr( aes(fill= after_stat(probs)),alpha=1,xlim= c(0,8),ylim= c(30,110) )
ggplot(faithful, aes(eruptions,waiting))+ geom_hdr_lines( aes(color= after_stat(probs)),alpha=1,xlim= c(0,8),ylim= c(30,110) )
In addition to trying to make the visuals clean and the functions whatyou would expect as aggplot2 user, we’ve spent considerable effortin trying to ensure that the graphics you’re getting withggdensityare statistically rigorous and provide a range of estimation options formore detailed control.
To that end, you can pass amethod argument intogeom_hdr() andgeom_hdr_lines() that allows you to specify various nonparametric andparametric ways to estimate the underlying bivariate distribution, andwe have plans for even more. Each of the estimators below offersadvantages in certain contexts. For example, histogram estimators resultin HDRs that obey constrained supports. Normal estimators can be helpfulin providing simplified visuals that give the viewer a sense of wherethe distributions are, potentially at the expense of over-simplifyingand removing important features of how the variables (co-)vary.
Themethod argument may be specified either as a character vector(method = "kde") or as a function call (method = method_kde()). Whena function call is used, it may be possible to specify parametersgoverning the density estimation procedure. For example,method_kde()accepts parametersh andadjust, both related to the kernel’sbandwidth. For details see?method_kde orvignette("method", "ggdensity").
The above discussion has focused around densities that are estimatedfrom data. But in some instances, you have the distribution in the formof a function that encodes thejointPDF. Inthose circumstances, you can usegeom_hdr_fun() andgeom_hdr_lines_fun() to make the analogous plots. These functionsbehave similarly togeom_function() fromggplot2, accepting theargumentfun specifying the pdf to be summarized. Here’s an example:
f<-function(x,y) dnorm(x)* dgamma(y,5,3)ggplot()+ geom_hdr_fun(fun=f,xlim= c(-4,4),ylim= c(0,5))
In addition to all of the methods of density estimation available withgeom_hdr(), one of the perks of havinggeom_hdr_fun() is that itallows you to plot parametric densities that you estimate outside theggdensity framework. The basic idea is that you fit yourdistribution outsideggdensity calls with your method of choice, saymaximum likelihood, and then plug the maximum likelihood estimate intothe density formula to obtain a function to plug intogeom_hdr_fun().
Here’s an example of how you can do that that assuming that theunderlying data are independent and exponentially distributed withunknown rates.
set.seed(123)th<- c(3,5)df<-data.frame("x"= rexp(1000,th[1]),"y"= rexp(1000,th[2]))# construct the likelihood functionl<-function(th) {log_liks<- apply(df,1,function(xy) { dexp(xy[1],rate=th[1],log=TRUE)+ dexp(xy[2],rate=th[2],log=TRUE) }) sum(log_liks)}# compute the mle(th_hat<- optim(c(2,2),l,control=list(fnscale=-1))$par)#> [1] 2.912736 5.032125# construct the parametric density estimatef<-function(x,y,th) dexp(x,th[1])* dexp(y,th[2])# pass estimated density into geom_hdr_fun()ggplot(df, aes(x,y))+ geom_hdr_fun(fun=f,args=list(th=th_hat))+ geom_point(shape=21,fill="lightgreen",alpha=.25)+ coord_equal()
Inspired byggpointdensity,ggdensity provides a scatterplot geom whereby the individual datapoints can be seen simultaneously with HDRs. This is most useful insituations with significant overplotting.
p_points<- ggplot(diamonds, aes(carat,price))+ geom_point()p_hdr_points<- ggplot(diamonds, aes(carat,price))+ geom_hdr_points()p_points+p_hdr_points
Rug plots are standard additions to plots with densities:
ggplot(cars, aes(speed,dist))+ geom_density_2d()+ geom_point()+ geom_rug()
With HDRs, these can be used to visualize joint and marginal HDRssimultaneously. The marginal HDRs are computed off of only thecorrespondingx andy aesthetic variables. Note that these can besubstantially different: the joint HDR isnot theproduct of themarginal HDRs.
ggplot(cars, aes(speed,dist))+ geom_hdr()+ geom_point(color="red")+ geom_hdr_rug()
Likegeom_rug(), these can be placed on different sides of the object:
ggplot(cars, aes(speed,dist))+ geom_hdr()+ geom_point(color="red")+ geom_hdr_rug(sides="tr",outside=TRUE)+ coord_cartesian(clip="off")
We sometimes find it easier to view if the rug intervals are colored:
ggplot(cars, aes(speed,dist))+ geom_hdr()+ geom_point(color="red")+ geom_hdr_rug(aes(fill= after_stat(probs)),length= unit(.2,"cm"),alpha=1)+ scale_fill_viridis_d(option="magma",begin=.8,end=0)
The secondprobs guide is currently a bug. As a work around, you cansolve it by addingguides(alpha = "none"). Note also the use oflength = unit(.2, "cm"), this allows us to make the thickness the sameon both axes and reasonable on the plot. (Compare those rug plots tothose on the previous graphic.)
It is possible to access numerical summaries of the estimated densitiesand HDRs computed byggdensity withget_hdr():
df<-data.frame(x= rnorm(1e3),y= rnorm(1e3))res<- get_hdr(df,method="kde")str(res)#> List of 3#> $ df_est:'data.frame': 10000 obs. of 5 variables:#> ..$ x : num [1:10000] -3.05 -2.99 -2.93 -2.86 -2.8 ...#> ..$ y : num [1:10000] -3.13 -3.13 -3.13 -3.13 -3.13 ...#> ..$ fhat : num [1:10000] 1.58e-09 4.49e-09 1.30e-08 3.66e-08 9.83e-08 ...#> ..$ fhat_discretized: num [1:10000] 6.43e-12 1.83e-11 5.29e-11 1.49e-10 4.00e-10 ...#> ..$ hdr : num [1:10000] 1 1 1 1 1 1 1 1 1 1 ...#> $ breaks: Named num [1:5] 0.00257 0.00887 0.02929 0.07574 Inf#> ..- attr(*, "names")= chr [1:5] "99%" "95%" "80%" "50%" ...#> $ data :'data.frame': 1000 obs. of 3 variables:#> ..$ x : num [1:1000] -0.817 -2.463 -1.343 0.136 0.883 ...#> ..$ y : num [1:1000] -0.5277 -1.4411 -1.9568 0.0287 1.5382 ...#> ..$ hdr_membership: num [1:1000] 0.5 0.99 0.95 0.5 0.8 0.99 0.8 0.95 0.5 0.5 ...
Similarly, there isget_hdr_1d() for univariate data:
x<- rnorm(1e3)res<- get_hdr_1d(x,method="kde")str(res)#> List of 3#> $ df_est:'data.frame': 512 obs. of 4 variables:#> ..$ x : num [1:512] -2.89 -2.88 -2.86 -2.85 -2.84 ...#> ..$ fhat : num [1:512] 0.00441 0.0046 0.00479 0.00499 0.0052 ...#> ..$ fhat_discretized: num [1:512] 5.46e-05 5.70e-05 5.94e-05 6.19e-05 6.45e-05 ...#> ..$ hdr : num [1:512] 1 1 1 1 1 1 1 1 1 1 ...#> $ breaks: Named num [1:5] 0.0141 0.0563 0.1757 0.317 Inf#> ..- attr(*, "names")= chr [1:5] "99%" "95%" "80%" "50%" ...#> $ data :'data.frame': 1000 obs. of 2 variables:#> ..$ x : num [1:1000] -0.4301 -1.5792 0.1929 -0.4973 -0.0859 ...#> ..$ hdr_membership: num [1:1000] 0.5 0.95 0.5 0.5 0.5 0.5 0.8 0.5 0.5 0.99 ...
For details on the objects returned by these functions, see?get_hdrand?get_hdr_1d.
geom_hdr() and related functions were written with the intent ofplaying nicely withggplot2, so that what the typicalggplot2user would expect from the rest of theggplot2 ecosystem would workin the same way withggdensity.
One place where the effect isn’t ideal is in the limits of thex andy scales. Without getting into too much detail, these key off of theobserved points themselves, and not properties of the estimated density.This is consistent withgeom_density_2d() andstat_smooth(), forexample: computed aesthetics don’t extend past the range of the data.
One potential danger here is that the estimated HDRs are computed basedon not the estimated density directly, but a discretization of it. Thisis how all non-parametric density estimation in R works,e.g. MASS::kde2d(), and most parametric density estimation, too. Inother words: the density estimate itself is only known at points on agrid over thex-y aesthetic space. As a consequence, if that rangeis too small, it’s possible that a probabilistically non-trivialproportion of the density is excluded from the computations, biasing theresulting HDRs.
The punch line is that whenever you see an HDR getting truncated by thewindow of the plot, it’s probably a good idea to manually increase theaesthetic limits and usecoord_cartesian() to zoom in as needed.Here’s an example using the previously created graphic. The limits giventocoord_cartesian() and the call toscale_y_continuous() is simplyan effort to make the third plot comparable to the first.
Note: The support of the data isn’t respected here-the estimateddensity doesn’t know speed can’t go negative. That’s not an artifact ofthe effect described above, that’s just because that’s what KDE’s do.
p1<- ggplot(cars, aes(speed,dist))+ geom_hdr()+ geom_point(color="red")+ guides(alpha="none")+ ggtitle("Default geom_hdr()")p2<- ggplot(cars, aes(speed,dist))+ geom_hdr(xlim= c(-20,50),ylim= c(-40,140))+ geom_point(color="red")+ guides(alpha="none")+ ggtitle("Manually set xlim, ylim")p3<- ggplot(cars, aes(speed,dist))+ geom_hdr(xlim= c(-20,50),ylim= c(-40,140))+ geom_point(color="red")+ guides(alpha="none")+ scale_y_continuous(breaks=25*(0:5))+ coord_cartesian(xlim= c(4,25),ylim= c(-1,120))+ ggtitle("Zoom with coord_cartesian()")(p1/p2/p3)& theme(title= element_text(size=9))
There are a few other great packages out there you should know about ifyou’re interested inggdensity.
Theggdist package providesseveral flexible geoms for visualizing distributions of data, mostlyunivariate data.
Thehdrcde packageallows you to make bivariate HDR plots as well. At the surface, the maindifference is thathdrcde doesn’t useggplot2 graphics; however,under the hood there are many more differences. (More coming onexplaining these discrepancies.)
The code illustrating the two strategies is quite simple, but trying tomake the graphics more directly comparable requires some effort. Here’sa pretty good rendition on thefaithful dataset, which has 272observations.
p_hdr_scale<- ggplot(faithful, aes(eruptions,waiting))+ geom_hdr(xlim=scales::expand_range(range(faithful$eruptions),mul=.25),ylim=scales::expand_range(range(faithful$waiting),mul=.25) )+ geom_point(color="red")+ scale_x_continuous(breaks=0:6)+ scale_y_continuous(breaks= (3:10)*10)+ guides(alpha="none")den<- with(faithful,MASS::kde2d(eruptions,waiting,n=100,lims= c(0,6,30,105)))if (!requireNamespace("hdrcde")) install.packages("hdrcde")library("hdrcde")p_den<-~ with(faithful, plot( hdr.2d(eruptions,waiting,prob= c(50,80,95,99),den=den),pointcol="red",show.points=TRUE,xlim= c(0,6),ylim= c(30,105) ))par(mar= c(0,1.75,0,0),bg=NA)p_hdr_scale+ coord_cartesian(xlim= c(0,6),ylim= c(30,105),expand=FALSE)+ wrap_elements(panel=p_den,clip=FALSE)
These look quite different, and they are. It’s worth noting that evenwithinhdrcde there is variability as well:
par(mar= c(3,3,1,1)+0.1,mfrow= c(1,2))with(faithful, plot( hdr.2d(eruptions,waiting,prob= c(50,80,95,99),kde.package="ash",xextend=.20),pointcol="red",show.points=TRUE,xlim= c(0,6),ylim= c(30,105) ))with(faithful, plot( hdr.2d(eruptions,waiting,prob= c(50,80,95,99),kde.package="ks",xextend=.20),pointcol="red",show.points=TRUE,xlim= c(0,6),ylim= c(30,105) ))
gghdr is somewhat of aggplot2 port ofhdrcde, developed by some of the same teammembers. In some ways, it’s very similar toggdensity. For example,it contains a functiongghdr::geom_hdr_rug() that does effectively thesame asggdensity::geom_hdr_rug(); it implements a kind ofggdensity::geom_hdr_pointdensity() via a functiongghdr::hdr_bin()plus the color aesthetic togeom_point(); and it provides a boxplotalternativegghdr::geom_hdr_boxplot(). To the extent the similaritiesbetweenggdensity andhdrcde/gghdr exist (and they obviouslydo), they are an example ofconvergentevolution. Thepresent authors only discovered those projects after writing most ofggdensity, unfortunately. Interestingly, we also had designs on theCDE part as well (“conditional density estimation”, think models);however had not implemented it before seeinghdrcde. You can expectthose to come down the road.
Perhaps the most important difference betweenggdensity andgghdr is that the latter doesn’t implement bivariate HDRs in theggplot2 framework, which was the original motivation ofggdensity. For that purpose, it seems the only project available isggdensity.
About
An R package for interpretable visualizations of bivariate density estimates
Resources
License
Unknown, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.


















