Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

An R package for interpretable visualizations of bivariate density estimates

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

jamesotto852/ggdensity

Repository files navigation

check-releasetest-coverageCRAN_Status_BadgeCRAN_Download_Badge

ggdensity extendsggplot2 providing moreinterpretable visualizations of density estimates based on highestdensity regions (HDRs).ggdensity offers drop-in replacements forggplot2 functions:

  • instead ofggplot2::geom_density_2d_filled(), useggdensity::geom_hdr();
  • instead ofggplot2::geom_density_2d(), useggdensity::geom_hdr_lines().

Also included are the functionsgeom_hdr_fun() andgeom_hdr_lines_fun() for plotting HDRs of user-specified bivariateprobability density functions.

Installation

ggdensity is available on CRAN and can be installed with:

install.packages("ggdensity")

Alternatively, you can install the latest development version fromGitHub with:

if (!requireNamespace("remotes")) install.packages("remotes")remotes::install_github("jamesotto852/ggdensity")

geom_density_2d_filled() vs. geom_hdr()

The standard way to visualize the joint distribution of two continuousvariables inggplot2 is to useggplot2::geom_density_2d() orgeom_density_2d_filled(). Here’s an example:

library("ggplot2"); theme_set(theme_minimal())theme_update(panel.grid.minor= element_blank())library("ggdensity")library("patchwork")df<-data.frame("x"= rnorm(1000),"y"= rnorm(1000))p<- ggplot(df, aes(x,y))+ coord_equal()p+ geom_density_2d_filled()

While it’s a nice looking plot, it isn’t immediately clear how we shouldunderstand it. That’s becausegeom_density_2d_filled() generates itscontours as equidistant level sets of the estimated bivariate density,i.e. taking horizontal slices of the 3d surface at equally-spacedheights, and projecting the intersections down into the plane. So youget a general feel of where the density is high, but not much else. Tointerpret a contour, you would need to multiply its height by the areait bounds, which of course is very challenging to do by just looking atit.

geom_hdr() tries to get around this problem by presenting you withregions of the estimated distribution that are immediatelyinterpretable:

p+ geom_hdr()

probs here tells us the probability bounded by the correspondingregion, and the regions are computed to be the smallest such regionsthat bound that level of probability; these are called highest densityregions or HDRs. By default, the plotted regions show the$50%$,$80%$,$95%$, and$99%$ HDRs of the estimated density, but this canbe changed with theprobs argument togeom_hdr(). Notice that yourtake-away from the plot made withgeom_density_2d_filled() is subtlelyyet significantly different than that of the plot made bygeom_hdr().

Visualizing subpopulations andgeom_hdr_lines()

ggdensity’s functions were designed to be seamlessly consistent withthe rest of theggplot2 framework. As a consequence, pretty mucheverything you would expect to just work does. (Well, we hope!Let usknow if that’snot true.)

For example, becausegeom_hdr() maps probability to thealphaaesthetic, thefill andcolor aesthetics are available for mappingto variables. You can use them to visualize subpopulations in your data.For example, in thepenguins data frompalmerpenguins youmay want to look at how the relationship between bill length and flipperlength changes across different species of penguins. Here’s one way youcould look at that:

library("palmerpenguins")ggplot(penguins, aes(flipper_length_mm,bill_length_mm,fill=species))+  geom_hdr(xlim= c(160,240),ylim= c(30,70))+  geom_point(shape=21)

Nice, but a bit overplotted. To alleviate overplotting, we can usegeom_hdr_lines():

ggplot(penguins, aes(flipper_length_mm,bill_length_mm,color=species))+  geom_hdr_lines(xlim= c(160,240),ylim= c(30,70))+  geom_point(size=1)

Or you could facet the plot:

ggplot(penguins, aes(flipper_length_mm,bill_length_mm,fill=species))+  geom_hdr(xlim= c(160,240),ylim= c(30,70))+  geom_point(shape=21)+  facet_wrap(vars(species))

The main point here is that you should really think ofgeom_hdr() andgeom_hdr_lines() as drop-in replacements for functions likegeom_density_2d_filled(),geom_density2d(), and so on, and you canexpect all of the rest of theggplot2 stuff to just work.

A deeper cut illustratingggplot2 integration

The underlying stat used bygeom_hdr() creates the computed variableprobs that can be mapped in the standard way you map computedvariables inggplot2, withafter_stat().

For example,geom_hdr() andgeom_hdr_lines() mapprobs to thealpha aesthetic by default. But you can override it like this, just besure to override thealpha aesthetic by settingalpha = 1.

ggplot(faithful, aes(eruptions,waiting))+  geom_hdr(    aes(fill= after_stat(probs)),alpha=1,xlim= c(0,8),ylim= c(30,110)  )

ggplot(faithful, aes(eruptions,waiting))+  geom_hdr_lines(    aes(color= after_stat(probs)),alpha=1,xlim= c(0,8),ylim= c(30,110)  )

Statistics details

In addition to trying to make the visuals clean and the functions whatyou would expect as aggplot2 user, we’ve spent considerable effortin trying to ensure that the graphics you’re getting withggdensityare statistically rigorous and provide a range of estimation options formore detailed control.

To that end, you can pass amethod argument intogeom_hdr() andgeom_hdr_lines() that allows you to specify various nonparametric andparametric ways to estimate the underlying bivariate distribution, andwe have plans for even more. Each of the estimators below offersadvantages in certain contexts. For example, histogram estimators resultin HDRs that obey constrained supports. Normal estimators can be helpfulin providing simplified visuals that give the viewer a sense of wherethe distributions are, potentially at the expense of over-simplifyingand removing important features of how the variables (co-)vary.

Themethod argument may be specified either as a character vector(method = "kde") or as a function call (method = method_kde()). Whena function call is used, it may be possible to specify parametersgoverning the density estimation procedure. For example,method_kde()accepts parametersh andadjust, both related to the kernel’sbandwidth. For details see?method_kde orvignette("method", "ggdensity").

If you know your PDF

The above discussion has focused around densities that are estimatedfrom data. But in some instances, you have the distribution in the formof a function that encodes thejointPDF. Inthose circumstances, you can usegeom_hdr_fun() andgeom_hdr_lines_fun() to make the analogous plots. These functionsbehave similarly togeom_function() fromggplot2, accepting theargumentfun specifying the pdf to be summarized. Here’s an example:

f<-function(x,y) dnorm(x)* dgamma(y,5,3)ggplot()+  geom_hdr_fun(fun=f,xlim= c(-4,4),ylim= c(0,5))

Visualizing custom parametric density estimates withgeom_hdr_fun()

In addition to all of the methods of density estimation available withgeom_hdr(), one of the perks of havinggeom_hdr_fun() is that itallows you to plot parametric densities that you estimate outside theggdensity framework. The basic idea is that you fit yourdistribution outsideggdensity calls with your method of choice, saymaximum likelihood, and then plug the maximum likelihood estimate intothe density formula to obtain a function to plug intogeom_hdr_fun().

Here’s an example of how you can do that that assuming that theunderlying data are independent and exponentially distributed withunknown rates.

set.seed(123)th<- c(3,5)df<-data.frame("x"= rexp(1000,th[1]),"y"= rexp(1000,th[2]))# construct the likelihood functionl<-function(th) {log_liks<- apply(df,1,function(xy) {    dexp(xy[1],rate=th[1],log=TRUE)+    dexp(xy[2],rate=th[2],log=TRUE)  })  sum(log_liks)}# compute the mle(th_hat<- optim(c(2,2),l,control=list(fnscale=-1))$par)#> [1] 2.912736 5.032125# construct the parametric density estimatef<-function(x,y,th) dexp(x,th[1])* dexp(y,th[2])# pass estimated density into geom_hdr_fun()ggplot(df, aes(x,y))+  geom_hdr_fun(fun=f,args=list(th=th_hat))+  geom_point(shape=21,fill="lightgreen",alpha=.25)+  coord_equal()

Other perks

geom_hdr_points()

Inspired byggpointdensity,ggdensity provides a scatterplot geom whereby the individual datapoints can be seen simultaneously with HDRs. This is most useful insituations with significant overplotting.

p_points<- ggplot(diamonds, aes(carat,price))+  geom_point()p_hdr_points<- ggplot(diamonds, aes(carat,price))+  geom_hdr_points()p_points+p_hdr_points

geom_hdr_rug()

Rug plots are standard additions to plots with densities:

ggplot(cars, aes(speed,dist))+  geom_density_2d()+  geom_point()+  geom_rug()

With HDRs, these can be used to visualize joint and marginal HDRssimultaneously. The marginal HDRs are computed off of only thecorrespondingx andy aesthetic variables. Note that these can besubstantially different: the joint HDR isnot theproduct of themarginal HDRs.

ggplot(cars, aes(speed,dist))+  geom_hdr()+  geom_point(color="red")+  geom_hdr_rug()

Likegeom_rug(), these can be placed on different sides of the object:

ggplot(cars, aes(speed,dist))+  geom_hdr()+  geom_point(color="red")+  geom_hdr_rug(sides="tr",outside=TRUE)+  coord_cartesian(clip="off")

We sometimes find it easier to view if the rug intervals are colored:

ggplot(cars, aes(speed,dist))+  geom_hdr()+  geom_point(color="red")+  geom_hdr_rug(aes(fill= after_stat(probs)),length= unit(.2,"cm"),alpha=1)+   scale_fill_viridis_d(option="magma",begin=.8,end=0)

The secondprobs guide is currently a bug. As a work around, you cansolve it by addingguides(alpha = "none"). Note also the use oflength = unit(.2, "cm"), this allows us to make the thickness the sameon both axes and reasonable on the plot. (Compare those rug plots tothose on the previous graphic.)

Numerical summaries of HDRs

It is possible to access numerical summaries of the estimated densitiesand HDRs computed byggdensity withget_hdr():

df<-data.frame(x= rnorm(1e3),y= rnorm(1e3))res<- get_hdr(df,method="kde")str(res)#> List of 3#>  $ df_est:'data.frame':  10000 obs. of  5 variables:#>   ..$ x               : num [1:10000] -3.05 -2.99 -2.93 -2.86 -2.8 ...#>   ..$ y               : num [1:10000] -3.13 -3.13 -3.13 -3.13 -3.13 ...#>   ..$ fhat            : num [1:10000] 1.58e-09 4.49e-09 1.30e-08 3.66e-08 9.83e-08 ...#>   ..$ fhat_discretized: num [1:10000] 6.43e-12 1.83e-11 5.29e-11 1.49e-10 4.00e-10 ...#>   ..$ hdr             : num [1:10000] 1 1 1 1 1 1 1 1 1 1 ...#>  $ breaks: Named num [1:5] 0.00257 0.00887 0.02929 0.07574 Inf#>   ..- attr(*, "names")= chr [1:5] "99%" "95%" "80%" "50%" ...#>  $ data  :'data.frame':  1000 obs. of  3 variables:#>   ..$ x             : num [1:1000] -0.817 -2.463 -1.343 0.136 0.883 ...#>   ..$ y             : num [1:1000] -0.5277 -1.4411 -1.9568 0.0287 1.5382 ...#>   ..$ hdr_membership: num [1:1000] 0.5 0.99 0.95 0.5 0.8 0.99 0.8 0.95 0.5 0.5 ...

Similarly, there isget_hdr_1d() for univariate data:

x<- rnorm(1e3)res<- get_hdr_1d(x,method="kde")str(res)#> List of 3#>  $ df_est:'data.frame':  512 obs. of  4 variables:#>   ..$ x               : num [1:512] -2.89 -2.88 -2.86 -2.85 -2.84 ...#>   ..$ fhat            : num [1:512] 0.00441 0.0046 0.00479 0.00499 0.0052 ...#>   ..$ fhat_discretized: num [1:512] 5.46e-05 5.70e-05 5.94e-05 6.19e-05 6.45e-05 ...#>   ..$ hdr             : num [1:512] 1 1 1 1 1 1 1 1 1 1 ...#>  $ breaks: Named num [1:5] 0.0141 0.0563 0.1757 0.317 Inf#>   ..- attr(*, "names")= chr [1:5] "99%" "95%" "80%" "50%" ...#>  $ data  :'data.frame':  1000 obs. of  2 variables:#>   ..$ x             : num [1:1000] -0.4301 -1.5792 0.1929 -0.4973 -0.0859 ...#>   ..$ hdr_membership: num [1:1000] 0.5 0.95 0.5 0.5 0.5 0.5 0.8 0.5 0.5 0.99 ...

For details on the objects returned by these functions, see?get_hdrand?get_hdr_1d.

A caveat and recommendation for cropped HDRs

geom_hdr() and related functions were written with the intent ofplaying nicely withggplot2, so that what the typicalggplot2user would expect from the rest of theggplot2 ecosystem would workin the same way withggdensity.

One place where the effect isn’t ideal is in the limits of thex andy scales. Without getting into too much detail, these key off of theobserved points themselves, and not properties of the estimated density.This is consistent withgeom_density_2d() andstat_smooth(), forexample: computed aesthetics don’t extend past the range of the data.

One potential danger here is that the estimated HDRs are computed basedon not the estimated density directly, but a discretization of it. Thisis how all non-parametric density estimation in R works,e.g. MASS::kde2d(), and most parametric density estimation, too. Inother words: the density estimate itself is only known at points on agrid over thex-y aesthetic space. As a consequence, if that rangeis too small, it’s possible that a probabilistically non-trivialproportion of the density is excluded from the computations, biasing theresulting HDRs.

The punch line is that whenever you see an HDR getting truncated by thewindow of the plot, it’s probably a good idea to manually increase theaesthetic limits and usecoord_cartesian() to zoom in as needed.Here’s an example using the previously created graphic. The limits giventocoord_cartesian() and the call toscale_y_continuous() is simplyan effort to make the third plot comparable to the first.

Note: The support of the data isn’t respected here-the estimateddensity doesn’t know speed can’t go negative. That’s not an artifact ofthe effect described above, that’s just because that’s what KDE’s do.

p1<- ggplot(cars, aes(speed,dist))+  geom_hdr()+  geom_point(color="red")+  guides(alpha="none")+  ggtitle("Default geom_hdr()")p2<- ggplot(cars, aes(speed,dist))+  geom_hdr(xlim= c(-20,50),ylim= c(-40,140))+  geom_point(color="red")+  guides(alpha="none")+  ggtitle("Manually set xlim, ylim")p3<- ggplot(cars, aes(speed,dist))+  geom_hdr(xlim= c(-20,50),ylim= c(-40,140))+  geom_point(color="red")+  guides(alpha="none")+  scale_y_continuous(breaks=25*(0:5))+  coord_cartesian(xlim= c(4,25),ylim= c(-1,120))+   ggtitle("Zoom with coord_cartesian()")(p1/p2/p3)& theme(title= element_text(size=9))

Related projects

There are a few other great packages out there you should know about ifyou’re interested inggdensity.

Theggdist package providesseveral flexible geoms for visualizing distributions of data, mostlyunivariate data.

Thehdrcde packageallows you to make bivariate HDR plots as well. At the surface, the maindifference is thathdrcde doesn’t useggplot2 graphics; however,under the hood there are many more differences. (More coming onexplaining these discrepancies.)

The code illustrating the two strategies is quite simple, but trying tomake the graphics more directly comparable requires some effort. Here’sa pretty good rendition on thefaithful dataset, which has 272observations.

p_hdr_scale<- ggplot(faithful, aes(eruptions,waiting))+  geom_hdr(xlim=scales::expand_range(range(faithful$eruptions),mul=.25),ylim=scales::expand_range(range(faithful$waiting),mul=.25)  )+  geom_point(color="red")+  scale_x_continuous(breaks=0:6)+  scale_y_continuous(breaks= (3:10)*10)+  guides(alpha="none")den<- with(faithful,MASS::kde2d(eruptions,waiting,n=100,lims= c(0,6,30,105)))if (!requireNamespace("hdrcde")) install.packages("hdrcde")library("hdrcde")p_den<-~ with(faithful,  plot(    hdr.2d(eruptions,waiting,prob= c(50,80,95,99),den=den),pointcol="red",show.points=TRUE,xlim= c(0,6),ylim= c(30,105)  ))par(mar= c(0,1.75,0,0),bg=NA)p_hdr_scale+  coord_cartesian(xlim= c(0,6),ylim= c(30,105),expand=FALSE)+  wrap_elements(panel=p_den,clip=FALSE)

These look quite different, and they are. It’s worth noting that evenwithinhdrcde there is variability as well:

par(mar= c(3,3,1,1)+0.1,mfrow= c(1,2))with(faithful,  plot(    hdr.2d(eruptions,waiting,prob= c(50,80,95,99),kde.package="ash",xextend=.20),pointcol="red",show.points=TRUE,xlim= c(0,6),ylim= c(30,105)  ))with(faithful,  plot(    hdr.2d(eruptions,waiting,prob= c(50,80,95,99),kde.package="ks",xextend=.20),pointcol="red",show.points=TRUE,xlim= c(0,6),ylim= c(30,105)  ))

gghdr is somewhat of aggplot2 port ofhdrcde, developed by some of the same teammembers. In some ways, it’s very similar toggdensity. For example,it contains a functiongghdr::geom_hdr_rug() that does effectively thesame asggdensity::geom_hdr_rug(); it implements a kind ofggdensity::geom_hdr_pointdensity() via a functiongghdr::hdr_bin()plus the color aesthetic togeom_point(); and it provides a boxplotalternativegghdr::geom_hdr_boxplot(). To the extent the similaritiesbetweenggdensity andhdrcde/gghdr exist (and they obviouslydo), they are an example ofconvergentevolution. Thepresent authors only discovered those projects after writing most ofggdensity, unfortunately. Interestingly, we also had designs on theCDE part as well (“conditional density estimation”, think models);however had not implemented it before seeinghdrcde. You can expectthose to come down the road.

Perhaps the most important difference betweenggdensity andgghdr is that the latter doesn’t implement bivariate HDRs in theggplot2 framework, which was the original motivation ofggdensity. For that purpose, it seems the only project available isggdensity.

About

An R package for interpretable visualizations of bivariate density estimates

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp