- Notifications
You must be signed in to change notification settings - Fork0
A collection of ggplot2 extensions and scripts for graphics in R
License
bdilday/grcdr
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This package provides a collection of customggplot2
extensions -Geoms
,Stats
,Coords
,Themes
, etc. The namegrcdr
is a contraction ofGraphs with Code and Data in R and is a reference to theGraphs with Code and Data blog, to which this package is loosely related.
You can installgrcdr
from github with:
# install.packages("devtools")devtools::install_github("bdilday/grcdr")
The below will usedplyr
andggplot2
library(grcdr)## load basicslibrary(dplyr)library(ggplot2)## set the themeggplot2::theme_set(theme_minimal(base_size=14))
geom_excursion
plots running quantities as connected scatter plots. It requires anx
andy
aesthetic and also at
aesthetic to give the ordering (i.e. "time")
Here's some example data provided with the package that gives team-level stats for 4 MLB teams (Cleveland 1999, New York Yankees 1998, Houston Astros 2017, Milwaukee Brewers 1982).
# load some example data from the packageteams_df= read.csv(system.file("extdata/team_stats.csv",package="grcdr"),stringsAsFactors=FALSE)teams_df$game_date= as.Date(teams_df$game_date)head(teams_df,2)#> k game_key game_source game_date game_number site_key#> 1 CLE_1999 ANA199904060 evt 1999-04-06 0 ANA01#> 2 CLE_1999 ANA199904070 evt 1999-04-07 0 ANA01#> season_phase team_alignment team_key opponent_key r_g r_w r_l r_t b_g#> 1 R 0 CLE ANA 1 0 1 0 1#> 2 R 0 CLE ANA 1 1 0 0 1#> b_pa b_ab b_r b_h b_tb b_2b b_3b b_hr b_hr4 b_rbi b_gw b_bb b_ibb b_so#> 1 41 36 5 10 16 1 1 1 0 4 NA 4 0 6#> 2 47 40 9 13 17 4 0 0 0 9 NA 6 0 7#> b_gdp b_hp b_sh b_sf b_sb b_cs b_xi b_lob p_g p_gs p_cg p_sho p_csho#> 1 0 0 1 0 0 0 0 9 1 1 0 0 0#> 2 1 1 0 0 2 0 0 11 1 1 0 0 0#> p_gf p_w p_l p_sv p_out p_tbf p_ab p_r season season_game_number#> 1 1 0 1 0 24 35 32 6 1999 1#> 2 1 1 0 0 27 34 31 1 1999 2
We can plot runs scored (b_r
) on the x-axis and runs allowed (p_r
) on the y-axis. The time coordinate is the season game number.
base_plot=teams_df %>% ggplot()+ labs(x="runs scored",y="runs allowed")
By default the data aren't averaged (the run length is 1)
p=base_plot+ geom_excursion(aes(x=b_r,y=p_r,t=season_game_number))+ facet_wrap(~k)print(p)
Therun_length
parameter controls how many items get summed. The following usesrun_length = 10
p=base_plot+ geom_excursion(aes(x=b_r,y=p_r,t=season_game_number),run_length=10)+ facet_wrap(~k)print(p)
If you givex_weight
ory_weight
aesthetics then weighted averages are computed instead of sums. Passing_weight = 1
therefore results in straight averages.
p=base_plot+ geom_excursion(aes(x=b_r,y=p_r,t=season_game_number,x_weight=1,y_weight=1),run_length=10)+ facet_wrap(~k)print(p)
Additionally, the time ordering can be changed
set.seed(101)random_idx= sample(1:nrow(teams_df), nrow(teams_df))# order by randomp=teams_df %>% cbind.data.frame(random_idx=random_idx) %>% ggplot()+ labs(x="runs scored",y="runs allowed")+ geom_excursion(aes(x=b_r,y=p_r,t=random_idx,x_weight=1,y_weight=1),run_length=10)+ facet_wrap(~k)print(p)
stat_run
is a lower-level utility thangeom_excursion
. It can change thegeom
(frompath
) and also can plot running line charts in addition to connected scatterplots.
Here I set the y aesthetic but not x, which is effectively a line chart.
p=teams_df %>% ggplot()+ stat_run(aes(y=b_r,t=season_game_number))+ facet_wrap(~k)+ labs(x="game number",y="runs scored")print(p)
However, unlike a traditional line chart, I can average or sum the y variable
p=teams_df %>% ggplot()+ stat_run(aes(y=b_r,t=season_game_number),run_length=10)+ facet_wrap(~k)+ labs(x="game number",y="runs scored")print(p)
The default is to generate running totals by using the cumulative sum function and taking differences. A different cumulative aggregation function can be specified, however. Note that following is for illustration and that the run_fun doesn't correspond to any particularly useful quantity(that I'm aware of, anyway).
fun_with_cumsum_fun=function(x) { cumsum(x* (x-1)* sin(x/5*pi))}p=teams_df %>% ggplot()+ stat_run(aes(y=p_r,t=season_game_number),run_length=10,y_run_fun=fun_with_cumsum_fun)+ facet_wrap(~k)+ labs(x="game number",y="runs scored")print(p)
Setting thex
aesthetic reproduces ageom_excursion
p=teams_df %>% ggplot()+ stat_run(aes(x=b_r,y=p_r,t=season_game_number),run_length=10)+ facet_wrap(~k)+ labs(x="runs scored",y="runs allowed")print(p)
If the time coordinate is missing values, we can fill in the corresponding x and y.
censored_df=teams_df %>% filter(season_game_number<40|season_game_number>60)p=censored_df %>% ggplot()+ stat_run(aes(y=b_r,t=season_game_number,x=season_game_number))+ facet_wrap(~k)print(p)
p=censored_df %>% ggplot()+ stat_run(aes(y=b_r,t=season_game_number,x=season_game_number),run_length=10)+ facet_wrap(~k)print(p)
p=censored_df %>% ggplot()+ stat_run(aes(y=b_r,t=season_game_number,x=season_game_number),run_length=10,y_run_fill_value=20,x_run_fill_value=50)+ facet_wrap(~k)print(p)
Thegeom_excursion
layer forces apath
Geom
, but the lower-levelstat_run
layer can change theGeom
. For example it can usepolygon
(although unclear what the interpretation is)
p=teams_df %>% ggplot()+ stat_run(aes(x=b_r,y=p_r,t=game_date),geom='polygon')+ facet_wrap(~k)print(p)
This stat applies dimensionality reduction using multi-dimensional scaling. As of this writing the available algorithms are principal components analysis (pca
) or t-distributed stochastic neighbor embedding (tsne
). The variables to use in the dimensionality reduction are passed in the aestheticsx#
where#
is an arbitrary integer. The defaultgeom
isGeomPoint
.
set.seed(101)df1=data.frame(x1= rnorm(100))for (iin2:10) {k= sprintf("x%d",i)df1[,k]= rnorm(100)}# now, for the last 25 add a constant to create two well separated groupsdf1[75:100, ]=df1[75:100,]+2
Use only 2 variables
set.seed(101)p=df1 %>% ggplot(aes(x1=x1,x2=x2))+ stat_mds(mds_method="pca")print(p)
Use them all
set.seed(101)p=df1 %>% ggplot(aes(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5,x6=x6,x7=x7,x8=x8,x9=x9,x10=x10))+ stat_mds(mds_method="pca")print(p)
Use them all and label them
set.seed(101)p=df1 %>% mutate(rn=row_number()) %>% ggplot(aes(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5,x6=x6,x7=x7,x8=x8,x9=x9,x10=x10))+ stat_mds(mds_method="pca",geom="text", aes(label=rn))print(p)
Apply t-SNE. This requires theRtsne
package.
set.seed(101)p=df1 %>% mutate(rn=row_number()) %>% ggplot(aes(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5,x6=x6,x7=x7,x8=x8,x9=x9,x10=x10))+ stat_mds(mds_method="tsne",geom="text", aes(label=rn))print(p)#> Loading required package: Rtsne
This geom implements a tail scatter plot. It is inspired by thexenographics project. Thex
andy
aesthetics are points in a two-d plane. Subsequent variables are passed in aesthetics namedx#
where x is an arbitrary integer. They do not need to start at1
, however, the order will be interpreted lexigraphically. Thex#
variables are mapped to lines extending at an angle of-(15 + 30 * i)
degrees. This means that variables trying to use 12 or more variables in addition tox
andy
is not supported at this time and will result in lines that overlap.
Some simulated data
set.seed(101)df1=data.frame(x1= rnorm(100),x2= rnorm(100))df1$x3= with(df1,x1**2+ abs(x2))df1$x4=100*df1$x1**2# make a categorical vardf1$g=factor(sample(c(0,1),100,replace=TRUE))
Plot withgeom_tailscatter
p=df1 %>% ggplot(aes(x=x1,y=x2,x3=x3,x4=x4))+ geom_tailscatter(size=2)print(p)
The parametertail_scale
controls the length of the tail lines
p=df1 %>% ggplot(aes(x=x1,y=x2,x3=x3,x4=x4))+ geom_tailscatter(size=2,tail_scale=0.5)print(p)
Color by group
p=df1 %>% ggplot(aes(x=x1,y=x2,x3=x3,x4=x4,color=g))+ geom_tailscatter(size=2)print(p)
tsne_linked
is anhtmlwidget
. It takes a data set, projects it into 2-dimensions using thet-SNE
algorithm, and then plots a 2-d scatter plot. The points in the scatter plot are linked to a bar graph that shows the values of the coordinates that went into thet-SNE
calculation. The scatter plot uses a Voronoi tessellation to make the mouse-over highlighting smoother.
simulated data
set.seed(101)df1=data.frame(x1= rnorm(100),x2= rnorm(100))df1$x3= rnorm(100)df1$x4= rnorm(100)df1$id= row.names(df1)df1$g= ifelse(df1$x1>0,1,0)tsne_coords= c("x1","x2","x3","x4")
tsne_linked(df1,tsne_coords=tsne_coords,label_var="id",group_var="g")