- Notifications
You must be signed in to change notification settings - Fork13
matthieugomez/statar
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This package contains R functions corresponding to useful Stata commands.
The package includes:
- panel data functions (monthly/quarterly dates, lead/lag, fillin)
- data.frame functions (tabulate, merge)
- vector functions (xtile, pctile, winsorize)
- graph functions (binscatter)
sum_up prints detailed summary statistics (corresponds to Statasummarize)
N<-100df<- tibble(id=1:N,v1= sample(5,N,TRUE),v2= sample(1e6,N,TRUE))sum_up(df)df %>% sum_up(starts_with("v"),d=TRUE)df %>% group_by(v1) %>% sum_up()
tab prints distinct rows with their count. Compared to the dplyr functioncount, this command adds frequency, percent, and cumulative percent.
N<-1e2 ;K=10df<- tibble(id= sample(c(NA,1:5),N/K,TRUE),v1= sample(1:5,N/K,TRUE) )tab(df,id)tab(df,id,na.rm=TRUE)tab(df,id,v1)
join is a wrapper for dplyr merge functionalities, with two added functions
The option
checkchecks there are no duplicates in the master or using data.tables (as in Stata).# merge m:1 v1join(x,y,kind="full",check=m~1)
The option
genspecifies the name of a new variable that identifies non matched and matched rows (as in Stata).# merge m:1 v1, gen(_merge)join(x,y,kind="full",gen="_merge")
The option
updateallows to update missing values of the master dataset by the value in the using dataset
# pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)v<- c(NA,1:10) pctile(v,probs= c(0.3,0.7),na.rm=TRUE)# xtile creates integer variable for quantile categories (corresponds to Stata xtile)v<- c(NA,1:10) xtile(v,n_quantiles=3)# 3 groups based on tercilesxtile(v,probs= c(0.3,0.7))# 3 groups based on two quantilesxtile(v,cutpoints= c(2,3))# 3 groups based on two cutpoints# winsorize (default based on 5 x interquartile range)v<- c(1:4,99)winsorize(v)winsorize(v,replace=NA)winsorize(v,probs= c(0.01,0.99))winsorize(v,cutpoints= c(1,50))
The classes "monthly" and "quarterly" print as dates and are compatible with usual time extraction (iemonth,year, etc). Yet, they are stored as integers representing the number of elapsed periods since 1970/01/0 (resp in week, months, quarters). This is particularly handy for simple algebra:
# elapsed dates library(lubridate)date<- mdy(c("04/03/1992","01/04/1992","03/15/1992"))datem<- as.monthly(date)# displays as a perioddatem#> [1] "1992m04" "1992m01" "1992m03"# behaves as an integer for numerical operations:datem+1#> [1] "1992m05" "1992m02" "1992m04"# behaves as a date for period extractions: year(datem)#> [1] 1992 1992 1992
tlag/tlead a vector with respect to a number of periods,not with respect to the number of rows
year<- c(1989,1991,1992)value<- c(4.1,4.5,3.3)tlag(value,1,time=year)library(lubridate)date<- mdy(c("01/04/1992","03/15/1992","04/03/1992"))datem<- as.monthly(date)value<- c(4.1,4.5,3.3)tlag(value,time=datem)
In constrast to comparable functions inzoo andxts, these functions can be applied to any vector and be used within adplyr chain:
df<- tibble(id= c(1,1,1,2,2),year= c(1989,1991,1992,1991,1992),value= c(4.1,4.5,3.3,3.2,5.2))df %>% group_by(id) %>% mutate(value_l= tlag(value,time=year))
is.panel checks whether a dataset is a panel i.e. the time variable is never missing and the combinations (id, time) are unique.
df<- tibble(id1= c(1,1,1,2,2),id2=1:5,year= c(1991,1993,NA,1992,1992),value= c(4.1,4.5,3.3,3.2,5.2))df %>% group_by(id1) %>% is.panel(year)df1<-df %>% filter(!is.na(year))df1 %>% is.panel(year)df1 %>% group_by(id1) %>% is.panel(year)df1 %>% group_by(id1,id2) %>% is.panel(year)
fill_gap transforms a unbalanced panel into a balanced panel. It corresponds to the stata commandtsfill. Missing observations are added as rows with missing values.
df<- tibble(id= c(1,1,1,2),datem= as.monthly(mdy(c("04/03/1992","01/04/1992","03/15/1992","05/11/1992"))),value= c(4.1,4.5,3.3,3.2))df %>% group_by(id) %>% fill_gap(datem)df %>% group_by(id) %>% fill_gap(datem,full=TRUE)df %>% group_by(id) %>% fill_gap(datem,roll="nearest")
stat_binmean() (astat for ggplot2) returns the mean ofy andx within 20 bins ofx. It's a barebone version of the Stata commandbinscatter
ggplot(iris, aes(x=Sepal.Width ,y=Sepal.Length))+ stat_binmean()# change number of binsggplot(iris, aes(x=Sepal.Width ,y=Sepal.Length,color=Species))+ stat_binmean(n=10)# add regression lineggplot(iris, aes(x=Sepal.Width ,y=Sepal.Length,color=Species))+ stat_binmean()+ stat_smooth(method="lm",se=FALSE)
You can install
About
R package for data manipulation — inspired by Stata's API
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.