- Notifications
You must be signed in to change notification settings - Fork36
Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
License
mcaceresb/stata-gtools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Overview|Installation|Examples|Remarks|FAQs|Benchmarks
|Compiling
Faster Stata for big data. This packages uses C plugins and hashes toprovide a massive speed improvements to common Stata commands, including:reshape, collapse, xtile, tabstat, isid, egen, pctile, winsor, contract,levelsof, duplicates, unique/distinct, and more.
This package provides a fast implementation of various Stata commandsusing hashes and C plugins. The syntax and purpose is largely analogousto their Stata counterparts; for example, you can replacecollapse
withgcollapse
,reshape
withgreshape
, and so on. For acomprehensive list of differences (including some extra features!)see theremarks below; for details and examples seetheofficial project page.
Quickstart
ssc install gtoolsgtools, upgrade
Somequick benchmarks:
NOTE: Stata 17 introduced massive speed improvements tosort and collapse.In the MP version, in particular with many cores available, the nativecollapse
can be up to twice as fast. (YMMV; overall native collapsescould still be slower in some use cases.)gcollapse
remains fasterin SE and older Stata versions.
Gtools commands with a Stata equivalent
Function | Replaces | Speedup (IC / MP) | Unsupported | Extras |
---|---|---|---|---|
gcollapse | collapse | -0.5 to 2 (Stata 17+); 4 to 100 (Stata 16 and earlier) | Quantiles, merge, labels, nunique, etc. | |
greshape | reshape | 4 to 20 / 4 to 15 | "advanced syntax" | fast , spread/gather (tidyr equiv) |
gegen | egen | 9 to 26 / 4 to 9 (+,.) | labels | Weights, quantiles, nunique, etc. |
gcontract | contract | 5 to 7 / 2.5 to 4 | ||
gisid | isid | 8 to 30 / 4 to 14 | using ,sort | if ,in |
glevelsof | levelsof | 3 to 13 / 2 to 7 | Multiple variables, arbitrary levels | |
gduplicates | duplicates | 8 to 16 / 3 to 10 | ||
gquantiles | xtile | 10 to 30 / 13 to 25 (-) | by() , various (seeusage) | |
pctile | 13 to 38 / 3 to 5 (-) | Ibid. | ||
_pctile | 25 to 40 / 3 to 5 | Ibid. | ||
gstats tab | tabstat | 10 to 50 / 5 to 30 (-) | Seeremarks | various (seeusage) |
gstats sum | sum, detail | 10 to 20 / 5 to 10 | Seeremarks | various (seeusage) |
(+) The upper end of the speed improvements are for quantiles(e.g. median, iqr, p90) and few groups. Weights have not beenbenchmarked.
(.) Only gegen group was benchmarked rigorously.
(-) Benchmarks computed 10 quantiles. When computing a largenumber of quantiles (e.g. thousands)pctile
andxtile
are prohibitivelyslow due to the way they are written; in that case gquantiles is hundredsor thousands of times faster, but this is an edge case.
Extra commands
Function | Similar (SSC/SJ) | Speedup (IC / MP) | Notes |
---|---|---|---|
fasterxtile | fastxtile | 20 to 30 / 2.5 to 3.5 | Allowsby() |
egenmisc (SSC) (-) | 8 to 25 / 2.5 to 6 | ||
astile (SSC) (-) | 8 to 12 / 3.5 to 6 | ||
gstats hdfe | (.) | Allows weights,by() | |
gstats winsor | winsor2 | 10 to 40 / 10 to 20 | Allows weights |
gunique | unique | 4 to 26 / 4 to 12 | |
gdistinct | distinct | 4 to 26 / 4 to 12 | Also saves results in matrix |
gtop (gtoplevelsof) | groups, select() | (+) | See table notes (+) |
gstats range | rangestat | 10 to 20 / 10 to 20 | Allows weights; no flex stats |
gstats transform | Various statistical functions |
(-)fastxtile
from egenmisc andastile
were benchmarked againstgquantiles, xtile
(fasterxtile
) usingby()
.
(+) While similar to the user command 'groups' with the 'select'option, gtoplevelsof does not really have an equivalent. It is severaldozen times faster than 'groups, select', but that command was not writtenwith the goal of gleaning the most common levels of a varlist. Rather, ithas a plethora of features and that one is somewhat incidental. As such, thebenchmark is not equivalent andgtoplevelsof
does not attempt to implementthe features of 'groups'
(.) Other than the dated 'hdfe' command, I do not know of a statacommand that residualizes variables from a set of fixed effects. The'hdfe' command, as far as I can tell, morphed into the 'reghdfe'package; the latter, however, is a fully-functioning regression command,while 'gstats hdfe' only residualizes a set of variables.
Regression models
WARNING: Regression models are in beta and are only intended as utilitiesto compute coefficients and standard errors. I do not recommend their use inproduction; various post-estimation commands and statistics arenot availabe.(Seegstats hdfe
for residualizing variables net of fixed effects.)
Function | Model | Similar |
---|---|---|
gregress | OLS | regress ,reghdfe |
givregress | 2SLS | ivregress 2sls ,ivreghdfe |
gglm | IRLS | logit ,poisson ,ppmlhdfe |
All commands allow the user to optionally add:
absorb()
for high-dimensional fixed effects absorptions.cluster()
for clustering (multiple covariates assume clusters are nested).by()
for regressions by group.weights
for weighted versions. Unlike other weights,fweights
are assumed to refer to thenumber of observations.
Linear regression is computed via OLS (or WLS), IV regression iscomputed via two-stage least squares (2SLS), and GLM (poisson or logit)regression is computed via iteratively reweighted least squares (IRLS).See theTODO section for planned features, or theMissing Featuressection in the documentation for what is missing before the firstnon-beta release.
Extra features
Several commands offer additional features on top of the massivespeedup. See theremarks section below for an overview; fordetails and examples, see each command's help page:
- gcollapse
- greshape
- gquantiles
- gstats sum/tab
- gstats transform/range/moving
- glevelsof
- gtoplevelsof
- gegen
- gdistinct
- gregress
- givregress
- gglm (poisson and logit)
In addition, several commands take gsort-style input, that is
[+|-]varname [[+|-]varname ...]
This does not affect the results in most cases, just the sort order.Commands that take this type of input include:
- gcollapse
- gcontract
- gegen
- glevelsof
- gtop (gtoplevelsof)
Ftools
The commands here are also faster than the commands provided byftools
; further,gtools
commands take a mix of string and numericvariables, which is a limitation offtools
. (Note I could not getseveral parts offtools
working on the Linux server where I haveaccess to Stata/MP; hence the IC benchmarks.)
Gtools | Ftools | Speedup (IC) |
---|---|---|
gcollapse | fcollapse | 2-9 |
gegen | fegen | 2.5-4 (+) |
gisid | fisid | 4-14 |
glevelsof | flevelsof | 1.5-13 |
hashsort | fsort | 2.5-4 |
(+) Only egen group was benchmarked rigorously.
Limitations
strL
variables only partially supported on Stata 14 and above;gcollapse
,gcontract
, andgreshape
do not supportstrL
variabes.Due to a Stata bug, gtools cannot support morethan
2^31-1
(2.1 billion) observations. SeethisissueDue to limitations in the Stata Plugin Interface, gtoolscan only handle as many variables as the largest
matsize
in the user's Stata version. For MP this is more than10,000 variables but in IC this is only 800. Seethisissue.Gtools uses compiled C code to achieve it's massive increases inspeed. This has two side-effects users might notice: First, it is sometimesnot possible to break the program's execution. While this is already truefor at least some parts of most Stata commands, there are fewer opportunitiesto break Gtools commands relative to their Stata counterparts.
Second, the Stata GUI might appear frozen when running Gtoolscommands. If the system then runs out of RAM (memory), it could looklike Stata has crashed (it may show a "(Not Responding)" message onWindows or it may darken on *nix systems). However, the program hasnot crashed; it is merely trying to swap memory. To check this is thecase, the user can monitor disk activity or monitor their system'spagefile or swap space directly.
The OSX version of gtools was implemented with invaluable help from @fbelottiinissue 11.
Gtools was largely inspired by Sergio Correia's (@sergiocorreia) excellentftools package. Further, severalimprovements and bug fixes have come from to @sergiocorreia's helpful comments.
With the exception of
greshape
, every gtools command has beenwritten almost entirely from scratch (and evengreshape
is mostlynew code). However, gtools commands typically mimic the functionalityof existing Stata commands, including community-contributed programs,meaning many of the ideas and options are based on them (see therespective help files for details).gtools
commands based oncommunity-contributed programs include:gstats winsor
, based onwinsor2
by Lian (Arlion) Yujungunique
, based onunique
by Michael Hills and Tony Brady.gdistinct
, based ondistinct
by Gary Longton and Nicholas J. Cox.
I only have access to Stata 13.1, so I impose that to be the minimum.You can installgtools
from Stata via SSC:
ssc install gtoolsgtools, upgrade
By default this syncs to the master branch, which is stable. To installthe latest version directly, type:
local github"https://raw.githubusercontent.com"net install gtools, from(`github'/mcaceresb/stata-gtools/master/build/)
The syntax is generally analogous to the standard commands (see the correspondinghelp files for full syntax and options):
sysuse auto, clear* gstats {hdfe|residualize} varlist [if] [in] [weight], [absorb(varlist) options]gstats hdfe hdfe_price= price, absorb(foreign rep78)gstats residualize price mpg, absorb(foreign rep78) prefix(res_)* gstats {sum|tab} varlist [if] [in] [weight], [by(varlist) options]gstats sum price [pw= gear_ratio/ 4]gstats tab price mpg, by(foreign) matasave* gquantiles [newvarname =] exp [if] [in] [weight], {_pctile|xtile|pctile} [options]gquantiles 2* price, _pctile nq(10)gquantiles p10= 2* price, pctile nq(10)gquantiles x10= 2* price, xtile nq(10) by(rep78)fasterxtile xx= log(price) [w= weight], cutpoints(p10) by(foreign)* gstats winsor varlist [if] [in] [weight], [by(varlist) cuts(# #) options]gstats winsor price gear_ratio mpg, cuts(5 95) s(_w1)gstats winsor price gear_ratio mpg, cuts(5 95) by(foreign) s(_w2)drop*_w?* hashsort varlist, [options]hashsort-makehashsort foreign-rep78, benchmark verbose mlast* gegen target = stat(source) [if] [in] [weight], by(varlist) [options]gegen tag= tag(foreign)gegen group= tag(-price make)gegen p2_5= pctile(price) [w= weight], by(foreign) p(2.5)* gisid varlist [if] [in], [options]gisid make, missokgisid price in 1/ 2* gduplicates varlist [if] [in], [options gtools(gtools_options)]gduplicates report foreigngduplicates report rep78 if foreign, gtools(bench(3))* glevelsof varlist [if] [in], [options]glevelsof rep78, local(levels) sep(" |")glevelsof foreign mpg if price< 4000, loc(lvl) sep(" |") colsep(",")glevelsof foreign mpg in 10/ 70, gen(uniq_) nolocal* gtop varlist [if] [in] [weight], [options]* gtoplevelsof varlist [if] [in] [weight], [options]gtoplevelsof foreign rep78gtop foreign rep78 [w= weight], ntop(5) missrow groupmiss pctfmt(%6.4g) colmax(3)* gregress depvar indepvars [if] [in] [weight], [by(varlist) options]gregress price mpg rep78, mata(coefs) prefix(b(_b_) se(_se_))gregress price mpg [fw= rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)* givregress depvar (endog = instruments) exog [if] [in] [weight], [by(varlist) options]givregress price (mpg= gear_ratio) rep78, mata(coefs) prefix(b(_b_) se(_se_)) replacegivregress price (mpg= gear_ratio) [fw= rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)* gglm depvar indepvars [if] [in] [weight], family(...) [by(varlist) options]gglm price mpg rep78, family(poisson) mata(coefs) prefix(b(_b_) se(_se_)) replacegglm price mpg [fw= trunk], family(poisson) by(foreign) absorb(rep78 headroom) cluster(rep78)gglm foreign price rep78 [fw= trunk], family(binomial) absorb(headroom) mata(coefs)gglm foreign price if rep78> 2, family(binomial) by(rep78) prefix(b(_b_) se(_se_)) replace* gcollapse (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]gen h1= headroomgen h2= headroomlocal lbl labelformat(#stat:pretty# #sourcelabel#)gcollapse (mean) mean= price (median) p50= gear_ratio, by(make) merge v `lbl'disp"`:var label mean', `:var label p50'"gcollapse (iqr) irq?= h? (nunique) turn (p97.5) mpg, by(foreign rep78) bench(2) wild* gcontract varlist [if] [if] [fweight], [options]gcontract foreign [fw= turn], freq(f) percent(p)* greshape wide varlist, i(i) j(j) [options]* greshape long prefixlist, i(i) [j(j) string options]** greshape spread varlist, j(j) [options]* greshape gather varlist, j(j) value(value) [options]gen j= _ngreshape wide f p, i(foreign) j(j)greshape long f p, i(foreign) j(j)greshape spread f p, j(j)greshape gather f? p?, j(j) value(fp)* gstats transform (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]* gstats range (stat) out = src [...] [if] [if] [weight], by(varlist) [options]* gstats moving (stat) out = src [...] [if] [if] [weight], by(varlist) [options]sysuse auto, cleargstats transform (normalize) price (demean) price (range mean-sd sd) price, autogstats range (mean) mean_r= price (sd) sd_r= price, interval(-10 10 mpg)gstats moving (mean) mean_m= price (sd) sd_m= price, by(foreign) window(-5 5)
See theFAQs or the respective documentation for a list of supportedgcollapse
andgegen
functions.
Functions available withgegen
,gcollapse
,gstats tab
gcollapse
supports everycollapse
function, including theirweighted versions. In addition, weights can be selectively applied viarawstat()
, and several additional statistics are allowed, includingnunique
,select#
, and so on.
gegen
technically does not support all ofegen
, but whenever afunction that is not supported is requested,gegen
hashes the data andcallsegen
grouping by the hash, which is often faster (gegen
onlysupports weights for internal functions, sinceegen
does not normallyallow weights).
Hence both should be able to replicate all of the functionality of theirStata counterparts. Last,gstats tab
allows every statistic allowedbytabstat
as well as any statistic allowed bygcollapse
; thesyntax for the statistics specified viastatistics()
is the sameas intabstat
.
The following are implemented internally in C:
Function | gcollapse | gegen | gstats tab |
---|---|---|---|
tag | X | ||
group | X | ||
total | X | ||
count | X | X | X |
nunique | X | X | X |
nmissing | X | X (+) | X |
sum | X | X | X |
nansum | X | X | X |
rawsum | X | X | |
rawnansum | X | X | |
mean | X | X | X |
geomean | X | X | X |
median | X | X | X |
percentiles | X | X | X |
iqr | X | X | X |
sd | X | X | X |
variance | X | X (+) | X |
cv | X | X | X |
max | X | X | X |
min | X | X | X |
range | X | X | X |
select | X | X | X |
rawselect | X | X | |
percent | X | X | X |
first | X | X (+) | X |
last | X | X (+) | X |
firstnm | X | X (+) | X |
lastnm | X | X (+) | X |
semean | X | X (+) | X |
sebinomial | X | X | X |
sepoisson | X | X | X |
skewness | X | X | X |
kurtosis | X | X | X |
gini | X | X | X |
gini dropneg | X | X | X |
gini keepneg | X | X | X |
(+) indicates the function has the same or a very similarname to a function in the "egenmore" packge, but the function wasindependently implemented and is hence analogous to its gcollapsecounterpart, not necessarily the function in egenmore.
The percentile syntax mimics that ofcollapse
andegen
, with the additionthat quantiles are also supported. That is,
gcollapse (p#) target= var [target= var ...] , by(varlist)gegen target= pctile(var), by(varlist) p(#)
where # is a "percentile" with arbitrary decimal places (e.g. 2.5 or 97.5).gtools
also supports selecting the#
th smallest or largest value:
gcollapse (select#) target= var [(select-#) target= var ...] , by(varlist)gegen target= select(var), by(varlist) n(#)gegen target= select(var), by(varlist) n(-#)
In addition, the following are allowed ingegen
as wrappers to othergtools functions (stat
is any stat available togcollapse
, exceptpercent
,nunique
):
Function | calls |
---|---|
xtile | fasterxtile |
standardize | gstats transform |
normalize | gstats transform |
demean | gstats transform |
demedian | gstats transform |
moving_stat | gstats transform |
range_stat | gstats transform |
cumsum | gstats transform |
shift | gstats transform |
rank | gstats transform |
winsor | gstats winsor |
winsorize | gstats winsor |
Last, whengegen
calls a function that is not implemented internallybygtools
, it will hash the by variables and callegen
withby
set to an id based on the hash. That is, iffcn
is not one of thefunctions above,
gegen outvar= fcn(varlist) [if] [in], by(byvars)
would be the same as
hashsort byvars, group(id) sortgroupegen outvar= fcn(varlist) [if] [in], by(id)
but preserving the original sort order. In case anegen
option mightconflict with a gtools option, the user can passgtools_capture(fcn_options)
togegen
.
Differences and Extras
Differences fromcollapse
- String variables are not allowed for
first
,last
,min
,max
, etc.(seeissue 25) - New functions:
nunique
,nmissing
,cv
,variance
,select#
,select-#
,range
,gini
rawstat
allows selectively applying weights.rawselect
ignores weights forselect
(analogously torawsum
).- Option
wild
allows bulk-rename. E.g.gcollapse mean_x* = x*, wild
gcollapse (nansum)
andgcollapse (rawnansum)
outputs a missingvalue for sums if all inputs are missing (instead of 0).gcollapse, merge
merges the collapsed data set back into memory. This ismuch faster than collapsing a dataset, saving, and merging after. However,Stata'smerge ..., update
functionality is not implemented, only replace.(If the targets exist the function will throw an error withoutreplace
).gcollapse, labelformat
allows specifying the output label using placeholders.gcollapse, sumcheck
keeps integer types withsum
if the sum will not overflow.
Differences fromreshape
- Allows an arbitrary number of variables in
i()
andj()
- Several option allow turning off error checks for faster execution,including:
fast
(similar tofast
ingcollapse
),unsorted
(do not sort the output),nodupcheck
(allow duplicates ini
),nomisscheck
(allow missing values and/or leading blanks inj
), ornochecks
(all of the above). - Subcommands
gather
andspread
implement the equivalent commands fromR'stidyr
package. - At the moment,
j(name [values])
is not supported. All values ofj
are used. - "reshape mode" is not supported. Reshape variables are not saved aspart of the current dataset's characteristics, meaning the user cannottype
reshape wide
andreshape long
without further arguments toreverse thereshape
. This syntax is very cumbersome and difficult tosupport;greshape
re-wrote much of the code base and had to dispensewith this functionality. - For that same reason, "advanced" syntax is not supported, includingthe subcommands: clear, error, query, i, j, xij, and xi.
@
syntax can be modified viamatch()
dropmiss
allows dropping missing observations when reshaping fromwide to long (vialong
orgather
).
Differences from regression models
gregress
,givregress
, andgglm
do not aim to replicatethe entire table of estimation results, nor the entire suite ofpost-estimation results and tests, thatregress
(reghdfe
),ivregress 2sls
(ivreghdfe
),poisson
(ppmlhdfe
), orlogit
makeavailable. At the moment, they are considered beta software and onlycoefficients and standard errors are computed.
- Results are saved either to mata (default) or copied to variables inthe dataset in memory.
by()
andabsorb()
are allowed and can be combined.givregress
does a small sample adjustment (small
) automatically.givregress
does not exit with error if covariates are collinear withthe dependent variable.- If the
givregress
model is not identified, standard errors andcoefficients are set to missing instead of exiting with error. gglm
runs with optionrobust
automatically.- If the
givregress
model is not identified, standard errors and - If there are no non-linear covariates (i.e. all observations arenumerically zero) then the coefficients and standard errors areboth set to missing.
Differences fromxtile
,pctile
, and_pctile
- Adds support for
by()
(including weights) - Does not ignore
altdef
withxtile
(seethis Statalist thread) - Category frequencies can also be requested via
binfreq[()]
. xtile
,pctile
, and_pctile
can be combined viaxtile(newvar)
andpctile(newvar)
- There is no limit to
nquantiles()
forxtile
- Quantiles can be requested via
percentiles()
(orquantiles()
),cutquantiles()
, orquantmatrix()
forxtile
as well aspctile
. - Cutoffs can be requested via
cutquantiles()
,cutoffs()
,orcutmatrix()
forxtile
as well aspctile
. - The user has control over the behavior of
cutpoints()
andcutquantiles()
.They obeyif
in
with optioncutifin
, they can be group-specific withoptioncutby
, and they can be de-duplicated viadedup
. - Fixes numerical precision issues with
pctile, altdef
(e.g. seethis Statalist thread, which is a very minor thing so Stata and fellow users maintain it's not an issue, but I think it is because Stata/MP gives what I think is the correct answer whereas IC and SE do not). - Fixes a possible issue with the weights implementation in
_pctile
; seethis thread.
Differences fromegen
group
label options are not supported- weights are supported for internally implemented functions.
- New functions:
nunique
,nmissing
,cv
,variance
,select#
,select-#
,range
gegen
upgrades the type of the target variable if it is not specified bythe user. This means that if the sources aredouble
then the output willbe double. All sums are double.group
creates along
or adouble
. Andso on.egen
will default to the system type, which could cause a loss ofprecision on some functions.- For internally supported functions, you can specify a varlist as the source,not just a single variable. Observations will be pooled by row in that case.
- While
gegen
is much faster fortag
,group
, and summary stats, mostegen function are not implemented internally, meaning for arbitrarygegen
calls this is a wrapper for hashsort and egen.
Differences fromtabstat
- Multiple groups are allowed.
- Saving the output is done via
mata
instead ofr()
. No matricesare saved inr()
and optionsave
is not allowed. However, optionmatasave
saves the output andby()
info inGstatsOutput
(the objectcan be named viamatasave(name)
). Seemata GstatsOutput.desc()
aftergstats tab, matasave
for details. GstatsOutput
provides helpers for extracting rows, columns, and levels.- Options
casewise
,longstub
are not supported. - Option
nototal
is on by default;total
is planned for a future release. - Option
pooled
pools the source variables into one.
Differences fromsummarize, detail
- The behavior of
summarize
andsummarize, meanonly
can berecovered via optionsnodetail
andmeanonly
. These twooptions are mainly for use withby()
- Option
matasave
saves output andby()
info inGstatsOutput
,a mata class object (the object can be named viamatasave(name)
).Seemata GstatsOutput.desc()
aftergstats sum, matasave
for details. - Option
noprint
saves the results but omits printing output. - Option
tab
prints statistics in the style oftabstat
- Option
pooled
pools the source variables and computes summarystats as if it was a single variable. pweights
are allowed.- Largest and smallest observations are weighted.
rolling:
,statsby:
, andby:
are not allowed. To useby
passthe optionby()
display options
are not supported.- Factor and time series variables are not allowed.
Differences fromlevelsof
- It can take a
varlist
and not just avarname
; in that case it printsall unique combinations of the varlist. The user can specify column and rowseparators. - It can deduplicate an arbitrary number of levels and store the results in anew variable list or replace the old variable list via
gen(prefix)
andgen(replace)
, respectively. If the user runs up against the maximum macrovariable length, add optionnolocal
.
Differences fromisid
- No support for
using
. The C plugin API does not allow to load a Statadataset from disk. - Option
sort
is not available. - It can also check IDs with
if
andin
conditions.
Differences fromgsort
hashsort
behaves as ifmfirst
was passed. To recover the defaultbehavior ofgsort
pass optionmlast
.
Differences fromduplicates
gduplicates
does not sortexamples
orlist
by default. This massivelyenhances performance but it might be harder to read. Pass optionsort
(sorted
) to mimicduplicates
behavior and sort the list.
Differences fromrangestat
Note that
gstats range
is an alias forgstats transform
that assumesall the stats requested are range statistics. However, it can be calledin conjunction with any other transform via(range stat ...)
. It wasnot intended to be a replacement ofrangestat
but it can replicate someof its functionality.flex_stat
s (reg, corr, cov) are not allowed (seegregress
).Intervals are of the form
interval(low high [keyvar])
; ifkeyvar
is missing then it is taken to be the source variable.Variables are not allowed in place of
low
orhigh
. Instead theymust be#[stat]
where#
is a number andstat
is an optionalsummary statistic; e.g.interval(-sd 0.5sd x)
.Separate interval and interval variables can be specified for eachtarget; e.g.
gstats range (mean -3 3) x (mean -2 . time) y ...
.All statistics allowed by
gstats tab
are allowed bygstats range
(exceptnunique
orpercent
).Options
casewise
,describe
, andlocal
are not allowed.
There are two key insights to the massive speedups of Gtools:
Hashing the data and sorting a hash is a lot faster than sortingthe data to then process it by group. Sorting a hash can be achievedin linear O(N) time, whereas the best general-purpose sorts take O(Nlog(N)) time. Sorting the groups would then be achievable in O(Jlog(J)) time (with J groups). Hence the speed improvements are largestwhen N / J is largest.
Compiled C code is much faster than Stata commands. While it is truethat many of Stata's underpinnings are compiled code, severaloperations are written in
ado
files without much thought givento optimization. If you're working with tens of thousands ofobservations you might barely notice (and the difference between5 seconds and 0.5 seconds might not be particularly important).However, with tens of millions or hundreds of millions of rows, thedifference between half a day and an hour can matter quite a lot.
Stata Sorting
It should be noted that Stata's sorting mechanism is hard to improveupon because of the overhead involved in sorting. We have implemented ahash-based sorting command,hashsort
, which should be faster Stata'ssort
for groups, but not necessarily otherwise:
Function | Replaces | Speedup (IC / MP) | Unsupported | Extras |
---|---|---|---|---|
hashsort | sort | 2.5 to 4 / .8 to 1.3 | Group (hash) sorting | |
gsort | 2 to 18 / 1 to 6 | mfirst (seemlast ) | Sorts are stable |
The overhead involves copying the by variables, hashing, sorting the hash,sorting the groups, copying a sort index back to Stata, and having Stata dothe final swaps. The plugin runs fast, but the copy overhead plus the Stataswaps often make the function be slower than Stata's nativesort
.
The reason that the other functions are faster is because they don't deal withall that overhead. By contrast, Stata'sgsort
is not efficient. To sortdata, you need to make pair-wise comparisons. For real numbers, this is justa > b
. However, a generic comparison function can be written ascompare(a, b) > 0
.This is true if a is greater than b and false otherwise. To invertthe sort order, one need only usecompare(b, a) > 0
, which is what gtoolsdoes internally.
However, Stata creates a variable that is the inverse of the sort variable.This is equivalent, but the overhead makes it slower thanhashsort
.
Planned features:
- Things to add to gcollapse:
prod
geomean pos
: exclude negative numbersand zero.geomean abspos
: ibid but take absolute value first.- Generally should you add an
abs
option to everything?
- Flexible save options for
gregress
predict()
, includingxb
ande
.absorb(fe1=group1 fe2=group2 ...)
syntax to save the FE.- Choose which coefs/se to save.
- Improve formula documentation for summary statistics (e.g.
gini
) - Internal consistency test for various parts of
gquantiles
. Eachfunction section does cases but they should be consistent!
These are options/features/improvements I would like to add, but I don'thave an ETA for them (i.e. they are a wishlist because I am either notsure how to implement them or because writing the code will take a longtime). Roughly in order of likelihood:
gregress
missing features- Non-nested multi-way clustering.
- HDFE collienar categories check.
- HDFE drop singletons.
- Detect separated observations in
gglm, family(poisson)
. - Guard against possible overflows in
X' X
- Accelerate HDFE corner cases (e.g. very dense multi-way HDFE)
- Include quick primers on OLS, IV, and IRLS in docs.
- Some support for Stata's extended syntax in
gregress
- Update benchmarks for all commands. Still on 0.8 benchmarks.
- Dropmissing vs dropmissing but not extended missing values.
- Allow keeping both variable names and labels in
greshape spread/gather
- Implement
selectoverflow(missing|closest)
- Add totals row for
J > 1
in gstats - Improve debugging info.
- Implement
collapse()
option forgreshape
. - Rolling (interval) and moving options for
gregress
. - Add support for binary
strL
variables. - Minimize memory use.
- Add memory(greedy|lean) to give user fine-grained control over internals.
- Create a Stata C hashing API with thin wrappers around core functions.
- This will be a C library that other users can import.
- Some functionality will be available from Stata via gtooos, api()
- Improve code comments when you write the API!
- Have some type of coding standard for the base (coding style)
- Implement
gmerge
- Integration withReadStat?
Hi! I'mMauricio Caceres; I made gtoolsafter some of my Stata jobs were taking literally days to run because of repeatcalls toegen
,collapse
, and similar on data with over 100M rows. Feedbackand comments are welcome! I hope you find this package as useful as I do.
Along those lines, here are some other Stata projects I like:
ftools
: The main inspiration forgtools. Not as fast, but it has a rich feature set; its mata API inparticular is excellent.reghdfe
: The fastest way to runa regression with multiple fixed effects (as far as I know).stata_kernel
: A Stata kernelfor Jupyter; extremely useful for interacting with Stata.stata-cowsay
: Productivity-boostingcowsay functionality in Stata.
Gtools isMIT-licensed../lib/spookyhash
and./src/plugin/common/quicksort.c
belong to their respectiveauthors and are BSD-licensed. Also seegtools, licenses
.
About
Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins