| Title: | Penalized Poisson Pseudo Maximum Likelihood Regression |
| Version: | 0.2.4 |
| Description: | A set of tools that enables efficient estimation of penalized Poisson Pseudo Maximum Likelihood regressions, using lasso or ridge penalties, for models that feature one or more sets of high-dimensional fixed effects. The methodology is based on Breinlich, Corradi, Rocha, Ruta, Santos Silva, and Zylkin (2021)http://hdl.handle.net/10986/35451 and takes advantage of the method of alternating projections of Gaure (2013) <doi:10.1016/j.csda.2013.03.024> for dealing with HDFE, as well as the coordinate descent algorithm of Friedman, Hastie and Tibshirani (2010) <doi:10.18637/jss.v033.i01> for fitting lasso regressions. The package is also able to carry out cross-validation and to implement the plugin lasso of Belloni, Chernozhukov, Hansen and Kozbur (2016) <doi:10.1080/07350015.2015.1102733>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| LazyData: | true |
| LazyDataCompression: | gzip |
| RoxygenNote: | 7.3.2 |
| LinkingTo: | Rcpp, RcppEigen |
| Imports: | Rcpp, glmnet, fixest, collapse, rlang, magrittr, matrixStats,dplyr, devtools |
| Depends: | R (≥ 2.10) |
| URL: | https://github.com/tomzylkin/penppml |
| BugReports: | https://github.com/tomzylkin/penppml/issues |
| Suggests: | testthat (≥ 3.0.0), MASS, knitr, rmarkdown, ggplot2,reshape2 |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr, rmarkdown |
| NeedsCompilation: | yes |
| Packaged: | 2025-02-07 21:19:42 UTC; joaoa |
| Author: | Diego Ferreras Garrucho [aut], Tom Zylkin [aut], Joao Cruz [cre, ctb], Nicolas Apfel [ctb] |
| Maintainer: | Joao Cruz <joaocruz@iseg.ulisboa.pt> |
| Repository: | CRAN |
| Date/Publication: | 2025-02-08 16:20:12 UTC |
penppml: Penalized Poisson Pseudo Maximum Likelihood Regression
Description
A set of tools that enables efficient estimation of penalized Poisson Pseudo Maximum Likelihood regressions, using lasso or ridge penalties, for models that feature one or more sets of high-dimensional fixed effects. The methodology is based on Breinlich, Corradi, Rocha, Ruta, Santos Silva, and Zylkin (2021)http://hdl.handle.net/10986/35451 and takes advantage of the method of alternating projections of Gaure (2013)doi:10.1016/j.csda.2013.03.024 for dealing with HDFE, as well as the coordinate descent algorithm of Friedman, Hastie and Tibshirani (2010)doi:10.18637/jss.v033.i01 for fitting lasso regressions. The package is also able to carry out cross-validation and to implement the plugin lasso of Belloni, Chernozhukov, Hansen and Kozbur (2016)doi:10.1080/07350015.2015.1102733.
Functions
The workhorse of this package is themlfitppml function, which allows users to carry outpenalized HDFE-PPML estimation with a wide variety of options. The syntax is very simple, allowingusers to select a data frame with all the relevant variables and then select dependent, independentand fixed effects variables by name or column number.
In addition, the internalshdfeppml (post-lasso regression),penhdfeppml (penalizedregression for a single lambda),penhdfeppml_cluster (plugin lasso), andxvalidate (cross-validation) are made available on a stand-alone basis for advanced users.
The package also includes alternative versions ofmlfitppml,hdfeppml,penhdfeppmlandpenhdfeppml_cluster. These (mlfitppml_int,hdfeppml_int,penhdfeppml_intandpenhdfeppml_cluster_int) use an alternative syntax: users must provide the dependent variablein a vector, the regressors in a matrix and the fixed effects in a list.
Finally, support for the iceberg lasso method in Breinlich, Corradi, Rocha, Ruta, Santos Silva,and Zylkin (2021) is in development and can be accessed at its current stage via theicebergfunction.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Author(s)
Maintainer: Joao Cruzjoaocruz@iseg.ulisboa.pt [contributor]
Authors:
Diego Ferreras Garruchod.ferreras-garrucho@lse.ac.uk
Tom Zylkintzylkin@richmond.edu
Other contributors:
Nicolas Apfeln.apfel@soton.ac.uk [contributor]
See Also
Useful links:
Pipe operator
Description
Seemagrittr::%>% for details.
Usage
lhs %>% rhsArguments
lhs | A value or the magrittr placeholder. |
rhs | A function call using the magrittr semantics. |
Value
The result of callingrhs(lhs).
Computing A'A
Description
Computes A'A using C++.
Usage
AtA(A)Arguments
A | A matrix. |
Bootstrap Lasso Implementation (in development)
Description
This function performs standard plugin lasso PPML estimation forbootreps samples drawn again withreplacement and reportsthose regressors selected in at least a certain fraction of the bootstrap repetitions.
Usage
bootstrap( data, dep, indep = NULL, cluster_id = NULL, fixed = NULL, selectobs = NULL, bootreps = 250, boot_threshold = 0.01, colcheck_x = FALSE, colcheck_x_fes = FALSE, post = FALSE, gamma_val = NULL, verbose = FALSE, tol = 1e-06, hdfetol = 0.01, penweights = NULL, maxiter = 1000, phipost = TRUE)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the names of the independent variables or their column numbers. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
cluster_id | A string denoting the cluster-id with which to performcluster bootstrap. |
fixed | A vector with the names or column numbers of factor variables identifying the fixed effects,or a list with the desired interactions between variables in |
selectobs | Optional. A vector indicating which observations to use (either a logical vectoror a numeric vector with row numbers, as usual when subsetting in R). |
bootreps | Number of bootstrap repetitions. |
boot_threshold | Minimal threshold. If a variable is selected in at least thisfraction of times, it is reported at the end of the iterations. |
colcheck_x | Logical. If |
colcheck_x_fes | Logical. If |
post | Logical. If |
gamma_val | Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
verbose | Logical. If |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol | Tolerance parameter for the within-transformation step,passed on to |
penweights | Optional: a vector of coefficient-specific penalties to use in plugin lasso when |
maxiter | Maximum number of iterations (a number). |
phipost | Logical. If |
Details
This function enables users to implement the "bootstrap" step in the procedure described inBreinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2020). To do this, Plugin Lasso is run B times.The function can also perform a post-selection estimation.
Value
A matrix with coefficient estimates for all dependent variables.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: bs1 <- bootstrap(data=trade3, dep="export", cluster_id="clus", fixed=list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), indep=7:22, bootreps=10, colcheck_x = TRUE, colcheck_x_fes = TRUE, boot_threshold = 0.01, post=TRUE, gamma_val=0.01, verbose=FALSE)## End(Not run)Cluster-robust Standard Error Estimation
Description
cluster_matrix is a helper for computation of cluster-robust standard errors.
Usage
cluster_matrix(e, cluster, x)Arguments
e | Vector of residuals. |
cluster | Vector of clusters. |
x | Regressor matrix. |
Value
Gives the XeeX matrix.
Checking for Perfect Multicollinearity
Description
collinearity_check checks for perfect multicollinearity in a model with high-dimensionalfixed effects. It callslfe::demeanlist in order to partial out the fixed effects, and thenusesstats::lm.wfit to discard linearly dependent variables.
Usage
collinearity_check( y, x = NULL, fes = NULL, hdfetol, colcheck_x_fes = TRUE, colcheck_x = TRUE)Arguments
y | Dependent variable (a numeric vector). |
x | Regressor matrix. |
fes | List of fixed effects. |
hdfetol | Tolerance for the centering, passed on to |
colcheck_x_fes | Logical. If |
colcheck_x | Logical. If |
Value
A numeric vector containing the variables that pass the collinearity check.
Fixed Effects Computation
Description
This function is a helper forxvalidate that computes FEs using PPML First Order Conditions(FOCs).
Usage
compute_fes( y, fes, x, b, insample_obs = rep(1, n), onlymus = FALSE, tol = 1e-08, verbose = FALSE)Arguments
y | Dependent variable (a vector). |
fes | List of fixed effects. |
x | Regressor matrix. |
b | A vector of coefficient estimates. |
insample_obs | Vector of observations used to estimate the |
onlymus | Logical. If |
tol | A tolerance parameter. |
verbose | Logical. If |
Value
Ifonlymus = TRUE, the vector of conditional means. Otherwise, a list with twoelements:
mu: conditional means.fe_values: fixed effects.
Country ISO Codes
Description
An auxiliary data set with basic geographic information about country ISO 3166 codes included in thetrade data set.
Usage
countriesFormat
A data frame with 249 rows and 4 variables.
iso: Country ISO 3166 code.
name: Country name.
region: Continent.
subregion: sub-continental region.
Source
The source of the data set is Luke Duncalfe's ISO-3166-Countries-with-Regional-Codes repositoryon GitHub (https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes#readme).
Faster Matrix Multiplication
Description
Faster matrix multiplication using C++.
Usage
eigenMatMult(A, B)eigenMapMatMult(A, B)Arguments
A,B | Matrices. |
Faster Least Squares Estimation
Description
Finds Least Squares solutions using C++.
Usage
fastolsCpp(X, y)Arguments
X | Regressor matrix. |
y | Dependent variable (a vector). |
Value
The vector of parameter (beta) estimates.
Finding Ridge Regression Solutions
Description
A wrapper aroundfastridgeCpp, for faster computation of the analytical solution forridge regression.
Usage
fastridge(x, y, weights = rep(1/n, n), lambda, standardize = TRUE)Arguments
x | Regressor matrix. |
y | Dependent variable (a numeric vector). |
weights | Vector of weights. |
lambda | Penalty parameter. |
standardize | Logical. If |
Value
A vector of coefficient (beta) estimates.
Faster Ridge Regression
Description
Finds Ridge solutions using C++.
Usage
fastridgeCpp(X, y, lambda)Arguments
X | Regressor matrix. |
y | Dependent variable (a vector). |
lambda | Penalty parameter (a number). |
Value
The vector of parameter (beta) estimates.
Faster Standard Deviation
Description
Computes standard deviation using C++.
Usage
faststddev(X, w)Arguments
X | Regressor matrix. |
w | Weights. |
Value
Vector of standard deviations of the parameter estimates.
Faster Weighted Mean
Description
Computes weighted mean using C++.
Usage
fastwmean(X, w)Arguments
X | Regressor matrix. |
w | Weights. |
Value
Weighted mean.
Generating a List of Fixed Effects
Description
genfes generates a list of fixed effects by creating interactions of paired factors.
Usage
genfes(data, inter)Arguments
data | A data frame including the factors. |
inter | A list: each element includes the variables to be interacted (both names and column |
Value
A list containing the desired interactions ofvars, with the same length asinter.
Generating Model Structure
Description
genmodel transforms a data frame into the needed components for our main functions (a y vector,a x matrix and a fes list).
Usage
genmodel( data, dep = NULL, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the name of the independent variable or a column number. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed | A vector with the names or column numbers of factor variables identifying the fixed effects,or a list with the desired interactions between variables in |
cluster | Optional. A string with the name of the clustering variable or a column number.It's also possible to input a vector with several variables, in which case the interaction ofall of them is taken as the clustering variable. |
selectobs | Optional. A vector indicating which observations to use. |
Value
A list with four elements:
y: y vector.x: x matrix.fes: list of fixed effects.cluster: cluster vector.
PPML Estimation with HDFE
Description
hdfeppml fits an (unpenalized) Poisson Pseudo Maximum Likelihood (PPML) model withhigh-dimensional fixed effects (HDFE).
Usage
hdfeppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ...)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the name of the independent variable or a column number. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed | A vector with the names or column numbers of factor variables identifying the fixed effects,or a list with the desired interactions between variables in |
cluster | Optional. A string with the name of the clustering variable or a column number.It's also possible to input a vector with several variables, in which case the interaction ofall of them is taken as the clustering variable. |
selectobs | Optional. A vector indicating which observations to use (either a logical vectoror a numeric vector with row numbers, as usual when subsetting in R). |
... | Further options. For a full list, seehdfeppml_int. |
Details
This function is a thin wrapper aroundhdfeppml_int, providing a more convenient interface fordata frames. Whereas the internal function requires some preliminary handling of data sets (ymust be a vector,x must be a matrix and fixed effectsfes must be provided in a list),the wrapper takes a full data frame in thedata argument, and users can simply specify whichvariables correspond to y, x and the fixed effects, using either variable names or column numbers.
More formally,hdfeppml_int performs iteratively re-weighted least squares (IRLS) on atransformed model, as described in Correia, Guimarães and Zylkin (2020) and similar to theppmlhdfe package in Stata. In each iteration, the function calculates the transformed dependentvariable, partials out the fixed effects (callingcollapse:fhdwithin) and then solves a weightedleast squares problem (using fast C++ implementation).
Value
A list with the following elements:
coefficients: a 1 xncol(x)matrix with coefficient (beta) estimates.residuals: a 1 xlength(y)matrix with the residuals of the model.mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance:bic: Bayesian Information Criterion.x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.se: standard errors of the coefficients.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]test <- hdfeppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas))## End(Not run)PPML Estimation with HDFE
Description
hdfeppml_int is the internal algorithm called byhdfeppml to fit an (unpenalized)Poisson Pseudo Maximum Likelihood (PPML) regression with high-dimensional fixed effects (HDFE). Ittakes a vector with the dependent variable, a regressor matrix and a set of fixed effects (in listform: each element in the list should be a separate HDFE).
Usage
hdfeppml_int( y, x = NULL, fes = NULL, tol = 1e-08, hdfetol = 1e-04, mu = NULL, saveX = TRUE, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, init_z = NULL, verbose = FALSE, maxiter = 1000, cluster = NULL, vcv = TRUE)Arguments
y | Dependent variable (a vector) |
x | Regressor matrix. |
fes | List of fixed effects. |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol | Tolerance parameter for the within-transformation step,passed on to |
mu | A vector of initial values for mu that can be passed to the command. |
saveX | Logical. If |
colcheck | Logical. If |
colcheck_x | Logical. If |
colcheck_x_fes | Logical. If |
init_z | Optional: initial values of the transformed dependent variable, to be used in thefirst iteration of the algorithm. |
verbose | Logical. If |
maxiter | Maximum number of iterations (a number). |
cluster | Optional: a vector classifying observations into clusters (to use when calculating SEs). |
vcv | Logical. If |
Details
More formally,hdfeppml_int performs iteratively re-weighted least squares (IRLS) on atransformed model, as described in Correia, Guimarães and Zylkin (2020) and similar to theppmlhdfe package in Stata. In each iteration, the function calculates the transformed dependentvariable, partials out the fixed effects (callingcollapse::fhdwithin, which uses the algorithm inGaure (2013)) and then solves a weighted least squares problem (using fast C++ implementation).
Value
A list with the following elements:
coefficients: a 1 xncol(x)matrix with coefficient (beta) estimates.residuals: a 1 xlength(y)matrix with the residuals of the model.mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance:bic: Bayesian Information Criterion.x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.se: standard errors of the coefficients.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ]# Now generate the needed x, y and fes objects:y <- trade$exportx <- data.matrix(trade[, -1:-6])fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp))# Finally, the call to hdfeppml_int:reg <- hdfeppml_int(y = y, x = x, fes = fes)## End(Not run)Iceberg Lasso Implementation (in development)
Description
A function performs standard plugin lasso PPML estimation (without fixed effects) for severaldependent variables in a single step. This is still IN DEVELOPMENT: at the current stage, onlycoefficient estimates are are provided and there is no support for clustered errors.
Usage
iceberg(data, dep, indep = NULL, selectobs = NULL, ...)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the names of the independent variables or their column numbers. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
selectobs | Optional. A vector indicating which observations to use (either a logical vectoror a numeric vector with row numbers, as usual when subsetting in R). |
... | Further arguments, including:
|
Details
This functions enables users to implement the "iceberg" step in the two-step procedure described inBreinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2020). To do this after using the pluginmethod inmlfitppml, just select all the variables with non-zero coefficients indep and the remaining regressors inindep. The function will then perform separatelasso estimation on each of the selected dependent variables and report the coefficients.
Value
A matrix with coefficient estimates for all dependent variables.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
iceberg_results <- iceberg(data = trade[, -(1:6)], dep = c("ad_prov_14", "cp_prov_23", "tbt_prov_07", "tbt_prov_33", "tf_prov_41", "tf_prov_45"), selectobs = (trade$time == "2016"))Many Outer Products
Description
Compute a large number of outer products (useful for clustered SEs) using C++.
Usage
manyouter(A, B, c)Arguments
A,B | Numeric vectors. |
c | Integer. |
General Penalized PPML Estimation
Description
mlfitppml is a general-purpose wrapper function for penalized PPML estimation. This is aflexible tool that allows users to select:
Penalty type: either lasso or ridge.
Penalty parameter: users can provide a single global value for lambda (a single regressionis estimated), a vector of lambda values (the function estimates the regression using each of them,sequentially) or even coefficient-specific penalty weights.
Method: plugin lasso estimates can be obtained directly from this function too.
Cross-validation: if this option is enabled, the function uses IDs provided by the userto perform k-fold cross-validation and reports the resulting RMSE for all lambda values.
Usage
mlfitppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ...)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the name of the independent variable or a column number. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed | A vector with the names or column numbers of factor variables identifying the fixed effects,or a list with the desired interactions between variables in |
cluster | Optional. A string with the name of the clustering variable or a column number.It's also possible to input a vector with several variables, in which case the interaction ofall of them is taken as the clustering variable. |
selectobs | Optional. A vector indicating which observations to use (either a logical vectoror a numeric vector with row numbers, as usual when subsetting in R). |
... | Further arguments, including:
For a full list of options, seemlfitppml_int. |
Details
This function is a thin wrapper aroundmlfitppml_int, providing a more convenient interface fordata frames. Whereas the internal function requires some preliminary handling of data sets (ymust be a vector,x must be a matrix andfes must be provided in a list), the wrappertakes a full data frame in thedata argument, and users can simply specify which variablescorrespond to y, x and the fixed effects, using either variable names or column numbers.
For technical details on the algorithms used, seehdfeppml (post-lasso regression),penhdfeppml (standard penalized regression),penhdfeppml_cluster (plugin lasso),andxvalidate (cross-validation).
Value
A list with the following elements:
beta: ifpost = FALSE, alength(lambdas)xncol(x)matrix withcoefficient (beta) estimates from the penalized regressions. Ifpost = TRUE, this isthe matrix of coefficients from the post-penalty regressions.beta_pre: ifpost = TRUE, alength(lambdas)xncol(x)matrix withcoefficient (beta) estimates from the penalized regressions.bic: Bayesian Information Criterion.lambdas: vector of penalty parameters.ses: standard errors of the coefficients of the post-penalty regression. Note thatthese are only provided whenpost = TRUE.rmse: ifxval = TRUE, a matrix with the root mean squared error (RMSE - column 2)for each value of lambda (column 1), obtained by cross-validation.phi: coefficient-specific penalty weights (only ifmethod == "plugin").
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]# Now we can use our main functions on the reduced trade data set:test <- mlfitppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas), lambdas = c(0.01, 0.001), tol = 1e-6, hdfetol = 1e-2)## End(Not run)General Penalized PPML Estimation
Description
mlfitppml_int is the internal wrapper called bymlfitppml for penalized PPML estimation.This in turn callspenhdfeppml_int,penhdfeppml_cluster_int andhdfeppml_intas needed. It takes a vector with the dependent variable, a regressor matrix and a set of fixedeffects (in list form: each element in the list should be a separate HDFE). This is a flexible toolthat allows users to select:
Penalty type: either lasso or ridge.
Penalty parameter: users can provide a single global value for lambda (a single regressionis estimated), a vector of lambda values (the function estimates the regression using each of them,sequentially) or even coefficient-specific penalty weights.
Method: plugin lasso estimates can be obtained directly from this function too.
Cross-validation: if this option is enabled, the function uses IDs provided by the userto perform k-fold cross-validation and reports the resulting RMSE for all lambda values.
Usage
mlfitppml_int( y, x, fes, lambdas, penalty = "lasso", tol = 1e-08, hdfetol = 1e-04, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, post = TRUE, cluster = NULL, method = "bic", IDs = 1:n, verbose = FALSE, xval = FALSE, standardize = TRUE, vcv = TRUE, phipost = TRUE, penweights = NULL, K = 15, gamma_val = NULL, mu = NULL)Arguments
y | Dependent variable (a vector) |
x | Regressor matrix. |
fes | List of fixed effects. |
lambdas | Vector of penalty parameters. |
penalty | A string indicating the penalty type. Currently supported: "lasso" and "ridge". |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol | Tolerance parameter for the within-transformation step,passed on to |
colcheck | Logical. If |
colcheck_x | Logical. If |
colcheck_x_fes | Logical. If |
post | Logical. If |
cluster | Optional: a vector classifying observations into clusters (to use when calculating SEs). |
method | The user can set this equal to "plugin" to perform the plugin algorithm withcoefficient-specific penalty weights (see details). Otherwise, a single global penalty is used. |
IDs | A vector of fold IDs for k-fold cross validation. If left unspecified, each observationis assigned to a different fold (warning: this is likely to be very resource-intensive). |
verbose | Logical. If |
xval | Logical. If |
standardize | Logical. If |
vcv | Logical. If |
phipost | Logical. If |
penweights | Optional: a vector of coefficient-specific penalties to use in plugin lasso when |
K | Maximum number of iterations for the plugin algorithm to converge. |
gamma_val | Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
mu | A vector of initial values for mu that can be passed to the command. |
Details
For technical details on the algorithms used, seehdfeppml_int (post-lasso regression),penhdfeppml_int (standard penalized regression),penhdfeppml_cluster_int (plugin lasso),andxvalidate (cross-validation).
Value
A list with the following elements:
beta: ifpost = FALSE, alength(lambdas)xncol(x)matrix withcoefficient (beta) estimates from the penalized regressions. Ifpost = TRUE, this isthe matrix of coefficients from the post-penalty regressions.beta_pre: ifpost = TRUE, alength(lambdas)xncol(x)matrix withcoefficient (beta) estimates from the penalized regressions.bic: Bayesian Information Criterion.lambdas: vector of penalty parameters.ses: standard errors of the coefficients of the post-penalty regression. Note thatthese are only provided whenpost = TRUE.rmse: ifxval = TRUE, a matrix with the root mean squared error (RMSE - column 2)for each value of lambda (column 1), obtained by cross-validation.phi: coefficient-specific penalty weights (only ifmethod == "plugin").
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # First, we need to transform the data (this is what mlfitppml handles internally). Start by# filtering the data set to keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ]# Now generate the needed x, y and fes objects:y <- trade$exportx <- data.matrix(trade[, -1:-6])fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp))# Finally, we try mlfitppml_int with a lasso penalty (the default) and two lambda values:reg <- mlfitppml_int(y = y, x = x, fes = fes, lambdas = c(0.1, 0.01))# We can also try plugin lasso:\donttest{reg <- mlfitppml_int(y = y, x = x, fes = fes, cluster = fes$pair, method = "plugin")}# For an example with cross-validation, please see the vignette.## End(Not run)One-Shot Penalized PPML Estimation with HDFE
Description
penhdfeppml fits a penalized PPML regression for a given type of penalty and a givenvalue of the penalty parameter. The penalty can be either lasso or ridge, and the plugin methodcan be enabled via themethod argument.
Usage
penhdfeppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ...)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the name of the independent variable or a column number. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed | A vector with the names or column numbers of factor variables identifying the fixed effects,or a list with the desired interactions between variables in |
cluster | Optional. A string with the name of the clustering variable or a column number.It's also possible to input a vector with several variables, in which case the interaction ofall of them is taken as the clustering variable. |
selectobs | Optional. A vector indicating which observations to use (either a logical vectoror a numeric vector with row numbers, as usual when subsetting in R). |
... | Further options, including:
For a full list of options, seepenhdfeppml_int. |
Details
This function is a thin wrapper aroundpenhdfeppml_int, providing a more convenient interfacefor data frames. Whereas the internal function requires some preliminary handling of data sets (ymust be a vector,x must be a matrix andfes must be provided in a list), the wrappertakes a full data frame in thedata argument, and users can simply specify which variablescorrespond to y, x and the fixed effects, using either variable names or column numbers.
More formally,penhdfeppml_int performs iteratively re-weighted least squares (IRLS) on atransformed model, as described in Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2021).In each iteration, the function calculates the transformed dependent variable, partials out the fixedeffects (callinglfe::fhdwithin) and then and then callsglmnet::glmnet if the selectedpenalty is lasso (the default). If the user has selected ridge, the analytical solution is insteadcomputed directly using fast C++ implementation.
For information on how the plugin lasso method works, seepenhdfeppml_cluster.
Value
Ifmethod == "lasso" (the default), an object of classelnet with the elementsdescribed inglmnet, as well as:
mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance.bic: Bayesian Information Criterion.phi: coefficient-specific penalty weights (only ifmethod == "plugin".x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.
Ifmethod == "ridge", a list with the following elements:
beta: a 1 xncol(x)matrix with coefficient (beta) estimates.mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance.bic: Bayesian Information Criterion.x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]test <- penhdfeppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), lambda = 0.05, selectobs = (trade$imp %in% americas) & (trade$exp %in% americas))## End(Not run)Plugin Lasso Estimation
Description
Performs plugin lasso - PPML estimation with HDFE. This is an internal function, called bymlfitppml andpenhdfeppml when users select themethod = "plugin"option, but it's made available as a stand-alone option for advanced users who may prefer to avoidsome overhead imposed by the wrappers.
Usage
penhdfeppml_cluster( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ...)Arguments
data | A data frame containing all relevant variables. |
dep | A string with the name of the independent variable or a column number. |
indep | A vector with the names or column numbers of the regressors. If left unspecified,all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed | A vector with the names or column numbers of factor variables identifying the fixed effects,or a list with the desired interactions between variables in |
cluster | A string with the name of the clustering variable or a column number.It's also possible to input a vector with several variables, in which case the interaction ofall of them is taken as the clustering variable. Note that this is NOT OPTIONAL in this case:our plugin algorithm requires clusters to be specified. |
selectobs | Optional. A vector indicating which observations to use (either a logical vectoror a numeric vector with row numbers, as usual when subsetting in R). |
... | Further options. For a full list of options, seepenhdfeppml_cluster_int. |
Details
This function is a thin wrapper aroundpenppml_cluster_int, providing a more convenient interfacefor data frames. Whereas the internal function requires some preliminary handling of data sets (ymust be a vector,x must be a matrix andfes must be provided in a list), the wrappertakes a full data frame in thedata argument, and users can simply specify which variablescorrespond to y, x and the fixed effects, using either variable names or column numbers.
The plugin method uses coefficient-specific penalty weights that account for heteroskedasticity. Thepenalty parameters are calculated automatically by the function using statistical theory - for abrief discussion of this, see Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2021), andfor a more in-depth analysis, check Belloni, Chernozhukov, Hansen, and Kozbur (2016), which introducedthe specific implementation used in this package. Heuristically, the penalty parameters are set ata level high enough so that the absolute value of the score for each regressor must be statisticallylarge relative to its standard error in order for the regressors to be selected.
Value
An object of classelnet with the elements described inglmnet, aswell as the following:
mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance.bic: Bayesian Information Criterion.phi: coefficient-specific penalty weights.x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]test <- penhdfeppml_cluster(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), cluster = c("exp", "imp"), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas), tol = 1e-5, hdfetol = 1e-1)## End(Not run)Plugin Lasso Estimation
Description
Performs plugin lasso - PPML estimation with HDFE. This is an internal function, called bymlfitppml_int andpenhdfeppml_int when users select themethod = "plugin" option, but it's made availableas a stand-alone option for advanced users who may prefer to avoid some overhead imposed by thewrappers.
Usage
penhdfeppml_cluster_int( y, x, fes, cluster, tol = 1e-08, hdfetol = 1e-04, glmnettol = 1e-12, penalty = "lasso", penweights = NULL, saveX = TRUE, mu = NULL, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, K = 15, init_z = NULL, post = FALSE, verbose = FALSE, lambda = NULL, phipost = TRUE, gamma_val = NULL)Arguments
y | Dependent variable (a vector) |
x | Regressor matrix. |
fes | List of fixed effects. |
cluster | Optional: a vector classifying observations into clusters (to use when calculating SEs). |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol | Tolerance parameter for the within-transformation step,passed on to |
glmnettol | Tolerance parameter to be passed on to |
penalty | Only "lasso" is supported at the present stage. |
penweights | Optional: a vector of coefficient-specific penalties to use in plugin lasso when |
saveX | Logical. If |
mu | A vector of initial values for mu that can be passed to the command. |
colcheck | Logical. If |
colcheck_x | Logical. If |
colcheck_x_fes | Logical. If |
K | Maximum number of iterations. |
init_z | Optional: initial values of the transformed dependent variable, to be used in thefirst iteration of the algorithm. |
post | Logical. If |
verbose | Logical. If |
lambda | Penalty parameter (a number). |
phipost | Logical. If |
gamma_val | Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
Details
The plugin method uses coefficient-specific penalty weights that account for heteroskedasticity. Thepenalty parameters are calculated automatically by the function using statistical theory - for abrief discussion of this, see Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2021), andfor a more in-depth analysis, check Belloni, Chernozhukov, Hansen, and Kozbur (2016), which introducedthe specific implementation used in this package. Heuristically, the penalty parameters are set ata level high enough so that the absolute value of the score for each regressor must be statisticallylarge relative to its standard error in order for the regressors to be selected.
Value
An object of classelnet with the elements described inglmnet, aswell as the following:
mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance.bic: Bayesian Information Criterion.phi: coefficient-specific penalty weights.x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in Latin America and the Caribbean:LatAmericaCar <- countries$iso[countries$sub.region == "Latin America and the Caribbean"]trade <- trade[(trade$imp %in% LatAmericaCar) & (trade$exp %in% LatAmericaCar), ]# Now generate the needed x, y and fes objects:y <- trade$exportx <- data.matrix(trade[, -1:-6])fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp))# Finally, we try penhdfeppml_cluster_int:reg <- penhdfeppml_cluster_int(y = y, x = x, fes = fes, cluster = fes$pair)## End(Not run)One-Shot Penalized PPML Estimation with HDFE
Description
penhdfeppml_int is the internal algorithm called bypenhdfeppml to fit a penalized PPMLregression for a given type of penalty and a given value of the penalty parameter. It takes a vectorwith the dependent variable, a regressor matrix and a set of fixed effects (in list form: each elementin the list should be a separate HDFE). The penalty can be either lasso or ridge, and the pluginmethod can be enabled via themethod argument.
Usage
penhdfeppml_int( y, x, fes, lambda, tol = 1e-08, hdfetol = 1e-04, glmnettol = 1e-12, penalty = "lasso", penweights = NULL, saveX = TRUE, mu = NULL, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, init_z = NULL, post = FALSE, verbose = FALSE, phipost = TRUE, standardize = TRUE, method = "placeholder", cluster = NULL, debug = FALSE, gamma_val = NULL)Arguments
y | Dependent variable (a vector) |
x | Regressor matrix. |
fes | List of fixed effects. |
lambda | Penalty parameter (a number). |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol | Tolerance parameter for the within-transformation step,passed on to |
glmnettol | Tolerance parameter to be passed on to |
penalty | A string indicating the penalty type. Currently supported: "lasso" and "ridge". |
penweights | Optional: a vector of coefficient-specific penalties to use in plugin lasso when |
saveX | Logical. If |
mu | A vector of initial values for mu that can be passed to the command. |
colcheck | Logical. If |
colcheck_x | Logical. If |
colcheck_x_fes | Logical. If |
init_z | Optional: initial values of the transformed dependent variable, to be used in thefirst iteration of the algorithm. |
post | Logical. If |
verbose | Logical. If |
phipost | Logical. If |
standardize | Logical. If |
method | The user can set this equal to "plugin" to perform the plugin algorithm withcoefficient-specific penalty weights (see details). Otherwise, a single global penalty is used. |
cluster | Optional: a vector classifying observations into clusters (to use when calculating SEs). |
debug | Logical. If |
gamma_val | Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
Details
More formally,penhdfeppml_int performs iteratively re-weighted least squares (IRLS) on atransformed model, as described in Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2020).In each iteration, the function calculates the transformed dependent variable, partials out the fixedeffects (callingcollapse::fhdwithin) and then and then callsglmnet if the selectedpenalty is lasso (the default). If the user selects ridge, the analytical solution is insteadcomputed directly using fast C++ implementation.
For information on the plugin lasso method, seepenhdfeppml_cluster_int.
Value
Ifmethod == "lasso" (the default), an object of classelnet with the elementsdescribed inglmnet, as well as:
mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance.bic: Bayesian Information Criterion.phi: coefficient-specific penalty weights (only ifmethod == "plugin".x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.
Ifmethod == "ridge", a list with the following elements:
beta: a 1 xncol(x)matrix with coefficient (beta) estimates.mu: a 1 xlength(y)matrix with the final values of the conditional mean\mu.deviance.bic: Bayesian Information Criterion.x_resid: matrix of demeaned regressors.z_resid: vector of demeaned (transformed) dependent variable.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
## Not run: # To reduce run time, we keep only countries in the Americas:americas <- countries$iso[countries$region == "Americas"]trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ]# Now generate the needed x, y and fes objects:y <- trade$exportx <- data.matrix(trade[, -1:-6])fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp))# Finally, we try penhdfeppml_int with a lasso penalty (the default):reg <- penhdfeppml_int(y = y, x = x, fes = fes, lambda = 0.1)# We can also try ridge:\donttest{reg <- penhdfeppml_int(y = y, x = x, fes = fes, lambda = 0.1, penalty = "ridge")}## End(Not run)Iceberg Lasso Implementation (in development)
Description
This is the internal function upon which theiceberg wrapper is built. It performs standardplugin lasso PPML estimation without fixed effects, relying onglmnet::glmnet. As the otherinternals in the package, it needs a y vector and an x matrix.
Usage
plugin_lasso_int( y, x, tol = 1e-08, glmnettol = 1e-12, penweights = NULL, colcheck = FALSE, K = 50, verbose = FALSE, lambda = NULL, icepost = FALSE)Arguments
y | Dependent variable (a vector). |
x | Regressor matrix. |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
glmnettol | Tolerance parameter to be passed on to |
penweights | Optional: a vector of coefficient-specific penalties to use in plugin lasso. |
colcheck | Logical. If |
K | Maximum number of iterations. |
verbose | Logical. If |
lambda | Penalty parameter (a number). |
icepost | Logical. If |
Value
A list with 14 elements, includingbeta, which is the only one we use in the wrapper.For a full list, seeglmnet.
Filtering fixed effect lists
Description
A helper function forxvalidate that filters a list of fixed effects and returns the modifiedlist. Used to split the fixed effects for cross-validation.
Usage
select_fes(fe_list, select_obs, list = TRUE)Arguments
fe_list | A list of fixed effects. |
select_obs | A vector of selected observations / rows. |
list | Logical. If |
Value
A modified list of fixed effects.
Weighted Standardization
Description
Performs weighted standardization of x variables. Used infastridge.
Usage
standardize_wt(x, weights = rep(1/n, n), intercept = TRUE, return.sd = FALSE)Arguments
x | Regressor matrix. |
weights | Weights. |
intercept | Logical. If |
return.sd | Logical. If |
Value
Ifreturn.sd == FALSE, it gives the matrix of standardized regressors. Ifreturn.sd == TRUE, then it returns the vector of standard errors of the means of thevariables.
International trade agreements data set
Description
A panel data set containing bilateral trade flows between 210 exporters and 262 importers between1964 and 2016. The data set also contains information about trade agreements in force betweencountry pairs, as well as 16 dummies for specific provisions in those agreements (a small selectionfrom a broader data set).
Usage
tradeFormat
A data frame with 194,092 rows and 22 variables:
- exp
Exporter country (ISO 3166 code)
- imp
Importer country (ISO 3166 code).
- time
Year.
- export
Merchandise trade exports in USD.
- id
Agreement ID code.
- agreement
Agreement name.
- ad_prov_14
Anti-dumping actions allowed and with specific provisions for material injury.
- cp_prov_23
Does the agreement contain provisions that promote transparency?
- tbt_prov_07
Technical Regulations - Is the use of international standards promoted?
- tbt_prov_33
Does the agreement go beyond the TBT (Technical Barriers to Trade) Agreement?
- tf_prov_41
Harmonization and common legal framework
- tf_prov_45
Issuance of proof of origin
- ser_prov_47
Does the agreement contain a standstill provision?
- inv_prov_22
Does the agreement grant Fair and Equitable Treatment (FET)?
- et_prov_38
Prohibits export-related performance requirements, subject to exemptions.
- ipr_prov_44
Stipulates that GIs can be registered and protected through a TM system
- env_prov_18
Does the agreement require states to control ozone-depleting substances?
- ipr_prov_15
Incorporates/reaffirms all multilateral agreements to which both parties are aparty (general obligation)
- moc_prov_21
Does the transfer provision explicitly exclude “good faith and non-discriminatoryapplication of its laws” related to bankruptcy, insolvency or creditor rights protection?
- ste_prov_30
Does the agreement regulate subsidization to state enterprises?
- lm_prov_10
Does the agreement include reference to internationally recognized labor standards?
- cp_prov_26
Does the agreement regulate consumer protection?
Source
Data on international trade flows was obtained from Comtrade.Provision data comes from:Mattoo, A., N. Rocha, M. Ruta (2020). Handbook of deep trade agreements. Washington, DC: World Bank.
XeeX Matrix Computation
Description
Given matrix ee' and matrix X, compute X(k)'ee'X(k) for each regressor X.
Usage
xeex(X, e, S)Arguments
X | Regressor matrix. |
e | Residuals. |
S | Cluster sizes. |
Value
The matrix product X(k)'ee'X(k).
Implementing Cross Validation
Description
This is the internal function called bymlfitppml_int to perform cross-validation, if theoption is enabled. It is available also on a stand-alone basis in case it is needed, but generallyusers will be better served by using the wrappermlfitppml.
Usage
xvalidate( y, x, fes, IDs, testID = NULL, tol = 1e-08, hdfetol = 1e-04, colcheck_x = TRUE, colcheck_x_fes = TRUE, init_mu = NULL, init_x = NULL, init_z = NULL, verbose = FALSE, cluster = NULL, penalty = "lasso", method = "placeholder", standardize = TRUE, penweights = rep(1, ncol(x_reg)), lambda = 0)Arguments
y | Dependent variable (a vector) |
x | Regressor matrix. |
fes | List of fixed effects. |
IDs | A vector of fold IDs for k-fold cross validation. If left unspecified, each observationis assigned to a different fold (warning: this is likely to be very resource-intensive). |
testID | Optional. A number indicating which ID to hold out during cross-validation. If leftunspecified, the function cycles through all IDs and reports the average RMSE. |
tol | Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol | Tolerance parameter for the within-transformation step,passed on to |
colcheck_x | Logical. If |
colcheck_x_fes | Logical. If |
init_mu | Optional: initial values of the conditional mean |
init_x | Optional: initial values of the independent variables. |
init_z | Optional: initial values of the transformed dependent variable, to be used in thefirst iteration of the algorithm. |
verbose | Logical. If |
cluster | Optional: a vector classifying observations into clusters (to use when calculating SEs). |
penalty | A string indicating the penalty type. Currently supported: "lasso" and "ridge". |
method | The user can set this equal to "plugin" to perform the plugin algorithm withcoefficient-specific penalty weights (see details). Otherwise, a single global penalty is used. |
standardize | Logical. If |
penweights | Optional: a vector of coefficient-specific penalties to use in plugin lasso when |
lambda | Penalty parameter, to be passed on to penhdfeppml_int or penhdfeppml_cluster_int. |
Details
xvalidate carries out cross-validation with the user-provided IDs by holding out each one ofthem, sequentially, as in the k-fold procedure (unlesstestID is specified, in which caseit just uses this ID for validation). After filtering out the holdout sample, the function simplycallspenhdfeppml_int andpenhdfeppml_cluster_int to estimate the coefficients, itpredicts the conditional means for the held-out observations and finally it calculates the root meansquared error (RMSE).
Value
A list with two elements:
rmse: root mean squared error (RMSE).mu: conditional means.
References
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021)."Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements",Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensionalfixed effects",STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables",Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linearmodels via coordinate descent",Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panelmodels with an application to gun control",Journal of Business & Economic Statistics, 34, 590-605.
Examples
# First, we need to transform the data. Start by filtering the data set to keep only countries in# the Americas:americas <- countries$iso[countries$region == "Americas"]trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ]# Now generate the needed x, y and fes objects:y <- trade$exportx <- data.matrix(trade[, -1:-6])fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp))# We also need to create the IDs. We split the data set by agreement, not observation:id <- unique(trade[, 5])nfolds <- 10unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE))cross_ids <- merge(trade[, 5, drop = FALSE], unique_ids, by = "id", all.x = TRUE)# Finally, we try xvalidate with a lasso penalty (the default) and two lambda values:## Not run: reg <- xvalidate(y = y, x = x, fes = fes, lambda = 0.001, IDs = cross_ids$fold, verbose = TRUE)## End(Not run)