The R packagepcpr implements Principal ComponentPursuit (PCP), a robust dimensionality reduction technique, for patternrecognition tailored to environmental health (EH) data. The statisticalmethodology and computational details are provided in Gibson etal. (2022).
In this code-free vignette, we present a crash course in PCP’stheoretical background, so researchers can better navigate all of thefunctionality offered inpcpr. We will touch upon:
L andSpcprlambda,mu, andetagrid_search_cv()We recommend users skim this crash course before reading the twocode-heavy vignettes:
vignette("pcp-quickstart"): applying PCP to a simulatedenvironmental mixturevignette("pcp-applied"): employing PCP for sourceapportionment of real-world PM2.5 air pollution concentration data usingthequeens datasetPCP algorithms model an observed exposure matrix\(D\) as the sum of three underlyingground-truth matrices:
a low-rank matrix\(L_0\) encodingconsistent patterns of exposure, a sparse matrix\(S_0\) isolating unique or outlying exposureevents (that cannot be explained by the consistent exposure patterns),and dense noise\(Z_0\). All of thesematrices are of dimension\(n \timesp\), where\(n\) is the numberof observations (e.g. study participants or measurement dates) and\(p\) is the number of exposures(e.g. chemical and/or non-chemical stressors). Beyond this mixturesmodel, the main assumption made by PCP is that\(Z_0 \sim N(\mu, \sigma^2)\) consists ofindependently and identically distributed (i.i.d.) Gaussian noisecorrupting each entry of the overall exposure matrix\(D\).
The models inpcpr seek to decompose an observed datamatrix\(D\) into estimated low-rankand sparse components\(L\) and\(S\) for use in downstream environmentalhealth analyses.
The estimated low-rank matrix\(L\)provides information on the consistent exposure patterns,satisfying:
\[r = \text{rank}(L) \ll \min(n,p).\]
The rank\(r\) of a matrix is thenumber of linearly independent columns or rows in the matrix, and playsan important role in defining the mathematical structure of the data.Intuitively, the rank directly corresponds to the (relatively few)number of underlying patterns governing the mixture. Here, “patterns”can refer to specific sources, profiles or behaviors leading toexposure, depending on the application.
Contrary to closely related dimension reduction tools such asprincipal component analysis (PCA), PCP infers the rank\(r\) from the observed data. Inpcpr, this is done directly during optimization for convexPCP, and via grid search for non-convex PCP. As such, rather thanrequire the researcher to choose the number of estimated patterns foruse in subsequent health models, PCP allows the observed data to “speakfor itself”, thereby removing potential points of subjectivity in modeldesign.
Notice that\(L \in \mathbb{R}^{n \timesp}\), meaning it is still defined in terms of the original\(n\)-many observations and\(p\)-many environmental variables. Putdifferently,\(L\) can be taken as arobust approximation to the true environmental mixture matrix,unperturbed by outliers (captured in\(S\)) or noise (handled by PCP’s noiseterm). In this way, the latent exposure patterns areencoded in\(L\) rather than directly estimated.Toexplicitly obtain the exposure patterns from\(L\), PCP may then be paired with variousmatrix factorization methods (e.g., PCA, factor analysis, ornon-negative matrix factorization) that yield chemical loadings andindividual scores for use in downstream health models.
This flexibility allows\(L\) toadapt to mixture-specific assumptions. For example, if the assumption oforthogonal (i.e., independent) patterns is too strong, then instead ofpairing\(L\) with PCA, a moreappropriate method such as factor analysis can be used. Alternatively,depending on the sample size and study design,\(L\) may also be directly incorporated intoregression models.
The estimated sparse matrix\(S\)captures unusually high or low outlying exposure events, unexplained bythe identified patterns in\(L\). Mostentries in\(S\) are 0, with non-zeroentries identifying such extreme exposure activity. The number, location(i.e., support), and value of non-zero entries in\(S\) need not be a priori defined; PCPisolates these itself during optimization.
By separating and retaining sparse exposure events, PCP boasts anenormous advantage over other current dimension reduction techniques.Despite being common phenomena in mixtures data, sparse outliers aretypically removed from the exposure matrix prior to analysis. This isbecause PCA and other conventional dimension reduction approaches areunable to disentangle such unique events from the overall patterns ofexposure: If included, even low fractions of outliers can deviatepatterns identified by traditional methods away from the truedistribution of the data, yielding inaccurate pattern estimations andhigh false positive rates of detected outliers. By decomposing a mixtureinto low-rank and sparse components\(L\) and\(S\), PCP avoids such pitfalls.
The functions inpcpr are outfitted with threeenvironmental health (EH)-specific extensions, makingpcprparticularly powerful for EH research:
NA values in the observed mixture matrix, oftenoutperforming traditional imputation techniques.Lmatrix: If desired, PCP can enforce values in the estimatedlow-rank matrixL to be\(\geq0\), better modeling real world mixtures data.PCP assumes that the same data generating mechanisms govern both themissing and the observed entries in\(D\). Because PCP primarily seeks accurateestimation ofpatterns rather than individualobservations, this assumption is reasonable, but in some edgecases may not always be justified. Missing values in\(D\) are therefore reconstructed in therecovered low-rank\(L\) matrixaccording to the underlying patterns in\(L\). There are three corollaries to keep inmind regarding the quality of recovered missing observations:
When equipped with\(LOD\)information, PCP treats any estimations of values known to be below the\(LOD\) as equally valid if theirapproximations fall between\(0\) andthe\(LOD\). Over the course ofoptimization, observations below the LOD are pushed into this knownrange\([0, LOD]\) using penalties fromabove and below: should a\(< LOD\)estimate be\(< 0\), it isstringently penalized, since measured observations cannot be negative.On the other hand, if a\(< LOD\)estimate is\(> LOD\), it is alsoheavily penalized: less so than when\(<0\), but more so than observations known to be above the\(LOD\), because we have prior informationthat these observations must be below\(LOD\). Observations known to be above the\(LOD\) are penalized as usual, usingthe Frobenius norm in the above objective function.
Gibson et al. (2022) demonstrate that in experimental settings withup to 50% of the data corrupted below the\(LOD\), PCP with the\(LOD\) extension boasts superior accuracy ofrecovered\(L\) models compared to PCAcoupled with\(\frac{LOD}{\sqrt{2}}\)imputation. PCP even outperforms PCA in low-noise scenarios with as muchas 75% of the data corrupted below the\(LOD\). The few situations in which PCAbettered PCP were those pathological cases in which\(D\) was characterized by extreme noise andhuge proportions (i.e., 75%) of observations falling below the\(LOD\).
LmatrixTo enhance interpretability of PCP-rendered solutions, there is anoptional non-negativity constraint that can be imposed on the\(L\) matrix to ensure all estimated valueswithin it are\(\geq 0\). This preventsresearchers from having to deal with negative observation values andquestions surrounding their meaning and utility. Non-negative\(L\) models also allow for seamless use ofmethods such as non-negative matrix factorization to extractnon-negative patterns.
Currently, the non-negativity constraint is only supported in theconvex PCP functionroot_pcp(), incorporated in the ADMMsplitting technique via the introduction of an additional optimizationvariable and corresponding constraint. Future work will extend theconstraint to the non-convex PCP methodrrmc().
Of the many flavors of PCP undergoing active study in the currentliterature, we provide two distinct models inpcpr: theconvex modelroot_pcp() and non-convex modelrrmc(). The table below offers a quick glance at theirrelative differences:
root_pcp() | rrmc() | |
|---|---|---|
| Convex? | Yes | No |
| Convergence? | Slow | Fast |
| Expected low-rank structure? | Well-defined | Complex |
| Parameters? | \(D, \lambda, \mu\) | \(D, r, \eta\) |
| Supports missing values? | Yes | Yes |
| Supports LOD penalty? | Yes | Yes |
| Supports non-negativity constraint? | Yes | No |
| Rank determination? | Autonomous | User-defined* |
| Sparse event identification? | Autonomous | Autonomous |
| Optimization approach? | ADMM | Iterative rank-based |
*rrmc() can be paired with the cross-validatedgrid_search_cv() function for autonomous rankdetermination.
Convex PCP viaroot_pcp() is best for data characterizedby rapidly decaying singular values (e.g. image and video data),indicative of very well-defined latent patterns.
Non-convex PCP withrrmc() is best suited for datacharacterized by slowly decaying singular values, indicative of complexunderlying patterns and a relatively large degree of noise. Most EH datacan be described this way, so we expect most EH researchers to utilizerrmc() in their analyses, however there are cases where theconvexity ofroot_pcp() may be preferable.
Convex PCP formulations possess a number of particularly attractiveproperties, foremost of which is convexity, meaning that every localoptimum is a global optimum, and a single best solution exists. Convexapproaches to PCP also have the virtue that the rank\(r\) of the recovered\(L\) matrix is determined duringoptimization, without researcher input.
Unfortunately, these benefits come at a cost: convex PCP programs areexpensive to run on large datasets, suffering from poor convergencerates. Moreover, convex PCP approaches are best suited to instances inwhich the target low-rank matrix\(L_0\) can be accurately modelled aslow-rank (i.e. \(L_0\) is governed byonly a few very well-defined patterns). This is often the case withimage and video data (characterized by rapidly decaying singularvalues), but not common for EH data. EH data is typically onlyapproximately low-rank (characterized by complex patterns and slowlydecaying singular values).
The convex model available inpcpr isroot_pcp(). For a comprehensive technical understanding, werefer readers toZhanget al. (2021) introducing the algorithm.
root_pcp() optimizes the following objectivefunction:
\[\min_{L, S} ||L||_* + \lambda ||S||_1 +\mu ||L + S - D||_F\]
The first term is the nuclear norm of the L matrix, incentivizing\(L\) to be low-rank. The second termis the\(\ell_1\) norm of the S matrix,encouraging S to be sparse. The third term is the Frobenius norm appliedto the model’s noise, ensuring that the estimated low-rank and sparsemodels\(L\) and\(S\) together have high fidelity to theobserved data\(D\). The objective isnot smooth nor differentiable, however it is convex and separable. Assuch, it is optimized using the Alternating Direction Method ofMultipliers (ADMM) algorithm (Boyd et al. (2011)), (Gao etal. (2020)).
To alleviate the high computational complexity of convex methods,non-convex PCP frameworks have been developed. These drastically improveupon the convergence rates of their convex counterparts. Better still,non-convex PCP methods more flexibly accommodate data lacking awell-defined low-rank structure, so they are from the outset bettersuited to handling EH data. Non-convex formulations provide thisflexibility by allowing the user to interrogate the data at differentranks.
The drawback here is that non-convex algorithms can no longerdetermine the rank best describing the data on their own, insteadrequiring the researcher to subjectively specify the rank\(r\) as in PCA. However, by pairingnon-convex PCP algorithms with the cross-validation routine implementedin thegrid_search_cv() function, the optimal rank can bedetermined semi-autonomously; the researcher need only define a ranksearch space from which theoptimal rank will be identifiedvia grid search. One of the more glaring trade-offs made bynon-convex methods for this improved run-time and flexibility is weakertheoretical promises; specifically, non-convex PCP runs the risk offinding spuriouslocal optima, rather than theglobaloptimum guaranteed by their convex siblings. Having said that, theoryhas been developed guaranteeing equivalent performance betweennon-convex implementations and closely related convex formulations undercertain conditions. These advancements provide strong motivation fornon-convex frameworks despite their weaker theoretical promises.
The non-convex model available inpcpr isrrmc(). We refer readers toCherapanamjeriet al. (2017) for an in depth look at the mathematical details.
rrmc() implicitly optimizes the following objectivefunction:
\[\min_{L, S} I_{rank(L) \leq r} + \eta||S||_0 + ||L + S - D||_F^2\]
The first term is the indicator function checking that the\(L\) matrix is strictly rank\(r\) or less, implemented using a rank\(r\) projection operatorproj_rank_r(). The second term is the\(\ell_0\) norm applied to the\(S\) matrix to encourage sparsity, and isimplemented with the help of an adaptive hard-thresholding operatorhard_threshold(). The third term is the squared Frobeniusnorm applied to the model’s noise.
rrmc() uses an incremental rank-based strategy in orderto estimate\(L\) and\(S\): First, a rank-\(1\) model\((L^{(1)}, S^{(1)})\) is estimated. Therank-\(1\) model is then used as aninitialization point to construct a rank-\(2\) model\((L^{(2)}, S^{(2)})\), and so on, until thedesired rank-r model\((L^{(r)},S^{(r)})\) is recovered. All models from ranks\(1\) through\(r\) are returned byrrmc() inthis way.
lambda,mu, andetaRecallroot_pcp()’s objective function is given by:
\[\min_{L, S} ||L||_* + \lambda ||S||_1 +\mu ||L + S - D||_F\]
lambda)controls the sparsity ofroot_pcp()’s output\(S\) matrix; larger values of\(\lambda\) penalize non-zero entries in\(S\) more stringently, driving therecovery of sparser\(S\) matrices.Therefore, if you a priori expect few outlying events in your model, youmight expect a grid search to recover relatively larger\(\lambda\) values, and vice-versa.mu) adjustsroot_pcp()’s sensitivity to noise; larger values of\(\mu\) penalize errors between the predictedmodel and the observed data (i.e. noise), more severely. Environmentaldata subject to higher noise levels therefore require aroot_pcp() model equipped with smaller mu values (sincehigher noise means a greater discrepancy between the observed mixtureand the true underlying low-rank and sparse model). In virtuallynoise-free settings (e.g. simulations), larger values of\(\mu\) would be appropriate.rrmc()’s objective function is given by:
\[\min_{L, S} I_{rank(L) \leq r} + \eta||S||_0 + ||L + S - D||_F^2\]
eta)controls the sparsity ofrrmc()’s output\(S\) matrix, just as\(\lambda\) does forroot_pcp().Because there are no other parameters scaling the noise term,\(\eta\) can be thought of as a ratio betweenroot_pcp()’s\(\lambda\)and\(\mu\): Larger values of\(\eta\) will place a greater emphasis onpenalizing the non-zero entries in\(S\) over penalizing the errors between thepredicted and observed data (the dense noise\(Z\)).Theget_pcp_defaults() function calculates the “default”PCP parameter settings of\(\lambda\),\(\mu\) and\(\eta\) for a given data matrix\(D\).
The “default” values of\(\lambda\)and\(\mu\) offertheoreticalguarantees of optimal estimation performance. Candès et al. (2011)obtained the guarantee for\(\lambda\),while Zhang et al. (2021) obtained the result for\(\mu\). It has not yet been proven whetheror not\(\eta\) enjoys similarproperties.
The theoretically optimal\(\lambda_*\) is given by:
\[\lambda_* = 1 / \sqrt{\max(n,p)},\]
where\(n\) and\(p\) are the dimensions of the input matrix\(D_{n \times p}\).
The theoretically optimal\(\mu_*\)is given by:\[\mu_* = \sqrt{\frac{\min(n,p)}{2}}.\]
The “default” value of\(\eta\) isthen simply\(\eta =\frac{\lambda_*}{\mu_*}\).
Mixtures data is rarely so well-behaved in practice, however.Instead, it is common to find differentempirically optimalparameter values aftertuning these parameters in a gridsearch. Therefore, it is recommended to useget_pcp_defaults() primarily to help define a reasonableinitial parameter search space to pass intogrid_search_cv().
grid_search_cv()grid_search_cv() conducts a Monte Carlo stylecross-validated grid search of PCP parameters for a given data matrix\(D\), PCP functionpcp_fn, and grid of parameter settings to search throughgrid. The run time of the grid search can be sped up usingbespoke parallelization settings.
Each hyperparameter setting is cross-validated by:
NA values), yielding\(P_\Omega(D)\). Done using thesim_na() function.pcp_fn on\(P_\Omega(D)\), yielding estimates\(L\) and\(S\).In thegrid_search_cv() function,\(\xi\) is referred to asperc_test (percent test), while\(K\) is known asnum_runs(number of runs).
Experimentally, this grid search procedure retrieves the bestperforming PCP parameter settings when\(\xi\) is relatively low, e.g. \(\xi = 0.05\), or 5%, and\(K\) is relatively high, e.g. \(K = 100\). This is because:
Once the grid search of has been conducted, the optimalhyperparameters can be chosen by examining the output statisticssummary_stats. Below are a few suggestions for how tointerpret thesummary_stats table:
rel_err statistic, capturing the relative discrepancybetween recovered test sets and their original, observed (yet possiblynoisy) values. Lowerrel_err means the PCP model was betterable to recover the held-out test set. So, in general, the bestparameter settings are those with the lowestrel_err.Having said this, it is important to remember that this statistic shouldbe taken with a grain of salt: Because in practice the researcher doesnot have access to the ground truth\(L\) matrix, therel_errmeasurement is forced to rely on the comparison between the noisyobserved data matrix\(D\) and theestimated low-rank model\(L\). So therel_err metric is an “apples to oranges” relative error.For data that is a priori expected to be subject to a high degree ofnoise, it may actually be better to discard parameter settings withsuspiciously low rel_errs (in which case the solution may behallucinating an inaccurate low-rank structure from the observednoise).root_pcp() as the PCP model,parameters that fail to converge can be discarded. Generally, fewerroot_pcp() iterations (num_iter) taken toreach convergence portend a more reliable / stable solution. In rarecases, the user may need to increaseroot_pcp()’smax_iter argument to reach convergence.rrmc()does not report convergence metadata, as its optimization scheme runsfor a fixed number of iterations.thresh set by thesparsity() &matrix_rank() functions. E.g. it could be that the actualaverage matrix rank is much higher or lower when a threshold that bettertakes into account the relative scale of the singular values is used.Likewise for the sparsity estimations. Also, recall that the given valuefor\(\xi\) artifically sets a sparsityfloor, since those missing entries in the test set cannot be recoveredin the\(S\) matrix. E.g. if\(\xi = 0.05\), then no parameter settingwill have an estimated sparsity lower than 5%.To see how to apply all of the above inpcpr, werecommend reading:
vignette("pcp-quickstart"): applying PCP to a simulatedenvironmental mixturevignette("pcp-applied"): employing PCP for sourceapportionment of real-world PM2.5 air pollution concentration data usingthequeens datasetGibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud,Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, JohnWright, and Marianthi-Anna Kioumourtzoglou. “Principal component pursuitfor pattern identification in environmental mixtures.” EnvironmentalHealth Perspectives 130, no. 11 (2022): 117008.
Zhang, Junhui, Jingkai Yan, and John Wright. “Square root principalcomponent pursuit: tuning-free noisy robust matrix recovery.” Advancesin Neural Information Processing Systems 34 (2021): 29464-29475.
Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, and JonathanEckstein. “Distributed optimization and statistical learning via thealternating direction method of multipliers.” Foundations and Trends inMachine learning 3, no. 1 (2011): 1-122.
Gao, Wenbo, Donald Goldfarb, and Frank E. Curtis. “ADMM formultiaffine constrained optimization.” Optimization Methods and Software35, no. 2 (2020): 257-303.
Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain. “Nearlyoptimal robust matrix completion.” International Conference on MachineLearning. PMLR, 2017.
Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright. “Robustprincipal component analysis?.” Journal of the ACM (JACM) 58, no. 3(2011): 1-37.