Movatterモバイル変換

The R packagepcpr implements Principal ComponentPursuit (PCP), a robust dimensionality reduction technique, for patternrecognition tailored to environmental health (EH) data. The statisticalmethodology and computational details are provided in Gibson etal. (2022).

In this code-free vignette, we present a crash course in PCP’stheoretical background, so researchers can better navigate all of thefunctionality offered inpcpr. We will touch upon:

We recommend users skim this crash course before reading the twocode-heavy vignettes:

The low-rank matrix

The estimated low-rank matrix\(L\)provides information on the consistent exposure patterns,satisfying:

\[r = \text{rank}(L) \ll \min(n,p).\]

The rank\(r\) of a matrix is thenumber of linearly independent columns or rows in the matrix, and playsan important role in defining the mathematical structure of the data.Intuitively, the rank directly corresponds to the (relatively few)number of underlying patterns governing the mixture. Here, “patterns”can refer to specific sources, profiles or behaviors leading toexposure, depending on the application.

Contrary to closely related dimension reduction tools such asprincipal component analysis (PCA), PCP infers the rank\(r\) from the observed data. Inpcpr, this is done directly during optimization for convexPCP, and via grid search for non-convex PCP. As such, rather thanrequire the researcher to choose the number of estimated patterns foruse in subsequent health models, PCP allows the observed data to “speakfor itself”, thereby removing potential points of subjectivity in modeldesign.

Notice that\(L \in \mathbb{R}^{n \timesp}\), meaning it is still defined in terms of the original\(n\)-many observations and\(p\)-many environmental variables. Putdifferently,\(L\) can be taken as arobust approximation to the true environmental mixture matrix,unperturbed by outliers (captured in\(S\)) or noise (handled by PCP’s noiseterm). In this way, the latent exposure patterns areencoded in\(L\) rather than directly estimated.Toexplicitly obtain the exposure patterns from\(L\), PCP may then be paired with variousmatrix factorization methods (e.g., PCA, factor analysis, ornon-negative matrix factorization) that yield chemical loadings andindividual scores for use in downstream health models.

This flexibility allows\(L\) toadapt to mixture-specific assumptions. For example, if the assumption oforthogonal (i.e., independent) patterns is too strong, then instead ofpairing\(L\) with PCA, a moreappropriate method such as factor analysis can be used. Alternatively,depending on the sample size and study design,\(L\) may also be directly incorporated intoregression models.

Extensions for environmental health data

The functions inpcpr are outfitted with threeenvironmental health (EH)-specific extensions, makingpcprparticularly powerful for EH research:

Missing value functionality: PCP is able to recoverNA values in the observed mixture matrix, oftenoutperforming traditional imputation techniques.
Leveraging potential limit of detection (LOD)information: When equipped with LOD information, PCP treats anyestimations of values known to be below the LOD as equally valid iftheir approximations fall between 0 and the LOD. PCP with LOD data oftenoutperforms PCA imputed with\(\frac{LOD}{\sqrt{2}}\).
Non-negativity constraint on the estimatedLmatrix: If desired, PCP can enforce values in the estimatedlow-rank matrixL to be\(\geq0\), better modeling real world mixtures data.

Missing value functionality

PCP assumes that the same data generating mechanisms govern both themissing and the observed entries in\(D\). Because PCP primarily seeks accurateestimation ofpatterns rather than individualobservations, this assumption is reasonable, but in some edgecases may not always be justified. Missing values in\(D\) are therefore reconstructed in therecovered low-rank\(L\) matrixaccording to the underlying patterns in\(L\). There are three corollaries to keep inmind regarding the quality of recovered missing observations:

Recovery of missing entries in\(D\) relies on accurate estimation of\(L\);
The fewer observations there are in\(D\), the harder it is to accuratelyreconstruct\(L\) (therefore estimationofboth unobservedand observed measurements in\(L\) degrades); and
Greater proportions of missingness in\(D\) artificially drive up the sparsity ofthe estimated\(S\) matrix. This isbecause it is not possible to recover a sparse event in\(S\) when the corresponding entry in\(D\) is unobserved. By definition, sparseevents in\(S\) cannot be explained bythe consistent patterns in\(L\).Practically, if 20% of the entries in\(D\) are missing, then at least 20% of theentries in\(S\) will be 0.

Leveraging potential limit of detection (LOD) information

When equipped with\(LOD\)information, PCP treats any estimations of values known to be below the\(LOD\) as equally valid if theirapproximations fall between\(0\) andthe\(LOD\). Over the course ofoptimization, observations below the LOD are pushed into this knownrange\([0, LOD]\) using penalties fromabove and below: should a\(< LOD\)estimate be\(< 0\), it isstringently penalized, since measured observations cannot be negative.On the other hand, if a\(< LOD\)estimate is\(> LOD\), it is alsoheavily penalized: less so than when\(<0\), but more so than observations known to be above the\(LOD\), because we have prior informationthat these observations must be below\(LOD\). Observations known to be above the\(LOD\) are penalized as usual, usingthe Frobenius norm in the above objective function.

Gibson et al. (2022) demonstrate that in experimental settings withup to 50% of the data corrupted below the\(LOD\), PCP with the\(LOD\) extension boasts superior accuracy ofrecovered\(L\) models compared to PCAcoupled with\(\frac{LOD}{\sqrt{2}}\)imputation. PCP even outperforms PCA in low-noise scenarios with as muchas 75% of the data corrupted below the\(LOD\). The few situations in which PCAbettered PCP were those pathological cases in which\(D\) was characterized by extreme noise andhuge proportions (i.e., 75%) of observations falling below the\(LOD\).

Non-negativity constraint on the estimated`L`matrix

To enhance interpretability of PCP-rendered solutions, there is anoptional non-negativity constraint that can be imposed on the\(L\) matrix to ensure all estimated valueswithin it are\(\geq 0\). This preventsresearchers from having to deal with negative observation values andquestions surrounding their meaning and utility. Non-negative\(L\) models also allow for seamless use ofmethods such as non-negative matrix factorization to extractnon-negative patterns.

Currently, the non-negativity constraint is only supported in theconvex PCP functionroot_pcp(), incorporated in the ADMMsplitting technique via the introduction of an additional optimizationvariable and corresponding constraint. Future work will extend theconstraint to the non-convex PCP methodrrmc().

Convex vs. non-convex PCP

Of the many flavors of PCP undergoing active study in the currentliterature, we provide two distinct models inpcpr: theconvex modelroot_pcp() and non-convex modelrrmc(). The table below offers a quick glance at theirrelative differences:

	`root_pcp()`	`rrmc()`
Convex?	Yes	No
Convergence?	Slow	Fast
Expected low-rank structure?	Well-defined	Complex
Parameters?	\(D, \lambda, \mu\)	\(D, r, \eta\)
Supports missing values?	Yes	Yes
Supports LOD penalty?	Yes	Yes
Supports non-negativity constraint?	Yes	No
Rank determination?	Autonomous	User-defined*
Sparse event identification?	Autonomous	Autonomous
Optimization approach?	ADMM	Iterative rank-based

*rrmc() can be paired with the cross-validatedgrid_search_cv() function for autonomous rankdetermination.

Convex PCP viaroot_pcp() is best for data characterizedby rapidly decaying singular values (e.g. image and video data),indicative of very well-defined latent patterns.

Non-convex PCP withrrmc() is best suited for datacharacterized by slowly decaying singular values, indicative of complexunderlying patterns and a relatively large degree of noise. Most EH datacan be described this way, so we expect most EH researchers to utilizerrmc() in their analyses, however there are cases where theconvexity ofroot_pcp() may be preferable.

Convex PCP

Convex PCP formulations possess a number of particularly attractiveproperties, foremost of which is convexity, meaning that every localoptimum is a global optimum, and a single best solution exists. Convexapproaches to PCP also have the virtue that the rank\(r\) of the recovered\(L\) matrix is determined duringoptimization, without researcher input.

Unfortunately, these benefits come at a cost: convex PCP programs areexpensive to run on large datasets, suffering from poor convergencerates. Moreover, convex PCP approaches are best suited to instances inwhich the target low-rank matrix\(L_0\) can be accurately modelled aslow-rank (i.e. \(L_0\) is governed byonly a few very well-defined patterns). This is often the case withimage and video data (characterized by rapidly decaying singularvalues), but not common for EH data. EH data is typically onlyapproximately low-rank (characterized by complex patterns and slowlydecaying singular values).

The convex model available inpcpr isroot_pcp(). For a comprehensive technical understanding, werefer readers toZhanget al. (2021) introducing the algorithm.

root_pcp() optimizes the following objectivefunction:

\[\min_{L, S} ||L||_* + \lambda ||S||_1 +\mu ||L + S - D||_F\]

The first term is the nuclear norm of the L matrix, incentivizing\(L\) to be low-rank. The second termis the\(\ell_1\) norm of the S matrix,encouraging S to be sparse. The third term is the Frobenius norm appliedto the model’s noise, ensuring that the estimated low-rank and sparsemodels\(L\) and\(S\) together have high fidelity to theobserved data\(D\). The objective isnot smooth nor differentiable, however it is convex and separable. Assuch, it is optimized using the Alternating Direction Method ofMultipliers (ADMM) algorithm (Boyd et al. (2011)), (Gao etal. (2020)).

Non-convex PCP

To alleviate the high computational complexity of convex methods,non-convex PCP frameworks have been developed. These drastically improveupon the convergence rates of their convex counterparts. Better still,non-convex PCP methods more flexibly accommodate data lacking awell-defined low-rank structure, so they are from the outset bettersuited to handling EH data. Non-convex formulations provide thisflexibility by allowing the user to interrogate the data at differentranks.

The drawback here is that non-convex algorithms can no longerdetermine the rank best describing the data on their own, insteadrequiring the researcher to subjectively specify the rank\(r\) as in PCA. However, by pairingnon-convex PCP algorithms with the cross-validation routine implementedin thegrid_search_cv() function, the optimal rank can bedetermined semi-autonomously; the researcher need only define a ranksearch space from which theoptimal rank will be identifiedvia grid search. One of the more glaring trade-offs made bynon-convex methods for this improved run-time and flexibility is weakertheoretical promises; specifically, non-convex PCP runs the risk offinding spuriouslocal optima, rather than theglobaloptimum guaranteed by their convex siblings. Having said that, theoryhas been developed guaranteeing equivalent performance betweennon-convex implementations and closely related convex formulations undercertain conditions. These advancements provide strong motivation fornon-convex frameworks despite their weaker theoretical promises.

The non-convex model available inpcpr isrrmc(). We refer readers toCherapanamjeriet al. (2017) for an in depth look at the mathematical details.

rrmc() implicitly optimizes the following objectivefunction:

\[\min_{L, S} I_{rank(L) \leq r} + \eta||S||_0 + ||L + S - D||_F^2\]

The first term is the indicator function checking that the\(L\) matrix is strictly rank\(r\) or less, implemented using a rank\(r\) projection operatorproj_rank_r(). The second term is the\(\ell_0\) norm applied to the\(S\) matrix to encourage sparsity, and isimplemented with the help of an adaptive hard-thresholding operatorhard_threshold(). The third term is the squared Frobeniusnorm applied to the model’s noise.

rrmc() uses an incremental rank-based strategy in orderto estimate\(L\) and\(S\): First, a rank-\(1\) model\((L^{(1)}, S^{(1)})\) is estimated. Therank-\(1\) model is then used as aninitialization point to construct a rank-\(2\) model\((L^{(2)}, S^{(2)})\), and so on, until thedesired rank-r model\((L^{(r)},S^{(r)})\) is recovered. All models from ranks\(1\) through\(r\) are returned byrrmc() inthis way.

PCP parameters

Intuition behind`lambda`,`mu`, and`eta`

Recallroot_pcp()’s objective function is given by:

\[\min_{L, S} ||L||_* + \lambda ||S||_1 +\mu ||L + S - D||_F\]

\(\lambda\) (lambda)controls the sparsity ofroot_pcp()’s output\(S\) matrix; larger values of\(\lambda\) penalize non-zero entries in\(S\) more stringently, driving therecovery of sparser\(S\) matrices.Therefore, if you a priori expect few outlying events in your model, youmight expect a grid search to recover relatively larger\(\lambda\) values, and vice-versa.
\(\mu\) (mu) adjustsroot_pcp()’s sensitivity to noise; larger values of\(\mu\) penalize errors between the predictedmodel and the observed data (i.e. noise), more severely. Environmentaldata subject to higher noise levels therefore require aroot_pcp() model equipped with smaller mu values (sincehigher noise means a greater discrepancy between the observed mixtureand the true underlying low-rank and sparse model). In virtuallynoise-free settings (e.g. simulations), larger values of\(\mu\) would be appropriate.

rrmc()’s objective function is given by:

\[\min_{L, S} I_{rank(L) \leq r} + \eta||S||_0 + ||L + S - D||_F^2\]

\(\eta\) (eta)controls the sparsity ofrrmc()’s output\(S\) matrix, just as\(\lambda\) does forroot_pcp().Because there are no other parameters scaling the noise term,\(\eta\) can be thought of as a ratio betweenroot_pcp()’s\(\lambda\)and\(\mu\): Larger values of\(\eta\) will place a greater emphasis onpenalizing the non-zero entries in\(S\) over penalizing the errors between thepredicted and observed data (the dense noise\(Z\)).

Theoretically optimal parameters

Theget_pcp_defaults() function calculates the “default”PCP parameter settings of\(\lambda\),\(\mu\) and\(\eta\) for a given data matrix\(D\).

The “default” values of\(\lambda\)and\(\mu\) offertheoreticalguarantees of optimal estimation performance. Candès et al. (2011)obtained the guarantee for\(\lambda\),while Zhang et al. (2021) obtained the result for\(\mu\). It has not yet been proven whetheror not\(\eta\) enjoys similarproperties.

The theoretically optimal\(\lambda_*\) is given by:

\[\lambda_* = 1 / \sqrt{\max(n,p)},\]

where\(n\) and\(p\) are the dimensions of the input matrix\(D_{n \times p}\).

The theoretically optimal\(\mu_*\)is given by:\[\mu_* = \sqrt{\frac{\min(n,p)}{2}}.\]

The “default” value of\(\eta\) isthen simply\(\eta =\frac{\lambda_*}{\mu_*}\).

Empirically optimal parameters

Mixtures data is rarely so well-behaved in practice, however.Instead, it is common to find differentempirically optimalparameter values aftertuning these parameters in a gridsearch. Therefore, it is recommended to useget_pcp_defaults() primarily to help define a reasonableinitial parameter search space to pass intogrid_search_cv().

Tuning parameters with`grid_search_cv()`

Cross-validation procedure

grid_search_cv() conducts a Monte Carlo stylecross-validated grid search of PCP parameters for a given data matrix\(D\), PCP functionpcp_fn, and grid of parameter settings to search throughgrid. The run time of the grid search can be sped up usingbespoke parallelization settings.

Each hyperparameter setting is cross-validated by:

Randomly corrupting\(\xi\) percentof the entries in\(D\) as missing(i.e. NA values), yielding\(P_\Omega(D)\). Done using thesim_na() function.
Running the given PCP functionpcp_fn on\(P_\Omega(D)\), yielding estimates\(L\) and\(S\).
Recording the relative recovery error of\(L\) compared with the input data matrix\(D\) for only those values that wereimputed as missing during the corruption step (step 1 above). Formally,the relative error is calculated with:\[RelativeError(L | D) := \frac{||P_{\Omega^c}(D -L)||_F}{||P_{\Omega^c}(D)||_F}\]
Re-running steps 1-3 for a total of\(K\)-many runs (each “run” has a uniquerandom seed from 1 to\(K\) associatedwith it).
Performance statistics can then be calculated for each “run”, andthen summarized across all runs using mean-aggregated statistics.

In thegrid_search_cv() function,\(\xi\) is referred to asperc_test (percent test), while\(K\) is known asnum_runs(number of runs).

Best practices for\(\xi\) and\(K\)

Experimentally, this grid search procedure retrieves the bestperforming PCP parameter settings when\(\xi\) is relatively low, e.g. \(\xi = 0.05\), or 5%, and\(K\) is relatively high, e.g. \(K = 100\). This is because:

The larger\(\xi\) is, the more thetest set turns into a matrix completion problem, rather than the desiredmatrix decomposition problem. To better resemble the actual problem PCPwill be faced with come inference time,\(\xi\) should therefore be kept relativelylow.
Choosing a reasonable value for\(K\) is dependent on the need to keep\(\xi\) relatively low. Ideally, a largeenough\(K\) is used so that many (ifnot all) of the entries in\(D\) arelikely to eventually be tested. Note that since test set entries arechosen randomly for all runs\(1\)through\(K\), in the pathologicallyworst case scenario, the same exact test set could be drawn each time.In the best case scenario, a different test set is obtained each run,providing balanced coverage of\(D\).Viewed another way, the smaller\(K\)is, the more the results are susceptible to overfitting to therelatively few selected test sets.

Interpretation of results

Once the grid search of has been conducted, the optimalhyperparameters can be chosen by examining the output statisticssummary_stats. Below are a few suggestions for how tointerpret thesummary_stats table:

Generally speaking, the first thing a user will want to inspect istherel_err statistic, capturing the relative discrepancybetween recovered test sets and their original, observed (yet possiblynoisy) values. Lowerrel_err means the PCP model was betterable to recover the held-out test set. So, in general, the bestparameter settings are those with the lowestrel_err.Having said this, it is important to remember that this statistic shouldbe taken with a grain of salt: Because in practice the researcher doesnot have access to the ground truth\(L\) matrix, therel_errmeasurement is forced to rely on the comparison between the noisyobserved data matrix\(D\) and theestimated low-rank model\(L\). So therel_err metric is an “apples to oranges” relative error.For data that is a priori expected to be subject to a high degree ofnoise, it may actually be better to discard parameter settings withsuspiciously low rel_errs (in which case the solution may behallucinating an inaccurate low-rank structure from the observednoise).
For grid searches usingroot_pcp() as the PCP model,parameters that fail to converge can be discarded. Generally, fewerroot_pcp() iterations (num_iter) taken toreach convergence portend a more reliable / stable solution. In rarecases, the user may need to increaseroot_pcp()’smax_iter argument to reach convergence.rrmc()does not report convergence metadata, as its optimization scheme runsfor a fixed number of iterations.
Parameter settings with unreasonable sparsity or rank measurementscan also be discarded. Here, “unreasonable” means these reported metricsflagrantly contradict prior assumptions, knowledge, or work. Forinstance, most air pollution datasets contain a number of extremeexposure events, so PCP solutions returning sparse\(S\) models with 100% sparsity haveobviously been regularized too heavily. Note that reported sparsity andrank measurements areestimates heavily dependent on thethresh set by thesparsity() &matrix_rank() functions. E.g. it could be that the actualaverage matrix rank is much higher or lower when a threshold that bettertakes into account the relative scale of the singular values is used.Likewise for the sparsity estimations. Also, recall that the given valuefor\(\xi\) artifically sets a sparsityfloor, since those missing entries in the test set cannot be recoveredin the\(S\) matrix. E.g. if\(\xi = 0.05\), then no parameter settingwill have an estimated sparsity lower than 5%.

Movatterモバイル変換

Theory crash course

PCP modeling overview

The low-rank matrix

The sparse matrix

Extensions for environmental health data

Missing value functionality

Leveraging potential limit of detection (LOD) information

Non-negativity constraint on the estimated`L`matrix

Convex vs. non-convex PCP

Convex PCP

Non-convex PCP

PCP parameters

Intuition behind`lambda`,`mu`, and`eta`

Theoretically optimal parameters

Empirically optimal parameters

Tuning parameters with`grid_search_cv()`

Cross-validation procedure

Best practices for\(\xi\) and\(K\)

Interpretation of results

Coded example analyses

References

Movatterモバイル変換

Theory crash course

PCP modeling overview

The low-rank matrix

The sparse matrix

Extensions for environmental health data

Missing value functionality

Leveraging potential limit of detection (LOD) information

Non-negativity constraint on the estimatedLmatrix

Convex vs. non-convex PCP

Convex PCP

Non-convex PCP

PCP parameters

Intuition behindlambda,mu, andeta

Theoretically optimal parameters

Empirically optimal parameters

Tuning parameters withgrid_search_cv()

Cross-validation procedure

Best practices for\(\xi\) and\(K\)

Interpretation of results

Coded example analyses

References

Non-negativity constraint on the estimated`L`matrix

Intuition behind`lambda`,`mu`, and`eta`

Tuning parameters with`grid_search_cv()`