Movatterモバイル変換


[0]ホーム

URL:


Theory crash course

The R packagepcpr implements Principal ComponentPursuit (PCP), a robust dimensionality reduction technique, for patternrecognition tailored to environmental health (EH) data. The statisticalmethodology and computational details are provided in Gibson etal. (2022).

In this code-free vignette, we present a crash course in PCP’stheoretical background, so researchers can better navigate all of thefunctionality offered inpcpr. We will touch upon:

  1. PCP’s modeling overview
  2. The low-rank and sparse matricesL andS
  3. The EH-specific extensions inpcpr
  4. Convex and non-convex PCP
  5. PCP parameterslambda,mu, andeta
  6. Tuning parameters withgrid_search_cv()

We recommend users skim this crash course before reading the twocode-heavy vignettes:

PCP modeling overview

PCP algorithms model an observed exposure matrix\(D\) as the sum of three underlyingground-truth matrices:

a low-rank matrix\(L_0\) encodingconsistent patterns of exposure, a sparse matrix\(S_0\) isolating unique or outlying exposureevents (that cannot be explained by the consistent exposure patterns),and dense noise\(Z_0\). All of thesematrices are of dimension\(n \timesp\), where\(n\) is the numberof observations (e.g. study participants or measurement dates) and\(p\) is the number of exposures(e.g. chemical and/or non-chemical stressors). Beyond this mixturesmodel, the main assumption made by PCP is that\(Z_0 \sim N(\mu, \sigma^2)\) consists ofindependently and identically distributed (i.i.d.) Gaussian noisecorrupting each entry of the overall exposure matrix\(D\).

The models inpcpr seek to decompose an observed datamatrix\(D\) into estimated low-rankand sparse components\(L\) and\(S\) for use in downstream environmentalhealth analyses.

The low-rank matrix

The estimated low-rank matrix\(L\)provides information on the consistent exposure patterns,satisfying:

\[r = \text{rank}(L) \ll \min(n,p).\]

The rank\(r\) of a matrix is thenumber of linearly independent columns or rows in the matrix, and playsan important role in defining the mathematical structure of the data.Intuitively, the rank directly corresponds to the (relatively few)number of underlying patterns governing the mixture. Here, “patterns”can refer to specific sources, profiles or behaviors leading toexposure, depending on the application.

Contrary to closely related dimension reduction tools such asprincipal component analysis (PCA), PCP infers the rank\(r\) from the observed data. Inpcpr, this is done directly during optimization for convexPCP, and via grid search for non-convex PCP. As such, rather thanrequire the researcher to choose the number of estimated patterns foruse in subsequent health models, PCP allows the observed data to “speakfor itself”, thereby removing potential points of subjectivity in modeldesign.

Notice that\(L \in \mathbb{R}^{n \timesp}\), meaning it is still defined in terms of the original\(n\)-many observations and\(p\)-many environmental variables. Putdifferently,\(L\) can be taken as arobust approximation to the true environmental mixture matrix,unperturbed by outliers (captured in\(S\)) or noise (handled by PCP’s noiseterm). In this way, the latent exposure patterns areencoded in\(L\) rather than directly estimated.Toexplicitly obtain the exposure patterns from\(L\), PCP may then be paired with variousmatrix factorization methods (e.g., PCA, factor analysis, ornon-negative matrix factorization) that yield chemical loadings andindividual scores for use in downstream health models.

This flexibility allows\(L\) toadapt to mixture-specific assumptions. For example, if the assumption oforthogonal (i.e., independent) patterns is too strong, then instead ofpairing\(L\) with PCA, a moreappropriate method such as factor analysis can be used. Alternatively,depending on the sample size and study design,\(L\) may also be directly incorporated intoregression models.

The sparse matrix

The estimated sparse matrix\(S\)captures unusually high or low outlying exposure events, unexplained bythe identified patterns in\(L\). Mostentries in\(S\) are 0, with non-zeroentries identifying such extreme exposure activity. The number, location(i.e., support), and value of non-zero entries in\(S\) need not be a priori defined; PCPisolates these itself during optimization.

By separating and retaining sparse exposure events, PCP boasts anenormous advantage over other current dimension reduction techniques.Despite being common phenomena in mixtures data, sparse outliers aretypically removed from the exposure matrix prior to analysis. This isbecause PCA and other conventional dimension reduction approaches areunable to disentangle such unique events from the overall patterns ofexposure: If included, even low fractions of outliers can deviatepatterns identified by traditional methods away from the truedistribution of the data, yielding inaccurate pattern estimations andhigh false positive rates of detected outliers. By decomposing a mixtureinto low-rank and sparse components\(L\) and\(S\), PCP avoids such pitfalls.

Extensions for environmental health data

The functions inpcpr are outfitted with threeenvironmental health (EH)-specific extensions, makingpcprparticularly powerful for EH research:

  1. Missing value functionality: PCP is able to recoverNA values in the observed mixture matrix, oftenoutperforming traditional imputation techniques.
  2. Leveraging potential limit of detection (LOD)information: When equipped with LOD information, PCP treats anyestimations of values known to be below the LOD as equally valid iftheir approximations fall between 0 and the LOD. PCP with LOD data oftenoutperforms PCA imputed with\(\frac{LOD}{\sqrt{2}}\).
  3. Non-negativity constraint on the estimatedLmatrix: If desired, PCP can enforce values in the estimatedlow-rank matrixL to be\(\geq0\), better modeling real world mixtures data.

Missing value functionality

PCP assumes that the same data generating mechanisms govern both themissing and the observed entries in\(D\). Because PCP primarily seeks accurateestimation ofpatterns rather than individualobservations, this assumption is reasonable, but in some edgecases may not always be justified. Missing values in\(D\) are therefore reconstructed in therecovered low-rank\(L\) matrixaccording to the underlying patterns in\(L\). There are three corollaries to keep inmind regarding the quality of recovered missing observations:

  1. Recovery of missing entries in\(D\) relies on accurate estimation of\(L\);
  2. The fewer observations there are in\(D\), the harder it is to accuratelyreconstruct\(L\) (therefore estimationofboth unobservedand observed measurements in\(L\) degrades); and
  3. Greater proportions of missingness in\(D\) artificially drive up the sparsity ofthe estimated\(S\) matrix. This isbecause it is not possible to recover a sparse event in\(S\) when the corresponding entry in\(D\) is unobserved. By definition, sparseevents in\(S\) cannot be explained bythe consistent patterns in\(L\).Practically, if 20% of the entries in\(D\) are missing, then at least 20% of theentries in\(S\) will be 0.

Leveraging potential limit of detection (LOD) information

When equipped with\(LOD\)information, PCP treats any estimations of values known to be below the\(LOD\) as equally valid if theirapproximations fall between\(0\) andthe\(LOD\). Over the course ofoptimization, observations below the LOD are pushed into this knownrange\([0, LOD]\) using penalties fromabove and below: should a\(< LOD\)estimate be\(< 0\), it isstringently penalized, since measured observations cannot be negative.On the other hand, if a\(< LOD\)estimate is\(> LOD\), it is alsoheavily penalized: less so than when\(<0\), but more so than observations known to be above the\(LOD\), because we have prior informationthat these observations must be below\(LOD\). Observations known to be above the\(LOD\) are penalized as usual, usingthe Frobenius norm in the above objective function.

Gibson et al. (2022) demonstrate that in experimental settings withup to 50% of the data corrupted below the\(LOD\), PCP with the\(LOD\) extension boasts superior accuracy ofrecovered\(L\) models compared to PCAcoupled with\(\frac{LOD}{\sqrt{2}}\)imputation. PCP even outperforms PCA in low-noise scenarios with as muchas 75% of the data corrupted below the\(LOD\). The few situations in which PCAbettered PCP were those pathological cases in which\(D\) was characterized by extreme noise andhuge proportions (i.e., 75%) of observations falling below the\(LOD\).

Non-negativity constraint on the estimatedLmatrix

To enhance interpretability of PCP-rendered solutions, there is anoptional non-negativity constraint that can be imposed on the\(L\) matrix to ensure all estimated valueswithin it are\(\geq 0\). This preventsresearchers from having to deal with negative observation values andquestions surrounding their meaning and utility. Non-negative\(L\) models also allow for seamless use ofmethods such as non-negative matrix factorization to extractnon-negative patterns.

Currently, the non-negativity constraint is only supported in theconvex PCP functionroot_pcp(), incorporated in the ADMMsplitting technique via the introduction of an additional optimizationvariable and corresponding constraint. Future work will extend theconstraint to the non-convex PCP methodrrmc().

Convex vs. non-convex PCP

Of the many flavors of PCP undergoing active study in the currentliterature, we provide two distinct models inpcpr: theconvex modelroot_pcp() and non-convex modelrrmc(). The table below offers a quick glance at theirrelative differences:

root_pcp()rrmc()
Convex?YesNo
Convergence?SlowFast
Expected low-rank structure?Well-definedComplex
Parameters?\(D, \lambda, \mu\)\(D, r, \eta\)
Supports missing values?YesYes
Supports LOD penalty?YesYes
Supports non-negativity constraint?YesNo
Rank determination?AutonomousUser-defined*
Sparse event identification?AutonomousAutonomous
Optimization approach?ADMMIterative rank-based

*rrmc() can be paired with the cross-validatedgrid_search_cv() function for autonomous rankdetermination.

Convex PCP viaroot_pcp() is best for data characterizedby rapidly decaying singular values (e.g. image and video data),indicative of very well-defined latent patterns.

Non-convex PCP withrrmc() is best suited for datacharacterized by slowly decaying singular values, indicative of complexunderlying patterns and a relatively large degree of noise. Most EH datacan be described this way, so we expect most EH researchers to utilizerrmc() in their analyses, however there are cases where theconvexity ofroot_pcp() may be preferable.

Convex PCP

Convex PCP formulations possess a number of particularly attractiveproperties, foremost of which is convexity, meaning that every localoptimum is a global optimum, and a single best solution exists. Convexapproaches to PCP also have the virtue that the rank\(r\) of the recovered\(L\) matrix is determined duringoptimization, without researcher input.

Unfortunately, these benefits come at a cost: convex PCP programs areexpensive to run on large datasets, suffering from poor convergencerates. Moreover, convex PCP approaches are best suited to instances inwhich the target low-rank matrix\(L_0\) can be accurately modelled aslow-rank (i.e. \(L_0\) is governed byonly a few very well-defined patterns). This is often the case withimage and video data (characterized by rapidly decaying singularvalues), but not common for EH data. EH data is typically onlyapproximately low-rank (characterized by complex patterns and slowlydecaying singular values).

The convex model available inpcpr isroot_pcp(). For a comprehensive technical understanding, werefer readers toZhanget al. (2021) introducing the algorithm.

root_pcp() optimizes the following objectivefunction:

\[\min_{L, S} ||L||_* + \lambda ||S||_1 +\mu ||L + S - D||_F\]

The first term is the nuclear norm of the L matrix, incentivizing\(L\) to be low-rank. The second termis the\(\ell_1\) norm of the S matrix,encouraging S to be sparse. The third term is the Frobenius norm appliedto the model’s noise, ensuring that the estimated low-rank and sparsemodels\(L\) and\(S\) together have high fidelity to theobserved data\(D\). The objective isnot smooth nor differentiable, however it is convex and separable. Assuch, it is optimized using the Alternating Direction Method ofMultipliers (ADMM) algorithm (Boyd et al. (2011)), (Gao etal. (2020)).

Non-convex PCP

To alleviate the high computational complexity of convex methods,non-convex PCP frameworks have been developed. These drastically improveupon the convergence rates of their convex counterparts. Better still,non-convex PCP methods more flexibly accommodate data lacking awell-defined low-rank structure, so they are from the outset bettersuited to handling EH data. Non-convex formulations provide thisflexibility by allowing the user to interrogate the data at differentranks.

The drawback here is that non-convex algorithms can no longerdetermine the rank best describing the data on their own, insteadrequiring the researcher to subjectively specify the rank\(r\) as in PCA. However, by pairingnon-convex PCP algorithms with the cross-validation routine implementedin thegrid_search_cv() function, the optimal rank can bedetermined semi-autonomously; the researcher need only define a ranksearch space from which theoptimal rank will be identifiedvia grid search. One of the more glaring trade-offs made bynon-convex methods for this improved run-time and flexibility is weakertheoretical promises; specifically, non-convex PCP runs the risk offinding spuriouslocal optima, rather than theglobaloptimum guaranteed by their convex siblings. Having said that, theoryhas been developed guaranteeing equivalent performance betweennon-convex implementations and closely related convex formulations undercertain conditions. These advancements provide strong motivation fornon-convex frameworks despite their weaker theoretical promises.

The non-convex model available inpcpr isrrmc(). We refer readers toCherapanamjeriet al. (2017) for an in depth look at the mathematical details.

rrmc() implicitly optimizes the following objectivefunction:

\[\min_{L, S} I_{rank(L) \leq r} + \eta||S||_0 + ||L + S - D||_F^2\]

The first term is the indicator function checking that the\(L\) matrix is strictly rank\(r\) or less, implemented using a rank\(r\) projection operatorproj_rank_r(). The second term is the\(\ell_0\) norm applied to the\(S\) matrix to encourage sparsity, and isimplemented with the help of an adaptive hard-thresholding operatorhard_threshold(). The third term is the squared Frobeniusnorm applied to the model’s noise.

rrmc() uses an incremental rank-based strategy in orderto estimate\(L\) and\(S\): First, a rank-\(1\) model\((L^{(1)}, S^{(1)})\) is estimated. Therank-\(1\) model is then used as aninitialization point to construct a rank-\(2\) model\((L^{(2)}, S^{(2)})\), and so on, until thedesired rank-r model\((L^{(r)},S^{(r)})\) is recovered. All models from ranks\(1\) through\(r\) are returned byrrmc() inthis way.

PCP parameters

Intuition behindlambda,mu, andeta

Recallroot_pcp()’s objective function is given by:

\[\min_{L, S} ||L||_* + \lambda ||S||_1 +\mu ||L + S - D||_F\]

rrmc()’s objective function is given by:

\[\min_{L, S} I_{rank(L) \leq r} + \eta||S||_0 + ||L + S - D||_F^2\]

Theoretically optimal parameters

Theget_pcp_defaults() function calculates the “default”PCP parameter settings of\(\lambda\),\(\mu\) and\(\eta\) for a given data matrix\(D\).

The “default” values of\(\lambda\)and\(\mu\) offertheoreticalguarantees of optimal estimation performance. Candès et al. (2011)obtained the guarantee for\(\lambda\),while Zhang et al. (2021) obtained the result for\(\mu\). It has not yet been proven whetheror not\(\eta\) enjoys similarproperties.

The theoretically optimal\(\lambda_*\) is given by:

\[\lambda_* = 1 / \sqrt{\max(n,p)},\]

where\(n\) and\(p\) are the dimensions of the input matrix\(D_{n \times p}\).

The theoretically optimal\(\mu_*\)is given by:\[\mu_* = \sqrt{\frac{\min(n,p)}{2}}.\]

The “default” value of\(\eta\) isthen simply\(\eta =\frac{\lambda_*}{\mu_*}\).

Empirically optimal parameters

Mixtures data is rarely so well-behaved in practice, however.Instead, it is common to find differentempirically optimalparameter values aftertuning these parameters in a gridsearch. Therefore, it is recommended to useget_pcp_defaults() primarily to help define a reasonableinitial parameter search space to pass intogrid_search_cv().

Tuning parameters withgrid_search_cv()

Cross-validation procedure

grid_search_cv() conducts a Monte Carlo stylecross-validated grid search of PCP parameters for a given data matrix\(D\), PCP functionpcp_fn, and grid of parameter settings to search throughgrid. The run time of the grid search can be sped up usingbespoke parallelization settings.

Each hyperparameter setting is cross-validated by:

  1. Randomly corrupting\(\xi\) percentof the entries in\(D\) as missing(i.e. NA values), yielding\(P_\Omega(D)\). Done using thesim_na() function.
  2. Running the given PCP functionpcp_fn on\(P_\Omega(D)\), yielding estimates\(L\) and\(S\).
  3. Recording the relative recovery error of\(L\) compared with the input data matrix\(D\) for only those values that wereimputed as missing during the corruption step (step 1 above). Formally,the relative error is calculated with:\[RelativeError(L | D) := \frac{||P_{\Omega^c}(D -L)||_F}{||P_{\Omega^c}(D)||_F}\]
  4. Re-running steps 1-3 for a total of\(K\)-many runs (each “run” has a uniquerandom seed from 1 to\(K\) associatedwith it).
  5. Performance statistics can then be calculated for each “run”, andthen summarized across all runs using mean-aggregated statistics.

In thegrid_search_cv() function,\(\xi\) is referred to asperc_test (percent test), while\(K\) is known asnum_runs(number of runs).

Best practices for\(\xi\) and\(K\)

Experimentally, this grid search procedure retrieves the bestperforming PCP parameter settings when\(\xi\) is relatively low, e.g. \(\xi = 0.05\), or 5%, and\(K\) is relatively high, e.g. \(K = 100\). This is because:

Interpretation of results

Once the grid search of has been conducted, the optimalhyperparameters can be chosen by examining the output statisticssummary_stats. Below are a few suggestions for how tointerpret thesummary_stats table:

Coded example analyses

To see how to apply all of the above inpcpr, werecommend reading:

References

Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud,Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, JohnWright, and Marianthi-Anna Kioumourtzoglou. “Principal component pursuitfor pattern identification in environmental mixtures.” EnvironmentalHealth Perspectives 130, no. 11 (2022): 117008.

Zhang, Junhui, Jingkai Yan, and John Wright. “Square root principalcomponent pursuit: tuning-free noisy robust matrix recovery.” Advancesin Neural Information Processing Systems 34 (2021): 29464-29475.

Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, and JonathanEckstein. “Distributed optimization and statistical learning via thealternating direction method of multipliers.” Foundations and Trends inMachine learning 3, no. 1 (2011): 1-122.

Gao, Wenbo, Donald Goldfarb, and Frank E. Curtis. “ADMM formultiaffine constrained optimization.” Optimization Methods and Software35, no. 2 (2020): 257-303.

Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain. “Nearlyoptimal robust matrix completion.” International Conference on MachineLearning. PMLR, 2017.

Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright. “Robustprincipal component analysis?.” Journal of the ACM (JACM) 58, no. 3(2011): 1-37.


[8]ページ先頭

©2009-2025 Movatter.jp