- Research
- Open access
- Published:
Stochastic convex sparse principal component analysis
EURASIP Journal on Bioinformatics and Systems Biologyvolume 2016, Article number: 15 (2016)Cite this article
3778Accesses
3Citations
1Altmetric
Abstract
Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as anℓ1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.
1Introduction
Principal component analysis (PCA) is a commonly used dimensionality reduction and data analysis tool in many areas such as computer vision [1,2], data mining [3,4], biomedical informatics [5,6], and many others. The goal of PCA is to learn a linear transformation such that the learned principal components are the dimensions retaining the most of the variance in the data. Principal components are obtained by computing the eigenvalue decomposition of the covariance matrix, and it can also be computed by the singular value decomposition of the data matrix. Let\(\mathbf {S} = \frac {1}{n} \sum _{i = 1}^{n} x_{i}{x_{i}^{T}}\) be the normalized covariance matrix forn training data points where each data point is in ad-dimensional feature space. The PCA of computing the topp components can be written as the following optimization problem:
whereZ is an orthogonal projection matrix. In many applications, we are only interested in a few top principal components. In this case, the principal components can be computed in an iterative fashion: the leading principal component is calculated at each iteration (e.g., using power methods), and we then deflate the computed component and the next principal component now becomes the leading one [7]. Therefore, we focus on finding the leading principal component in this paper. In spite of its advantages, there is an obvious disadvantage of PCA. In the solution of Eq. (1), the principal components are linear combinations of all input variables. This means that the columns ofZ matrix, which are called loadings of principal components, are dense. One important implication of dense loadings is that we lose the ability to interpret the output dimensions of conventional PCA. PCA works well if we are not interested in the physical meanings of the features or if the interpretation of principal components is not crucial for the application. However, the intepretability is a significant factor when it comes to many applications such as biology, finance, and biomedical informatics. In the domain of biomedical informatics, as more and more electronic medical records (EMR) [8] of patients are available, medical researchers are interested in applying various techniques to analyze the EMR data. Each feature of the EMR data is a record/event related to a certain diagnosis. When the traditional PCA is applied to the data, those medical features are projected to a low dimensional space, in which each new feature will be the linear combination of all the original features. In this case, it is hard to comprehend the meaning of the new features.
Sparse PCA has been proposed to address this drawback. In sparse PCA, we learn sparse loading vectors which combine only few of the input variables allowing interpretation of the principal components. Sparse PCA was firstly proposed by Zou et al. in [9], where PCA was formulated as a regression problem and the sparse PCA was introduced by imposing the lasso (elastic net) constraint. Other common approaches to solve the sparse PCA problem are semi-definite programming [10,11] and inverse power method [12]. Moreover, a more recent study [13] investigated sparse PCA with oracle property. Aforementioned approaches are generally not scalable enough to work with large-scale datasets. One way to deal with large sample sizes is using stochastic methods. We can see an example of stochastic PCA in [7]. Authors described an algorithm with computationally cheap stochastic iterations and variance reduction which was suggested in [14].
In this study, sparse PCA is posed as anℓ1 regularized optimization problem. Standard approaches to solve such sparse learning problems are proximal gradient methods [15–17], which require computation of the full gradient at each iteration. These methods generally work with a composite function including a smooth part and a non-smooth part. A large family of machine learning problems [18–23] can be expressed as composite functions. Traditionally, solving problems with objectives, which are not continuously differentiable, requires subgradient descent [24] which has very poor performance [25]. The recently developed proximal gradient methods can solve these composite problems with fast convergence rates [26,27]. However, these methods are hardly scalable to large-scale problems with large sample sizes because of the computation of full gradient. Therefore, stochastic gradient-based methods are preferred in such problems. One major disadvantage of the stochastic gradient descent is the low convergence due to high variance by random sampling. Johnson and Zhang proposed a solution for this drawback in [14]. Their solution reduced the variance by using a copy of the estimated optimal point and the full gradient at this point in the gradient step. This approach exploited the strong convexity property to obtain a geometric convergence rate under expectation. Xiao and Zhang similarly presented a multi-stage scheme to progressively reduce the variance of the proximal stochastic gradient (Prox-SVRG) with a geometric convergence rate under expectation in [15]. The fundamental assumptions were Lipschitz continuity of the gradient of smooth part and the strong convexity of the objective function.
To tackle the aforementioned challenges in this paper, we introduce a novel stochastic convex sparse PCA (Cvx-SPCA) method which is extremely efficient and can handle large-scale datasets. Specifically, we propose to adopt a convex formulation of PCA [28] which provides a strongly convex function. The problem structure in this design allows us to leverage efficient scheme of Prox-SVRG [15] which leads to an exponential (geometric) convergence rate. We also investigate the convergence analysis of Prox-SVRG and present a new proof of the convergence rate which significantly reduces the conditions and assumptions required. As such, we show that the optimization scheme can be applied to a much larger class of problems to obtain the geometric convergence rate. We conducted extensive experiments on both synthetic and real datasets to illustrate the efficiency of the proposed algorithm. Because of its efficiency, we were able to apply the proposed algorithm to analyze a real EMR cohort with a large number of patients, which is hardly possible to analyze by using traditional approaches.
2Convex sparse principal component analysis
In this section, we introduce the problem formulation and optimization scheme of the proposed approach. The problem of finding a sparse loading vector is posed as the combination ofℓ1 sparsity inducing norm and convexity from the convex principal component analysis, which allows us to utilize an extremely efficient stochastic proximal gradient approach.
2.1Convex sparse PCA
The goal of sparse PCA is to learn sparse loading vectors such that the principal components will be linear combinations of a few key variables instead of all the variables. We propose the following convex optimization problem:
where the convex PCA loss [28] is given by:
and the regularization termR(z)=γ∥z∥1 is theℓ1 norm of the loading vectorz,\(\gamma \in \mathbb {R}\) is the regularization parameter controlling the sparsity of the loading vector,λ>λ1(S) is the convexity parameter,\(\mathbf {w} \in \mathbb {R}^{d}\) is a random vector, and\(\mathbf {S} = \frac {1}{n} \sum _{i = 1}^{n} \mathbf {x}_{i}\mathbf {x}_{i}^{T}\). Here,λ1(S) represents the largest eigenvalue of the covariance matrixS andw is a vector of normally distributed random numbers. An upper bound for the regularization termγ can be derived by using standard subgradient analysis [25]: if the regularization parameterγ is larger than the maximum of absolute value of the elements of the vectorw, i.e.,∥w∥∞, we will end up with trivial solutions (solutions with only zeros). This thus guides us to use a parameter range ofγ∈[0,∥w∥∞].
In the above approach, we use a convex optimization formulation of finding the first principal component inspired by the work in [28]. Even thoughR(z) is not strongly convex, the overall cost function in Eq. (2) is a strongly convex function in which the strong convexity comes fromF(z). The structure of the problem defined in Eq. (2) allows us to use gradient based algorithms to obtain the global solution. Moreover, the strong convexity usually ensures nice convergence properties for stochastic gradient schemes as well. Therefore, we can also benefit from the faster convergence rate of the proximal stochastic scheme proposed in [15]. We note that the objective function of traditional PCA as shown in Eq. (1) does not define a convex problem, and thus, the analysis in this paper cannot be applied to it.
The most common methods to solve problems such as Eq. (2), where the objective function is comprised of the average of smooth component functions and a non-smooth function, are proximal gradient methods. In the next section, the method used to solve convex optimization problem given in Eq. (2) will be explained.
2.2Optimization scheme
In this paper, we propose to use a proximal stochastic gradient method with progressive variance reduction approach [15] to solve the problem in Eq. (2). The function denoted byF(z) can also be written as the sum ofn smooth functions:
Whenn is very large, calculating the full gradient at each gradient descent iteration is an expensive operation. Hence, stochastic gradient methods are preferred to solve such problems. In stochastic approach, instead of calculating gradients for all of the data points, one data point is randomly sampled and the gradient at this point is calculated at each iteration. Therefore, the number of calculations decreases. However, the drawback of the stochastic gradient methods is the high variance introduced because of random sampling. As a result of the high variance, we suffer from poor convergence rates. As discussed previously, there are solutions to reduce the variance and increase the convergence rate. One of the studies which mitigates the high variance problem of stochastic gradient method is proximal stochastic gradient method with progressive variance reduction [15]. The study in [15] showed that the variance of the gradient can be upper bounded by using a multi-stage scheme which progressively reduces the variance. When the algorithm converges to optimal point, variance also converges to zero. Therefore, this approach can achieve better convergence rates than conventional stochastic gradient even with constant step sizes. We refer the readers to Section 3.1 in [15] for detailed proof of bounding the variance.
In this paper, we also follow the approach in [15]. The algorithm used in this study is given in Algorithm 1.

In the algorithm,z0 is the initial value for loading vectorz,η is the constant step size,γ is the regularization term to control sparsity ofz,m is the number of iterations for each epochs, andT is the maximum number of epochs. At each epoch, full gradient at the point\(\tilde {\mathbf {z}}\) is calculated periodically. The cost of calculating the full gradient is the product of ad×d matrix and ad dimensional vector. Therefore, the most time consuming part in our algorithm is the multiplications with covariance matrix, when the feature dimension is high.\(\tilde {\mathbf {z}}\) is an estimate of the optimal point and it is updated at each epoch to be utilized in gradient calculations. Duringm stochastic gradient steps, we first sample a data point randomly and compute the gradientvk. If we take the expectation of the gradient calculated in Eq. (4), we can see thatvk is also an estimate of the full gradient as in conventional stochastic gradient methods. This shows thatvk given below is in the same direction as the full gradient under expectation.
where\(\nabla F\left (\tilde {\mathbf {z}}\right)\) is the average gradient of functionsfi(z),i=1,…,n or the full gradient at point\(\tilde {\mathbf {z}}\),∇fik(zk−1) is the gradient of the function calculated by using the data pointxik sampled at thekth iteration and\(\tilde {\mathbf {z}}\) is the average ofzk,k=1,..,m at the end of an epoch.
After the gradient computation, we updatezk by using the proximal mapping forℓ1 norm as follows.
In this algorithm, variance of the stochastic gradientvk is reduced progressively, while both\(\tilde {\mathbf {z}}\) andzk−1 are converging to the optimal pointz∗= arg minzP(z) [15]. Since the full gradient is utilized to modify stochastic gradients and functionF is an average of smooth component functions, variance can be bounded. In the next section, we will give the convergence analysis of the aforementioned algorithm.
3Convergence analysis
In this section, we present the convergence analysis of the proposed algorithm. The objective function used in this paper is suitable to follow the convergence analysis in [15]. Therefore, our analysis is mostly adapted from [15]. However, we use much weaker conditions which allow a broader family of objective functions to fit in this scheme and to enjoy the geometric convergence. We retain the following assumption used throughout in [15]:
Assumption 1
The functionR(z) is lower semi-continuous and convex, and its effective domain,\(dom(R):=\left \{\mathbf {z}\in \mathbb {R}^{d} | R\left (\mathbf {z}\right)<+\infty \right \}\) is closed. Eachfi(z),fori=1,…,n, is differentiable on an open set that containsdom(R), and their gradients are Lipschitz continuous. That is, there existLi>0 such that for allz,y∈dom(R),
which also implies that the gradient of the average functionF(z) is also Lipschitz continuous, i.e., there is anL>0 such that for allz,y∈dom(R),
where\(L \leq \left (1/n\right)\sum _{i = 1}^{n}L_{i}\).
In [15], convergence analysis was done for generalF andR functions and both of them were assumed to be strongly convex. On the other hand, we only assume that functionsF(z) andR(z) are convex, but not necessarily strongly convex. Thus, we are relaxing this strong assumption at this point. Strong convexity provides good properties and is relevant for faster convergence rates. However, objective functions are not always strongly convex in many cases. Therefore, a simplified version of the analysis will be preferable, when the objective functions do not have necessarily strong convexity property.
Although our overall objective function is strongly convex,R(z) is not strongly convex as it was mentioned in the previous section. Therefore, we drop the strong convexity assumption at two steps in the original analysis of [15] and obtain the convergence rate given in the following theorem.
Theorem 1
Under the assumption that Assumption 1 holds and 0<η<1/(4LQ), whereLQ=maxiLi, the convergence rate is obtained as follows:
wherez∗= arg minzP(z).
Proof
The proof of Theorem 1 starts with investigating the distance betweenzk andz∗;∥zk−z∗∥2. According to the stochastic gradient mapping definition in [15],zk can be written aszk−1−ηgk.
The term\(\left (- \mathbf {g_{k}}^{T}\left (\mathbf {z_{k-1}-\mathbf {z_{*}}}\right) + \frac {\eta }{2} \left \| \mathbf {g_{k}}\right \|^{2}\right)\) can be bounded by using the definition of the proximal update as shown below.
According to the optimality condition,
whereξ∈∂R(zk) is the subgradient ofR(z) atzk. If we combine the stochastic gradient mapping definition with the optimality condition, we obtain the following expression.
By using the convexity ofF(z) andR(z), we can write the following inequality.
Convergence analysis of [15] utilized strong convexity ofF andR in7. However, we will show that strong convexity is not required at this point. SinceF(z) is assumed to be Lipschitz continuous with Lipschitz constantL,F(zk−1) can also be bounded by using Theorem 2.1.5 in [29].
If we combine Eqs. (7) and (8), we obtain the following inequality.
Here, we again use stochastic gradient mapping;zk−zk−1=−ηgk to obtain the following inequality.
If we substituteξ withgk−vk, then add and subtractzk−1 from the term (y−zk):
Under the assumption of 0<η<1/4LQ<1/L,\(\left (\eta - \frac {L}{2}\eta ^{2}\right) = \frac {\eta }{2}\left (2 - L\eta \right)\) can be taken asη/2. Because (2−Lη) is between (1,2) according to the assumption, therefore, eliminating (2−Lη) does not change the inequality. Now we will use the result derived above for the term\(\left (-\mathbf {g_{k}}^{T} \left (\mathbf {z_{k-1} - z_{*}}\right) + \frac {\eta }{2}\left \|\mathbf {g_{k}}\right \|^{2}\right)\) in Eq. (6).
whereΔ=vk−∇F(zk−1) andz∗ corresponds toy. The term −2ηΔT(zk−z∗) can further be bounded by using the proximal full gradient update\(\bar {\mathbf {z}_{k}} = \text {prox}_{\eta R}\left (\mathbf {z_{k-1}} - \eta \nabla F\left (\mathbf {z_{k-1}}\right)\right)\), If Cauchy-Schwarz inequality and the non-expansiveness of the proximal mapping (∥proxηR(x)−proxηR(y)∥≤∥x−y∥) are utilized, the following expression can be derived.
If we insert the definitions ofzk=(zk−1−ηvk) and\(\bar {\mathbf {z}_{k}} = \left (\mathbf {z_{k-1}} - \eta \nabla F\left (\mathbf {z_{k-1}}\right)\right)\), we will have:
If we combine the result shown above with Eq. (9):
Now, expectations of both sides are taken with respect tozk.
Since\(\bar {\mathbf {z}_{k}}\) andz∗ are independent from the variablezk;\(\mathbb {E} \left \{\Delta ^{T} \left (\bar {\mathbf {z}_{k}} - \mathbf {z_{*}}\right)\right \} = \mathbb {E} \left \{\Delta ^{T}\right \}\left (\bar {\mathbf {z}_{k}} - \mathbf {z_{*}}\right) = 0\). Because\(\mathbb {E} \left \{\Delta ^{T}\right \} = \mathbb {E} \left \{\mathbf {v_{k}} - \nabla F\left (\mathbf {z_{k-1}}\right)\right \} = \mathbb {E}\left \{\mathbf {v_{k}}\right \} - \nabla F\left (\mathbf {v_{k-1}}\right) = 0\). The variance of the gradient\(\mathbb {E}\left \{\left \|\Delta \right \|^{2}\right \}\) is upper bounded in Prox-SVRG algorithm and we will use the result of Corollary 3 in [15] which is\(\mathbb {E}\left \{\left \|\Delta \right \|^{2}\right \} \leq 4L_{Q} \left [P\left (\mathbf {z_{k-1}}\right) - P\left (\mathbf {z_{*}}\right) + P\left (\tilde {\mathbf {z}}\right)-P\left (\mathbf {z_{*}}\right)\right ]\), whereLQ= maxiLi,\(\tilde {\mathbf {z}}_{s} = \frac {1}{m}\sum _{k=1}^{m} \mathbf {z_{k}}\), and\(\tilde {\mathbf {z}} = \tilde {\mathbf {z}}_{s-1} = \mathbf {z_{0}}\) for a fixed epoch. After incorporating the bound of the variance of the gradient into the analysis, the following expression is obtained.
Now, if we apply the inequality above repeatedly fork=1,…,m and the expectation with respect to previous random variablesz1,…,zm are taken, then we can obtain the following inequality.
Since 2η(1−4ηLQ)<2η,\(\mathbf {z_{0}} = \tilde {\mathbf {z}}\) andP is convex, therefore,\(P\left (\tilde {\mathbf {z}}_{s}\right) \leq \frac {1}{m}\sum _{k=1}^{m}P\left (\mathbf {z_{k}}\right)\), and we can write the following inequality.
By using Lemma 1 which is a weaker condition then using the strong convexity and by applying the above inequality recursively, we derive the convergence rate as follows:
□
Lemma 1
Consider the problem of minimizing the sum of two convex functions:
A standard method for solving the above problem is the proximal gradient method. Given an initial pointz0, using the proximal mapping, which is shown below, iteratively generates a sequence that will converge to the optimal solution.
SinceR(x) is a convex function, the optimal solution of above problem is also an optimal solution of the following problem using a tuning parameterμ[30][Theorem 1].
By utilizing the optimal strong convexity condition which is a weaker condition than strong convexity[31] for a convex functionR, we have the following inequality for allz∈Ω:
where the proxE is the Euclidean projection on to setE andℓ is a positive parameter.
We have thus removed the strong convexity condition so that we are able to apply the algorithm in [15] to more generic convex objectives.
4Results
In this section, we present the results of two types of experiments. First, the proposed algorithm was tested on synthetic datasets to investigate the convergence of the variance reduced proximal stochastic gradient compared to traditional proximal stochastic gradient descent. In addition, running times of the proposed stochastic Cvx-SPCA and other sparse PCA methods were compared to emphasize the advantage of using a stochastic approach, when there are large number of samples. In our experiments, step sizeη was chosen by the following heuristic according to 0<η<1/(4LQ) andLQ was taken as the largest eigenvalue of the covariance matrix. Iteration numberm was chosen asΘ(LQ/(λ−λ1(S))) which is suggested in [15]. Secondly, we presented our experiments on an electronic medical records data.
4.1Synthetic dataset
In this section, we present some results of the proposed stochastic Cvx-SPCA algorithm on synthetic datasets. Synthetic datasets used in this section were all randomly generated by normally distributed random numbers with\(\mathcal {N}\left (0,1\right)\). For this purpose, synthetic data with varying sample sizes were prepared by random sampling. First of all, we would like to compare the convergence of proximal stochastic gradient with variance reduction and traditional proximal stochastic gradient for our algorithm. In Fig.1, objective versus number of epochs are plotted for using traditional proximal stochastic gradient (prox-SGD) and proximal stochastic variance reduced gradient (Prox-SVRG) methods.
Convergence for synthetic data. Convergence of the proposed stochastic Cvx-SPCA with (Prox-SVRG) and without variance reduction (prox-SGD). Proximal stochastic gradient with variance reduction has a faster convergence rate, since the variance caused by random sampling is bounded in Prox-SVRG
In Fig.1, convergence is observed when the maximum number of epochs is fixed to 50. We also would like to investigate how many epochs are necessary for both algorithms to converge. Therefore, we made another experiment to see how fast Cvx-SPCA with Prox-SVRG converges to a similar sparsity as Cvx-SPCA with prox-SGD. We generated another synthetic dataset with 100,000 instances and 10,000 dimensions. The result of the experiment is shown in Fig.2. Cvx-SPCA with traditional SGD took 3646.94 s and Cvx-SPCA with SVRG took 644.60 s to converge to similar sparsity patterns.
Convergence of sparse pattern in the log scale. Cvx-SPCA with Prox-SGD takes 275 iterations, whereas Cvx-SPCA with Prox-SVRG takes 45 iterations to converge a similar sparsity pattern
Secondly, running times of other sparse PCA methods and the proposed method were compared in Table1. In experiments, feature dimension was chosen as 1000. Algorithms ran until they reached similar sparsity patterns. The proposed Cvx-SPCA algorithm is more scalable, since only one gradient is computed at a time and there are no eigenvalue decomposition or SVD steps during iterations. For instance, [9] requires singular value decomposition at each iteration, which is a bottleneck in terms of running time, [12] is an inverse power method based approach, and [11] uses semi-definite programming. Therefore, scalability with respect to sample size and dimension is an issue for the aforementioned methods.
We also investigate the regularization path for the proposed algorithm. Regularization path illustrates how the solution changes for different values of regularization parametersγ which specify the level of sparsity. In order to have a suitable level of sparsity,γ should be tuned. One common way of finding an appropriateγ is the regularization path. We first generated a random sample with ten features and applied the proposed Cvx-SPCA algorithm to obtain the principal component. Then, the covariance matrix was reconstructed by using the first principal component corresponding to the largest eigenvalue with a little random noise. Loading values of principal components were computed with varying regularization parametersγ by using the reconstructed covariance matrix. We started with smallγ values, and the loading vector learned from the previous step is used as the initialization for each new Cvx-SPCA step. The result is given in Fig.3.
Regularization path for Cvx-SPCA. We checked whether the known principal component can be recovered through the path to be able confirm that this is a valid regularization path. When regularization term was around −0.11 (dashed line) in logarithmic scale, we could exactly recover the non-zero loading values of the known principal component which was used to generate the data
4.2Large-scale healthcare dataset
We applied our Cvx-SPCA algorithm to analyze disease patterns in a general patient population. The dataset we used is a real world electronic medical record (EMR) warehouse including the records of 223,076 patients over 4 years. We used the diagnosis information (in terms of ICD9 codes [32]) in our investigation, which resulted in 11,982 features in total. In this dataset, we do not have demographic information of patients explicitly. However, we investigated patient groups with different gender and age by looking at the descriptions of the ICD9 codes. We draw histograms of the number of patients with respect to the number of diagnoses each patient has in different demographic groups and in the general population as in Figs.4 and5, from which we can observe that the majority of the patients just have very few records. In our experiments, we eliminated the patients who have less than five records, and this resulted in 177,856 patients. As it was mentioned earlier, some of the diseases are specifically related to gender and age that let us have an idea about the demographic information of the dataset. For instance, complications of pregnancy, female genital disorders, and abortion are some of the diagnoses which are explicitly about women. Similarly, maternal complications affecting newborn and diseases such as chickenpox and measles are related to children. There are also ICD9 codes which have terms indicating the age. For instance, some of the diagnoses have the term “senile” which points out patients at least above 60 years old. Thus, we sampled female, male, old, and child patients by taking the definitions of the ICD9 codes into account. The age range of child patients can be given as from babyhood to adolescence and age of old patients can be thought as above 60 years old. In Table2, number of patients and number of features related to female, male, people above 60 years old and children groups are given. We should note that there may be female, male, old, and child patients who we did not include into these demographic groups. For example, there should be female/male patients with diagnoses which are not gender- or age-specific. It is not always possible to guess the gender or age from diagnosis such as hypertension or infectious diseases which can be encountered in both genders. Therefore, we are reporting the demographic groups whose ICD9 codes have clear terms indicating the demographic information.
Patient distribution of demographic groups. We used only diagnoses/diseases which have explicit information about demographic of the patient while sub-sampling the patients. We can observe that each group of patient has a similar trend. Most of the patients have 1–50 diagnoses entered into the record
As can be seen from Table2, the number of female-specific diseases and the number of female patients are more than other groups in the EMR dataset we used in this paper. Number of old patients is given less than other groups in the table. However, it may not mean that there are less number of old people in the whole patient population. We could not exactly extract age information of every diagnoses/diseases. For instance, hypertension or Alzheimer’s were diseases commonly encountered among the people above a certain age in the past. However, these problems can be occurred in younger ages recently. For this reason, we used only diagnoses/diseases which have explicit information about demographic of the patient, while sub-sampling the patients. Distributions of different patient groups in Table2 are given in Fig.4.
In our experiments, we further aggregated all diagnoses belong to the same ICD9 group together, so that each patient is represented by a 918 dimensional feature vector. The value on itsith dimension represents the frequency of theith diagnosis code appearing in the EMR of the corresponding patient. Since every patient will have a limited number of diseases, patient vectors are very sparse.
We would like to emphasize that existing sparse PCA algorithms cannot be used to analyze a dataset at this scale. We carried out both quantitative and qualitative evaluations on this dataset. We studied the convergence of the algorithm with varying number of patients, and we observe that the proposed Cvx-SPCA can still achieve a good convergence even when the sample size is very large, as shown in Fig.6.
Next, we conducted an experiment to show how the proposed algorithm helps us to analyze the EMR data. We applied our algorithm to the whole data set and got the output features which correspond to the non-zero loading values of the leading principal component. These output features are inferred as key medical features. One of the results is summarized in Table3. Diseases shown in this table are the features which have non-zero loadings whose absolute values are greater than a heuristic threshold. In our experiments, we observed that the most frequently encountered output features were infectious diseases, problems related to pregnancy and labor, injuries, and cancer types. This result tells us that the proposed algorithm can provide insight about the diagnoses encountered in the patient population.
We further examined the data set and divided the features into groups in terms of gender and age. We sampled the patients who have gender- and age-related problems separately and applied our algorithm to those samples to analyze the output dimensions. Examples from each group are shown in Tables4,5,6, and7. We can see plausible results for the output features of each group in the tables. For example, diagnoses such as female genital disorders, perinatal problems, and anemia, which are more common among women, appeared in Table4 where the algorithm was applied to the subset of patients who have female-related problems. Similarly, we can see from Table5 that a subset of male patients generates prostate cancer along with other diagnoses which can be frequently seen in the general patient population as well. Cancer is a commonly encountered problem in nearly every age. We can come across cancer in the results of children and old patients as well. Another observation is that tuberculosis and bacterial infections are quite common among children.
5Discussion
Throughout the paper, advantage of using a convex optimization approach for sparse PCA is emphasized. In this section, we would like to discuss about our conjuring of the convergence of non-convex stochastic sparse PCA by using the same framework. One surprising finding we have is if we use this non-convex PCA to construct a non-convex sparse PCA (by addingℓ1-norm), we still benefit from a much faster convergence rate using the stochastic scheme studied in this paper. A similar result is also presented in [7], where the authors propose a stochastic PCA approach with an exponential convergence rate by using variance reduced stochastic gradient presented in [14]. These results lead us to ask the following question:Can we generalize the convergence analysis of proximal variance reduced stochastic gradient method further for non-convex settings? We will investigate this problem in the future work.
6Conclusions
In this paper, a convex stochastic sparse PCA method is proposed. Since the problem of finding the leading eigenvector is formed as convex optimization, a well-defined convergence rate can be applied to the proposed algorithm. A proximal stochastic gradient method with variance reduction is preferred to avoid low convergence rates of traditional stochastic methods. Although strong convexity is usually required in literature, we simplify the convergence analysis of the existing Prox-SVRG algorithm by using weaker conditions. According to the experiments on several synthetic data, the proposed algorithm is shown to be more scalable due to stochastic approach. In addition, an application of sparse PCA is presented to show how sparse PCA can help to interpret electronic medical records. In future work, we would like to investigate whether sparse PCA can be used to cluster patients with respect to their medical records. For instance, we propose to apply the proposed algorithm to analyze medical records and derive clinically meaningful and structural phenotypes, which can further be helpful for patient risk stratification and clustering.
References
FD la Torre, MJ Black, inICCV Eighth IEEE International Conference on Computer Vision, vol. 1. Robust principal component analysis for computer vision (IEEEVancouver, 2001).
MW Manal Abdullah, S Bo-saeed, Optimizing face recognition using pca. Int. J. Artif. Intell. Appl. (IJAIA).3(2), 23–31 (2012).
C Gokulnath, MK Priyan, E Vishnu Balan, KP Rama Prabha, inInternational Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM). Preservation of privacy in data mining by using pca based perturbation technique (IEEEChennai, 2015), pp. 202–206.
W-Y Wang, C-X Qu, inSecond International Symposium on Information Science and Engineering. Application and research of data mining based on improved pca method (IEEEShanghai, 2009), pp. 140–143.
PP Alberto Landi, G Pioggia, inIntelligent Systems Design and Applications, 9th International Conference on. Backpropagation-based non linear pca for biomedical applications (IEEEPisa, 2009), pp. 635–640.
D Omucheni, K Kaduki, W Bulimo, H Angeyo, Application of principal component analysis to multispectral-multimodal optical image analysis for malaria diagnostics. Malar. J.13(1), 485 (2014). Springer Nature.
O Shamir, in32nd International Conference on Machine Learning, vol. 37. A stochastic pca and svd algorithm with an exponential convergence rate (Journal of Machine Learning Research (JMLR)Lille Grand Palais, 2015).
What Is an Electronic Medical Record (EMR)?.https://www.healthit.gov/providers-professionals/electronic-medical-records-emr.
TH Hui Zou, R Tibshirani, Sparse principal component analysis. J. Comput. Graph. Stat.15(2), 265–286 (2006).
A d’Aspremont, L El Ghaoui, M Jordan, G Lanckriet, A direct formulation for sparse pca using semidefinite programming. SIAM Rev.49(3), 434–448 (2007).
AY Nikhil Naikal, SS Sastry, Informative feature selection for object recognition via sparse pca. Int. Conf. Comput. Vision, IEEE, 818–825 (2011).
M Hein, T Buhler, An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse pca. Adv. Neural Inf. Process. Syst.23:, 847–855 (2010).
Z Gu, Q Wang, H Liu, Sparse pca with oracle property. Adv. Neural Inf. Process. Syst. (NIPS).27:, 1529–1537 (2014).
R Johnson, T Zhang, Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst.26:, 315–323 (2013).
L Xiao, T Zhang, A proximal stochastic gradient method with progressive variance reduction. SIAM J. OPTIM.24(4), 2057–2075 (2014).
A Nitanda, Stochastic proximal gradient descent with acceleration techniques. Neural Inf. Process. Syst.27:, 1574–1582 (2014).
S Shalev-Shwartz, T Zhang, in31 st International Conference on Machine Learning, JMLR, vol. 32. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization (Journal of Machine Learning Research (JMLR)Beijing, 2014).
J Liu, J Chen, J Ye, inProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Large-scale sparse logistic regression (ACMParis, 2009), pp. 547–556.
R Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol, 267–288 (1996).
S Ji, J Ye, inProceedings of the 26th Annual International Conference on Machine Learning. An accelerated gradient method for trace norm minimization (ACMMontreal, 2009), pp. 457–464.
J Zhou, L Yuan, J Liu, J Ye, inProceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. A multi-task learning formulation for predicting disease progression (ACMSan Francisco, 2011), pp. 814–822.
L Jacob, G Obozinski, J-P Vert, inProceedings of the 26th Annual International Conference on Machine Learning. Group lasso with overlap and graph lasso (ACMMontreal, 2009), pp. 433–440.
J Zhou, J Chen, J Ye,Malsar: Multi-task learning via structural regularization. (Arizona State University, 2011),http://www.public.asu.edu/~jye02/Software/MALSAR.
NZ Shor,Minimization Methods for Non-differentiable Functions, vol. 3 (Springer, Berlin Heidelberg, 2012).
S Boyd, L Xiao, A Mutapcic, Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter.2004:, 2004–2005 (2003).
A Beck, M Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci.2(1), 183–202 (2009).
SJ Wright, RD Nowak, MA Figueiredo, Sparse reconstruction by separable approximation. Signal Process. IEEE Trans.57(7), 2479–2493 (2009).
D Garber, E Hazan, Fast and simple pca via convex optimization (2015). arXiv:1509.05647v4 [math.OC],https://arxiv.org/abs/1509.05647.
Y Nesterov,Introductory Lectures On Convex Optimization: A Basic Course, vol. 87 (Springer US, New York, 2004).
M Kloft, U Brefeld, P Laskov, K-R Müller, A Zien, S Sonnenburg, Efficient and accurate lp-norm multiple kernel learning. Adv. Neural Inf. Process. Syst, 997–1005 (2009).
J Liu, SJ Wright, Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim.25(1), 351–376 (2015).
International Classification of Diseases (ICD).http://www.who.int/classifications/icd/en/.
Acknowledgements
This work is supported in part by the Office of Naval Research (ONR) under grant number N00014-14-1-0631 and National Science Foundation under grant numbers IIS-1565596 and IIS-1615597.
Authors’ contributions
IMB and JZ developed the algorithm. KL contributed to dropping the strong convexity section. FW provided the EMR data and contributed to the interpretation of the experimental results. IMB wrote the paper, and JZ and AKJ edited the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Authors and Affiliations
Computer Science and Engineering, Michigan State University, East Lansing, 48824, USA
Inci M. Baytas, Kaixiang Lin, Anil K. Jain & Jiayu Zhou
School of Software Engineering, Beijing University of Technology, Beijing, China
Fei Wang
- Inci M. Baytas
You can also search for this author inPubMed Google Scholar
- Kaixiang Lin
You can also search for this author inPubMed Google Scholar
- Fei Wang
You can also search for this author inPubMed Google Scholar
- Anil K. Jain
You can also search for this author inPubMed Google Scholar
- Jiayu Zhou
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toJiayu Zhou.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Baytas, I.M., Lin, K., Wang, F.et al. Stochastic convex sparse principal component analysis.J Bioinform Sys Biology2016, 15 (2016). https://doi.org/10.1186/s13637-016-0045-x
Received:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative