Illustration of approximate non-negative matrix factorization: the matrixV is represented by the two smaller matricesW andH, which, when multiplied, approximately reconstructV.
Non-negative matrix factorization (NMF orNNMF), alsonon-negative matrix approximation[1][2] is a group ofalgorithms inmultivariate analysis andlinear algebra where amatrixV isfactorized into (usually) two matricesW andH, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.
Inchemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution".[9]In this framework the vectors in the right matrix are continuous curves rather than discrete vectors.Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the namepositive matrix factorization.[10][11][12]It became more widely known asnon-negative matrix factorization after Lee andSeung investigated the properties of the algorithm and published some simple and usefulalgorithms for two types of factorizations.[13][14]
Matrix multiplication can be implemented as computing the column vectors ofV as linear combinations of the column vectors inW using coefficients supplied by columns ofH. That is, each column ofV can be computed as follows:
wherevi is thei-th column vector of the product matrixV andhi is thei-th column vector of the matrixH.
When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, ifV is anm ×n matrix,W is anm ×p matrix, andH is ap ×n matrix thenp can be significantly less than bothm andn.
Here is an example based on a text-mining application:
Let the input matrix (the matrix to be factored) beV with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vectorv inV represents a document.
Assume we ask the algorithm to find 10 features in order to generate afeatures matrixW with 10000 rows and 10 columns and acoefficients matrixH with 10 rows and 500 columns.
The product ofW andH is a matrix with 10000 rows and 500 columns, the same shape as the input matrixV and, if the factorization worked, it is a reasonable approximation to the input matrixV.
From the treatment of matrix multiplication above it follows that each column in the product matrixWH is a linear combination of the 10 column vectors in the features matrixW with coefficients supplied by the coefficients matrixH.
This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features.
It is useful to think of each feature (column vector) in the features matrixW as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrixH represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors inW) where each feature is weighted by the feature's cell value from the document's column inH.
NMF has an inherent clustering property,[15] i.e., it automatically clusters the columns of input data.
More specifically, the approximation of by is achieved by finding and that minimize the error function (using theFrobenius norm)
subject to,
If we furthermore impose an orthogonality constraint on, i.e., then the above minimization is mathematically equivalent to the minimization ofK-means clustering.[15]
Furthermore, the computed gives the cluster membership, i.e., if for alli ≠k, this suggests that the input data belongs to-th cluster. The computed gives the cluster centroids, i.e., the-th column gives the cluster centroid of-th cluster. This centroid's representation can be significantly enhanced by convex NMF.
When the orthogonality constraint is not explicitly imposed, the orthogonality holds to a large extent, and the clustering property holds too.
Usually the number of columns ofW and the number of rows ofH in NMF are selected so the productWH will become an approximation toV. The full decomposition ofV then amounts to the two non-negative matricesW andH as well as a residualU, such that:V =WH +U. The elements of the residual matrix can either be negative or positive.
WhenW andH are smaller thanV they become easier to store and manipulate. Another reason for factorizingV into smaller matricesW andH, is that if one's goal is to approximately represent the elements ofV by significantly less data, then one has to infer some latent structure in the data.
In standard NMF, matrix factorW ∈R+m ×k, i.e.,W can be anything in that space. Convex NMF[17] restricts the columns ofW toconvex combinations of the input data vectors. This greatly improves the quality of data representation ofW. Furthermore, the resulting matrix factorH becomes more sparse and orthogonal.
In case thenonnegative rank ofV is equal to its actual rank,V =WH is called a nonnegative rank factorization (NRF).[18][19][20] The problem of finding the NRF ofV, if it exists, is known to be NP-hard.[21]
There are different types of non-negative matrix factorizations.The different types arise from using differentcost functions for measuring the divergence betweenV andWH and possibly byregularization of theW and/orH matrices.[1]
Two simple divergence functions studied by Lee and Seung are the squared error (orFrobenius norm) and an extension of the Kullback–Leibler divergence to positive matrices (the originalKullback–Leibler divergence is defined on probability distributions).Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules.
The factorization problem in the squared error version of NMF may be stated as:Given a matrix find nonnegative matrices W and H that minimize the function
WhenL1 regularization (akin toLasso) is added to NMF with the mean squared error cost function, the resulting problem may be callednon-negative sparse coding due to the similarity to thesparse coding problem,[23][24]although it may also still be referred to as NMF.[25]
Many standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. This may be unsatisfactory in applications where there are too many data to fit into memory or where the data are provided instreaming fashion. One such use is forcollaborative filtering inrecommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system. The cost function for optimization in these cases may or may not be the same as for standard NMF, but the algorithms need to be rather different.[26][27]
If the columns ofV represent data sampled over spatial or temporal dimensions, e.g. time signals, images, or video, features that are equivariant w.r.t. shifts along these dimensions can be learned by Convolutional NMF. In this case,W is sparse with columns having local non-zero weight windows that are shared across shifts along the spatio-temporal dimensions ofV, representingconvolution kernels. By spatio-temporal pooling ofH and repeatedly using the resulting representation as input to convolutional NMF, deep feature hierarchies can be learned.[28]
There are several ways in which theW andH may be found: Lee and Seung'smultiplicative update rule[14] has been a popular method due to the simplicity of implementation. This algorithm is:
initialize:W andH non negative.
Then update the values inW andH by computing the following, with as an index of the iteration.
and
UntilW andH are stable.
Note that the updates are done on an element by element basis not matrix multiplication.
We note that the multiplicative factors forW andH, i.e. the and terms, arematrices of ones when.
More recently other algorithms have been developed.Some approaches are based on alternatingnon-negative least squares: in each step of such an algorithm, firstH is fixed andW found by a non-negative least squares solver, thenW is fixed andH is found analogously. The procedures used to solve forW andH may be the same[29] or different, as some NMF variants regularize one ofW andH.[23] Specific approaches include the projectedgradient descent methods,[29][30] theactive set method,[6][31] the optimal gradient method,[32] and the block principal pivoting method[33] among several others.[34]
Current algorithms are sub-optimal in that they only guarantee finding a local minimum, rather than a global minimum of the cost function. A provably optimal algorithm is unlikely in the near future as the problem has been shown to generalize the k-means clustering problem which is known to beNP-complete.[35] However, as in many other data mining applications, a local minimum may still prove to be useful.
In addition to the optimization step, initialization has a significant effect on NMF. The initial values chosen forW andH may affect not only the rate of convergence, but also the overall error at convergence. Some options for initialization include complete randomization,SVD, k-means clustering, and more advanced strategies based on these and other paradigms.[36]
Fractional residual variance (FRV) plots for PCA and sequential NMF;[4] for PCA, the theoretical values are the contribution from the residual eigenvalues. In comparison, the FRV curves for PCA reaches a flat plateau where no signal are captured effectively; while the NMF FRV curves are declining continuously, indicating a better ability to capture signal. The FRV curves for NMF also converges to higher levels than PCA, indicating the less-overfitting property of NMF.
The sequential construction of NMF components (W andH) was firstly used to relate NMF withPrincipal Component Analysis (PCA) in astronomy.[37] The contribution from the PCA components are ranked by the magnitude of their corresponding eigenvalues; for NMF, its components can be ranked empirically when they are constructed one by one (sequentially), i.e., learn the-th component with the first components constructed.
The contribution of the sequential NMF components can be compared with theKarhunen–Loève theorem, an application of PCA, using the plot of eigenvalues. A typical choice of the number of components with PCA is based on the "elbow" point, then the existence of the flat plateau is indicating that PCA is not capturing the data efficiently, and at last there exists a sudden drop reflecting the capture of random noise and falls into the regime of overfitting.[38][39] For sequential NMF, the plot of eigenvalues is approximated by the plot of the fractional residual variance curves, where the curves decreases continuously, and converge to a higher level than PCA,[4] which is the indication of less over-fitting of sequential NMF.
Exact solutions for the variants of NMF can be expected (in polynomial time) when additional constraints hold for matrixV. A polynomial time algorithm for solving nonnegative rank factorization ifV contains a monomial sub matrix of rank equal to its rank was given by Campbell and Poole in 1981.[40] Kalofolias and Gallopoulos (2012)[41] solved the symmetric counterpart of this problem, whereV is symmetric and contains a diagonal principal sub matrix of rank r. Their algorithm runs inO(rm2) time in the dense case. Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, & Zhu (2013) give a polynomial time algorithm for exact NMF that works for the case where one of the factors W satisfies a separability condition.[42]
InLearning the parts of objects by non-negative matrix factorization Lee and Seung[43] proposed NMF mainly for parts-based decomposition of images. It compares NMF tovector quantization andprincipal component analysis, and shows that although the three techniques may be written as factorizations, they implement different constraints and therefore produce different results.
NMF as a probabilistic graphical model: visible units (V) are connected to hidden units (H) through weightsW, so thatV isgenerated from a probability distribution with mean.[13]: 5
It was later shown that some types of NMF are an instance of a more general probabilistic model called "multinomial PCA".[44]When NMF is obtained by minimizing theKullback–Leibler divergence, it is in fact equivalent to another instance of multinomial PCA,probabilistic latent semantic analysis,[45]trained bymaximum likelihood estimation.That method is commonly used for analyzing and clustering textual data and is also related to thelatent class model.
NMF with the least-squares objective is equivalent to a relaxed form ofK-means clustering: the matrix factorW contains cluster centroids andH contains cluster membership indicators.[15][46] This provides a theoretical foundation for using NMF for data clustering. However, k-means does not enforce non-negativity on its centroids, so the closest analogy is in fact with "semi-NMF".[17]
NMF can be seen as a two-layerdirected graphical model with one layer of observed random variables and one layer of hidden random variables.[47]
NMF extends beyond matrices to tensors of arbitrary order.[48][49][50] This extension may be viewed as a non-negative counterpart to, e.g., thePARAFAC model.
Other extensions of NMF include joint factorization of several data matrices and tensors where some factors are shared. Such models are useful for sensor fusion and relational learning.[51]
NMF is an instance of nonnegativequadratic programming, just like thesupport vector machine (SVM). However, SVM and NMF are related at a more intimate level than that of NQP, which allows direct application of the solution algorithms developed for either of the two methods to problems in both domains.[52]
The factorization is not unique: A matrix and itsinverse can be used to transform the two factorization matrices by, e.g.,[53]
If the two new matrices and arenon-negative they form another parametrization of the factorization.
The non-negativity of and applies at least ifB is a non-negativemonomial matrix.In this simple case it will just correspond to a scaling and apermutation.
More control over the non-uniqueness of NMF is obtained with sparsity constraints.[54]
In astronomy, NMF is a promising method fordimension reduction in the sense that astrophysical signals are non-negative. NMF has been applied to the spectroscopic observations[55][3] and the direct imaging observations[4] as a method to study the common properties of astronomical objects and post-process the astronomical observations. The advances in the spectroscopic observations by Blanton & Roweis (2007)[3] takes into account of the uncertainties of astronomical observations, which is later improved by Zhu (2016)[37] where missing data are also considered andparallel computing is enabled. Their method is then adopted by Ren et al. (2018)[4] to the direct imaging field as one of themethods of detecting exoplanets, especially for the direct imaging ofcircumstellar disks.
Ren et al. (2018)[4] are able to prove the stability of NMF components when they are constructed sequentially (i.e., one by one), which enables thelinearity of the NMF modeling process; thelinearity property is used to separate the stellar light and the light scattered from theexoplanets andcircumstellar disks.
In direct imaging, to reveal the faint exoplanets and circumstellar disks from bright the surrounding stellar lights, which has a typical contrast from 10⁵ to 10¹⁰, various statistical methods have been adopted,[56][57][38] however the light from the exoplanets or circumstellar disks are usually over-fitted, where forward modeling have to be adopted to recover the true flux.[58][39] Forward modeling is currently optimized for point sources,[39] however not for extended sources, especially for irregularly shaped structures such as circumstellar disks. In this situation, NMF has been an excellent method, being less over-fitting in the sense of the non-negativity andsparsity of the NMF modeling coefficients, therefore forward modeling can be performed with a few scaling factors,[4] rather than a computationally intensive data re-reduction on generated models.
To impute missing data in statistics, NMF can take missing data while minimizing its cost function, rather than treating these missing data as zeros.[5] This makes it a mathematically proven method fordata imputation in statistics.[5] By first proving that the missing data are ignored in the cost function, then proving that the impact from missing data can be as small as a second order effect, Ren et al. (2020)[5] studied and applied such an approach for the field of astronomy. Their work focuses on two-dimensional matrices, specifically, it includes mathematical derivation, simulated data imputation, and application to on-sky data.
The data imputation procedure with NMF can be composed of two steps. First, when the NMF components are known, Ren et al. (2020) proved that impact from missing data during data imputation ("target modeling" in their study) is a second order effect. Second, when the NMF components are unknown, the authors proved that the impact from missing data during component construction is a first-to-second order effect.
Depending on the way that the NMF components are obtained, the former step above can be either independent or dependent from the latter. In addition, the imputation quality can be increased when the more NMF components are used, see Figure 4 of Ren et al. (2020) for their illustration.[5]
NMF can be used fortext mining applications.In this process, adocument-term matrix is constructed with the weights of various terms (typically weighted word frequency information) from a set of documents.This matrix is factored into aterm-feature and afeature-document matrix.The features are derived from the contents of the documents, and the feature-document matrix describesdata clusters of related documents.
One specific application used hierarchical NMF on a small subset of scientific abstracts fromPubMed.[59]Another research group clustered parts of theEnron email dataset[60]with 65,033 messages and 91,133 terms into 50 clusters.[61]NMF has also been applied to citations data, with one example clusteringEnglish Wikipedia articles andscientific journals based on the outbound scientific citations in English Wikipedia.[62]
Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, & Zhu (2013) have given polynomial-time algorithms to learn topic models using NMF. The algorithm assumes that the topic matrix satisfies a separability condition that is often found to hold in these settings.[42]
Hassani, Iranmanesh and Mansouri (2019) proposed a feature agglomeration method for term-document matrices which operates using NMF. The algorithm reduces the term-document matrix into a smaller matrix more suitable for text clustering.[63]
NMF is applied in scalable Internet distance (round-trip time) prediction. For a network with hosts, with the help of NMF, the distances of all the end-to-end links can be predicted after conducting only measurements. This kind of method was firstly introduced in InternetDistance Estimation Service (IDES).[65] Afterwards, as a fully decentralized approach, Phoenix network coordinate system[66]is proposed. It achieves better overall prediction accuracy by introducing the concept of weight.
Speech denoising has been a long lasting problem inaudio signal processing. There are many algorithms for denoising if the noise is stationary. For example, theWiener filter is suitable for additiveGaussian noise. However, if the noise is non-stationary, the classical denoising algorithms usually have poor performance because the statistical information of the non-stationary noise is difficult to estimate. Schmidt et al.[67] use NMF to do speech denoising under non-stationary noise, which is completely different from classical statistical approaches. The key idea is that clean speech signal can be sparsely represented by a speech dictionary, but non-stationary noise cannot. Similarly, non-stationary noise can also be sparsely represented by a noise dictionary, but speech cannot.
The algorithm for NMF denoising goes as follows. Two dictionaries, one for speech and one for noise, need to be trained offline. Once a noisy speech is given, we first calculate the magnitude of the Short-Time-Fourier-Transform. Second, separate it into two parts via NMF, one can be sparsely represented by the speech dictionary, and the other part can be sparsely represented by the noise dictionary. Third, the part that is represented by the speech dictionary will be the estimated clean speech.
Sparse NMF is used inPopulation genetics for estimating individual admixture coefficients, detecting genetic clusters of individuals in a population sample or evaluatinggenetic admixture in sampled genomes. In human genetic clustering, NMF algorithms provide estimates similar to those of the computer program STRUCTURE, but the algorithms are more efficient computationally and allow analysis of large population genomic data sets.[68]
NMF has been successfully applied inbioinformatics for clusteringgene expression andDNA methylation data and finding the genes most representative of the clusters.[24][69][70][71] In the analysis of cancer mutations it has been used to identify common patterns of mutations that occur in many cancers and that probably have distinct causes.[72] NMF techniques can identify sources of variation such as cell types, disease subtypes, population stratification, tissue composition, and tumor clonality.[73]
A particular variant of NMF, namely Non-Negative Matrix Tri-Factorization (NMTF),[74] has been use fordrug repurposing tasks in order to predict novel protein targets and therapeutic indications for approved drugs[75] and to infer pair of synergic anticancer drugs.[76]
NMF, also referred in this field as factor analysis, has been used since the 1980s[77] to analyze sequences of images inSPECT andPET dynamic medical imaging. Non-uniqueness of NMF was addressed using sparsity constraints.[78][79][80]
This section needs to beupdated. Please help update this article to reflect recent events or newly available information.(February 2024)
Current research (since 2010) in nonnegative matrix factorization includes, but is not limited to,
Algorithmic: searching for global minima of the factors and factor initialization.[81]
Scalability: how to factorize million-by-billion matrices, which are commonplace in Web-scale data mining, e.g., see Distributed Nonnegative Matrix Factorization (DNMF),[82] Scalable Nonnegative Matrix Factorization (ScalableNMF),[83] Distributed Stochastic Singular Value Decomposition.[84]
Online: how to update the factorization when new data comes in without recomputing from scratch, e.g., see online CNSC[85]
Collective (joint) factorization: factorizing multiple interrelated matrices for multiple-view learning, e.g. multi-view clustering, see CoNMF[86] and MultiNMF[87]
Cohen and Rothblum 1993 problem: whether a rational matrix always has an NMF of minimal inner dimension whose factors are also rational. Recently, this problem has been answered negatively.[88]
^abRainer Gemulla; Erik Nijkamp;Peter J. Haas; Yannis Sismanis (2011).Large-scale matrix factorization with distributed stochastic gradient descent. Proc. ACM SIGKDD Int'l Conf. on Knowledge discovery and data mining. pp. 69–77.
^abC Ding, T Li, MI Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 45-55, 2010
^Berman, A.; R.J. Plemmons (1974). "Inverses of nonnegative matrices".Linear and Multilinear Algebra.2 (2):161–172.doi:10.1080/03081087408817055.
^A. Berman; R.J. Plemmons (1994).Nonnegative matrices in the Mathematical Sciences. Philadelphia: SIAM.
^Behnke, S. (2003). "Discovering hierarchical speech features using convolutional non-negative matrix factorization".Proceedings of the International Joint Conference on Neural Networks, 2003. Vol. 4. Portland, Oregon USA: IEEE. pp. 2758–2763.doi:10.1109/IJCNN.2003.1224004.ISBN978-0-7803-7898-8.S2CID3109867.
^Ding, C.; He, X. & Simon, H.D. (2005). "On the equivalence of nonnegative matrix factorization and spectral clustering".Proc. SIAM Data Mining Conf. Vol. 4. pp. 606–610.doi:10.1137/1.9781611972757.70.ISBN978-0-89871-593-4.
^Hafshejani, Sajad Fathi; Moaberfard, Zahra (November 2022). "Initialization for Nonnegative Matrix Factorization: a Comprehensive Review".International Journal of Data Science and Analytics.16 (1):119–134.arXiv:2109.03874.doi:10.1007/s41060-022-00370-9.ISSN2364-415X.
^abZhu, Guangtun B. (2016-12-19). "Nonnegative Matrix Factorization (NMF) with Heteroscedastic Uncertainties and Missing data".arXiv:1612.06037 [astro-ph.IM].
^Eric Gaussier & Cyril Goutte (2005).Relation between PLSA and NMF and Implications(PDF). Proc. 28th international ACM SIGIR conference on Research and development in information retrieval (SIGIR-05). pp. 601–602. Archived fromthe original(PDF) on 2007-09-28. Retrieved2007-01-29.
^Vamsi K. Potluru; Sergey M. Plis; Morten Morup; Vince D. Calhoun & Terran Lane (2009).Efficient Multiplicative updates for Support Vector Machines. Proceedings of the 2009 SIAM Conference on Data Mining (SDM). pp. 1218–1229.
^Wahhaj, Zahed; Cieza, Lucas A.; Mawet, Dimitri; Yang, Bin; Canovas, Hector; de Boer, Jozua; Casassus, Simon; Ménard, François; Schreiber, Matthias R.; Liu, Michael C.; Biller, Beth A.; Nielsen, Eric L.; Hayward, Thomas L. (2015). "Improving signal-to-noise in the direct imaging of exoplanets and circumstellar disks with MLOCI".Astronomy & Astrophysics.581 (24): A24.arXiv:1502.03092.Bibcode:2015A&A...581A..24W.doi:10.1051/0004-6361/201525837.S2CID20174209.
^Hassani, Ali; Iranmanesh, Amir; Mansouri, Najme (2019-11-12). "Text Mining using Nonnegative Matrix Factorization and Latent Semantic Analysis".arXiv:1911.04705 [cs.LG].
^Berry, Michael W.; Browne, Murray; Langville, Amy N.; Paucac, V. Paul; Plemmonsc, Robert J. (15 September 2007). "Algorithms and Applications for Approximate Nonnegative Matrix Factorization".Computational Statistics & Data Analysis.52 (1):155–173.doi:10.1016/j.csda.2006.11.006.
^Ding; Li; Peng; Park (2006). "Orthogonal nonnegative matrix t-factorizations for clustering".Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 126–135.doi:10.1145/1150402.1150420.ISBN1-59593-339-5.S2CID165018.
^Pinoli; Ceddia; Ceri; Masseroli (2021). "Predicting drug synergism by means of non-negative matrix tri-factorization".IEEE/ACM Transactions on Computational Biology and Bioinformatics.PP (4):1956–1967.doi:10.1109/TCBB.2021.3091814.PMID34166199.S2CID235634059.
Andrzej Cichocki, Morten Mrup, et al.: "Advances in Nonnegative Matrix and Tensor Factorization", Hindawi Publishing Corporation,ISBN978-9774540455 (2008).
Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan and Shun-ichi Amari: "Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation", Wiley,ISBN978-0470746660 (2009).
Andri Mirzal: "Nonnegative Matrix Factorizations for Clustering and LSI: Theory and Programming", LAP LAMBERT Academic Publishing,ISBN978-3844324891 (2011).