CN107871139A

Movatterモバイル変換

Info

Publication number: CN107871139A
Application number: CN201711058157.XA
Authority: CN
Inventors: 董渭清; 李玥; 郭桑; 董文鑫; 陈建友; 仓剑; 袁泉
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2017-11-01
Filing date: 2017-11-01
Publication date: 2018-04-03

Abstract

Translated fromChinese

本发明公开了一种邻域保持嵌入改进算法的数据降维方法，首先构建邻接图，使用测地线计算出每一个样本点的临近点，从而构成邻接矩阵；然后计算重建权值，将每一个采样点用临近点表示；最后计算投影矩阵，利用重建权值矩阵计算得到变换投影矩阵。本发明方法使用测地线距离替代了欧几里得距离，更好保持了NPE算法的局部结构信息，提高了算法处理流形结构的能力。

The invention discloses a data dimensionality reduction method of an improved algorithm of neighborhood preserving embedding. First, an adjacency graph is constructed, and the adjacent points of each sample point are calculated by using geodesics, thereby forming an adjacency matrix; then, the reconstruction weight is calculated, and each A sampling point is represented by an adjacent point; finally, the projection matrix is calculated, and the transformation projection matrix is obtained by calculating the reconstruction weight matrix. The method of the invention replaces the Euclidean distance with the geodesic distance, better maintains the local structure information of the NPE algorithm, and improves the ability of the algorithm to process the manifold structure.

Description

Translated fromChinese

一种邻域保持嵌入改进算法的数据降维方法A Data Dimensionality Reduction Method Based on Improved Neighborhood Preserving Embedding Algorithm

技术领域technical field

本发明属于大数据处理领域，涉及一种数据降维方法，具体涉及一种邻域保持嵌入改进算法的数据降维方法。The invention belongs to the field of big data processing, and relates to a data dimensionality reduction method, in particular to a data dimensionality reduction method with a neighborhood-preserving embedded improved algorithm.

背景技术Background technique

大数据时代下，数据量的不断膨胀导致了信息爆炸，这些数据往往呈现高维度的特性，高维数据因为其结构的复杂性，所以在现实世界中所掌握的技术通常是难以直接处理的，比如数据挖掘的主要目的是利用高效算法来探索隐藏在数据背后的信息，并最终转化为知识来引导人们做出合理决策。为了可以恰当地处理这些高维数据，数据降维技术由此诞生。数据降维是将数据从高维特征空间投影到低维特征空间的过程，且在降维的过程中能极大的保留数据的本质结构。降低数据的维度，可以更有利于数据挖掘。从数据的特性出发，降维方法又有线性降维和非线性降维两种方法。为了有效的探索数据集中所含的非线性结构，人们发展了许多有效的非线性降维手段，对于非线性降维算法有人工神经网络、遗传算法、流形学习等，通常这些流形的非线性算法在训练样本上表现良好，而对于测试样本无法达到降维效果，因为它们缺少投影矩阵，无法对新加入的样本集进行特征提取，为了解决这个问题，线性化的典型流形学习算法被提出，比如邻域保持嵌入算法的流形学习(NPE)，使用局部表示得到投影矩阵，将高维流形数据投影到低维流形空间。但是这样的局部表示通常假设局部流形空间是线性的，都会导致降维结果的波动很大。In the era of big data, the continuous expansion of data volume has led to an information explosion. These data often exhibit high-dimensional characteristics. Because of the complexity of its structure, high-dimensional data is usually difficult to directly process in the real world. For example, the main purpose of data mining is to use efficient algorithms to explore the information hidden behind the data, and finally transform it into knowledge to guide people to make reasonable decisions. In order to properly deal with these high-dimensional data, data dimensionality reduction technology was born. Data dimensionality reduction is the process of projecting data from a high-dimensional feature space to a low-dimensional feature space, and the essential structure of the data can be greatly preserved during the dimensionality reduction process. Reducing the dimensionality of data can be more conducive to data mining. Starting from the characteristics of the data, there are two methods of dimensionality reduction: linear dimensionality reduction and nonlinear dimensionality reduction. In order to effectively explore the nonlinear structure contained in the data set, people have developed many effective nonlinear dimensionality reduction methods, such as artificial neural network, genetic algorithm, manifold learning, etc. for nonlinear dimensionality reduction algorithms. The linear algorithm performs well on the training samples, but the dimensionality reduction effect cannot be achieved for the test samples because they lack the projection matrix and cannot perform feature extraction on the newly added sample set. In order to solve this problem, the linearized typical manifold learning algorithm is Proposed, such as the manifold learning (NPE) of the neighborhood preserving embedding algorithm, using local representation to obtain the projection matrix, and projecting the high-dimensional manifold data to the low-dimensional manifold space. However, such local representations usually assume that the local manifold space is linear, which will lead to large fluctuations in dimensionality reduction results.

发明内容Contents of the invention

为了解决现有技术中的问题，本发明提出一种邻域保持嵌入改进算法的数据降维方法，针对邻域保持嵌入算法(NPE)的局限性，基于测地线的邻域保持嵌入算法，能够更加准确地描述局部信息，使得在邻近点的选取上有了优化，在能够更好的保留局部信息的前提下，减少重构误差，并最终实现数据降维。In order to solve the problems in the prior art, the present invention proposes a data dimensionality reduction method of a neighborhood-preserving embedding improved algorithm. Aiming at the limitations of the neighborhood-preserving embedding algorithm (NPE), the geodesic-based neighborhood-preserving embedding algorithm, The local information can be described more accurately, so that the selection of adjacent points can be optimized, and the reconstruction error can be reduced under the premise of better retaining the local information, and finally the data dimensionality can be reduced.

为了实现以上目的，本发明所采用的技术方案为：In order to achieve the above object, the technical solution adopted in the present invention is:

一种邻域保持嵌入改进算法的数据降维方法，包括以下步骤：A data dimensionality reduction method of an improved neighborhood preserving embedding algorithm, comprising the following steps:

1)构建邻接图，使用测地线距离计算出每一个采样点与其他点的距离，构成矩阵，然后从这些点中选取一部分距离较近的点最终构成邻接矩阵。；1) Construct an adjacency graph, use the geodesic distance to calculate the distance between each sampling point and other points to form a matrix, and then select a part of the points with a closer distance from these points to finally form an adjacency matrix. ;

2)根据邻接矩阵的测地线距离计算数据的重建权值，为了使得投影之后损失最小，重建权值依据邻接图中每一个样本点的贡献率进行计算，将数据的每个采样点采用邻接矩阵的临近点表示，得到重建权值矩阵；2) Calculate the reconstruction weight of the data according to the geodesic distance of the adjacency matrix. In order to minimize the loss after projection, the reconstruction weight is calculated according to the contribution rate of each sample point in the adjacency graph, and each sample point of the data is adjacency The adjacent points of the matrix are represented to obtain the reconstruction weight matrix;

3)计算数据的投影矩阵，将重建权值矩阵放入计算特征向量的等式中计算得到投影矩阵的变换矩阵，完成数据降维。3) Calculate the projection matrix of the data, put the reconstruction weight matrix into the equation for calculating the eigenvectors to calculate the transformation matrix of the projection matrix, and complete the data dimensionality reduction.

进一步的，所述步骤1)的邻接图中对于采样点i和j，若两个采样点属于同一类别，则两个采样点之间存在连线，则测地线距离d_G(i,j)＝d_x(i,j)；Further, for the sampling points i and j in the adjacency graph of the step 1), if the two sampling points belong to the same category, then there is a connecting line between the two sampling points, then the geodesic distance d_G (i, j ) = d_x (i,j);

若两个采样点不属于同一类别，则两个采样点之间不存在连线，则先假定d_G(i,j)＝∞，随后对所有的采样点l＝1,2,3,…,N求取测地线距离，更新d_G(i,j)，得到如下公式：If the two sampling points do not belong to the same category, there is no connection between the two sampling points, first assume that d_G (i, j) = ∞, then for all sampling points l = 1, 2, 3,... ,N Calculate the geodesic distance, update d_G (i,j), and get the following formula:

d_G(i,j)＝min{d_G(i,j),d_G(i,l)+d_G(l,j)}。_dG (i,j)=min{_dG (i,j),_dG (i,l)+_dG (l,j)}.

进一步的，所述步骤2)中计算重建权值将每一个采样点用临近点表示的目标函数为：Further, in the step 2), the reconstruction weight is calculated and the objective function representing each sampling point as a neighboring point is:

其中，w_ij为每一个采样点使用测地线距离得到的重建权值，w_i1,...,w_ik为对应的临近点中给定的权重向量。Among them, w_ij is the reconstruction weight obtained by using the geodesic distance for each sampling point, and w_i1 ,...,wi_ik are the weight vectors given in the corresponding adjacent points.

进一步的，所述目标函数中由于降维后转换了特征空间，即x_i→y_i的空间转换，并根据权重向量矩阵，目标函数简化为：Further, in the objective function, the feature space is converted after dimensionality reduction, that is, the space conversion of x_i →y_i , and according to the weight vector matrix, the objective function is simplified as:

进一步的，所述步骤3)中设投影之后的坐标为y_i，对于公式：做如下定义：Further, in step 3), set the coordinates after projection as y_i , for the formula: Do the following definition:

y_i＝A^Tx_iy_i = A^T x_i

则有：Then there are:

其中，a所组成的矩阵便是投影矩阵，Φ(y)表示变换矩阵，z表示变换矩阵的向量形式，I表示单位矩阵，W表示重建权值矩阵，X表示投影前的坐标矩阵，M表示(I-W)^T(I-W)，T表示矩阵的转置。Among them, the matrix composed of a is the projection matrix, Φ(y) represents the transformation matrix, z represents the vector form of the transformation matrix, I represents the identity matrix, W represents the reconstruction weight matrix, X represents the coordinate matrix before projection, and M represents (IW)^T (IW), where T represents the transpose of the matrix.

进一步的，所述变换矩阵公式中引入拉格朗日因子后，转变为利用SVD求解XMX^T过程：，将高维度坐标点N映射到n子空间点(N>n)，假设X的秩是l，利用SVD，X可以投影到l维度的矩阵B中，X＝USV^T，B＝U^TX＝SV^T。其中U是XX^T的特征向量，V是X^TX的特征向量，S是l×l的对角阵。最终求解下面的公式的特征向量就变成了矩阵(BB^T)^-1(BMB^T)的特征向量。Further, after introducing the Lagrangian factor in the transformation matrix formula, it is transformed into a process of solving XMX^T by using SVD: mapping the high-dimensional coordinate point N to n subspace points (N>n), assuming that the rank of X is l. Using SVD, X can be projected into a matrix B of dimension l, X=USV^T , B=U^T X=SV^T . Where U is the eigenvector of XX^T , V is the eigenvector of X^T X, and S is the diagonal matrix of l×l. Finally, the eigenvector of solving the following formula becomes the eigenvector of the matrix (BB^T )^-1 (BMB^T ).

XMX^TA＝λXX^TAXMX^T A = λXX^T A

其中，A表示特征向量，λ表示与矩阵相对应的特征值。Among them, A represents the eigenvector, and λ represents the eigenvalue corresponding to the matrix.

与现有技术相比，本发明针对邻域保持嵌入算法(NPE)的局限性，提出基于测地线的邻域保持嵌入算法，首先构建邻接图，使用测地线计算出每一个样本点的临近点，从而构成邻接矩阵；然后计算重建权值，将每一个采样点用临近点表示；最后计算投影矩阵，利用重建权值矩阵计算得到变换投影矩阵，使用测地线距离替代了欧几里得距离，更好保持了NPE算法的局部结构信息，提高了算法处理流形结构的能力，能够更加准确地描述局部信息，使得在邻近点的选取上有了优化，在能够更好的保留局部信息的前提下，减少重构误差，并最终实现数据降维。Compared with the prior art, the present invention aims at the limitations of the Neighborhood Preserving Embedding Algorithm (NPE), and proposes a geodesic-based Neighborhood Preserving Embedding Algorithm. First, an adjacency graph is constructed, and the geodesic is used to calculate the Adjacent points, thus forming an adjacency matrix; then calculate the reconstruction weight, and represent each sampling point as an adjacent point; finally calculate the projection matrix, use the reconstruction weight matrix to calculate the transformation projection matrix, and use the geodesic distance instead of Euclidean The distance is obtained, which better maintains the local structure information of the NPE algorithm, improves the algorithm's ability to deal with the manifold structure, and can describe the local information more accurately, so that the selection of adjacent points is optimized, and the local structure can be better preserved. Under the premise of information, the reconstruction error is reduced, and data dimensionality reduction is finally achieved.

附图说明Description of drawings

图1为具有两类特征的Helix三维效果图；Figure 1 is a three-dimensional rendering of Helix with two types of features;

图2为图1数据采用NPE方法降维效果图；Fig. 2 is the effect diagram of dimensionality reduction using the NPE method for the data in Fig. 1;

图3为图1数据采用本发明GNPE方法降维效果图；Fig. 3 adopts GNPE method dimension reduction effect figure of the present invention for Fig. 1 data;

图4为本发明的方法流程图。Fig. 4 is a flow chart of the method of the present invention.

图中横纵坐标分别代表了样本点之间的距离，为了肉眼正常识别样本点的离散度，故将距离设置有些变化。The horizontal and vertical coordinates in the figure respectively represent the distance between the sample points. In order to recognize the dispersion of the sample points normally with the naked eye, the distance settings are slightly changed.

具体实施方式Detailed ways

下面结合具体的实施例和说明书附图对本发明作进一步的解释说明。The present invention will be further explained below in conjunction with specific embodiments and accompanying drawings.

由于NPE算法假设流形空间局部是线性关系，对于曲率很大流形空间处理效果不是很好，本发明使用测地线距离替换欧几里得距离，通过选取在流形中真正的邻居点，挖掘其内在的真实空间，很好的保留了局部结构信息，提高了此方法处理高维数据的能力。Since the NPE algorithm assumes that the manifold space is locally linear, the processing effect on the manifold space with a large curvature is not very good. The present invention uses the geodesic distance to replace the Euclidean distance. By selecting the real neighbor points in the manifold, Mining its inner real space well preserves the local structure information and improves the ability of this method to deal with high-dimensional data.

参见图4，本发明包括以下步骤：Referring to Fig. 4, the present invention comprises the following steps:

步骤01：构建邻接图，使用测地线计算出每一个样本点一致的的临近点，从而构成邻接矩阵；Step 01: Construct an adjacency graph, and use geodesics to calculate the consistent adjacent points of each sample point to form an adjacency matrix;

步骤02：计算重建权值，将每一个采样点一致的用临近点表示；Step 02: Calculate the reconstruction weight, and represent each sampling point as an adjacent point;

步骤03：计算投影矩阵，利用重建权值矩阵计算得到变换矩阵；Step 03: Calculate the projection matrix, and use the reconstruction weight matrix to calculate the transformation matrix;

步骤01中构建邻接图，使用测地线计算出每一个样本点的临近点，从而构成邻接矩阵，具体包括：In step 01, the adjacency graph is constructed, and the adjacent points of each sample point are calculated using geodesics to form an adjacency matrix, including:

GNPE的任意采样点选取邻近点的时候，利用了测地线距离替代了欧几里得距离；对于采样点i和j，如果其属于同一类别，则存在连线，否则不存在连线；如果它们之间存在连线，则d_G(i,j)＝d_x(i,j)，否则先假定d_G(i,j)＝∞，随后对所有的l＝1,2,3,…,N，更新d_G(i,j)，得到如下式子：When any sampling point of GNPE selects adjacent points, the geodesic distance is used instead of the Euclidean distance; for sampling points i and j, if they belong to the same category, there is a connection, otherwise there is no connection; if There is a connection between them, then d_G (i, j) = d_x (i, j), otherwise assume d_G (i, j) = ∞, then for all l = 1, 2, 3,... ,N, update d_G (i,j), and get the following formula:

d_G(i,j)＝min{d_G(i,j),d_G(i,l)+d_G(l,j)}d_G (i,j)=min{d_G (i,j),d_G (i,l)+d_G (l,j)}

步骤02中计算重建权值，将每一个采样点用临近点表示，此时目标函数为：In step 02, the reconstruction weight is calculated, and each sampling point is represented by an adjacent point. At this time, the objective function is:

在上式中，w_ij为每一个采样点使用测地线距离得到的重建权值，重建权值在此情况下可以更加贴切地描述低维结构；利用这个方法，可以使得离给定的样本点x_i最近的邻居具有很大的权重，而远近点具有与样本点的距离呈指数衰减的小权重；w_i1,...,w_ik为对应的邻近点中给定的权重向量；由于降维后转换了特征空间，即就是x_i→y_i的空间转换，加上权重向量矩阵进一步就可以将上式化简可以得到下式：In the above formula, w_ij is the reconstruction weight obtained by using the geodesic distance for each sampling point. In this case, the reconstruction weight can describe the low-dimensional structure more appropriately; using this method, it is possible to make The nearest neighbor of point x_i has a large weight, while the far and near points have a small weight that decays exponentially with the distance from the sample point; w_i1 ,...,w_ik are given weight vectors in the corresponding neighboring points; because After dimensionality reduction, the feature space is converted, that is, the space conversion of x_i → y_i , and the weight vector matrix can be further simplified to obtain the following formula:

步骤03中，计算投影矩阵，利用重建权值矩阵计算得到变换矩阵，包括：In step 03, calculate the projection matrix, and use the reconstruction weight matrix to calculate the transformation matrix, including:

设投影之后的坐标为y_i，对于公式：Let the coordinates after projection be y_i , for the formula:

做如下定义：Do the following definition:

y_i＝A^Tx_iy_i = A^T x_i

那么：So:

其中a所组成的矩阵便是投影矩阵，使用拉格朗日因子使得公式转变为求解XMX^TA＝λXX^TA特征向量的问题。The matrix formed by a is the projection matrix, and the Lagrange factor is used to transform the formula into a problem of solving XMX^TA = λXX^TA eigenvector.

为了验证本方法的有效性，分别进行了两组实验。使用KNN分类器来确定识别率，实验中以NPE降维算法作为对比例，与本发明的GNPE算法进行比较。实验均选择降维后的数据维度d＝10和d＝80，选取参数k＝12。每个样本选择5个作为训练集，剩下的作为测试集，反复将实验进行5次。最后对数据取平均值，表1为d＝10的两种降维方法的平均正确率对比表：In order to verify the effectiveness of this method, two groups of experiments were carried out. The KNN classifier is used to determine the recognition rate, and the NPE dimensionality reduction algorithm is used as a comparative example in the experiment to compare with the GNPE algorithm of the present invention. In the experiments, the data dimensions d=10 and d=80 after dimensionality reduction were selected, and the parameter k=12 was selected. For each sample, 5 samples are selected as the training set, and the rest are used as the test set, and the experiment is repeated 5 times. Finally, the average value of the data is taken. Table 1 is the comparison table of the average accuracy rate of the two dimensionality reduction methods with d=10:

表2为d＝80的两种降维方法的平均正确率对比表：Table 2 is a comparison table of the average correct rate of the two dimensionality reduction methods with d=80:

从表1和表2不同算法中的平均正确率可以看出，从ORL人脸数据库中我们可以看到，总体看来GNPE的人脸识别率好于NPE算法。在确定训练样本的个数的情况下，样本数据所将维数越低，最终得到人脸识别率会有所下降，是因为样本信息所保留的低维数据结构的本征特征越来越少。GNPE算法在人脸这种流形结构中，使得原本应该是邻居的点由于NPE算法中使用欧几里得距离而变远的点重新计算之后变成邻居点，使得原本在流形中不属于邻居点由于在欧几里得距离中离得很近的样本点成为邻居点，重新利用测地线计算之后变成在重建矩阵中的贡献降低，这样可以提高识别率。也就是说，在原始邻域保持嵌入算法中，我们使用欧几里得距离计算每一个样本点得到的权值矩阵是根据每一个样本点的邻居的远近来计算权值矩阵，在流形中这样计算是不准确的，所以我们将欧几里得距离换成测地线距离，这样重新计算每一个样本点的k个邻居的远近程度，重新分配贡献值大小，从而重新计算出新的权值矩阵，得到更高的识别率，因为类似这样流形结构曲率较大，对于局部来说也不是很接近线性结构。It can be seen from the average accuracy rates of different algorithms in Table 1 and Table 2, and from the ORL face database, we can see that the face recognition rate of GNPE is better than that of NPE algorithm in general. In the case of determining the number of training samples, the lower the dimensionality of the sample data, the final face recognition rate will decrease, because the intrinsic features of the low-dimensional data structure retained by the sample information are less and less . In the manifold structure of the face, the GNPE algorithm makes the points that should be neighbors that are far away due to the use of Euclidean distance in the NPE algorithm become neighbor points after recalculation, so that they do not belong to the original manifold. Neighbor points become neighbor points because the sample points that are very close in the Euclidean distance become neighbor points, and after reusing the geodesic calculation, the contribution in the reconstruction matrix is reduced, which can improve the recognition rate. That is to say, in the original neighborhood preserving embedding algorithm, we use the Euclidean distance to calculate the weight matrix obtained by each sample point. The weight matrix is calculated according to the distance of the neighbors of each sample point. In the manifold This calculation is inaccurate, so we replace the Euclidean distance with the geodesic distance, so as to recalculate the distance of the k neighbors of each sample point, redistribute the contribution value, and recalculate the new weight The value matrix can get a higher recognition rate, because the curvature of the manifold structure like this is relatively large, and it is not very close to the linear structure locally.

横向比较来看，在选择同一个降维度下，训练样本数不断增加的过程中，算法识别率都有所提高，通过分析表格数据可以看到GNPE算法的识别率依然明显高于NPE算法。From the perspective of horizontal comparison, the recognition rate of the algorithm is improved when the number of training samples is continuously increased under the same dimensionality reduction. By analyzing the table data, it can be seen that the recognition rate of the GNPE algorithm is still significantly higher than that of the NPE algorithm.

同样的，从PIE人脸数据库中可以看到，GNPE算法的人脸识别率依然高出NPE算法，另外，经过实验，所将维数在80的基础上如果在继续增加，不管是ORL人脸数据库还是PIE人脸数据的人脸的识别率基本不发生变化，保持平稳。Similarly, it can be seen from the PIE face database that the face recognition rate of the GNPE algorithm is still higher than that of the NPE algorithm. In addition, after experiments, if the dimension continues to increase on the basis of 80, whether it is an ORL face The face recognition rate of the database or PIE face data basically does not change and remains stable.

实施例中采用具有两类特征的数据Helix人工数据集，其中空心点和实心点构成了三维空间中交错连接的螺旋线，其中，空心点所构成的连线属于第一类特征的螺旋线，实心点所构成的连线属于第二类特征的螺旋线，两类特征分布属于随机产生，它具有在任何点的切线，使之恒定角具有固定线。Helix的三维效果如图1所示。In the embodiment, the data Helix artificial data set with two types of characteristics is adopted, wherein hollow points and solid points form a spiral line interlaced in three-dimensional space, wherein the connection line formed by the hollow point belongs to the spiral line of the first type of feature, The connection line formed by the solid points belongs to the spiral line of the second type of characteristics, and the distribution of the two types of characteristics belongs to random generation. It has a tangent at any point, so that it has a fixed line with a constant angle. The three-dimensional effect of Helix is shown in Figure 1.

首先用最原始的NPE算法对Helix进行降维，选取参数k＝12，d＝2。数据降维结果如下图2所示，由图2可以看出，使用NPE算法处理Helix，起到了一定的降维效果，但是有一部分数据重合。First, use the most original NPE algorithm to reduce the dimension of Helix, and select parameters k=12, d=2. The data dimensionality reduction results are shown in Figure 2 below. It can be seen from Figure 2 that using the NPE algorithm to process Helix has achieved a certain dimensionality reduction effect, but some data overlap.

从图3中可以看出，GNPE有很好的降维效果和分类效果，可以使得每个类别的样本点分离开来，使得每一类的样本点重合较少。It can be seen from Figure 3 that GNPE has a good dimension reduction effect and classification effect, which can separate the sample points of each category, so that the sample points of each category overlap less.

由于NPE算法假设流形空间局部是线性关系，对于曲率很大流形空间处理效果不是很好，使用测地线距离替换欧几里得距离，通过选取在流形中真正的邻居点，挖掘其内在的真实空间，很好的保留了局部结构信息，提高了此方法处理高维数据的能力。Since the NPE algorithm assumes that the local manifold space is a linear relationship, the processing effect on the manifold space with large curvature is not very good. The geodesic distance is used to replace the Euclidean distance. By selecting the real neighbor points in the manifold, mining its The inner real space well preserves the local structure information and improves the ability of this method to deal with high-dimensional data.

最后需要说明的是，以上模型和实际资料算例对本发明的目的，技术方案以及有益效果提供了进一步的验证，这仅属于本发明的具体实施算例，并不用于限定本发明的保护范围，在本发明的精神和原则之内，所做的任何修改，改进或等同替换等，均应在本发明的保护范围内。Finally, it should be noted that the above models and actual data calculation examples provide further verification for the purpose of the present invention, technical solutions and beneficial effects, which only belong to the specific implementation calculation examples of the present invention, and are not used to limit the protection scope of the present invention. Within the spirit and principles of the present invention, any modifications, improvements or equivalent replacements should fall within the protection scope of the present invention.

Claims

1. A data dimension reduction method of a neighborhood preserving embedding improved algorithm is characterized by comprising the following steps:

1) Constructing an adjacency graph, calculating the distance between each sampling point and other points by using the geodesic distance to form a matrix, and then selecting a part of points with shorter distances from the points to finally form an adjacency matrix;

2) Calculating a reconstruction weight of the data according to the geodesic distance of the adjacency matrix, wherein in order to minimize loss after projection, the reconstruction weight is calculated according to the contribution rate of each sample point in the adjacency graph, and each sampling point of the data is represented by the adjacent point of the adjacency matrix to obtain a reconstruction weight matrix;

3) And calculating a projection matrix of the data, putting the reconstruction weight matrix into an equation for calculating the characteristic vector, and calculating to obtain a transformation matrix of the projection matrix to finish the data dimension reduction.

2. The data dimension reduction method of the neighborhood preserving embedding improved algorithm according to claim 1, characterized in that, for the sampling points i and j in the adjacency graph of the step 1), if two sampling points belong to the same category, a connection line exists between the two sampling points, and the geodesic distance d exists_G (i,j)＝d_x (i,j)；

If the two sampling points do not belong to the same category, a connecting line does not exist between the two sampling points, and d is assumed to be_G (i, j) = ∞ then the geodesic distance is determined for all sampling points l =1,2,3, …, NUpdate d_G (i, j), the following equation is obtained:

d_G (i,j)＝min{d_G (i,j),d_G (i,l)+d_G (l,j)}。

3. the data dimension reduction method of the neighborhood preserving embedding improved algorithm according to claim 1, wherein the objective function of calculating the reconstruction weight in step 2) and representing each sampling point by a near point is as follows:

wherein, w_ij Using the reconstructed weight, w, obtained from geodesic distances for each sample point_i1 ,...,w_ik Given in the corresponding proximate point.

4. The method of claim 3, wherein the feature space (x) of the objective function is transformed after dimensionality reduction_i →y_i And according to the weight vector matrix, simplifying the objective function as follows:

5. the method for reducing the data dimension of the neighborhood preserving embedding improved algorithm according to claim 4, wherein the coordinate after projection in the step 3) is set as y_i For the formula:as defined below:

y_i ＝A^T x_i

then there are:

wherein, the matrix formed by a is a projection matrix, phi (y) represents a transformation matrix, z represents the vector form of the transformation matrix, I represents an identity matrix, W represents a reconstruction weight matrix, X represents a coordinate matrix before projection, and M represents (I-W)^T (I-W), T represents the transpose of the matrix.

6. The method as claimed in claim 5, wherein the transformation matrix formula is converted into the method for solving XMX by SVD after introducing Lagrangian factor^T The process is as follows: mapping a high-dimensional coordinate point N to N subspace points (N)&gt, n), assuming that the rank of X is l, X can be projected into a matrix B of dimension l using SVD, X = USV^T ，B＝U^T X＝SV^T (ii) a Wherein U is XX^T V is X^T The eigenvectors of X, S is a diagonal matrix of l × l; the eigenvectors that ultimately solve the following equations become the matrix (BB)^T )(BMB^T ) The feature vector of (2);

XMX^T A＝λXX^T A

where a denotes an eigenvector and λ denotes an eigenvalue corresponding to the matrix.