CN106570104A

Movatterモバイル変換

Info

Publication number: CN106570104A
Application number: CN201610934684.1A
Authority: CN
Inventors: 王烁; 李千目; 戚湧; 王印海
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2016-11-01
Filing date: 2016-11-01
Publication date: 2017-04-19
Anticipated expiration: 2036-11-01
Also published as: CN106570104B

Abstract

Translated fromChinese

本发明公开了一种流数据的多分区聚类预处理方法，步骤包括：对流数据的态势因子进行筛选并计算关联度：对流数据进行统计分析，对高维数据库中高关联度因子采用低维数据库分区方法，低关联度因子采用等距三角分区；对低维数据库采用基于分布的分区方法；在每一规则分区内采用基于密度的聚类算法DBSCAN算法进行聚类；将各局部聚类进行合并。本发明对流数据的聚类预处理进行了多分区的改进，使数据分布更均匀，聚类结果更精确，实现了对数据的分布式并行处理，缓解了面对顺序、大量、快速、连续到达的数据序列时数据预处理效率不高的压力。

The invention discloses a multi-partition clustering preprocessing method for streaming data. The steps include: screening the situational factors of the streaming data and calculating the correlation degree; performing statistical analysis on the streaming data, and adopting a low-dimensional database for the high correlation factor in the high-dimensional database Partitioning method, the low correlation factor adopts equidistant triangular partitioning; the low-dimensional database adopts the distribution-based partitioning method; uses the density-based clustering algorithm DBSCAN algorithm for clustering in each regular partition; merges the local clusters . The present invention improves the multi-partition clustering preprocessing of stream data, makes the data distribution more uniform, and the clustering results are more accurate, realizes the distributed parallel processing of data, and eases the problem of sequential, massive, fast and continuous arrival. The data preprocessing efficiency is not high when the data sequence is under pressure.

Description

Translated fromChinese

一种流数据的多分区聚类预处理方法A multi-partition clustering preprocessing method for streaming data

技术领域technical field

本发明涉及DBSCAN聚类算法技术领域，特别是一种流数据的多分区聚类预处理方法。The invention relates to the technical field of DBSCAN clustering algorithm, in particular to a multi-partition clustering preprocessing method for streaming data.

背景技术Background technique

随着大数据时代的到来，数据逐渐已数据流的形式传递。流数据具有以下四个特点：1)数据实时到达；2)数据到达次序独立，不受应用系统所控制；3)数据规模宏大且不能预知其最大值；4)数据一经处理，除非特意保存，否则不能被再次取出处理，或者再次提取数据代价昂贵。流数据的以上特点，也随之带来安全审计数据的膨胀。数据挖掘本身是一项通用的知识发现技术，其目的是要从海量数据中提取出我们所感兴趣的数据信息(知识)，因此将数据挖掘技术应用于对审计数据的分析可以从包含了大量冗余信息的数据中提取出尽可能多的隐藏的安全信息。数据挖掘技术中包括统计、分类、聚类、关联、序列分析、群集分析等方法。其中，数据挖掘中的关联规则算法、序列模式算法和分类算法已经用在了日志分析和入侵检测中，但是这些方法在正确分区上仍然存在着很多不足之处。采用聚类分析日志是解决数据量大，信息匮乏，发现新入侵模式的一种可取的数据预处理方法。改进了传统的生成包分析，SVM等小样本取样的数据挖掘的不足。大规模数据库的聚类时，数据分区可以带来聚类结果的优化，因此，要提高聚类效率，合理的进行数据分区是关键。With the advent of the era of big data, data has gradually been transmitted in the form of data streams. Streaming data has the following four characteristics: 1) The data arrives in real time; 2) The order of data arrival is independent and not controlled by the application system; 3) The data scale is huge and its maximum value cannot be predicted; 4) Once the data is processed, unless it is specially saved, Otherwise, it cannot be retrieved and processed again, or it is expensive to retrieve the data again. The above characteristics of streaming data also bring about the expansion of security audit data. Data mining itself is a common knowledge discovery technology, its purpose is to extract the data information (knowledge) we are interested in from massive data, so the application of data mining technology to the analysis of audit data can contain a lot of redundant information. Extract as much hidden security information as possible from the rest of the data. Data mining techniques include statistics, classification, clustering, association, sequence analysis, cluster analysis and other methods. Among them, association rule algorithms, sequential pattern algorithms and classification algorithms in data mining have been used in log analysis and intrusion detection, but these methods still have many deficiencies in correct partitioning. The use of cluster analysis logs is a desirable data preprocessing method to solve the large amount of data, lack of information, and discover new intrusion patterns. It improves the deficiencies of traditional generation package analysis, small sample sampling data mining such as SVM. When clustering large-scale databases, data partitioning can lead to optimization of clustering results. Therefore, to improve clustering efficiency, reasonable data partitioning is the key.

改进数据分区是聚类算法的一个重要子课题，具有广阔的应用前景，较高的学术价值和理论研究意义。在通用的基于密度的聚类算法DBSCAN算法中，存在着两个弱点：一是在对数据聚类过程中要将整个数据库装入内存，当数据量很大时，算法的效率会急剧下降。二是当空间聚类的密度不均匀时，聚类间距离相差很大，聚类质量较差。在改进的基于网格和基于密度的聚类算法中，也仍然存在处理的数据量范围庞大，导致聚类效率不高的问题Improving data partition is an important sub-topic of clustering algorithm, which has broad application prospects, high academic value and theoretical research significance. In the general density-based clustering algorithm DBSCAN algorithm, there are two weaknesses: one is that the entire database needs to be loaded into the memory during the data clustering process, and when the amount of data is large, the efficiency of the algorithm will drop sharply. Second, when the density of spatial clusters is not uniform, the distances between clusters vary greatly, and the clustering quality is poor. In the improved grid-based and density-based clustering algorithms, there is still a problem that the processing data volume range is huge, resulting in low clustering efficiency.

发明内容Contents of the invention

本发明的目的在于提供一种使数据分布更均匀、聚类结果更精确的流数据的多分区聚类预处理方法，从而实现对数据的分布式并行处理。The purpose of the present invention is to provide a multi-partition clustering preprocessing method for stream data that makes data distribution more uniform and clustering results more accurate, thereby realizing distributed parallel processing of data.

实现本发明目的的技术解决方案为：The technical solution that realizes the object of the present invention is:

一种流数据的多分区聚类预处理方法，包括以下步骤：A multi-partition clustering preprocessing method for stream data, comprising the following steps:

步骤1，确定流数据态势因子的范围，并根据态势因子与网络安全态势的关联度，对流数据态势因子进行筛选；Step 1. Determine the scope of the flow data situation factors, and screen the flow data situation factors according to the correlation between the situation factors and the network security situation;

步骤2，根据筛选得到的态势因子，将流数据库划分为多个分区；Step 2. Divide the flow database into multiple partitions according to the situational factors obtained through screening;

步骤3，对每个分区，分别使用DBSCAN算法进行局部聚类；Step 3, for each partition, use the DBSCAN algorithm to perform local clustering;

步骤4，合并各局部聚类，得到流数据的多分区聚类预处理结果。Step 4, merge the local clusters to obtain the multi-partition clustering preprocessing results of the flow data.

进一步地，步骤1所述确定流数据态势因子的范围，并根据态势因子与网络安全态势的关联度，对流数据态势因子进行筛选，具体如下：Further, as described in step 1, determine the scope of the flow data situation factors, and screen the flow data situation factors according to the degree of correlation between the situation factors and the network security situation, as follows:

(1)确定流数据态势因子的范围；(1) Determine the range of the flow data situation factor;

(2)用随时刻K变化的态势数列和态势因子数列分别作为参考数列和比较数列；(2) Use the situation sequence that changes with time K and the series of situation factors Respectively as a reference sequence and a comparison sequence;

(3)对所述参考数列和比较数列进行无量纲化处理；(3) carry out dimensionless processing to described reference sequence and comparison sequence;

(4)确定K时刻所述参考数列与比较数列的灰色关联系数ε_i(K)，公式如下：(4) Determine the gray relational coefficient ε_i (K) of the reference sequence and the comparison sequence at the K moment, the formula is as follows:

式中，ρ是分辨系数，ρ∈[0，∞)；K＝1，2，…,L表示L个时刻中第K个时刻；i＝1,2,...m表示第i个因子；In the formula, ρ is the resolution coefficient, ρ∈[0, ∞); K=1, 2,..., L represents the Kth moment in the L moments; i=1, 2,...m represents the i-th factor ;

(5)计算态势因子与网络安全态势的关联度ρ_0,i，公式如下：(5) Calculate the correlation degree ρ_0,i between the situation factor and the network security situation, the formula is as follows:

(6)对流数据态势因子进行筛选，当所述关联度ρ_0,i大于阈值的时候，选用该流数据态势因子，否则剔除。(6) Screening the flow data situation factor, when the correlation degree ρ_0,i is greater than the threshold, select the flow data situation factor, otherwise eliminate it.

进一步地，步骤2所述根据筛选得到的态势因子，将流数据库划分为多个分区，具体为：Further, in step 2, according to the situation factors obtained through screening, the flow database is divided into multiple partitions, specifically:

(1)高维数据分区；(1) High-dimensional data partitioning;

对于高维数据，对筛选出的态势因子进行单因子方差分析；经过统计分析，若因子A不显著，则标记为高关联度因子，若A显著，则标记为低关联度因子；其中，对高关联度因子采用基于分布的分区方式，对低关联度因子采取等距的三角分区方式；For high-dimensional data, conduct single-factor analysis of variance on the selected situation factors; after statistical analysis, if factor A is not significant, it will be marked as a high-correlation factor, and if A is significant, it will be marked as a low-correlation factor; among them, for The distribution-based partition method is adopted for the high correlation factor, and the equidistant triangular partition method is adopted for the low correlation factor;

(2)低维数据分区(2) Low-dimensional data partition

对于低维数据采用随机取样算法对整个数据库进行取样，采用直方图方法统计分析采样点投影在X轴和Y轴上的数据分布特性，确定在哪一维上进行分区、分区的数量以及分区的边界。For low-dimensional data, the random sampling algorithm is used to sample the entire database, and the histogram method is used to statistically analyze the data distribution characteristics of the sampling points projected on the X-axis and Y-axis to determine which dimension to partition, the number of partitions, and the size of the partitions. boundary.

进一步地，步骤3所述对每个分区，分别使用DBSCAN算法进行局部聚类，具体过程如下：Further, as described in step 3, for each partition, the DBSCAN algorithm is used to perform local clustering, and the specific process is as follows:

(1)构建分区自身的R*树；(1) Build the R* tree of the partition itself;

(2)建立k-dist图；(2) Establish a k-dist graph;

(3)选取局部扫描半径Eps值，使用DBSCAN算法进行局部聚类。(3) Select the Eps value of the local scanning radius, and use the DBSCAN algorithm for local clustering.

进一步地，步骤4所述合并各局部聚类，具体过程如下：Further, as described in step 4, the local clusters are merged, and the specific process is as follows:

(1)两个类A和B的合并；(1) Merger of two classes A and B;

(2)归并噪声点；(2) merge noise points;

(3)由噪声点产生新类。(3) Generating new classes from noise points.

进一步地，所述对低关联度因子采取等距的三角分区方式，具体步骤如下：Further, the equidistant triangular partition method is adopted for the low correlation factor, and the specific steps are as follows:

(a)给定一个d维的数据集D＝(D₁，D₂，……D_d)，设第e维上的属性值分布在区间[w_e,h_e)中，其中e＝1,2,……，d，则S＝[w₁,h₁)×[w₂,h₂)×...[w_d,h_d)表示该e维数据空间；(a) Given a d-dimensional data set D=(D₁ , D₂ ,...D_d ), let the attribute values on the e-th dimension be distributed in the interval [w_e , he_e ), where e=1 ,2,...,d, then S=[w₁ ,h₁ )×[w₂ ,h₂ )×...[w_d ,h_d ) represents the e-dimensional data space;

(b)将数据空间的每一维划分成长度相等的t段，这样来自于每一维的一个段相交形成一个矩形，称为网格单元；网格单元C_e表示为C_e1,C_e2,...C_ed，C_ej＝[w_ej,h_ej)是一个左闭右开的区间，表示D_j(j＝1，2，……，d)上的一个段；(b) Divide each dimension of the data space into t segments of equal length, so that a segment from each dimension intersects to form a rectangle, which is called a grid unit; the grid unit C_e is expressed as C_e1 , C_e2 ,...C_ed , C_ej =[w_ej ,h_ej ) is a left-closed right-open interval, representing a segment on D_j (j=1,2,...,d);

(c)再取每个矩形的与X轴正方向成锐角的对角线，对矩形进行分割，形成三角单元T_e(T_e1,T_e2,...T_ed)。(c) Take the diagonal of each rectangle forming an acute angle with the positive direction of the X-axis, and divide the rectangle to form triangular units T_e (T_e1 ,T_e2 ,...T_ed ).

进一步地，所述两个类A和B合并，必须满足当且仅当：Further, the combination of the two classes A and B must satisfy if and only if:

(a)A、B分别处于相邻的两个分区P_A、P_B中；(a) A and_{B are located in two adjacent partitions PA and P B}_respectively ;

(b)设Eps(P_A)，Eps(P_B)分别是P_A,P_B的Eps邻域半径值，Eps(P_A,P_B)＝min{Eps(P_A)，Eps(P_B)}，E_A,E_B分别是在局部聚类过程中保存的P_A和P_B的边界区域点集，q∈E_B，满足DISTANCE{p,q}≤Eps(P_A,P_B)。(b) Let Eps(P_A ), Eps(P_B ) be the Eps neighborhood radius values of P_A and P_B respectively, Eps(P_A , P_B )=min{Eps(P_A ), Eps(P_B )}, E_A , E_B are the boundary area point sets of PA and_{P B}_saved in the local clustering process respectively, q∈E_B , satisfying DISTANCE{p,q}≤Eps(P_A ,P_B ).

进一步地，所述归并噪声点，必须满足当且仅当：Further, the merging of noise points must satisfy if and only if:

(a)噪声点P_N和类O分别处于相邻的两个分区中；(a) The noise point P_N and the class O are respectively in two adjacent partitions;

(b)E_O是局部聚类过程中保存在类O所在分区的边界区域点集,P_O是类O所处的分区，满足DISTANCE{P_N,P}≤Eps(P_O)。(b) E_O is the point set of the boundary area stored in the partition where class O is located in the local clustering process, and P_O is the partition where class O is located, satisfying DISTANCE{P_N ,P}≤Eps(P_O ).

本发明与现有技术相比，其显著优点为：(1)对流数据的态势因子进行深入研究，筛选出对安全态势影响最大的几个因子，并计算其关联度，仅对影响力大的因子进行分析处理，降低了数据处理的工作量；(2)利用单因子方差分析方法，对筛选出的因子进行统计区分，分为高关联度因子和低关联度因子，并采用区别处理，提高了聚类的效率；(3)对低关联度因子的数据分区方式采用了以网格分区为基础分割三角的分区方式，利用坐标矩形分区方法使数据分布更均匀，再次进行三角分割使得聚类边界从只有水平和垂直边界扩展到带角度边界，聚类质量得到提高。Compared with the prior art, the present invention has the following significant advantages: (1) Conduct in-depth research on the situational factors of the streaming data, screen out several factors that have the greatest impact on the security situation, and calculate their degree of correlation, only for those with large influence The factors are analyzed and processed, which reduces the workload of data processing; (2) Use the single-factor analysis of variance method to statistically distinguish the screened factors, divide them into high-correlation factors and low-correlation factors, and use differential processing to improve (3) For the data partition method of the low correlation factor, the partition method of dividing the triangle based on the grid partition is adopted, and the coordinate rectangle partition method is used to make the data distribution more uniform, and the triangular partition is performed again to make the clustering The boundaries are expanded from only horizontal and vertical boundaries to angled boundaries, and the clustering quality is improved.

附图说明Description of drawings

图1为本发明的流数据多分区聚类预处理方法流程图。FIG. 1 is a flow chart of the multi-partition clustering preprocessing method for stream data in the present invention.

图2为本发明的筛选态势因子并处理的方法流程图。Fig. 2 is a flow chart of the method for screening and processing situational factors of the present invention.

具体实施方式detailed description

为了更好地理解本发明，下面结合附图对本发明的内容做进一步的说明。In order to better understand the present invention, the content of the present invention will be further described below in conjunction with the accompanying drawings.

结合图1，本发明流数据的多分区聚类预处理方法，改进基于密度的聚类算法的分区方式，包括如下步骤：In conjunction with Fig. 1, the multi-partition clustering preprocessing method of flow data of the present invention improves the partitioning method of the density-based clustering algorithm, including the following steps:

流数据的态势因子是从入侵检测日志、主机设备运行状态、节点流量监控设备、实时告警系统获得的多源异质观测数据。影响安全态势的因子较多，本发明利用灰色理论中的灰色关联度分析法，分析因子与网络安全态势的关联度，选取对网络安全态势有主要影响的因子。结合图2，具体步骤如下：The situational factor of streaming data is multi-source heterogeneous observation data obtained from intrusion detection logs, host equipment running status, node traffic monitoring equipment, and real-time alarm system. There are many factors affecting the security situation. The present invention utilizes the gray relational degree analysis method in the gray theory to analyze the correlation between factors and the network security situation, and select the factors that have a major impact on the network security situation. Combined with Figure 2, the specific steps are as follows:

(1a)确定流数据态势因子的范围，在该范围内进行筛选；(1a) Determine the range of the flow data situation factor, and perform screening within this range;

(1b)选择一个态势因子，整理生成用随时刻K变化的态势数列和态势因子数列分别作为参考数列和比较数列；(1b) Select a situation factor, arrange and generate the situation series that changes with time K and the series of situation factors Respectively as a reference sequence and a comparison sequence;

(1c)对所述参考数列和比较数列进行无量纲化处理；(1c) performing dimensionless processing on the reference sequence and comparison sequence;

(1d)确定K时刻所述参考数列与比较数列的灰色关联系数ε_i(K)，公式如下：(1d) Determine the gray relational coefficient ε_i (K) of the reference sequence and the comparison sequence at the K moment, the formula is as follows:

(1e)计算态势因子与网络安全态势的关联度ρ_0,i，公式如下：(1e) Calculate the correlation degree ρ_0,i between the situation factor and the network security situation, the formula is as follows:

(1f)对流数据态势因子进行筛选，当所述关联度ρ_0,i大于阈值的时候，选用该流数据态势因子，否则剔除。(1f) Screening the flow data situation factor, when the correlation degree ρ_0,i is greater than the threshold, select the flow data situation factor, otherwise eliminate it.

重复(1b)至(1f)，直至筛选完所有因子。Repeat (1b) to (1f) until all factors are screened.

高维数据分区：对于高维数据，对筛选出的态势因子进行单因子方差分析；经过统计分析，若因子A不显著，则标记为高关联度因子，若A显著，则标记为低关联度因子；其中，对高关联度因子采用基于分布的分区方式，对低关联度因子采取等距的三角分区方式；High-dimensional data partitioning: For high-dimensional data, conduct one-way analysis of variance on the selected situation factors; after statistical analysis, if factor A is not significant, it will be marked as a high-correlation factor, and if A is significant, it will be marked as a low-correlation factor Factors; Among them, the distribution-based partitioning method is adopted for the high correlation factor, and the equidistant triangular partition method is adopted for the low correlation factor;

低维数据分区：对于低维数据采用随机取样算法对整个数据库进行取样，采用直方图方法统计分析采样点投影在X轴和Y轴上的数据分布特性，确定在哪一维上进行分区、分区的数量以及分区的边界。Low-dimensional data partition: For low-dimensional data, use random sampling algorithm to sample the entire database, use histogram method to statistically analyze the data distribution characteristics of the sampling points projected on the X-axis and Y-axis, and determine which dimension to partition and partition and the boundaries of the partitions.

所述对低关联度因子采取等距的三角分区方式，即对低关联度因子的数据，在第e维上先进行等距的矩形分区，再对每个矩形单元进行对角分割，形成三角分区。具体步骤：The low-correlation factor adopts equidistant triangular partition method, that is, for the data of the low-correlation factor, equidistant rectangular partition is first performed on the e-th dimension, and then each rectangular unit is diagonally divided to form a triangle partition. Specific steps:

(2a)给定一个d维的数据集D＝(D₁，D₂，……D_d)，设第e维上的属性值分布在区间[w_e,h_e)中，其中e＝1,2,……，d，则S＝[w₁,h₁)×[w₂,h₂)×...[w_d,h_d)表示该e维数据空间；(2a) Given a d-dimensional data set D=(D₁ , D₂ ,...D_d ), let the attribute values on the e-th dimension be distributed in the interval [we_e , he_e ), where e=1 ,2,...,d, then S=[w₁ ,h₁ )×[w₂ ,h₂ )×...[w_d ,h_d ) represents the e-dimensional data space;

(2b)将数据空间的每一维划分成长度相等的t段，这样来自于每一维的一个段相交形成一个矩形，称为网格单元；网格单元C_e表示为C_e1,C_e2,...C_ed，C_ej＝[w_ej,h_ej)是一个左闭右开的区间，表示D_j(j＝1·2，……，d)上的一个段；(2b) Divide each dimension of the data space into t segments of equal length, so that a segment from each dimension intersects to form a rectangle, which is called a grid unit; the grid unit C_e is expressed as C_e1 , C_e2 ,...C_ed , C_ej =[w_ej ,h_ej ) is a left-closed right-open interval, representing a segment on D_j (j=1·2,...,d);

(2c)再取每个矩形的与X轴正方向成锐角的对角线，对矩形进行分割，形成三角单元T_e(T_e1,T_e2,...T_ed)。(2c) Take the diagonal of each rectangle forming an acute angle with the positive direction of the X-axis, and divide the rectangle to form triangular units T_e (T_e1 , T_e2 ,...T_ed ).

密度可达对象的获取是通过不断执行区域查询来实现的。一个区域查询返回指定区域中的所有对象。为了有效地执行区域查询，DBSCAN算法使用了空间查询R*树结构。在进行聚类前，必须建立针对所有数据的R*树。The acquisition of density-reachable objects is achieved by continuously executing region queries. A range query returns all objects in a specified range. In order to efficiently perform region queries, the DBSCAN algorithm uses a spatial query R* tree structure. Before clustering, an R* tree for all data must be established.

为了确定取值，DBSCAN计算任意对象与它的第k个最近的对象之间的距离。然后，距离由小到大排序，并绘出排序后的k-dist图。k-dist图的横坐标表示数据对象与它的第k个最近的对象间的距离；纵坐标表示对应于某一k-dist距离值的数据对象的个数。To determine the value, DBSCAN computes the distance between any object and its kth closest object. Then, the distances are sorted from small to large, and the sorted k-dist graph is drawn. The abscissa of the k-dist graph represents the distance between a data object and its kth closest object; the ordinate represents the number of data objects corresponding to a certain k-dist distance value.

根据k-dist图，选定一个比较合适的Eps值。According to the k-dist diagram, select a more appropriate Eps value.

具体步骤：Specific steps:

(3a)构建分区的R*树，用于执行区域查询；(3a) constructing a partitioned R* tree for performing region queries;

(3b)建立k-dist图；(3b) Establish a k-dist graph;

(3c)选取分区的Eps值；(3c) select the Eps value of the partition;

(3d)采用DBSCAN算法对分区数据聚类，记录可能在合并阶段有用的类的代表点和边界信息：(3d) Use the DBSCAN algorithm to cluster the partition data, and record the representative points and boundary information of the classes that may be useful in the merging stage:

(3d1)检测数据库中尚未检查过的对象x，如果x未被处理(归入某个簇或者标记为噪声)，则检查其邻域，若其邻域包含的对象数不小于minPts，则建立新簇Z，将该邻域中的所有点加入候选集M，否则将x标记为噪声；(3d1) Detect an unchecked object x in the database. If x has not been processed (classified into a certain cluster or marked as noise), check its neighborhood. If the number of objects contained in its neighborhood is not less than minPts, establish For a new cluster Z, add all points in the neighborhood to the candidate set M, otherwise mark x as noise;

(3d2)对候选集M中所有尚未被处理的对象y，检查其邻域，若其中包含的对象数不小于minPts，则将这些对象加入M；如果y未归入任何一个簇，则将y加入Z；(3d2) For all unprocessed objects y in the candidate set M, check their neighborhoods, if the number of objects contained in them is not less than minPts, then add these objects to M; if y is not classified into any cluster, then add y Join Z;

(3d3)重复步骤(3d2)，直到候选集M为空；(3d3) Repeat step (3d2) until the candidate set M is empty;

(3d4)重复步骤(3d1)-(3d3)，直到所有对象都归入了某个簇或标记为噪声。(3d4) Steps (3d1)-(3d3) are repeated until all objects are classified into a certain cluster or marked as noise.

(4a)类的合并Merger of classes (4a)

两个类A和B合并，必须满足当且仅当：Two classes A and B are merged if and only if:

(4a1)A、B分别处于相邻的两个分区P_A、P_B中；(4a1) A and_{B are located in two adjacent partitions PA and P B}_respectively ;

(4a2)设Eps(P_A)，Eps(P_B)分别是P_A,P_B的Eps邻域半径值，Eps(P_A,P_B)＝min{Eps(P_A)，Eps(P_B)}，E_A,E_B分别是在局部聚类过程中保存的P_A和P_B的边界区域点集，q∈E_B，满足DISTANCE{p,q}≤Eps(P_A,P_B)(4a2) Suppose Eps(PA), Eps(P_B ) are the Eps neighborhood radius values of PA and P_B respectively, Eps(PA,_{P B}₎₌ min{Eps(PA₎ , Eps(_P_B )}, E_A , E_B are the boundary area point sets of PA and_{P B}_saved in the local clustering process respectively, q∈E_B , satisfying DISTANCE{p,q}≤Eps(P_A ,P_B )

(4b)归并噪声点：处在分区线附近的噪声点可能是全局中某个类的边界点，因此必须考虑将这些临时噪声点归到相邻分区的某个类中。一个噪声点被归纳入一个类，当且仅当：(4b) Merging noise points: The noise points near the partition line may be the boundary points of a certain class in the whole world, so it must be considered to classify these temporary noise points into a certain class of the adjacent partition. A noise point is classified into a class if and only if:

(4b1)噪声点P_N和类O分别处于相邻的两个分区中；(4b1) The noise point P_N and the class O are respectively in two adjacent partitions;

(4b2)E_O是局部聚类过程中保存在类O所在分区的边界区域点集，P_O是类O所处的分区，满足DISTANCE{P_N，P}≤Eps(P_O)。(4b2) E_O is the point set of the boundary area stored in the partition where class O is located during the local clustering process, and P_O is the partition where class O is located, satisfying DISTANCE{P_N , P}≤Eps(P_O ).

(4c)由噪声点产生新类：在数据分区阶段，一些小的聚类可能被划分到不同分区中去。由于处在每个分区中的数据量太小，在各分区的局部聚类过程中均被标记为噪声点，因此还需要对这些点进行处理。设P_M，P_N为两个相邻分区，为这两个分区在边界区域点集。考虑和中噪声点的集合为SN，若且p_u≠p_v。满足DISTANCE{P_u，P_v}≤Eps(P_M，P_N)，则认为P_u是全局中的一个新类的核心点，{P_v}(v＝1，2，…，mp)是这新类的其它点。剩余噪声点按(4b)进行处理。(4c) Generating new classes from noise points: In the data partition stage, some small clusters may be divided into different partitions. Since the amount of data in each partition is too small, they are marked as noise points in the local clustering process of each partition, so these points also need to be processed. Let_PM and_PN be two adjacent partitions, Set points in the boundary region for both partitions. consider with The set of noise points in the medium is SN, if And p_u ≠ p_v . If DISTANCE {P_u , P_v }≤Eps(P_M , P_N ), then P_u is considered to be the core point of a new class in the global, and {P_v }(v=1, 2, ..., mp) is Other points of this new class. The remaining noise is processed according to (4b).

综上，本发明对流数据的多分区聚类预处理方法，改善了面对大量流数据分布不均匀时，聚类精度不高的问题。相较于DBSCAN，这种改进的基于密度的聚类算法的数据分区方式，在一定程度上改善了聚类速度和聚类质量，减少了数据预处理的工作量，提高聚类效率，实现对实时流量的分布式并行预处理。In summary, the multi-partition clustering preprocessing method for streaming data in the present invention improves the problem of low clustering accuracy when faced with uneven distribution of a large amount of streaming data. Compared with DBSCAN, this improved density-based clustering algorithm's data partitioning method improves the clustering speed and clustering quality to a certain extent, reduces the workload of data preprocessing, improves clustering efficiency, and realizes Distributed parallel preprocessing of real-time traffic.