Movatterモバイル変換


[0]ホーム

URL:


CN106570104A - Multi-partition clustering preprocessing method of stream data - Google Patents

Multi-partition clustering preprocessing method of stream data
Download PDF

Info

Publication number
CN106570104A
CN106570104ACN201610934684.1ACN201610934684ACN106570104ACN 106570104 ACN106570104 ACN 106570104ACN 201610934684 ACN201610934684 ACN 201610934684ACN 106570104 ACN106570104 ACN 106570104A
Authority
CN
China
Prior art keywords
partition
data
clustering
situation
factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610934684.1A
Other languages
Chinese (zh)
Other versions
CN106570104B (en
Inventor
王烁
李千目
戚湧
王印海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and TechnologyfiledCriticalNanjing University of Science and Technology
Priority to CN201610934684.1ApriorityCriticalpatent/CN106570104B/en
Publication of CN106570104ApublicationCriticalpatent/CN106570104A/en
Application grantedgrantedCritical
Publication of CN106570104BpublicationCriticalpatent/CN106570104B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种流数据的多分区聚类预处理方法,步骤包括:对流数据的态势因子进行筛选并计算关联度:对流数据进行统计分析,对高维数据库中高关联度因子采用低维数据库分区方法,低关联度因子采用等距三角分区;对低维数据库采用基于分布的分区方法;在每一规则分区内采用基于密度的聚类算法DBSCAN算法进行聚类;将各局部聚类进行合并。本发明对流数据的聚类预处理进行了多分区的改进,使数据分布更均匀,聚类结果更精确,实现了对数据的分布式并行处理,缓解了面对顺序、大量、快速、连续到达的数据序列时数据预处理效率不高的压力。

The invention discloses a multi-partition clustering preprocessing method for streaming data. The steps include: screening the situational factors of the streaming data and calculating the correlation degree; performing statistical analysis on the streaming data, and adopting a low-dimensional database for the high correlation factor in the high-dimensional database Partitioning method, the low correlation factor adopts equidistant triangular partitioning; the low-dimensional database adopts the distribution-based partitioning method; uses the density-based clustering algorithm DBSCAN algorithm for clustering in each regular partition; merges the local clusters . The present invention improves the multi-partition clustering preprocessing of stream data, makes the data distribution more uniform, and the clustering results are more accurate, realizes the distributed parallel processing of data, and eases the problem of sequential, massive, fast and continuous arrival. The data preprocessing efficiency is not high when the data sequence is under pressure.

Description

Translated fromChinese
一种流数据的多分区聚类预处理方法A multi-partition clustering preprocessing method for streaming data

技术领域technical field

本发明涉及DBSCAN聚类算法技术领域,特别是一种流数据的多分区聚类预处理方法。The invention relates to the technical field of DBSCAN clustering algorithm, in particular to a multi-partition clustering preprocessing method for streaming data.

背景技术Background technique

随着大数据时代的到来,数据逐渐已数据流的形式传递。流数据具有以下四个特点:1)数据实时到达;2)数据到达次序独立,不受应用系统所控制;3)数据规模宏大且不能预知其最大值;4)数据一经处理,除非特意保存,否则不能被再次取出处理,或者再次提取数据代价昂贵。流数据的以上特点,也随之带来安全审计数据的膨胀。数据挖掘本身是一项通用的知识发现技术,其目的是要从海量数据中提取出我们所感兴趣的数据信息(知识),因此将数据挖掘技术应用于对审计数据的分析可以从包含了大量冗余信息的数据中提取出尽可能多的隐藏的安全信息。数据挖掘技术中包括统计、分类、聚类、关联、序列分析、群集分析等方法。其中,数据挖掘中的关联规则算法、序列模式算法和分类算法已经用在了日志分析和入侵检测中,但是这些方法在正确分区上仍然存在着很多不足之处。采用聚类分析日志是解决数据量大,信息匮乏,发现新入侵模式的一种可取的数据预处理方法。改进了传统的生成包分析,SVM等小样本取样的数据挖掘的不足。大规模数据库的聚类时,数据分区可以带来聚类结果的优化,因此,要提高聚类效率,合理的进行数据分区是关键。With the advent of the era of big data, data has gradually been transmitted in the form of data streams. Streaming data has the following four characteristics: 1) The data arrives in real time; 2) The order of data arrival is independent and not controlled by the application system; 3) The data scale is huge and its maximum value cannot be predicted; 4) Once the data is processed, unless it is specially saved, Otherwise, it cannot be retrieved and processed again, or it is expensive to retrieve the data again. The above characteristics of streaming data also bring about the expansion of security audit data. Data mining itself is a common knowledge discovery technology, its purpose is to extract the data information (knowledge) we are interested in from massive data, so the application of data mining technology to the analysis of audit data can contain a lot of redundant information. Extract as much hidden security information as possible from the rest of the data. Data mining techniques include statistics, classification, clustering, association, sequence analysis, cluster analysis and other methods. Among them, association rule algorithms, sequential pattern algorithms and classification algorithms in data mining have been used in log analysis and intrusion detection, but these methods still have many deficiencies in correct partitioning. The use of cluster analysis logs is a desirable data preprocessing method to solve the large amount of data, lack of information, and discover new intrusion patterns. It improves the deficiencies of traditional generation package analysis, small sample sampling data mining such as SVM. When clustering large-scale databases, data partitioning can lead to optimization of clustering results. Therefore, to improve clustering efficiency, reasonable data partitioning is the key.

改进数据分区是聚类算法的一个重要子课题,具有广阔的应用前景,较高的学术价值和理论研究意义。在通用的基于密度的聚类算法DBSCAN算法中,存在着两个弱点:一是在对数据聚类过程中要将整个数据库装入内存,当数据量很大时,算法的效率会急剧下降。二是当空间聚类的密度不均匀时,聚类间距离相差很大,聚类质量较差。在改进的基于网格和基于密度的聚类算法中,也仍然存在处理的数据量范围庞大,导致聚类效率不高的问题Improving data partition is an important sub-topic of clustering algorithm, which has broad application prospects, high academic value and theoretical research significance. In the general density-based clustering algorithm DBSCAN algorithm, there are two weaknesses: one is that the entire database needs to be loaded into the memory during the data clustering process, and when the amount of data is large, the efficiency of the algorithm will drop sharply. Second, when the density of spatial clusters is not uniform, the distances between clusters vary greatly, and the clustering quality is poor. In the improved grid-based and density-based clustering algorithms, there is still a problem that the processing data volume range is huge, resulting in low clustering efficiency.

发明内容Contents of the invention

本发明的目的在于提供一种使数据分布更均匀、聚类结果更精确的流数据的多分区聚类预处理方法,从而实现对数据的分布式并行处理。The purpose of the present invention is to provide a multi-partition clustering preprocessing method for stream data that makes data distribution more uniform and clustering results more accurate, thereby realizing distributed parallel processing of data.

实现本发明目的的技术解决方案为:The technical solution that realizes the object of the present invention is:

一种流数据的多分区聚类预处理方法,包括以下步骤:A multi-partition clustering preprocessing method for stream data, comprising the following steps:

步骤1,确定流数据态势因子的范围,并根据态势因子与网络安全态势的关联度,对流数据态势因子进行筛选;Step 1. Determine the scope of the flow data situation factors, and screen the flow data situation factors according to the correlation between the situation factors and the network security situation;

步骤2,根据筛选得到的态势因子,将流数据库划分为多个分区;Step 2. Divide the flow database into multiple partitions according to the situational factors obtained through screening;

步骤3,对每个分区,分别使用DBSCAN算法进行局部聚类;Step 3, for each partition, use the DBSCAN algorithm to perform local clustering;

步骤4,合并各局部聚类,得到流数据的多分区聚类预处理结果。Step 4, merge the local clusters to obtain the multi-partition clustering preprocessing results of the flow data.

进一步地,步骤1所述确定流数据态势因子的范围,并根据态势因子与网络安全态势的关联度,对流数据态势因子进行筛选,具体如下:Further, as described in step 1, determine the scope of the flow data situation factors, and screen the flow data situation factors according to the degree of correlation between the situation factors and the network security situation, as follows:

(1)确定流数据态势因子的范围;(1) Determine the range of the flow data situation factor;

(2)用随时刻K变化的态势数列和态势因子数列分别作为参考数列和比较数列;(2) Use the situation sequence that changes with time K and the series of situation factors Respectively as a reference sequence and a comparison sequence;

(3)对所述参考数列和比较数列进行无量纲化处理;(3) carry out dimensionless processing to described reference sequence and comparison sequence;

(4)确定K时刻所述参考数列与比较数列的灰色关联系数εi(K),公式如下:(4) Determine the gray relational coefficient εi (K) of the reference sequence and the comparison sequence at the K moment, the formula is as follows:

式中,ρ是分辨系数,ρ∈[0,∞);K=1,2,…,L表示L个时刻中第K个时刻;i=1,2,...m表示第i个因子;In the formula, ρ is the resolution coefficient, ρ∈[0, ∞); K=1, 2,..., L represents the Kth moment in the L moments; i=1, 2,...m represents the i-th factor ;

(5)计算态势因子与网络安全态势的关联度ρ0,i,公式如下:(5) Calculate the correlation degree ρ0,i between the situation factor and the network security situation, the formula is as follows:

(6)对流数据态势因子进行筛选,当所述关联度ρ0,i大于阈值的时候,选用该流数据态势因子,否则剔除。(6) Screening the flow data situation factor, when the correlation degree ρ0,i is greater than the threshold, select the flow data situation factor, otherwise eliminate it.

进一步地,步骤2所述根据筛选得到的态势因子,将流数据库划分为多个分区,具体为:Further, in step 2, according to the situation factors obtained through screening, the flow database is divided into multiple partitions, specifically:

(1)高维数据分区;(1) High-dimensional data partitioning;

对于高维数据,对筛选出的态势因子进行单因子方差分析;经过统计分析,若因子A不显著,则标记为高关联度因子,若A显著,则标记为低关联度因子;其中,对高关联度因子采用基于分布的分区方式,对低关联度因子采取等距的三角分区方式;For high-dimensional data, conduct single-factor analysis of variance on the selected situation factors; after statistical analysis, if factor A is not significant, it will be marked as a high-correlation factor, and if A is significant, it will be marked as a low-correlation factor; among them, for The distribution-based partition method is adopted for the high correlation factor, and the equidistant triangular partition method is adopted for the low correlation factor;

(2)低维数据分区(2) Low-dimensional data partition

对于低维数据采用随机取样算法对整个数据库进行取样,采用直方图方法统计分析采样点投影在X轴和Y轴上的数据分布特性,确定在哪一维上进行分区、分区的数量以及分区的边界。For low-dimensional data, the random sampling algorithm is used to sample the entire database, and the histogram method is used to statistically analyze the data distribution characteristics of the sampling points projected on the X-axis and Y-axis to determine which dimension to partition, the number of partitions, and the size of the partitions. boundary.

进一步地,步骤3所述对每个分区,分别使用DBSCAN算法进行局部聚类,具体过程如下:Further, as described in step 3, for each partition, the DBSCAN algorithm is used to perform local clustering, and the specific process is as follows:

(1)构建分区自身的R*树;(1) Build the R* tree of the partition itself;

(2)建立k-dist图;(2) Establish a k-dist graph;

(3)选取局部扫描半径Eps值,使用DBSCAN算法进行局部聚类。(3) Select the Eps value of the local scanning radius, and use the DBSCAN algorithm for local clustering.

进一步地,步骤4所述合并各局部聚类,具体过程如下:Further, as described in step 4, the local clusters are merged, and the specific process is as follows:

(1)两个类A和B的合并;(1) Merger of two classes A and B;

(2)归并噪声点;(2) merge noise points;

(3)由噪声点产生新类。(3) Generating new classes from noise points.

进一步地,所述对低关联度因子采取等距的三角分区方式,具体步骤如下:Further, the equidistant triangular partition method is adopted for the low correlation factor, and the specific steps are as follows:

(a)给定一个d维的数据集D=(D1,D2,……Dd),设第e维上的属性值分布在区间[we,he)中,其中e=1,2,……,d,则S=[w1,h1)×[w2,h2)×...[wd,hd)表示该e维数据空间;(a) Given a d-dimensional data set D=(D1 , D2 ,...Dd ), let the attribute values on the e-th dimension be distributed in the interval [we , hee ), where e=1 ,2,...,d, then S=[w1 ,h1 )×[w2 ,h2 )×...[wd ,hd ) represents the e-dimensional data space;

(b)将数据空间的每一维划分成长度相等的t段,这样来自于每一维的一个段相交形成一个矩形,称为网格单元;网格单元Ce表示为Ce1,Ce2,...Ced,Cej=[wej,hej)是一个左闭右开的区间,表示Dj(j=1,2,……,d)上的一个段;(b) Divide each dimension of the data space into t segments of equal length, so that a segment from each dimension intersects to form a rectangle, which is called a grid unit; the grid unit Ce is expressed as Ce1 , Ce2 ,...Ced , Cej =[wej ,hej ) is a left-closed right-open interval, representing a segment on Dj (j=1,2,...,d);

(c)再取每个矩形的与X轴正方向成锐角的对角线,对矩形进行分割,形成三角单元Te(Te1,Te2,...Ted)。(c) Take the diagonal of each rectangle forming an acute angle with the positive direction of the X-axis, and divide the rectangle to form triangular units Te (Te1 ,Te2 ,...Ted ).

进一步地,所述两个类A和B合并,必须满足当且仅当:Further, the combination of the two classes A and B must satisfy if and only if:

(a)A、B分别处于相邻的两个分区PA、PB中;(a) A andB are located in two adjacent partitions PA and P Brespectively ;

(b)设Eps(PA),Eps(PB)分别是PA,PB的Eps邻域半径值,Eps(PA,PB)=min{Eps(PA),Eps(PB)},EA,EB分别是在局部聚类过程中保存的PA和PB的边界区域点集,q∈EB,满足DISTANCE{p,q}≤Eps(PA,PB)。(b) Let Eps(PA ), Eps(PB ) be the Eps neighborhood radius values of PA and PB respectively, Eps(PA , PB )=min{Eps(PA ), Eps(PB )}, EA , EB are the boundary area point sets of PA andP Bsaved in the local clustering process respectively, q∈EB , satisfying DISTANCE{p,q}≤Eps(PA ,PB ).

进一步地,所述归并噪声点,必须满足当且仅当:Further, the merging of noise points must satisfy if and only if:

(a)噪声点PN和类O分别处于相邻的两个分区中;(a) The noise point PN and the class O are respectively in two adjacent partitions;

(b)EO是局部聚类过程中保存在类O所在分区的边界区域点集,PO是类O所处的分区,满足DISTANCE{PN,P}≤Eps(PO)。(b) EO is the point set of the boundary area stored in the partition where class O is located in the local clustering process, and PO is the partition where class O is located, satisfying DISTANCE{PN ,P}≤Eps(PO ).

本发明与现有技术相比,其显著优点为:(1)对流数据的态势因子进行深入研究,筛选出对安全态势影响最大的几个因子,并计算其关联度,仅对影响力大的因子进行分析处理,降低了数据处理的工作量;(2)利用单因子方差分析方法,对筛选出的因子进行统计区分,分为高关联度因子和低关联度因子,并采用区别处理,提高了聚类的效率;(3)对低关联度因子的数据分区方式采用了以网格分区为基础分割三角的分区方式,利用坐标矩形分区方法使数据分布更均匀,再次进行三角分割使得聚类边界从只有水平和垂直边界扩展到带角度边界,聚类质量得到提高。Compared with the prior art, the present invention has the following significant advantages: (1) Conduct in-depth research on the situational factors of the streaming data, screen out several factors that have the greatest impact on the security situation, and calculate their degree of correlation, only for those with large influence The factors are analyzed and processed, which reduces the workload of data processing; (2) Use the single-factor analysis of variance method to statistically distinguish the screened factors, divide them into high-correlation factors and low-correlation factors, and use differential processing to improve (3) For the data partition method of the low correlation factor, the partition method of dividing the triangle based on the grid partition is adopted, and the coordinate rectangle partition method is used to make the data distribution more uniform, and the triangular partition is performed again to make the clustering The boundaries are expanded from only horizontal and vertical boundaries to angled boundaries, and the clustering quality is improved.

附图说明Description of drawings

图1为本发明的流数据多分区聚类预处理方法流程图。FIG. 1 is a flow chart of the multi-partition clustering preprocessing method for stream data in the present invention.

图2为本发明的筛选态势因子并处理的方法流程图。Fig. 2 is a flow chart of the method for screening and processing situational factors of the present invention.

具体实施方式detailed description

为了更好地理解本发明,下面结合附图对本发明的内容做进一步的说明。In order to better understand the present invention, the content of the present invention will be further described below in conjunction with the accompanying drawings.

结合图1,本发明流数据的多分区聚类预处理方法,改进基于密度的聚类算法的分区方式,包括如下步骤:In conjunction with Fig. 1, the multi-partition clustering preprocessing method of flow data of the present invention improves the partitioning method of the density-based clustering algorithm, including the following steps:

步骤1,确定流数据态势因子的范围,并根据态势因子与网络安全态势的关联度,对流数据态势因子进行筛选;Step 1. Determine the scope of the flow data situation factors, and screen the flow data situation factors according to the correlation between the situation factors and the network security situation;

流数据的态势因子是从入侵检测日志、主机设备运行状态、节点流量监控设备、实时告警系统获得的多源异质观测数据。影响安全态势的因子较多,本发明利用灰色理论中的灰色关联度分析法,分析因子与网络安全态势的关联度,选取对网络安全态势有主要影响的因子。结合图2,具体步骤如下:The situational factor of streaming data is multi-source heterogeneous observation data obtained from intrusion detection logs, host equipment running status, node traffic monitoring equipment, and real-time alarm system. There are many factors affecting the security situation. The present invention utilizes the gray relational degree analysis method in the gray theory to analyze the correlation between factors and the network security situation, and select the factors that have a major impact on the network security situation. Combined with Figure 2, the specific steps are as follows:

(1a)确定流数据态势因子的范围,在该范围内进行筛选;(1a) Determine the range of the flow data situation factor, and perform screening within this range;

(1b)选择一个态势因子,整理生成用随时刻K变化的态势数列和态势因子数列分别作为参考数列和比较数列;(1b) Select a situation factor, arrange and generate the situation series that changes with time K and the series of situation factors Respectively as a reference sequence and a comparison sequence;

(1c)对所述参考数列和比较数列进行无量纲化处理;(1c) performing dimensionless processing on the reference sequence and comparison sequence;

(1d)确定K时刻所述参考数列与比较数列的灰色关联系数εi(K),公式如下:(1d) Determine the gray relational coefficient εi (K) of the reference sequence and the comparison sequence at the K moment, the formula is as follows:

式中,ρ是分辨系数,ρ∈[0,∞);K=1,2,…,L表示L个时刻中第K个时刻;i=1,2,...m表示第i个因子;In the formula, ρ is the resolution coefficient, ρ∈[0, ∞); K=1, 2,..., L represents the Kth moment in the L moments; i=1, 2,...m represents the i-th factor ;

(1e)计算态势因子与网络安全态势的关联度ρ0,i,公式如下:(1e) Calculate the correlation degree ρ0,i between the situation factor and the network security situation, the formula is as follows:

(1f)对流数据态势因子进行筛选,当所述关联度ρ0,i大于阈值的时候,选用该流数据态势因子,否则剔除。(1f) Screening the flow data situation factor, when the correlation degree ρ0,i is greater than the threshold, select the flow data situation factor, otherwise eliminate it.

重复(1b)至(1f),直至筛选完所有因子。Repeat (1b) to (1f) until all factors are screened.

步骤2,根据筛选得到的态势因子,将流数据库划分为多个分区;Step 2. Divide the flow database into multiple partitions according to the situational factors obtained through screening;

高维数据分区:对于高维数据,对筛选出的态势因子进行单因子方差分析;经过统计分析,若因子A不显著,则标记为高关联度因子,若A显著,则标记为低关联度因子;其中,对高关联度因子采用基于分布的分区方式,对低关联度因子采取等距的三角分区方式;High-dimensional data partitioning: For high-dimensional data, conduct one-way analysis of variance on the selected situation factors; after statistical analysis, if factor A is not significant, it will be marked as a high-correlation factor, and if A is significant, it will be marked as a low-correlation factor Factors; Among them, the distribution-based partitioning method is adopted for the high correlation factor, and the equidistant triangular partition method is adopted for the low correlation factor;

低维数据分区:对于低维数据采用随机取样算法对整个数据库进行取样,采用直方图方法统计分析采样点投影在X轴和Y轴上的数据分布特性,确定在哪一维上进行分区、分区的数量以及分区的边界。Low-dimensional data partition: For low-dimensional data, use random sampling algorithm to sample the entire database, use histogram method to statistically analyze the data distribution characteristics of the sampling points projected on the X-axis and Y-axis, and determine which dimension to partition and partition and the boundaries of the partitions.

所述对低关联度因子采取等距的三角分区方式,即对低关联度因子的数据,在第e维上先进行等距的矩形分区,再对每个矩形单元进行对角分割,形成三角分区。具体步骤:The low-correlation factor adopts equidistant triangular partition method, that is, for the data of the low-correlation factor, equidistant rectangular partition is first performed on the e-th dimension, and then each rectangular unit is diagonally divided to form a triangle partition. Specific steps:

(2a)给定一个d维的数据集D=(D1,D2,……Dd),设第e维上的属性值分布在区间[we,he)中,其中e=1,2,……,d,则S=[w1,h1)×[w2,h2)×...[wd,hd)表示该e维数据空间;(2a) Given a d-dimensional data set D=(D1 , D2 ,...Dd ), let the attribute values on the e-th dimension be distributed in the interval [wee , hee ), where e=1 ,2,...,d, then S=[w1 ,h1 )×[w2 ,h2 )×...[wd ,hd ) represents the e-dimensional data space;

(2b)将数据空间的每一维划分成长度相等的t段,这样来自于每一维的一个段相交形成一个矩形,称为网格单元;网格单元Ce表示为Ce1,Ce2,...Ced,Cej=[wej,hej)是一个左闭右开的区间,表示Dj(j=1·2,……,d)上的一个段;(2b) Divide each dimension of the data space into t segments of equal length, so that a segment from each dimension intersects to form a rectangle, which is called a grid unit; the grid unit Ce is expressed as Ce1 , Ce2 ,...Ced , Cej =[wej ,hej ) is a left-closed right-open interval, representing a segment on Dj (j=1·2,...,d);

(2c)再取每个矩形的与X轴正方向成锐角的对角线,对矩形进行分割,形成三角单元Te(Te1,Te2,...Ted)。(2c) Take the diagonal of each rectangle forming an acute angle with the positive direction of the X-axis, and divide the rectangle to form triangular units Te (Te1 , Te2 ,...Ted ).

步骤3,对每个分区,分别使用DBSCAN算法进行局部聚类;Step 3, for each partition, use the DBSCAN algorithm to perform local clustering;

密度可达对象的获取是通过不断执行区域查询来实现的。一个区域查询返回指定区域中的所有对象。为了有效地执行区域查询,DBSCAN算法使用了空间查询R*树结构。在进行聚类前,必须建立针对所有数据的R*树。The acquisition of density-reachable objects is achieved by continuously executing region queries. A range query returns all objects in a specified range. In order to efficiently perform region queries, the DBSCAN algorithm uses a spatial query R* tree structure. Before clustering, an R* tree for all data must be established.

为了确定取值,DBSCAN计算任意对象与它的第k个最近的对象之间的距离。然后,距离由小到大排序,并绘出排序后的k-dist图。k-dist图的横坐标表示数据对象与它的第k个最近的对象间的距离;纵坐标表示对应于某一k-dist距离值的数据对象的个数。To determine the value, DBSCAN computes the distance between any object and its kth closest object. Then, the distances are sorted from small to large, and the sorted k-dist graph is drawn. The abscissa of the k-dist graph represents the distance between a data object and its kth closest object; the ordinate represents the number of data objects corresponding to a certain k-dist distance value.

根据k-dist图,选定一个比较合适的Eps值。According to the k-dist diagram, select a more appropriate Eps value.

具体步骤:Specific steps:

(3a)构建分区的R*树,用于执行区域查询;(3a) constructing a partitioned R* tree for performing region queries;

(3b)建立k-dist图;(3b) Establish a k-dist graph;

(3c)选取分区的Eps值;(3c) select the Eps value of the partition;

(3d)采用DBSCAN算法对分区数据聚类,记录可能在合并阶段有用的类的代表点和边界信息:(3d) Use the DBSCAN algorithm to cluster the partition data, and record the representative points and boundary information of the classes that may be useful in the merging stage:

(3d1)检测数据库中尚未检查过的对象x,如果x未被处理(归入某个簇或者标记为噪声),则检查其邻域,若其邻域包含的对象数不小于minPts,则建立新簇Z,将该邻域中的所有点加入候选集M,否则将x标记为噪声;(3d1) Detect an unchecked object x in the database. If x has not been processed (classified into a certain cluster or marked as noise), check its neighborhood. If the number of objects contained in its neighborhood is not less than minPts, establish For a new cluster Z, add all points in the neighborhood to the candidate set M, otherwise mark x as noise;

(3d2)对候选集M中所有尚未被处理的对象y,检查其邻域,若其中包含的对象数不小于minPts,则将这些对象加入M;如果y未归入任何一个簇,则将y加入Z;(3d2) For all unprocessed objects y in the candidate set M, check their neighborhoods, if the number of objects contained in them is not less than minPts, then add these objects to M; if y is not classified into any cluster, then add y Join Z;

(3d3)重复步骤(3d2),直到候选集M为空;(3d3) Repeat step (3d2) until the candidate set M is empty;

(3d4)重复步骤(3d1)-(3d3),直到所有对象都归入了某个簇或标记为噪声。(3d4) Steps (3d1)-(3d3) are repeated until all objects are classified into a certain cluster or marked as noise.

步骤4,合并各局部聚类,得到流数据的多分区聚类预处理结果。Step 4, merge the local clusters to obtain the multi-partition clustering preprocessing results of the flow data.

(4a)类的合并Merger of classes (4a)

两个类A和B合并,必须满足当且仅当:Two classes A and B are merged if and only if:

(4a1)A、B分别处于相邻的两个分区PA、PB中;(4a1) A andB are located in two adjacent partitions PA and P Brespectively ;

(4a2)设Eps(PA),Eps(PB)分别是PA,PB的Eps邻域半径值,Eps(PA,PB)=min{Eps(PA),Eps(PB)},EA,EB分别是在局部聚类过程中保存的PA和PB的边界区域点集,q∈EB,满足DISTANCE{p,q}≤Eps(PA,PB)(4a2) Suppose Eps(PA), Eps(PB ) are the Eps neighborhood radius values of PA and PB respectively, Eps(PA,P B)= min{Eps(PA) , Eps(PB )}, EA , EB are the boundary area point sets of PA andP Bsaved in the local clustering process respectively, q∈EB , satisfying DISTANCE{p,q}≤Eps(PA ,PB )

(4b)归并噪声点:处在分区线附近的噪声点可能是全局中某个类的边界点,因此必须考虑将这些临时噪声点归到相邻分区的某个类中。一个噪声点被归纳入一个类,当且仅当:(4b) Merging noise points: The noise points near the partition line may be the boundary points of a certain class in the whole world, so it must be considered to classify these temporary noise points into a certain class of the adjacent partition. A noise point is classified into a class if and only if:

(4b1)噪声点PN和类O分别处于相邻的两个分区中;(4b1) The noise point PN and the class O are respectively in two adjacent partitions;

(4b2)EO是局部聚类过程中保存在类O所在分区的边界区域点集,PO是类O所处的分区,满足DISTANCE{PN,P}≤Eps(PO)。(4b2) EO is the point set of the boundary area stored in the partition where class O is located during the local clustering process, and PO is the partition where class O is located, satisfying DISTANCE{PN , P}≤Eps(PO ).

(4c)由噪声点产生新类:在数据分区阶段,一些小的聚类可能被划分到不同分区中去。由于处在每个分区中的数据量太小,在各分区的局部聚类过程中均被标记为噪声点,因此还需要对这些点进行处理。设PM,PN为两个相邻分区,为这两个分区在边界区域点集。考虑中噪声点的集合为SN,若且pu≠pv。满足DISTANCE{Pu,Pv}≤Eps(PM,PN),则认为Pu是全局中的一个新类的核心点,{Pv}(v=1,2,…,mp)是这新类的其它点。剩余噪声点按(4b)进行处理。(4c) Generating new classes from noise points: In the data partition stage, some small clusters may be divided into different partitions. Since the amount of data in each partition is too small, they are marked as noise points in the local clustering process of each partition, so these points also need to be processed. LetPM andPN be two adjacent partitions, Set points in the boundary region for both partitions. consider with The set of noise points in the medium is SN, if And pu ≠ pv . If DISTANCE {Pu , Pv }≤Eps(PM , PN ), then Pu is considered to be the core point of a new class in the global, and {Pv }(v=1, 2, ..., mp) is Other points of this new class. The remaining noise is processed according to (4b).

综上,本发明对流数据的多分区聚类预处理方法,改善了面对大量流数据分布不均匀时,聚类精度不高的问题。相较于DBSCAN,这种改进的基于密度的聚类算法的数据分区方式,在一定程度上改善了聚类速度和聚类质量,减少了数据预处理的工作量,提高聚类效率,实现对实时流量的分布式并行预处理。In summary, the multi-partition clustering preprocessing method for streaming data in the present invention improves the problem of low clustering accuracy when faced with uneven distribution of a large amount of streaming data. Compared with DBSCAN, this improved density-based clustering algorithm's data partitioning method improves the clustering speed and clustering quality to a certain extent, reduces the workload of data preprocessing, improves clustering efficiency, and realizes Distributed parallel preprocessing of real-time traffic.

Claims (8)

Translated fromChinese
1.一种流数据的多分区聚类预处理方法,其特征在于,包括以下步骤:1. A multi-partition clustering preprocessing method of streaming data, characterized in that, comprising the following steps:步骤1,确定流数据态势因子的范围,并根据态势因子与网络安全态势的关联度,对流数据态势因子进行筛选;Step 1. Determine the scope of the flow data situation factors, and screen the flow data situation factors according to the correlation between the situation factors and the network security situation;步骤2,根据筛选得到的态势因子,将流数据库划分为多个分区;Step 2. Divide the flow database into multiple partitions according to the situational factors obtained through screening;步骤3,对每个分区,分别使用DBSCAN算法进行局部聚类;Step 3, for each partition, use the DBSCAN algorithm to perform local clustering;步骤4,合并各局部聚类,得到流数据的多分区聚类预处理结果。Step 4, merge the local clusters to obtain the multi-partition clustering preprocessing results of the stream data.2.根据权利要求1所述的流数据的多分区聚类预处理方法,其特征在于:步骤1所述确定流数据态势因子的范围,并根据态势因子与网络安全态势的关联度,对流数据态势因子进行筛选,具体如下:2. The multi-partition clustering preprocessing method of stream data according to claim 1, characterized in that: the range of the stream data situation factor is determined in step 1, and the stream data is analyzed according to the degree of association between the situation factor and the network security situation. Situation factors are screened as follows:(1)确定流数据态势因子的范围;(1) Determine the range of the flow data situation factor;(2)用随时刻K变化的态势数列和态势因子数列分别作为参考数列和比较数列;(2) Use the situation sequence that changes with time K and the series of situation factors Respectively as a reference sequence and a comparison sequence;(3)对所述参考数列和比较数列进行无量纲化处理;(3) carry out dimensionless processing to described reference sequence and comparison sequence;(4)确定K时刻所述参考数列与比较数列的灰色关联系数εi(K),公式如下:(4) Determine the gray relational coefficient εi (K) of the reference sequence and the comparison sequence at the K moment, the formula is as follows:ϵϵii((KK))==mmiinnoiimmiinnokk||Xx00((00))((KK))--Xxii((00))((KK))||++ρρmaxmaxiimaxmaxkk||Xx00((00))((KK))--Xxii((00))((KK))||||Xx00((00))((KK))--Xxii((00))((KK))||++ρρmaxmaxiimaxmaxkk||Xx00((00))((KK))--Xxii((00))((KK))||式中,ρ是分辨系数,ρ∈[0,∞);K=1,2,…,L表示L个时刻中第K个时刻;i=1,2,...m表示第i个因子;In the formula, ρ is the resolution coefficient, ρ∈[0, ∞); K=1, 2,..., L represents the Kth moment among the L moments; i=1, 2,...m represents the i-th factor ;(5)计算态势因子与网络安全态势的关联度ρ0,i,公式如下:(5) Calculate the correlation degree ρ0,i between the situation factor and the network security situation, the formula is as follows:ρρ00,,ii==11LLΣΣKK==11LLϵϵii((KK))(6)对流数据态势因子进行筛选,当所述关联度ρ0,i大于阈值的时候,选用该流数据态势因子,否则剔除。(6) Screening the flow data situation factor, when the correlation degree ρ0,i is greater than the threshold, select the flow data situation factor, otherwise eliminate it.3.根据权利要求1所述的流数据的多分区聚类预处理方法,其特征在于,步骤2所述根据筛选得到的态势因子,将流数据库划分为多个分区,具体为:3. The multi-partition clustering preprocessing method of stream data according to claim 1, wherein, in step 2, the stream database is divided into multiple partitions according to the situation factors obtained through screening, specifically:(1)高维数据分区;(1) High-dimensional data partitioning;对于高维数据,对筛选出的态势因子进行单因子方差分析;经过统计分析,若因子A不显著,则标记为高关联度因子,若A显著,则标记为低关联度因子;其中,对高关联度因子采用基于分布的分区方式,对低关联度因子采取等距的三角分区方式;For high-dimensional data, conduct one-way analysis of variance on the screened situation factors; after statistical analysis, if factor A is not significant, it will be marked as a high-correlation factor, and if A is significant, it will be marked as a low-correlation factor; among them, for The distribution-based partition method is adopted for the high-correlation factors, and the equidistant triangular partition method is adopted for the low-correlation factors;(2)低维数据分区(2) Low-dimensional data partition对于低维数据采用随机取样算法对整个数据库进行取样,采用直方图方法统计分析采样点投影在X轴和Y轴上的数据分布特性,确定在哪一维上进行分区、分区的数量以及分区的边界。For low-dimensional data, the random sampling algorithm is used to sample the entire database, and the histogram method is used to statistically analyze the data distribution characteristics of the sampling points projected on the X-axis and Y-axis to determine which dimension to partition, the number of partitions, and the size of the partitions. boundary.4.根据权利要求1所述的流数据的多分区聚类预处理方法,其特征在于,步骤3所述对每个分区,分别使用DBSCAN算法进行局部聚类,具体过程如下:4. the multi-partition clustering preprocessing method of flow data according to claim 1, is characterized in that, for each partition described in step 3, uses DBSCAN algorithm to carry out local clustering respectively, and concrete process is as follows:(1)构建分区自身的R*树;(1) Build the R* tree of the partition itself;(2)建立k-dist图;(2) Establish a k-dist graph;(3)选取局部扫描半径Eps值,使用DBSCAN算法进行局部聚类。(3) Select the Eps value of the local scanning radius, and use the DBSCAN algorithm for local clustering.5.根据权利要求1所述的流数据的多分区聚类预处理方法,其特征在于,步骤4所述合并各局部聚类,具体过程如下:5. the multi-partition clustering preprocessing method of flow data according to claim 1, is characterized in that, described in step 4 merges each local cluster, concrete process is as follows:(1)两个类A和B的合并;(1) Merger of two classes A and B;(2)归并噪声点;(2) merge noise points;(3)由噪声点产生新类。(3) Generating new classes from noise points.6.根据权利要求3所述的流数据的多分区聚类预处理方法,其特征在于,所述对低关联度因子采取等距的三角分区方式,具体步骤如下:6. the multi-partition clustering preprocessing method of stream data according to claim 3, is characterized in that, described adopts equidistant triangular partition mode to low correlation degree factor, concrete steps are as follows:(a)给定一个d维的数据集D=(D1,D2,……Dd),设第e维上的属性值分布在区间[we,he)中,其中e=1,2,……,d,则S=[w1,h1)×[w2,h2)×...[wd,hd)表示该e维数据空间;(a) Given a d-dimensional data set D=(D1 , D2 ,...Dd ), let the attribute values on the e-th dimension be distributed in the interval [we , hee ), where e=1 , 2, ..., d, then S=[w1 ,h1 )×[w2 ,h2 )×...[wd ,hd ) represents the e-dimensional data space;(b)将数据空间的每一维划分成长度相等的t段,这样来自于每一维的一个段相交形成一个矩形,称为网格单元;网格单元Ce表示为Ce1,Ce2,...Ced,Cej=[wej,hej)是一个左闭右开的区间,表示Dj(j=1,2,……,d)上的一个段;(b) Divide each dimension of the data space into t segments of equal length, so that a segment from each dimension intersects to form a rectangle, which is called a grid unit; the grid unit Ce is expressed as Ce1 , Ce2 ,...Ced , Cej =[wej ,hej ) is a left-closed right-open interval, representing a segment on Dj (j=1,2,...,d);(c)再取每个矩形的与X轴正方向成锐角的对角线,对矩形进行分割,形成三角单元Te(Te1,Te2,...Ted)。(c) Take the diagonal of each rectangle forming an acute angle with the positive direction of the X-axis, and divide the rectangle to form triangular units Te (Te1 ,Te2 ,...Ted ).7.根据权利要求5所述的流数据的多分区聚类预处理方法,其特征在于,步骤(1)中所述两个类A和B合并,必须满足当且仅当:7. The multi-partition clustering preprocessing method of stream data according to claim 5, wherein the two classes A and B merged in step (1) must satisfy if and only if:(a)A、B分别处于相邻的两个分区PA、PB中;(a) A andB are located in two adjacent partitions PA and P Brespectively ;(b)设Eps(PA),Eps(PB)分别是PA,PB的Eps邻域半径值,Eps(PA,PB)=min{Eps(PA),Eps(PB)},EA,EB分别是在局部聚类过程中保存的PA和PB的边界区域点集,q∈EB,满足DISTANCE{p,q}≤Eps(PA,PB)。(b) Let Eps(PA ), Eps(PB ) be the Eps neighborhood radius values of PA and PB respectively, Eps(PA , PB )=min{Eps(PA ), Eps(PB )}, EA , EB are the boundary area point sets of PA andP Bsaved in the local clustering process respectively, q∈EB , satisfying DISTANCE{p, q}≤Eps(PA , PB ).8.根据权利要求5所述的流数据的多分区聚类预处理方法,其特征在于,步骤(2)所述归并噪声点,必须满足当且仅当:8. The multi-partition clustering preprocessing method of stream data according to claim 5, wherein the merging of noise points in step (2) must satisfy if and only if:(a)噪声点PN和类O分别处于相邻的两个分区中;(a) The noise point PN and the class O are respectively in two adjacent partitions;(b)EO是局部聚类过程中保存在类O所在分区的边界区域点集,PO是类O所处的分区,满足DISTANCE{PN,P}≤Eps(PO)。(b) EO is the point set of the boundary area stored in the partition where class O is located during the local clustering process, and PO is the partition where class O is located, satisfying DISTANCE{PN , P}≤Eps(PO ).
CN201610934684.1A2016-11-012016-11-01Multi-partition clustering preprocessing method for stream dataActiveCN106570104B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201610934684.1ACN106570104B (en)2016-11-012016-11-01Multi-partition clustering preprocessing method for stream data

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201610934684.1ACN106570104B (en)2016-11-012016-11-01Multi-partition clustering preprocessing method for stream data

Publications (2)

Publication NumberPublication Date
CN106570104Atrue CN106570104A (en)2017-04-19
CN106570104B CN106570104B (en)2020-04-07

Family

ID=58534267

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201610934684.1AActiveCN106570104B (en)2016-11-012016-11-01Multi-partition clustering preprocessing method for stream data

Country Status (1)

CountryLink
CN (1)CN106570104B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107169089A (en)*2017-05-122017-09-15武汉理工大学A kind of multidimensional flow data visual analysis method based on cluster
CN107659675A (en)*2017-08-212018-02-02郑州埃文计算机科技有限公司Dynamic IP multizone localization method based on polygon positioning form
CN109068349A (en)*2018-07-122018-12-21重庆邮电大学A kind of indoor intrusion detection method based on small sample iterative migration
CN113779105A (en)*2021-08-112021-12-10桂林电子科技大学Distributed trace stream adjoint mode mining method
WO2023273727A1 (en)*2021-07-022023-01-05华为云计算技术有限公司Data processing method and apparatus, storage medium, and program product

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101820357A (en)*2010-02-112010-09-01哈尔滨工业大学Network security incident visualization system
US20130246429A1 (en)*2012-03-192013-09-19Microsoft CorporationMulti-center canopy clustering
CN105306438A (en)*2015-09-172016-02-03杭州安恒信息技术有限公司Network security situation assessment method based on fuzzy rough set

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101820357A (en)*2010-02-112010-09-01哈尔滨工业大学Network security incident visualization system
US20130246429A1 (en)*2012-03-192013-09-19Microsoft CorporationMulti-center canopy clustering
CN105306438A (en)*2015-09-172016-02-03杭州安恒信息技术有限公司Network security situation assessment method based on fuzzy rough set

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张超: ""云计算网络安全态势评估研究与分析"", 《中国优秀硕士学位论文全文数据库信息科技辑》*

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107169089A (en)*2017-05-122017-09-15武汉理工大学A kind of multidimensional flow data visual analysis method based on cluster
CN107169089B (en)*2017-05-122020-09-01武汉理工大学Clustering-based multi-dimensional flow data visual analysis method
CN107659675A (en)*2017-08-212018-02-02郑州埃文计算机科技有限公司Dynamic IP multizone localization method based on polygon positioning form
CN109068349A (en)*2018-07-122018-12-21重庆邮电大学A kind of indoor intrusion detection method based on small sample iterative migration
CN109068349B (en)*2018-07-122021-08-06重庆邮电大学 An Indoor Intrusion Detection Method Based on Iterative Migration of Small Samples
WO2023273727A1 (en)*2021-07-022023-01-05华为云计算技术有限公司Data processing method and apparatus, storage medium, and program product
CN113779105A (en)*2021-08-112021-12-10桂林电子科技大学Distributed trace stream adjoint mode mining method
CN113779105B (en)*2021-08-112022-12-13桂林电子科技大学Distributed track flow accompanying mode mining method

Also Published As

Publication numberPublication date
CN106570104B (en)2020-04-07

Similar Documents

PublicationPublication DateTitle
CN102663100B (en) A two-stage hybrid particle swarm optimization clustering method
CN106570104A (en)Multi-partition clustering preprocessing method of stream data
CN104573705B (en)A kind of clustering method of building Point Cloud of Laser Scanner
CN104112026B (en)A kind of short message text sorting technique and system
CN113537321B (en)Network flow anomaly detection method based on isolated forest and X mean value
CN104699755B (en)A kind of intelligent multiple target integrated recognition method based on data mining
CN108388603B (en)Spark framework-based distributed summary data structure construction method and query method
CN113779105B (en)Distributed track flow accompanying mode mining method
Pan et al.Mining regular behaviors based on multidimensional trajectories
CN105868266A (en)Clustering model based high-dimensional data stream outlier detection method
CN111027016B (en) A clustering analysis method for dominant occurrence of rock mass discontinuities based on weaving network algorithm
Lei et al.An incremental clustering algorithm based on grid
CN102831431A (en)Detector training method based on hierarchical clustering
CN112884013A (en)Energy consumption partitioning method based on data mining technology
CN104809210A (en)Top-k query method based on massive data weighing under distributed computing framework
CN105631465A (en)Density peak-based high-efficiency hierarchical clustering method
CN111651501A (en) A spatial aggregation scale selection method for geographic big data
CN106815320B (en)Investigation big data visual modeling method and system based on expanded three-dimensional histogram
Zygouras et al.Corridor learning using individual trajectories
CN115658737B (en)Method and system for querying space-time accompaniment of large-scale track data
CN104123382B (en)A kind of image set abstraction generating method under Social Media
CN109101998B (en) A clustering method and system based on residential context spatial information
CN107992590B (en)Big data system beneficial to information comparison
CN110765130A (en)Ripley's K function-based spatio-temporal POI data point pattern analysis method in distributed environment
CN116522381A (en)Differential privacy-based non-equilibrium position data publishing method

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
CB03Change of inventor or designer information

Inventor after:Li Qianmu

Inventor after:Wang Shuo

Inventor after:Qi Yong

Inventor after:Wang Yinhai

Inventor before:Wang Shuo

Inventor before:Li Qianmu

Inventor before:Qi Yong

Inventor before:Wang Yinhai

CB03Change of inventor or designer information
GR01Patent grant
GR01Patent grant
EE01Entry into force of recordation of patent licensing contract

Application publication date:20170419

Assignee:NANJING SINOVATIO TECHNOLOGY Co.,Ltd.

Assignor:NANJING University OF SCIENCE AND TECHNOLOGY

Contract record no.:X2022980008506

Denomination of invention:A multi partition clustering preprocessing method for stream data

Granted publication date:20200407

License type:Common License

Record date:20220622

EE01Entry into force of recordation of patent licensing contract

[8]ページ先頭

©2009-2025 Movatter.jp