CN105160347A

Movatterモバイル変換

Info

Publication number: CN105160347A
Application number: CN201510393861.5A
Authority: CN
Inventors: 刘文婷
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2015-07-07
Filing date: 2015-07-07
Publication date: 2015-12-16

Abstract

Translated fromChinese

本发明公开了一种大规模高维数据中离群数据的检测方法，属于离群数据挖掘技术领域，具体包括以下几个步骤：（1）计算各数据点的余弦距离平均值；（2）计算各数据点的余弦距离；（3）计算各数据点的余弦距离平均间距；（4）分类划分余弦距离平均间距，选取余弦距离平均间距最小的几个点为数据离群度最大的离群点；（5）确定离群数据。本发明可以高效快速地从大规模高维数据中发现隐藏在其中的离群数据。

The invention discloses a method for detecting outlier data in large-scale high-dimensional data, which belongs to the technical field of outlier data mining, and specifically includes the following steps: (1) calculating the average value of cosine distance of each data point; (2) Calculate the cosine distance of each data point; (3) Calculate the average distance of the cosine distance of each data point; (4) Classify the average distance of the cosine distance, and select the points with the smallest average distance of the cosine distance as the outliers with the largest data outlier degree point; (5) Identify outlier data. The invention can efficiently and quickly find outlier data hidden therein from large-scale high-dimensional data.

Description

Translated fromChinese

一种大规模高维数据中离群数据的检测方法A method for detecting outliers in large-scale high-dimensional data

技术领域technical field

本发明涉及离群数据挖掘技术领域，特别涉及一种大规模高维数据中离群数据的检测方法。The invention relates to the technical field of outlier data mining, in particular to a method for detecting outlier data in large-scale high-dimensional data.

背景技术Background technique

离群数据挖掘技术是目前数据挖掘领域的研究热点之一，广泛应用于网络流量入侵检测、信用卡欺诈检测、视频监控异常行为检测等领域。目前已有的离群数据挖掘主要基于距离或最近邻概念进行离群挖掘，在高维数据中，如果还是根据高维空间距离和最近邻概念来考察数据的相邻点，就会出现大部分数据都被判定为离群数据的情况。如果在高维数据中，根据向量的余弦距离进行检测，则可以发现隐藏在高维数据中的离群数据，因为离群点与其它点组成的向量的夹角变化不大，而非离群点被包围在数据点中，非离群点与其它点组成的向量的夹角变化较大，因此根据夹角变化的大小可以发现隐藏在高维数据中的离群数据。Outlier data mining technology is one of the research hotspots in the field of data mining at present, and it is widely used in network traffic intrusion detection, credit card fraud detection, video surveillance abnormal behavior detection and other fields. At present, the existing outlier data mining is mainly based on the concept of distance or nearest neighbor for outlier mining. The data are judged as outliers. If detection is performed based on the cosine distance of vectors in high-dimensional data, outlier data hidden in high-dimensional data can be found, because the angle between the outlier point and the vector composed of other points does not change much, rather than outlier Points are surrounded by data points, and the angle between the non-outlier point and the vector composed of other points changes greatly, so the outlier data hidden in the high-dimensional data can be found according to the size of the angle change.

发明内容Contents of the invention

本发明提出了一种大规模高维数据中离群数据的检测方法，可以高效快速地从大规模高维数据中发现隐藏在其中的离群数据，可以广泛应用于信用卡欺诈检测、视频监控异常行为检测、网络流量入侵检测等高维数据中。The invention proposes a method for detecting outlier data in large-scale high-dimensional data, which can efficiently and quickly find outlier data hidden in large-scale high-dimensional data, and can be widely used in credit card fraud detection and abnormal video surveillance In high-dimensional data such as behavior detection and network traffic intrusion detection.

为了达到上述目的，本发明所采用的技术方案为：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种大规模高维数据中离群数据的检测方法，包括以下步骤：A method for detecting outlier data in large-scale high-dimensional data, comprising the following steps:

(1)计算大规模高维数据中每个数据点的余弦距离平均值，即对于每个数据点A，分别计算A点到其余所有任意两个点B和C组成的向量和的余弦距离的平均值；(1) Calculate the average cosine distance of each data point in large-scale high-dimensional data, that is, for each data point A, calculate the vector from point A to any other two points B and C and The average of the cosine distance of ;

(2)计算每个数据点A的余弦距离；(2) Calculate the cosine distance of each data point A;

(3)计算每个数据点A的所有余弦距离的平均间距；(3) Calculate the average spacing of all cosine distances of each data point A;

(4)分类划分余弦距离平均间距，选取余弦距离平均间距最小的几个点为数据离群度最大的离群点；(4) Classify and divide the average distance of cosine distance, and select several points with the smallest average distance of cosine distance as the outlier points with the largest data outlier degree;

(5)确定离群点。(5) Determine outliers.

前述的步骤(1)包括以下步骤：Aforesaid step (1) comprises the following steps:

1-1)形式化数据集，所述大规模高维数据形式化为：1-1) formalize the data set, the large-scale high-dimensional data is formalized as:

对于给定的大规模高维数据集范数||·||定义为R^d→R⁺，内积＜·,·＞定义为R^d×R^d→R，For a given large-scale high-dimensional data set The norm ||·|| is defined as R^d →R⁺ , the inner product <·,·> is defined as R^d ×R^d →R,

点A,B∈D，表示向量 point A, B ∈ D, representation vector

其中R^d表示d维实数空间，R⁺表示正实数，R^d→R⁺表示d维实数空间上的元素到正实数的一个映射，R^d×R^d→R表示d维实数空间上的两个向量作内积运算；Among them, R^d represents a d-dimensional real number space, R⁺ represents a positive real number, R^d → R⁺ represents a mapping from an element on a d-dimensional real number space to a positive real number, R^d × R^d → R represents two elements on a d-dimensional real number space vectors for inner product operation;

1-2)对于大规模高维数据集D中的所有点分别计算每个点A到其余两个点的向量夹角余弦距离之和，表示为M_θ(A)，计算公式为：1-2) For all points in the large-scale high-dimensional data set D, calculate the sum of the vector angle cosine distances from each point A to the other two points, expressed as M_θ (A), and the calculation formula is:

B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}

${M m}_{θ θ} ((A A)) = = \underset{A A &Element; &Element; D D.,, B B &Element; &Element; D D. \ \ {{A A}} . . C C &Element; &Element; D D. \ \ {{A A,, B B}}}{Σ Σ} \frac{< < \overset{&OverBar; &OverBar;}{A A B B},, \overset{&OverBar; &OverBar;}{A A C C} > >}{| | | | \overset{&OverBar; &OverBar;}{A A B B} | | {| |}^{22} \cdot &Center Dot; | | | | \overset{&OverBar; &OverBar;}{A A C C} | | {| |}^{22}}$

其中，表示向量和的内积，和分别表示向量和的范数；in, representation vector and inner product of and represent vectors respectively and the norm;

1-3)计算大规模高维数据集D中每个点A余弦距离的平均值计算公式为：1-3) Calculate the average value of the cosine distance of each point A in the large-scale high-dimensional data set D The calculation formula is:

B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}

$\overset{&OverBar; &OverBar;}{{M m}_{θ θ} ((A A))} = = \frac{{M m}_{θ θ} ((A A))}{\frac{11}{22} ((n no - - 11)) ((n no - - 22))} = = \frac{22 {M m}_{θ θ} ((A A))}{((n no - - 11)) ((n no - - 22))} . .$

前述的步骤(2)计算数据点A的余弦距离，即对于每个数据点A，分别计算A点到任意两点B和C组成的向量和的余弦距离计算公式为：The aforementioned step (2) calculates the cosine distance of data point A, that is, for each data point A, calculates the vector composed of point A to any two points B and C and cosine distance of The calculation formula is:

B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}

${M m}_{θ θ} ((\overset{&OverBar; &OverBar;}{B B A A C C})) = = \frac{< < \overset{&OverBar; &OverBar;}{A A B B},, \overset{&OverBar; &OverBar;}{A A C C} > >}{| | | | \overset{&OverBar; &OverBar;}{A A B B} | | {| |}^{22} \cdot &Center Dot; | | | | \overset{&OverBar; &OverBar;}{A A C C} | | {| |}^{22}} . .$

前述的步骤(3)计算每个数据点A的所有余弦距离的平均间距ΔM_θ(A)，即累计计算步骤2)与步骤1)获得的每个点的余弦距离与余弦距离平均值的差的绝对值，计算公式为：The aforementioned step (3) calculates the average distance ΔM_θ (A) of all cosine distances of each data point A, that is, the cumulative calculation of the cosine distance of each point obtained in step 2) and step 1) Mean with cosine distance The absolute value of the difference is calculated as:

${ΔM ΔM}_{θ θ} ((A A)) = = \underset{B B &Element; &Element; D D. \ \ {{A A}},, C C &Element; &Element; D D. \ \ {{A A,, B B}}}{Σ Σ} | | {M m}_{θ θ} ((\overset{&OverBar; &OverBar;}{B B A A C C})) - - \overset{&OverBar; &OverBar;}{{M m}_{e e} ((A A))} | | . .$

前述的步骤(4)包括以下步骤：Aforesaid step (4) comprises the following steps:

4-1)按从小到大的顺序排序所述步骤(3)中所有点的余弦距离平均间距，得到平均间距序列L；4-1) Sorting the average distance of the cosine distances of all points in the step (3) in ascending order to obtain the average distance sequence L;

4-2)划分平均间距序列L为2类C_A和C_B，4-2) Divide the average distance sequence L into 2 categories C_A and C_B ,

分类算法步骤为：依次比较平均间距序列L中的前后数据，如果数值变化大于某一阈值ε，则该数据及其后面所有的数据都划分为类C_B，其中，ε由用户确定，即The steps of the classification algorithm are: sequentially compare the data before and after the average distance sequence L, if the value change is greater than a certain threshold ε, then this data and all subsequent data are classified into class C_B , where ε is determined by the user, namely

C_A＝Φ,C_B＝L C_A = Φ, C_B = L

如果d＝|l_i+1-l_i|＜ε，则C_A＝C_A∪{l_i}If d=|l_i+1 -l_i |<ε, then C_A ＝C_A ∪{l_i }

否则，C_B＝C_B\{l_i}，Otherwise, C_B ＝C_B \{l_i },

其中，l_i表示平均间距序列L中的第i个数据，Φ表示空集。Among them, l_i represents the i-th data in the average distance sequence L, and Φ represents the empty set.

前述的步骤(5)确定离群点，具体方法为：The aforementioned step (5) determines outliers, and the specific method is:

检查所述步骤(4)中获得的类别C_A，如果C_A的数据个数大于某一阈值δ，则该大规模高维数据中没有检测到离群点，否则C_A中所有数据对应的点为离群点，其中，δ由用户设定。Check the category_CA obtained in step (4), if the number of data in CA is greater than_a certain threshold δ, no outliers are detected in the large-scale high_- dimensional data, otherwise all data in CA correspond to Points are outliers, where δ is set by the user.

本发明与已有技术相比，其效果是积极和明显的。本发明具有以下优点：Compared with the prior art, the present invention has positive and obvious effects. The present invention has the following advantages:

本发明提供的大规模高维数据中离群数据的检测方法，基于向量夹角余弦距离，能有效克服基于高维距离和最近邻等离群检测方法的“维度灾难”问题，利用本发明可以广泛应用于信用卡欺诈检测、视频监控异常行为检测、网络流量入侵检测等高维数据中。The detection method of outlier data in large-scale high-dimensional data provided by the present invention is based on the vector angle cosine distance, which can effectively overcome the "dimension disaster" problem based on high-dimensional distance and nearest neighbor detection methods, and the present invention can It is widely used in high-dimensional data such as credit card fraud detection, video surveillance abnormal behavior detection, and network traffic intrusion detection.

附图说明Description of drawings

图1为本发明的大规模高维数据中离群数据检测方法的流程图。FIG. 1 is a flow chart of the outlier data detection method in large-scale high-dimensional data of the present invention.

具体实施方式Detailed ways

现结合附图和具体实施方式，对本发明做进一步说明：Now in conjunction with accompanying drawing and specific embodiment, the present invention will be further described:

本发明的大规模高维数据中离群数据检测方法，如图1所示，包括以下步骤：Outlier data detection method in large-scale high-dimensional data of the present invention, as shown in Figure 1, comprises the following steps:

1)计算大规模高维数据中每个数据点的余弦距离平均值，即对于每个数据点A，分别计算A点到其它所有任意两个点B和C组成的向量和的余弦距离的平均值；1) Calculate the average cosine distance of each data point in large-scale high-dimensional data, that is, for each data point A, calculate the vector from point A to any other two points B and C and The average of the cosine distance of ;

为了得到各数据点的余弦距离平均值，需要给出大规模高维数据的形式化描述、向量夹角余弦距离和数据点余弦距离平均值的计算方法，分别为：In order to obtain the average cosine distance of each data point, it is necessary to give a formal description of large-scale high-dimensional data, the calculation method of the cosine distance between vector angles and the average cosine distance of data points, respectively:

1-1)形式化数据集，大规模高维数据可以形式化为：1-1) Formalized data sets, large-scale high-dimensional data can be formalized as:

点A,B∈D，表示向量 point A, B ∈ D, representation vector

其中R^d表示d维实数空间，R⁺表示正实数，R^d→R⁺表示d维实数空间上的元素到正实数的一个映射，R^d×R^d→R表示d维实数空间上的两个向量作内积运算。Among them, R^d represents a d-dimensional real number space, R⁺ represents a positive real number, R^d → R⁺ represents a mapping from an element on a d-dimensional real number space to a positive real number, R^d × R^d → R represents two elements on a d-dimensional real number space vectors for inner product operation.

1-2)对于大规模高维数据集D中的所有点分别计算每个点A到其它两个点的向量夹角余弦距离之和，表示为M_θ(A)，计算公式为：1-2) For all points in the large-scale high-dimensional data set D, calculate the sum of the vector angle cosine distances from each point A to the other two points, expressed as M_θ (A), and the calculation formula is:

B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}

${M m}_{θ θ} ((A A)) = = \underset{A A &Element; &Element; D D.,, B B &Element; &Element; D D. \ \ {{A A}} . . C C &Element; &Element; D D. \ \ {{A A,, B B}}}{Σ Σ} \frac{< < A A B B,, A A C C > >}{| | | | \overset{&OverBar; &OverBar;}{A A B B} | | {| |}^{22} \cdot &Center Dot; | | | | \overset{&OverBar; &OverBar;}{A A C C} | | {| |}^{22}}$

其中，表示向量和的内积，和分别表示向量和的范数。in, representation vector and inner product of and represent vectors respectively and The norm of .

B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}

$\overset{&OverBar; &OverBar;}{{M m}_{θ θ} ((A A))} = = \frac{{M m}_{θ θ} ((A A))}{\frac{11}{22} ((n no - - 11)) ((n no - - 22))} = = \frac{22 {M m}_{θ θ} ((A A))}{((n no - - 11)) ((n no - - 22))},,$

其中，n表示大规模高维数据集D中数据点的个数。Among them, n represents the number of data points in the large-scale high-dimensional data set D.

2)计算每个数据点A的余弦距离，即对于每个数据点A，分别计算A点到任意其它点B和C组成的向量和的余弦距离计算公式为：2) Calculate the cosine distance of each data point A, that is, for each data point A, calculate the vector composed of point A to any other point B and C and cosine distance of The calculation formula is:

B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}

3)计算每个数据点A的所有余弦距离的平均间距ΔM_θ(A)，即累计计算步骤1)与步骤2)获得的每个点的余弦距离与余弦距离平均值的差的绝对值，计算公式为：3) Calculate the average distance ΔM_θ (A) of all cosine distances of each data point A, that is, the cumulative calculation of the cosine distance of each point obtained in step 1) and step 2) Mean with cosine distance The absolute value of the difference is calculated as:

4)分类划分余弦距离平均间距，选取余弦距离平均间距最小的几个点为数据离群度最大的离群点，包括以下步骤：4) Classify and divide the average distance of cosine distance, and select several points with the smallest average distance of cosine distance as the outlier points with the largest data outlier degree, including the following steps:

4-1)按从小到大的顺序排序步骤3)中所有点的余弦距离平均间距，得到平均间距序列L，4-1) sort step 3) in ascending order of the cosine distance average distance of all points, to obtain the average distance sequence L,

其中，由于高维数据中离群点的平均间距较小，因此序列L的特点为：有一少部分数据的数值较小，而其它绝大部分数据的数值较大；Among them, due to the small average distance between outliers in high-dimensional data, the characteristics of sequence L are: a small number of data have small values, while most of the other data have large values;

4-2)划分数据序列L为2类C_A和C_B，C_A为数值较小的一类，C_B为数值较大的一类。4-2) Divide the data sequence L into two categories,_{CA and CB, where CA is the category with a smaller value, and C B}_is_the_category with a larger value.

分类算法步骤为：依次比较数据序列L中的前后数据，如果数值变化大于某一阈值ε，则该数据及其后面所有的数据都划分为类C_B，其中ε可以由用户确定，即The steps of the classification algorithm are: sequentially compare the data before and after the data sequence L, if the value change is greater than a certain threshold ε, then the data and all subsequent data are classified into the class C_B , where ε can be determined by the user, namely

C_A＝Φ,C_B＝L C_A = Φ, C_B = L

否则，C_B＝C_B\{l_i}，Otherwise, C_B ＝C_B \{l_i },

5)确定离群点，具体方法为：5) Determine outliers, the specific method is:

检查步骤4)中获得的类别C_A，如果C_A的数据个数大于某一阈值δ，则该大规模高维数据中没有检测到离群点，否则C_A中所有数据对应的点为离群点，其中δ可由用户设定。Check the category_CA obtained in step 4). If the number of data in CA is greater than_a certain threshold δ, no outliers are detected in the large-scale high_- dimensional data; otherwise, the points corresponding to all data in CA are outliers. Group points, where δ can be set by the user.