技术领域technical field
本发明涉及离群数据挖掘技术领域,特别涉及一种大规模高维数据中离群数据的检测方法。The invention relates to the technical field of outlier data mining, in particular to a method for detecting outlier data in large-scale high-dimensional data.
背景技术Background technique
离群数据挖掘技术是目前数据挖掘领域的研究热点之一,广泛应用于网络流量入侵检测、信用卡欺诈检测、视频监控异常行为检测等领域。目前已有的离群数据挖掘主要基于距离或最近邻概念进行离群挖掘,在高维数据中,如果还是根据高维空间距离和最近邻概念来考察数据的相邻点,就会出现大部分数据都被判定为离群数据的情况。如果在高维数据中,根据向量的余弦距离进行检测,则可以发现隐藏在高维数据中的离群数据,因为离群点与其它点组成的向量的夹角变化不大,而非离群点被包围在数据点中,非离群点与其它点组成的向量的夹角变化较大,因此根据夹角变化的大小可以发现隐藏在高维数据中的离群数据。Outlier data mining technology is one of the research hotspots in the field of data mining at present, and it is widely used in network traffic intrusion detection, credit card fraud detection, video surveillance abnormal behavior detection and other fields. At present, the existing outlier data mining is mainly based on the concept of distance or nearest neighbor for outlier mining. The data are judged as outliers. If detection is performed based on the cosine distance of vectors in high-dimensional data, outlier data hidden in high-dimensional data can be found, because the angle between the outlier point and the vector composed of other points does not change much, rather than outlier Points are surrounded by data points, and the angle between the non-outlier point and the vector composed of other points changes greatly, so the outlier data hidden in the high-dimensional data can be found according to the size of the angle change.
发明内容Contents of the invention
本发明提出了一种大规模高维数据中离群数据的检测方法,可以高效快速地从大规模高维数据中发现隐藏在其中的离群数据,可以广泛应用于信用卡欺诈检测、视频监控异常行为检测、网络流量入侵检测等高维数据中。The invention proposes a method for detecting outlier data in large-scale high-dimensional data, which can efficiently and quickly find outlier data hidden in large-scale high-dimensional data, and can be widely used in credit card fraud detection and abnormal video surveillance In high-dimensional data such as behavior detection and network traffic intrusion detection.
为了达到上述目的,本发明所采用的技术方案为:In order to achieve the above object, the technical scheme adopted in the present invention is:
一种大规模高维数据中离群数据的检测方法,包括以下步骤:A method for detecting outlier data in large-scale high-dimensional data, comprising the following steps:
(1)计算大规模高维数据中每个数据点的余弦距离平均值,即对于每个数据点A,分别计算A点到其余所有任意两个点B和C组成的向量和的余弦距离的平均值;(1) Calculate the average cosine distance of each data point in large-scale high-dimensional data, that is, for each data point A, calculate the vector from point A to any other two points B and C and The average of the cosine distance of ;
(2)计算每个数据点A的余弦距离;(2) Calculate the cosine distance of each data point A;
(3)计算每个数据点A的所有余弦距离的平均间距;(3) Calculate the average spacing of all cosine distances of each data point A;
(4)分类划分余弦距离平均间距,选取余弦距离平均间距最小的几个点为数据离群度最大的离群点;(4) Classify and divide the average distance of cosine distance, and select several points with the smallest average distance of cosine distance as the outlier points with the largest data outlier degree;
(5)确定离群点。(5) Determine outliers.
前述的步骤(1)包括以下步骤:Aforesaid step (1) comprises the following steps:
1-1)形式化数据集,所述大规模高维数据形式化为:1-1) formalize the data set, the large-scale high-dimensional data is formalized as:
对于给定的大规模高维数据集范数||·||定义为Rd→R+,内积<·,·>定义为Rd×Rd→R,For a given large-scale high-dimensional data set The norm ||·|| is defined as Rd →R+ , the inner product <·,·> is defined as Rd ×Rd →R,
点A,B∈D,表示向量 point A, B ∈ D, representation vector
其中Rd表示d维实数空间,R+表示正实数,Rd→R+表示d维实数空间上的元素到正实数的一个映射,Rd×Rd→R表示d维实数空间上的两个向量作内积运算;Among them, Rd represents a d-dimensional real number space, R+ represents a positive real number, Rd → R+ represents a mapping from an element on a d-dimensional real number space to a positive real number, Rd × Rd → R represents two elements on a d-dimensional real number space vectors for inner product operation;
1-2)对于大规模高维数据集D中的所有点分别计算每个点A到其余两个点的向量夹角余弦距离之和,表示为Mθ(A),计算公式为:1-2) For all points in the large-scale high-dimensional data set D, calculate the sum of the vector angle cosine distances from each point A to the other two points, expressed as Mθ (A), and the calculation formula is:
B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}
其中,表示向量和的内积,和分别表示向量和的范数;in, representation vector and inner product of and represent vectors respectively and the norm;
1-3)计算大规模高维数据集D中每个点A余弦距离的平均值计算公式为:1-3) Calculate the average value of the cosine distance of each point A in the large-scale high-dimensional data set D The calculation formula is:
B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}
前述的步骤(2)计算数据点A的余弦距离,即对于每个数据点A,分别计算A点到任意两点B和C组成的向量和的余弦距离计算公式为:The aforementioned step (2) calculates the cosine distance of data point A, that is, for each data point A, calculates the vector composed of point A to any two points B and C and cosine distance of The calculation formula is:
B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}
前述的步骤(3)计算每个数据点A的所有余弦距离的平均间距ΔMθ(A),即累计计算步骤2)与步骤1)获得的每个点的余弦距离与余弦距离平均值的差的绝对值,计算公式为:The aforementioned step (3) calculates the average distance ΔMθ (A) of all cosine distances of each data point A, that is, the cumulative calculation of the cosine distance of each point obtained in step 2) and step 1) Mean with cosine distance The absolute value of the difference is calculated as:
前述的步骤(4)包括以下步骤:Aforesaid step (4) comprises the following steps:
4-1)按从小到大的顺序排序所述步骤(3)中所有点的余弦距离平均间距,得到平均间距序列L;4-1) Sorting the average distance of the cosine distances of all points in the step (3) in ascending order to obtain the average distance sequence L;
4-2)划分平均间距序列L为2类CA和CB,4-2) Divide the average distance sequence L into 2 categories CA and CB ,
分类算法步骤为:依次比较平均间距序列L中的前后数据,如果数值变化大于某一阈值ε,则该数据及其后面所有的数据都划分为类CB,其中,ε由用户确定,即The steps of the classification algorithm are: sequentially compare the data before and after the average distance sequence L, if the value change is greater than a certain threshold ε, then this data and all subsequent data are classified into class CB , where ε is determined by the user, namely
CA=Φ,CB=L CA = Φ, CB = L
如果d=|li+1-li|<ε,则CA=CA∪{li}If d=|li+1 -li |<ε, then CA =CA ∪{li }
否则,CB=CB\{li},Otherwise, CB =CB \{li },
其中,li表示平均间距序列L中的第i个数据,Φ表示空集。Among them, li represents the i-th data in the average distance sequence L, and Φ represents the empty set.
前述的步骤(5)确定离群点,具体方法为:The aforementioned step (5) determines outliers, and the specific method is:
检查所述步骤(4)中获得的类别CA,如果CA的数据个数大于某一阈值δ,则该大规模高维数据中没有检测到离群点,否则CA中所有数据对应的点为离群点,其中,δ由用户设定。Check the categoryCA obtained in step (4), if the number of data in CA is greater thana certain threshold δ, no outliers are detected in the large-scale high- dimensional data, otherwise all data in CA correspond to Points are outliers, where δ is set by the user.
本发明与已有技术相比,其效果是积极和明显的。本发明具有以下优点:Compared with the prior art, the present invention has positive and obvious effects. The present invention has the following advantages:
本发明提供的大规模高维数据中离群数据的检测方法,基于向量夹角余弦距离,能有效克服基于高维距离和最近邻等离群检测方法的“维度灾难”问题,利用本发明可以广泛应用于信用卡欺诈检测、视频监控异常行为检测、网络流量入侵检测等高维数据中。The detection method of outlier data in large-scale high-dimensional data provided by the present invention is based on the vector angle cosine distance, which can effectively overcome the "dimension disaster" problem based on high-dimensional distance and nearest neighbor detection methods, and the present invention can It is widely used in high-dimensional data such as credit card fraud detection, video surveillance abnormal behavior detection, and network traffic intrusion detection.
附图说明Description of drawings
图1为本发明的大规模高维数据中离群数据检测方法的流程图。FIG. 1 is a flow chart of the outlier data detection method in large-scale high-dimensional data of the present invention.
具体实施方式Detailed ways
现结合附图和具体实施方式,对本发明做进一步说明:Now in conjunction with accompanying drawing and specific embodiment, the present invention will be further described:
本发明的大规模高维数据中离群数据检测方法,如图1所示,包括以下步骤:Outlier data detection method in large-scale high-dimensional data of the present invention, as shown in Figure 1, comprises the following steps:
1)计算大规模高维数据中每个数据点的余弦距离平均值,即对于每个数据点A,分别计算A点到其它所有任意两个点B和C组成的向量和的余弦距离的平均值;1) Calculate the average cosine distance of each data point in large-scale high-dimensional data, that is, for each data point A, calculate the vector from point A to any other two points B and C and The average of the cosine distance of ;
为了得到各数据点的余弦距离平均值,需要给出大规模高维数据的形式化描述、向量夹角余弦距离和数据点余弦距离平均值的计算方法,分别为:In order to obtain the average cosine distance of each data point, it is necessary to give a formal description of large-scale high-dimensional data, the calculation method of the cosine distance between vector angles and the average cosine distance of data points, respectively:
1-1)形式化数据集,大规模高维数据可以形式化为:1-1) Formalized data sets, large-scale high-dimensional data can be formalized as:
对于给定的大规模高维数据集范数||·||定义为Rd→R+,内积<·,·>定义为Rd×Rd→R,For a given large-scale high-dimensional data set The norm ||·|| is defined as Rd →R+ , the inner product <·,·> is defined as Rd ×Rd →R,
点A,B∈D,表示向量 point A, B ∈ D, representation vector
其中Rd表示d维实数空间,R+表示正实数,Rd→R+表示d维实数空间上的元素到正实数的一个映射,Rd×Rd→R表示d维实数空间上的两个向量作内积运算。Among them, Rd represents a d-dimensional real number space, R+ represents a positive real number, Rd → R+ represents a mapping from an element on a d-dimensional real number space to a positive real number, Rd × Rd → R represents two elements on a d-dimensional real number space vectors for inner product operation.
1-2)对于大规模高维数据集D中的所有点分别计算每个点A到其它两个点的向量夹角余弦距离之和,表示为Mθ(A),计算公式为:1-2) For all points in the large-scale high-dimensional data set D, calculate the sum of the vector angle cosine distances from each point A to the other two points, expressed as Mθ (A), and the calculation formula is:
B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}
其中,表示向量和的内积,和分别表示向量和的范数。in, representation vector and inner product of and represent vectors respectively and The norm of .
1-3)计算大规模高维数据集D中每个点A余弦距离的平均值计算公式为:1-3) Calculate the average value of the cosine distance of each point A in the large-scale high-dimensional data set D The calculation formula is:
B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}
其中,n表示大规模高维数据集D中数据点的个数。Among them, n represents the number of data points in the large-scale high-dimensional data set D.
2)计算每个数据点A的余弦距离,即对于每个数据点A,分别计算A点到任意其它点B和C组成的向量和的余弦距离计算公式为:2) Calculate the cosine distance of each data point A, that is, for each data point A, calculate the vector composed of point A to any other point B and C and cosine distance of The calculation formula is:
B∈D,C∈D,且B∈D\{A},C∈D\{A,B} B∈D, C∈D, and B∈D\{A}, C∈D\{A,B}
3)计算每个数据点A的所有余弦距离的平均间距ΔMθ(A),即累计计算步骤1)与步骤2)获得的每个点的余弦距离与余弦距离平均值的差的绝对值,计算公式为:3) Calculate the average distance ΔMθ (A) of all cosine distances of each data point A, that is, the cumulative calculation of the cosine distance of each point obtained in step 1) and step 2) Mean with cosine distance The absolute value of the difference is calculated as:
4)分类划分余弦距离平均间距,选取余弦距离平均间距最小的几个点为数据离群度最大的离群点,包括以下步骤:4) Classify and divide the average distance of cosine distance, and select several points with the smallest average distance of cosine distance as the outlier points with the largest data outlier degree, including the following steps:
4-1)按从小到大的顺序排序步骤3)中所有点的余弦距离平均间距,得到平均间距序列L,4-1) sort step 3) in ascending order of the cosine distance average distance of all points, to obtain the average distance sequence L,
其中,由于高维数据中离群点的平均间距较小,因此序列L的特点为:有一少部分数据的数值较小,而其它绝大部分数据的数值较大;Among them, due to the small average distance between outliers in high-dimensional data, the characteristics of sequence L are: a small number of data have small values, while most of the other data have large values;
4-2)划分数据序列L为2类CA和CB,CA为数值较小的一类,CB为数值较大的一类。4-2) Divide the data sequence L into two categories,CA and CB, where CA is the category with a smaller value, and C Bisthecategory with a larger value.
分类算法步骤为:依次比较数据序列L中的前后数据,如果数值变化大于某一阈值ε,则该数据及其后面所有的数据都划分为类CB,其中ε可以由用户确定,即The steps of the classification algorithm are: sequentially compare the data before and after the data sequence L, if the value change is greater than a certain threshold ε, then the data and all subsequent data are classified into the class CB , where ε can be determined by the user, namely
CA=Φ,CB=L CA = Φ, CB = L
如果d=|li+1-li|<ε,则CA=CA∪{li}If d=|li+1 -li |<ε, then CA =CA ∪{li }
否则,CB=CB\{li},Otherwise, CB =CB \{li },
其中,li表示平均间距序列L中的第i个数据,Φ表示空集。Among them, li represents the i-th data in the average distance sequence L, and Φ represents the empty set.
5)确定离群点,具体方法为:5) Determine outliers, the specific method is:
检查步骤4)中获得的类别CA,如果CA的数据个数大于某一阈值δ,则该大规模高维数据中没有检测到离群点,否则CA中所有数据对应的点为离群点,其中δ可由用户设定。Check the categoryCA obtained in step 4). If the number of data in CA is greater thana certain threshold δ, no outliers are detected in the large-scale high- dimensional data; otherwise, the points corresponding to all data in CA are outliers. Group points, where δ can be set by the user.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510393861.5ACN105160347A (en) | 2015-07-07 | 2015-07-07 | Method for detecting outlier data of large-scale high dimension data |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510393861.5ACN105160347A (en) | 2015-07-07 | 2015-07-07 | Method for detecting outlier data of large-scale high dimension data |
| Publication Number | Publication Date |
|---|---|
| CN105160347Atrue CN105160347A (en) | 2015-12-16 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201510393861.5APendingCN105160347A (en) | 2015-07-07 | 2015-07-07 | Method for detecting outlier data of large-scale high dimension data |
| Country | Link |
|---|---|
| CN (1) | CN105160347A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106951353A (en)* | 2017-03-20 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Work data method for detecting abnormality and device |
| CN110377798A (en)* | 2019-06-12 | 2019-10-25 | 成都理工大学 | Outlier detection method based on angle entropy |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106951353A (en)* | 2017-03-20 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Work data method for detecting abnormality and device |
| CN106951353B (en)* | 2017-03-20 | 2020-05-22 | 北京搜狐新媒体信息技术有限公司 | Operation data abnormality detection method and device |
| CN110377798A (en)* | 2019-06-12 | 2019-10-25 | 成都理工大学 | Outlier detection method based on angle entropy |
| CN110377798B (en)* | 2019-06-12 | 2022-10-21 | 成都理工大学 | Outlier detection method based on angle entropy |
| Publication | Publication Date | Title |
|---|---|---|
| CN111242521B (en) | Track anomaly detection method and system | |
| CN111539454B (en) | Vehicle track clustering method and system based on meta-learning | |
| CN106022229B (en) | Abnormal Behavior Recognition Method Based on Video Motion Information Feature Extraction and Adaptive Enhancement Algorithm Error Backpropagation Network | |
| CN103702416B (en) | Semi-supervised learning indoor positioning method based on support vector machine | |
| Chhikara et al. | Data dimensionality reduction techniques for Industry 4.0: Research results, challenges, and future research directions | |
| CN113537321B (en) | Network flow anomaly detection method based on isolated forest and X mean value | |
| Bhattacharyya | Confidence in predictions from random tree ensembles | |
| CN102970692B (en) | Method for detecting boundary nodes of wireless sensor network event | |
| CN102176698A (en) | Method for detecting abnormal behaviors of user based on transfer learning | |
| CN101561878A (en) | Unsupervised anomaly detection method and system based on improved CURE clustering algorithm | |
| CN112270355A (en) | Active safety prediction method based on big data technology and SAE-GRU | |
| CN115879030B (en) | A method and system for classifying network attacks on distribution networks | |
| CN102867195B (en) | Method for detecting and identifying a plurality of types of objects in remote sensing image | |
| CN103366177A (en) | Object detection classifier generating method, object detection classifier generating apparatus, image object detection method and image object detection apparatus | |
| CN102842043B (en) | Particle swarm classifying method based on automatic clustering | |
| CN108667684A (en) | A data flow anomaly detection method based on local vector dot product density | |
| CN110942099A (en) | Abnormal data identification and detection method of DBSCAN based on core point reservation | |
| CN103780588A (en) | User abnormal behavior detection method in digital home network | |
| CN102663681B (en) | Gray scale image segmentation method based on sequencing K-mean algorithm | |
| CN104318241A (en) | Local density spectral clustering similarity measurement algorithm based on Self-tuning | |
| CN114511905A (en) | Face clustering method based on graph convolution neural network | |
| CN104778472B (en) | Human face expression feature extracting method | |
| CN117540234B (en) | A system for identifying abnormal electricity price nodes and areas based on the density characteristics of electricity price data distribution | |
| CN104091078B (en) | Product Multi-information acquisition indicating failure means to save the situation based on D S evidence theories | |
| CN105046275A (en) | Large-scale high-dimensional outlier data detection method based on angle variance |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20151216 |