CN112905671A

Movatterモバイル変換

Info

Publication number: CN112905671A
Application number: CN202110313319.XA
Authority: CN
Inventors: 张文池; 王泓琳; 陈哲康; 周波; 王勇; 刘大鹏
Original assignee: Beijing Bishi Technology Co ltd; National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-04

Abstract

Translated fromChinese

本发明提供一种时间序列异常处理方法、装置、电子设备及计算机可读存储介质。其中，时间序列异常处理方法，包括步骤：获取时间序列数据，对所述时间序列数据训练，构建模型；根据所述模型检测实时获得的时间序列数据中是否存在异常数据，若存在，则推荐部分异常数据；判断被推荐的所述部分异常数据是否合理，然后反馈判断结果；根据所述判断结果优化所述模型，然后继续检测实时时间序列数据。根据本发明的时间序列异常处理方法，对数据没有明显的偏向性，能够适配具有特定场景语义的指标，能应对非传统互联网领域的运维需求，具有更高的可扩展性，具有普适性，给出的异常结果能够给出具体的异常原因。

The present invention provides a time series exception processing method, device, electronic device and computer-readable storage medium. The method for processing abnormal time series includes the steps of: acquiring time series data, training the time series data, and constructing a model; detecting whether there is abnormal data in the time series data obtained in real time according to the model, and recommending the part if there is. Abnormal data; judge whether the recommended part of abnormal data is reasonable, and then feed back the judgment result; optimize the model according to the judgment result, and then continue to detect real-time time series data. The time series exception processing method according to the present invention has no obvious bias towards data, can adapt to indicators with specific scene semantics, can meet the operation and maintenance requirements in non-traditional Internet fields, has higher scalability, and is universally applicable. The abnormal result given can give the specific abnormal cause.

Description

Time series exception handling method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing a time series exception, an electronic device, and a computer-readable storage medium.

Background

Modern software enterprises often rely on a large number of application services installed on a large number of infrastructures, including physical machines, virtual machines, containers. To ensure the reliability of these high-level services and systems, the operation and maintenance personnel need to monitor and check the operating conditions of the infrastructure. During routine operation and maintenance management work, an operation and maintenance engineer typically monitors and collects various performance metrics for the infrastructure. For example, the machine often has indexes such as memory utilization rate, CPU utilization rate, and disk utilization rate, and in actual operation, due to a fault caused by external attack, disk medium aging, performance continuous overload, and the like, the availability of the machine is severely challenged, and at this time, these monitoring indexes also reflect an abnormality. The method is very important for the abnormity detection of the time series indexes, and can help an operation and maintenance team to find the fault as soon as possible, so that the efficiency of fault occurrence to troubleshooting is improved.

The problem of anomaly detection of time series indexes is also widely noted in academia, and algorithms for anomaly detection of time series indexes are proposed in large quantities in recent years, but are limited by algorithm effects and detection performance, and the methods still cannot meet the requirements of actual landing application. In consideration of the fact that the number of indexes to be monitored and checked in operation and maintenance work is extremely large, manual index marking is impractical, and therefore a supervised anomaly detection method is difficult to practice, and an unsupervised learning mode must be adopted. In addition, the time series anomaly detection scenes are different, the service objects and loads of the services are different greatly, and the trends and characteristics shown by the indexes sometimes have strong service correlation, so that the anomaly detection method needs to have the capability of efficiently collecting the feedback of the operation and maintenance experts so as to acquire the knowledge of the operation and maintenance experts.

The following table 1 lists the most advanced unsupervised time series anomaly detection algorithms in the academic world at present, most of the algorithms adopt deep learning models, huge computing resources are required to support training, the computing performance needs to be improved, and user feedback cannot be directly applied to deep learning framework optimization. The traditional unsupervised statistical learning method needs a large amount of manual parameter adjustment and has uneven effects. The algorithms also have obvious bias on data, each algorithm is excellent in performance on a specific data type, but has no universality, and specific abnormal reasons are difficult to explain by given abnormal results.

Characteristics of	Regression statistics learning	Traditional unsupervised learning	Unsupervised depth generation model
				High capacity	Difference (D)	In general	Is excellent in
Without need of regulating parameters	Difference (D)	Is excellent in	In general
				Need not label	Is excellent in	Is excellent in	Is excellent in
The detection speed is high	Is excellent in	In general	In general
				Low training resources	Is excellent in	In general	Difference (D)
Short training time	Is excellent in	In general	Difference (D)
				Can be manually adjusted	Difference (D)	In general	Difference (D)

TABLE 1

Disclosure of Invention

The present invention is directed to solve at least one of the problems in the background art and provides a time series exception handling method, a time series exception handling apparatus, an electronic device, and a computer-readable storage medium.

In order to achieve the above object, the present invention provides a method for processing time series exception, comprising the following steps:

acquiring time sequence data, training the time sequence data, and constructing a model;

detecting whether abnormal data exist in the time sequence data obtained in real time according to the model, and if so, recommending part of the abnormal data;

judging whether the recommended part of abnormal data is reasonable or not, and then feeding back a judgment result;

and optimizing the model according to the judgment result, and then continuously detecting the real-time sequence data.

According to one aspect of the invention, acquiring time series data comprises acquiring regular small-scale time series data and irregular large-scale time series data, clustering all time series data when acquiring irregular large-scale time series data, and then training various types of time series data to construct a model.

According to one aspect of the invention, the clustering process is to capture the correlation among the time sequence data to be trained through DBSCAN, and cluster the data with approximate shape and consistent periodicity.

According to an aspect of the present invention, in the clustering process, in calculating the approximation degree of the time-series data, the distance between the time-series data is calculated using DTW.

According to one aspect of the invention, according to the type of the time sequence data, feature data capable of representing the corresponding type of the time sequence data is selected for training, and a model is constructed.

According to one aspect of the invention, RRCF is adopted to select all the feature data for training, all the feature data are iterated to obtain a plurality of decision trees, the decision trees form a decision forest, and then whether abnormal data exist in the real-time sequence data is determined through voting of the decision forest.

According to one aspect of the invention, when constructing the decision tree, the RRCF selects a segmentation dimension for segmenting the feature data when constructing the decision tree, and the RRCF has a probability of selecting the feature data as

gi＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability of the feature i being selected, the probability value being between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; gi represents the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation after the characteristic i is sorted according to the characteristic size in the training sample set; sigma g_jRepresenting g calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

According to one aspect of the invention, the RRCF equally divides the feature data in the slicing dimension into N intervals [ l [ ]₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each of the intervals is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i](ii) a Wherein l-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved for the training set, h-N-1 represents the maximum value of the characteristic, the difference between the minimum value and the maximum value is divided by N, and the N intervals are equally divided.

According to one aspect of the present invention, when the abnormal data exists, the abnormal score codip of the abnormal data is calculated by using the dividing point, and when the abnormal score codip is calculated, the ratio codip of the number of the abnormal data contained in the sibling subtree and the father subtree of the dividing point is calculated_NodeSelecting the largest ratio CoDisp_NodeAbnormal data x_iIs an abnormality score of

According to one aspect of the invention, the recommending part of the abnormal data is to select a plurality of most abnormal segments in the abnormal data, and recommend after obtaining labels of the plurality of segments; or

Recommending partial abnormal data by selecting a plurality of uncertain segments in the abnormal data and recommending after obtaining labels of the segments; or

And the recommendation of the abnormal data of the part is to divide the abnormal data into a plurality of groups according to the abnormal scores, obtain a plurality of fragments in each group, and recommend after obtaining the labels of the fragments.

According to one aspect of the invention, after the abnormal data of n labeled segments are obtained by the model, the abnormal data and M decision trees in the decision forest of the model jointly form an abnormal score matrix codip _ M [ x [ [ x ])_i][tree_j]For each exception data x_iIf the feedback judgment result is true positive, the decision tree is_jHas a weight of tw_j＝tw_j+δ×CoDisp_M[x_i][tree_j]And selecting a decision tree with higher weight according to the feedback judgment result so as to optimize the model.

In order to achieve the above object, the present invention further provides a time-series exception handling apparatus, including:

the data processing module is used for acquiring time series data, training the time series data and constructing a model;

the abnormal data detection recommending module detects whether abnormal data exist in the time sequence data obtained in real time according to the model, and if the abnormal data exist, part of the abnormal data are recommended;

the abnormal data judgment feedback module judges whether the part of abnormal data is reasonable or not and then feeds back a judgment result;

and the model optimization module optimizes the model according to the feedback judgment result and then continuously detects the real-time sequence data.

According to an aspect of the invention, further comprising:

and the data classification processing module is used for acquiring irregular large-scale time sequence data, clustering all the time sequence data, training various time sequence data and constructing a model.

In order to achieve the above object, the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the above time-series exception handling method.

To achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the above time-series exception handling method.

According to one scheme of the invention, as the number of time sequences to be monitored in a production environment is extremely large, each production unit can generate dozens or even hundreds of monitoring index data, the index data need to be monitored completely, if the time sequences are trained respectively in a targeted manner, the number of models and consumed resources are extremely large, and the existing operation and maintenance resources are difficult to support. Therefore, before the targeted training stage of the index data, the data are clustered, so that the detection processing time can be greatly reduced, and the abnormity can be quickly and accurately processed.

According to one scheme of the invention, a characteristic data selection stage is provided, and more appropriate characteristic data are extracted in a targeted manner according to the statistical information and characteristics of indexes, so that the accuracy of the model is improved.

According to one scheme of the invention, the most abnormal 30 segments are selected, and the labels of the abnormal segments are acquired, so that the explicit abnormality can be further confirmed, and the false positive rate can be reduced.

According to one scheme of the invention, 30 most uncertain segments (namely around the vicinity of an abnormality judgment threshold) are selected, and the labels can help the model to clearly classify boundaries, so that the identification accuracy of fuzzy abnormalities is improved.

According to one aspect of the invention, the abnormal data is divided into 10 groups according to the abnormal scores, each group obtains at most 3 segments, and the labels can capture attitudes of the judgment feedback module on different abnormal judgment conditions, so as to help the model determine the optimal threshold value selection range.

According to one scheme of the invention, the invention provides an unsupervised, white-box and accurate time series exception handling method which is matched with active learning and can actively and efficiently collect feedback information. On the basis of a traditional unsupervised learning frame, an active learning stage is introduced, abnormality is actively recommended to a judgment feedback part (such as a judgment feedback module or operation and maintenance personnel) and feedback is acquired, so that a model is corrected, and the accuracy is improved. The method reserves the advantages of the traditional unsupervised learning in the aspects of parameter adjustment and marking, designs the application strategy of marking feedback in a targeted manner, and further optimizes the recall rate, the detection speed and the capacity of the model.

According to one scheme of the invention, the processing method has no obvious bias on data, can adapt to indexes with specific scene semantics, can meet the operation and maintenance requirements in the field of non-traditional Internet, has higher expandability and universality, and can give specific abnormal reasons for the given abnormal result.

According to one aspect of the present invention, the present invention is able to accurately detect and interpret anomalies, testing on 1 public data set and 2 time series data of a commercial bank's actual production environment, ultimately reaching F1-score of 0.81 and 0.89 on both data sets. Compared with the traditional unsupervised exception handling method, the best F1-score is improved by 0.19-0.5 on two data sets, and the detection time is shortened by 58%.

Drawings

FIG. 1 schematically shows a flow diagram of a method for time series exception handling according to one embodiment of the present invention;

FIG. 2 schematically represents an approximate index plot collected by the same switch;

3-5 schematically show three different anomaly fragment proactive recommender diagrams;

fig. 6 schematically shows a functional configuration diagram of a time-series abnormality processing apparatus according to an embodiment of the present invention.

Detailed Description

The content of the invention will now be discussed with reference to exemplary embodiments. It is to be understood that the embodiments discussed are merely intended to enable one of ordinary skill in the art to better understand and thus implement the teachings of the present invention, and do not imply any limitations on the scope of the invention.

As used herein, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to. The term "based on" is to be read as "based, at least in part, on". The terms "one embodiment" and "an embodiment" are to be read as "at least one embodiment".

In view of the above-described drawbacks of the prior art in the background art, the present invention provides a batch task time monitoring method, which can predict the time of a batch task and detect an abnormality of the batch task, and update a task model or generate an alarm according to the prediction and detection results.

FIG. 1 schematically shows a flow diagram of a method for time series exception handling according to one embodiment of the present invention. As shown in fig. 1, a time-series exception handling method according to an embodiment of the present invention includes the following steps:

a. acquiring time sequence data, training the time sequence data, and constructing a model;

b. detecting whether abnormal data exist in the time sequence data obtained in real time according to the model, and if so, recommending part of the abnormal data;

c. judging whether the recommended part of abnormal data is reasonable or not, and then feeding back a judgment result;

d. and optimizing the model according to the judgment result, and then continuously detecting the real-time sequence data.

In practice, the time series data may be represented by x, where x is { x ═ x₁，x₂，...，x_NN is the length of data x, data point x at any time t_tIs a specific data value. The time series may be collected from many sources, such as networks, transaction links, request logs, and the like. Same sourceHave a greater probability of having similar characteristics.

Because the number of time sequences to be monitored in a production environment is extremely large, each production unit can generate dozens or even hundreds of monitoring index data, the index data need to be monitored completely, if the time sequences are trained respectively in a targeted manner, the number of models and consumed resources are extremely large, and the existing operation and maintenance resources are difficult to support. Therefore, before the targeted training stage of the index data, the data are clustered, so that the detection processing time can be greatly reduced, and the abnormity can be quickly and accurately processed.

Specifically, according to an embodiment of the present invention, in the step a, in the clustering stage, the algorithm uses DBSCAN to capture the association relationship between the timing indexes to be trained, and clusters the indexes with similar shapes and consistent periodicity. In calculating the index similarity, distance between indexes is calculated using dtw (dynamic Time warping). The DBSCAN does not need to provide the predefined category information, and can control the clustering accuracy by adjusting the clustering radius, so the DBSCAN is very suitable for index clustering scenes.

Figure 2 schematically shows an approximate index map of the same switch acquisition. As shown in fig. 2, two network traffic curves for different ports of the same switch exhibit substantially the same trend and scale. In an actual production environment, the same type of data under the same monitoring unit also has the clustering characteristic, and by utilizing the characteristic, the number of models generated in a model training stage can be greatly reduced, consumed resources are reduced, and the cost performance of an operation and maintenance tool is improved. In addition, the number of data in a part of scenes is small, the accuracy requirement is high, and the cost performance of the single training data is higher than that of pre-clustering at the moment, so that the clustering stage is taken as an optional step.

As can be seen from the above, in the present invention, acquiring time series data includes acquiring regular small-scale time series data and acquiring irregular large-scale time series data, and the acquisition of regular small-scale time series data is only performed by direct training, while the acquisition of irregular large-scale time series data requires clustering, and then training various types of time series data, and then constructing a model.

According to one embodiment of the invention, according to the type of the time sequence data, feature data capable of representing the corresponding type of the time sequence data is selected for training, and a model is constructed. The time series data have different characteristics. For example, percentage type sequence data tends to exhibit a horizontal state with short dips or spikes in failure; transaction sequence data related to services often show periodic peaks/valleys, and a small amount of fluctuation occurs in the case of failure; exchanging infrastructure sequence data such as space, there may be a process that slowly rises over time. Therefore, the invention provides a characteristic data selection stage, and according to the statistical information and characteristics of the indexes, more suitable characteristic data are extracted in a targeted manner, so that the model accuracy is improved. The specific extraction rules are shown in the following table:

TABLE 2

In this embodiment, table 2 contains simple and effective feature data that can cover the different features of most curves, and is easy to calculate and performs well.

According to one embodiment of the invention, RRCF is adopted to select all feature data for training, all feature data are iterated to obtain a plurality of decision trees, the decision trees form a decision forest, and then whether abnormal data exist in real-time sequence data is determined through voting of the decision forest.

When the decision tree is constructed, the RRCF selects the segmentation dimension for segmenting the feature data, and the probability of the RRCF selecting the feature data under the segmentation dimension is

g_i＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability, that the feature i is selectedThe value is between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; gi represents the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation after the characteristic i is sorted according to the characteristic size in the training sample set; sigma g_jRepresenting g calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

Specifically, the unsupervised anomaly detection basic algorithm selected by the invention is RRCF (robust Random Cut forest), the detection effect of the unsupervised anomaly detection basic algorithm is better than that of other unsupervised anomaly detection algorithms, and a certain difference exists between the accuracy of the unsupervised anomaly detection basic algorithm and the accuracy of the unsupervised anomaly detection basic algorithm used when the vehicle is actually landed. The RRCF trains all training sample feature data in batches, each batch of feature data is subjected to multiple rounds of iteration to obtain a decision tree, and all decision trees finally form a decision forest and decide whether the training sample feature data are abnormal or not through voting. In the process of constructing the decision tree, feature segmentation needs to be selected from multiple dimensions of feature data. The RRCF considers that the segmentation is carried out on the dimension with larger coverage data range, the distinguishing effect of the sample is better, namely the probability that the feature i is selected

l_i＝max_x∈Sx_i-min_x∈Sx_iWherein Si represents the probability of the feature i being selected, li represents the difference between the maximum value and the minimum value in the feature i, S represents the training sample set, and x_iRepresenting the value of the feature i calculated for one sample in S. But this does not take into account the effect of the distribution of the dimensions themselves. According to an embodiment of the invention, when a decision tree is constructed and the dimension for cutting branches is selected, in addition to considering the coverage range of data of the dimension, the extreme difference of the data is used as an influence factor, namely, the probability of selecting the characteristic i is selected by the invention

Wherein g is_i＝max_x∈Sx_j-x_j-1. Thus, the larger the maximum spacing of the data distribution in each dimension,the degree of discrimination provided by segmentation at the interval is higher, so that segmentation dimensionality is selected more effectively, and model accuracy is improved.

Further, when a decision tree is constructed, after each iteration determines a segmentation dimension, a suitable boundary point needs to be selected on data of the dimension, and left and right subtrees are divided according to the boundary point. After the RRCF equally divides the dimension data, a dividing point is randomly selected, and the distribution characteristics of the dimension are not considered. According to one embodiment of the invention, the RRCF equally divides the feature data in the segmentation dimension into N intervals l₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each interval is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i]. Wherein l-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved for the training set, h-N-1 represents the maximum value of the characteristic, the difference between the minimum value and the maximum value is divided by N, and the N intervals are equally divided. For example, the left and right endpoints of the ith interval are l_iAnd h_i. The selection strategy can identify the sparse part of the segmentation dimension more accurately, so that the discrimination is improved. In the present embodiment, d_iThe density of the intervals is represented, and refers to the number of samples in the range. Since the spacing widths are the same, the greater the number of samples, the greater the density. Count represents the Count, p represents each sample falling in the interval, i.e. [ l ] is counted_i,h_i]Number of samples in the interval range. Uniform [ l_i,h_i]Represents the interval of pair l_i,h_iMake normalization, X_iIs a randomly selected segmentation point in the normalized interval.

Further, when abnormal data exists, an abnormal score codip of the abnormal data is calculated using the dividing point (specific node), and when the abnormal score codip is calculated, the sibling subtree and father of the dividing point are calculatedProportion CoDisp of abnormal data quantity contained in subtree_NodeThe higher the ratio, the higher the outlier degree of the outlier data. Since the calculation process of each abnormal data involves a plurality of characteristic data, the model is gradually moved upwards from the initial node for detection, and after repeated multiple iterations, the largest proportion CoDisp is selected_NodeAbnormal data x_iIs an abnormality score of

Abnormal score CoDisp_xiMeans x_iThe calculated degree of abnormality is sampled. First, x_iA leaf sample in the decision tree is dropped, and the algorithm searches upwards from the leaf until a branch Node is found, and the sample size of the sub-tree represented by the Node is far smaller than that of the sibling sub-tree thereof. Final sample x_iThe Codisp of (1) is the average value of the Codisp of the Node nodes corresponding to the sample in each tree in the whole forest. In the present embodiment, the largest ratio codip is selected_NodeConsidering the depth at which the node is located, deeper nodes in the tree are more normal. Thus find the demarcation point of the sample where x_iThe subtree is isolated from other large samples and is more representative.

Further, in the step b, recommending part of abnormal data as a plurality of most abnormal segments in the selected abnormal data, and recommending after obtaining labels of the plurality of segments; or

Recommending partial abnormal data by selecting a plurality of uncertain segments in the abnormal data, and recommending after obtaining labels of the plurality of segments; or

And recommending part of abnormal data, namely segmenting the abnormal data into a plurality of groups according to the abnormal scores, acquiring a plurality of fragments in each group, and recommending after acquiring the labels of the fragments.

3-5 schematically show three different anomaly fragment proactive recommender diagrams. As shown in fig. 3, according to an embodiment of the present invention, the scheme a selects the most abnormal 30 segments, and the labels of these abnormal segments can further affirm the explicit abnormality and reduce the false positive rate.

According to another embodiment of the invention, as shown in fig. 4, the scheme B selects the most uncertain 30 segments (i.e., around the anomaly determination threshold), and these labels can help the model to clearly classify the boundary, thereby improving the identification accuracy of the fuzzy anomaly.

As shown in fig. 5, according to the third embodiment of the present invention, the solution C divides the abnormal data into 10 groups according to the abnormal score, each group obtains at most 3 segments, and these labels can capture, for example, attitudes of the judgment feedback module on different abnormal judgment conditions, thereby helping the model determine the optimal threshold selection range.

In experiments disclosing data sets, the F1-score for protocol a was higher than the other two protocols, but each of the other two protocols possessed specific applicable scenarios.

Furthermore, the invention improves the processing efficiency of the model in the online detection stage through various technologies, and enables the model to have the capability of dynamic adjustment according to the feedback of the user. In the on-line detection stage, only the extreme abnormal value is selected as the automatic model feedback data to dynamically adjust the RRCF model, so that the model updating frequency is reduced, and the detection performance is improved. According to an embodiment of the invention, after the abnormal data of n labeled segments are obtained by the model, the abnormal data and M trees in the decision forest of the model jointly form an abnormal score matrix codip _ M [ x ]_i][tree_j]For each exception data x_iIf the user marks true sun, tree is used_jWeight tw of_j＝tw_j+δ×CoDosp_M[x_i][tree_j]. The self-correction of the model is fed back, so that the model can be helped to screen out decision trees with higher quality, the decision trees have higher weight in later-stage abnormal judgment, and the decision trees with higher weight are selected, so that the model is optimized, and the influence on the detection result is improved.

Furthermore, the present invention provides a time-series exception handling apparatus for implementing the time-series exception handling method, as shown in fig. 6, the apparatus including:

According to an embodiment of the present invention, further comprising:

In the invention, the data processing module acquires time sequence data, including acquiring regular small-scale time sequence data and irregular large-scale time sequence data, and when acquiring irregular large-scale time sequence data, all the time sequence data are clustered, and then various time sequence data are trained to construct a model.

The clustering process is to capture the incidence relation among the time sequence data to be trained through DBSCAN and cluster the data with approximate shape and consistent periodicity.

In the clustering process, in calculating the approximation degree of the time-series data, the distance between the time-series data is calculated using Dynamic Time Warping (DTW).

And the data classification processing module selects characteristic data which can represent the time sequence data of the corresponding type according to the type of the time sequence data to train and construct a model.

According to one embodiment of the invention, the abnormal data detection recommendation module adopts RRCF to select all feature data for training, the feature data are iterated to obtain a plurality of decision trees, the decision trees form a decision forest, and then whether abnormal data exist in the real-time sequence data or not is determined through decision forest voting.

In this embodiment, when constructing the decision tree, the RRCF selects a segmentation dimension for segmenting the feature data, and the RRCF has a probability of selecting the feature data in the segmentation dimension of

g_i＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability of the feature i being selected, the probability value being between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; gi represents the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation after the characteristic i is sorted according to the characteristic size in the training sample set; sigma g_jRepresenting g calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

In this embodiment, the RRCF equally divides the feature data in the segmentation dimension into N intervals [ l [ ]₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each interval is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i]. Wherein l-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved for the training set, h-N-1 represents the maximum value of the characteristic, the difference between the minimum value and the maximum value is divided by N, and the N intervals are equally divided.

When abnormal data exists, the abnormal score CoDisp of the abnormal data is calculated by using the dividing point, and when the abnormal score CoDisp is calculated, the proportion CoDisp of the abnormal data quantity contained in the brother subtree and the father subtree of the dividing point is calculated_NodeSelecting the largest ratio CoDisp_NodeAbnormal data x_iIs an abnormality score of

In the invention, the abnormal data detection recommending module recommends part of abnormal data as a plurality of most abnormal segments in the selected abnormal data, acquires labels of the plurality of segments and then recommends; or

According to an embodiment of the present invention, after obtaining the abnormal data of n labeled segments, the model and M decision trees in the decision forest of the model jointly form an abnormal score matrix codip _ M [ x [ ]_i][tree_j]For each exception data x_iIf the feedback judgment result is true positive, the decision tree is_jHas a weight of tw_j＝tw_j+δ×CoDisp_M[x_i][tree_j]And selecting a decision tree with higher weight according to the feedback judgment result so as to optimize the model.

To achieve the above object, the present invention also provides an electronic device, including: the time-series exception handling system comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the time-series exception handling method when being executed by the processor.

In order to achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the above time-series exception handling method.

According to the scheme, the invention provides an unsupervised, white-box and accurate time series exception handling method which is matched with active learning and can actively and efficiently collect feedback information. On the basis of a traditional unsupervised learning frame, an active learning stage is introduced, abnormality is actively recommended to a judgment feedback part (such as a judgment feedback module or operation and maintenance personnel) and feedback is acquired, so that a model is corrected, and the accuracy is improved. The method reserves the advantages of the traditional unsupervised learning in the aspects of parameter adjustment and marking, designs the application strategy of marking feedback in a targeted manner, and further optimizes the recall rate, the detection speed and the capacity of the model.

Moreover, the processing method has no obvious bias on data, can adapt to indexes with specific scene semantics, can meet the operation and maintenance requirements in the field of non-traditional Internet, has higher expandability and universality, and can give specific abnormal reasons to the given abnormal result.

Moreover, the present invention was able to accurately detect and interpret anomalies, tested on 1 public data set and time series data of 2 commercial bank actual production environments, ultimately reaching F1-score of 0.81 and 0.89 on both data sets. Compared with the traditional unsupervised exception handling method, the best F1-score is improved by 0.19-0.5 on two data sets, and the detection time is shortened by 58%.

Those of ordinary skill in the art will appreciate that the modules and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and devices may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method for transmitting/receiving the power saving signal according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

It should be understood that the order of execution of the steps in the summary of the invention and the embodiments of the present invention does not absolutely imply any order of execution, and the order of execution of the steps should be determined by their functions and inherent logic, and should not be construed as limiting the process of the embodiments of the present invention.

Claims

Translated fromChinese

1.一种时间序列异常处理方法，其特征在于，包括以下步骤：1. a time series exception processing method, is characterized in that, comprises the following steps:

获取时间序列数据，对所述时间序列数据训练，构建模型；Obtain time series data, train the time series data, and build a model;

根据所述模型检测实时获得的时间序列数据中是否存在异常数据，若存在，则推荐部分异常数据；Detecting whether there is abnormal data in the time series data obtained in real time according to the model, if there is, recommending some abnormal data;

根据所述判断结果优化所述模型，然后继续检测实时时间序列数据。Optimize the model according to the judgment result, and then continue to detect real-time time series data.

2.根据权利要求1所述的时间序列异常处理方法，其特征在于，获取时间序列数据包括获取规则的小规模时间序列数据和获取不规则的大规模时间序列数据，获取不规则的大规模时间序列数据时，对所有时间序列数据聚类处理，然后对各类时间序列数据训练，构建模型。2. The method for processing anomalies in time series according to claim 1, wherein obtaining time series data comprises obtaining regular small-scale time series data and obtaining irregular large-scale time series data, and obtaining irregular large-scale time series data. For sequence data, all time series data are clustered, and then various types of time series data are trained to build models.

3.根据权利要求2所述的时间序列异常处理方法，其特征在于，所述聚类处理是通过DBSCAN来捕获待训练时间序列数据之间的关联关系，将形状近似和周期性相符的数据聚类。3. The method for processing anomalies in time series according to claim 2, wherein the clustering process is to capture the correlation between the time series data to be trained through DBSCAN, and cluster the data with approximate shape and periodicity. kind.

4.根据权利要求3所述的时间序列异常处理方法，其特征在于，在所述聚类处理过程中，在计算时间序列数据的近似度时，使用动态时间规整算法计算时间序列数据之间的距离。4. The method for processing anomalies in time series according to claim 3, wherein, in the clustering process, when calculating the approximation of the time series data, a dynamic time warping algorithm is used to calculate the difference between the time series data. distance.

5.根据权利要求4所述的时间序列异常处理方法，其特征在于，根据所述时间序列数据的类型，选取能够代表相应类型的时间序列数据的特征数据进行训练，构建模型。5 . The time series exception processing method according to claim 4 , wherein, according to the type of the time series data, characteristic data that can represent the corresponding type of time series data is selected for training to build a model. 6 .

6.根据权利要求5所述的时间序列异常处理方法，其特征在于，采用稳健随机砍伐森林选取所有所述特征数据进行训练，所有所述特征数据经过迭代得到多个决策树，多个所述决策树组成决策森林，然后通过所述决策森林投票决定实时时间序列数据中是否存在异常数据。6. The method for processing anomalies in time series according to claim 5, characterized in that, adopting robust random deforestation to select all the characteristic data for training, and all the characteristic data are iterated to obtain a plurality of decision trees, and a plurality of the characteristic data are obtained. The decision trees form a decision forest, and then vote through the decision forest to determine whether there is abnormal data in the real-time time series data.

7.根据权利要求6所述的时间序列异常处理方法，其特征在于，在构建所述决策树时，所述RRCF选择切分所述特征数据的切分维度，在所述切分维度下，所述RRCF选取所述特征数据的概率为

g_i＝max_x∈Sx_j-x_j-1；其中i为特征数据，p_i表示特征i被选择的概率，概率值为0到1之间；l_i表示特征i在训练的样本集中，计算得到的特征集合里，最大值和最小值的差；g_i表示特征i在训练的样本集中，计算得到的特征集合里，按特征大小排序后，相邻的两个特征值之间最大的差值；∑g_i代表每个特征维度j计算出来的g_j求和∑l_j代表每个特征维度j计算出来的l_j求和。7. The method for processing anomalies in time series according to claim 6, wherein, when constructing the decision tree, the RRCF selects a segmentation dimension to segment the feature data, and under the segmentation dimension, The probability that the RRCF selects the feature data is

g_i =max_x∈S x_j -x_j-1 ; where i is the feature data, p_i represents the probability of feature i being selected, and the probability value is between 0 and 1; li_i represents that feature i is in the training sample set , in the calculated feature set, the difference between the maximum value and the minimum value;_gi indicates that feature i is in the training sample set. In the calculated feature set, after sorting by feature size, the largest value between two adjacent feature values ∑g_i represents the sum of g_j calculated by each feature dimension j and ∑l_j represents the sum of l_j calculated by each feature dimension j.

8.根据权利要求7所述的时间序列异常处理方法，其特征在于，所述RRCF将所述切分维度上的所述特征数据等分，等分为N个间隔[l₀，h₀，l₁，h₁，...，l_N-1，h_N-1]，并计算每个间隔的密度d_i＝Count(p，p∈[l_i，h_i])，其中每个所述间隔被选择的概率为

最终从被选择的所述间隔中随机挑选切分点X_i～Uniform[l_i，h_i]；其中，1-0和h-N-1分别代表针对训练集求解的特征维度中，1-0表示特征的最小值，h-N-1代表特征的最大值，两者作差，除以N，等分为N个间隔。8. The time series anomaly processing method according to claim 7, wherein the RRCF divides the feature data on the segmentation dimension into equal parts, and divides them into N intervals [l₀ , h₀ , l₁ , h₁ , . . . , l_N-1 , h_N-1 ], and calculate the density d_i = Count(p, p∈[l_i , h_i ]) for each interval, where each The probability of the interval being chosen is

Finally, the segmentation points X_i ～Uniform[li_i , h_i ] are randomly selected from the selected interval; wherein, 1-0 and hN-1 respectively represent the feature dimensions solved for the training set, and 1-0 represents The minimum value of the feature, hN-1 represents the maximum value of the feature, the difference between the two, divided by N, is divided into N intervals.

9.根据权利要求8所述的时间序列异常处理方法，其特征在于，存在所述异常数据时，利用所述切分点计算所述异常数据的异常分数CoDisp，计算所述异常分数CoDisρ时，计算所述切分点的兄弟子树和父亲子树所包含所述异常数据数量的比例CoDisp_Node，选择最大的比例CoDisp_Node，异常数据x_i的异常分数为

T∈forest。9. The time series abnormality processing method according to claim 8, wherein when the abnormal data exists, the abnormal score CoDisp of the abnormal data is calculated by using the cutting point, and when the abnormal score CoDisp is calculated, Calculate the ratio CoDisp_Node of the number of abnormal data contained in the sibling subtree and the father subtree of the split point, select the largest ratio CoDisp_Node , and the abnormal score of abnormal data x_i is

T ∈ forest.

10.根据权利要求9所述的时间序列异常处理方法，其特征在于，所述推荐部分异常数据为选择所述异常数据中推测为最异常的多个片段；或者10 . The method for processing anomalies in time series according to claim 9 , wherein the recommended partial anomaly data is to select a plurality of segments that are presumed to be the most anomalous among the anomalous data; or

所述推荐部分异常数据为选择所述异常数据中最不确定的多个片段，获取所述多个片段的标注后进行推荐；或者The recommended part of the abnormal data is to select the most uncertain multiple segments in the abnormal data, and obtain the annotations of the multiple segments to make recommendations; or

所述推荐部分异常数据为根据所述异常分数将所述异常数据切分为多组，每组获取多个片段，获取所述片段的标注后进行推荐。The recommended part of the abnormal data is to divide the abnormal data into multiple groups according to the abnormal score, obtain a plurality of segments for each group, and perform recommendation after obtaining the labels of the segments.

11.根据权利要求10所述的时间序列异常处理方法，其特征在于，所述模型获取n个标注片段的异常数据后与所述模型的决策森林中的m棵决策树共同组成异常分数矩阵CoDisp_M[x_i][tree_j]，对于每个异常数据x_i，若反馈判断结果为真阳，则决策树tree_j的权重为tw_j＝tw_j+δ×CoDisp_M[xi][tree_j]，根据反馈判断结果选择更高权重的决策树，从而优化所述模型。11. The method for processing anomalies in time series according to claim 10, wherein after the model acquires the anomalous data of n labeled segments, it forms an anomaly score matrix CoDisp_M together with m decision trees in the decision forest of the model. [x_i ][tree_j ], for each abnormal data_xi , if the feedback judgment result is true positive, the weight of decision tree tree_j is tw_j =tw_j +δ×CoDisp_M[xi][tree_j ], A decision tree with a higher weight is selected according to the feedback judgment result, thereby optimizing the model.

12.一种时间序列异常处理装置，其特征在于，包括：12. A time series exception processing device, comprising:

数据处理模块，用于获取时间序列数据，对时间序列数据训练，构建模型；The data processing module is used to obtain time series data, train time series data, and build models;

异常数据检测推荐模块，根据所述模型检测实时获得的时间序列数据中是否存在异常数据，若存在，则推荐部分异常数据；An abnormal data detection and recommendation module, which detects whether there is abnormal data in the time series data obtained in real time according to the model, and if there is, recommends some abnormal data;

异常数据判断反馈模块，判断所述部分异常数据是否合理，然后反馈判断结果；Abnormal data judgment feedback module, judges whether the part of abnormal data is reasonable, and then feeds back the judgment result;

模型优化模块，根据所述反馈判断结果优化所述模型，然后继续检测实时时间序列数据。The model optimization module optimizes the model according to the feedback judgment result, and then continues to detect real-time time series data.

13.根据权利要求12所述的时间序列异常处理装置，其特征在于，还包括13 . The time series exception processing device according to claim 12 , further comprising: 13 .

数据分类处理模块，用于获取不规则的大规模时间序列数据，对所有时间序列数据聚类处理，然后对各类时间序列数据训练，构建模型。The data classification processing module is used to obtain irregular large-scale time series data, cluster all time series data, and then train various types of time series data to build models.

14.一种电子设备，其特征在于，包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述计算机程序被所述处理器执行时实现如权利要求1至11中任一项所述的时间序列异常处理方法。14. An electronic device, characterized in that it comprises a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to achieve the right The time series exception handling method described in any one of requirements 1 to 11 is required.

15.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储计算机程序，所述计算机程序被处理器执行时实现如权利要求1至11中任一项所述的时间序列异常处理方法。15. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the time according to any one of claims 1 to 11 is realized Sequence exception handling method.