Disclosure of Invention
In view of the above, the present invention aims to provide a multi-factor fusion method for identifying outlier data of internet of things, which fuses the influence factors associated with main factors for comprehensive analysis, and dynamically sets a threshold by adopting a sliding window technology based on the distribution of anomaly scores. The method remarkably improves the rationality, accuracy and real-time performance of the outlier data identification, and provides a firmer technical support for the outlier data identification of the Internet of things.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a multi-factor fusion method for identifying outlier data of the Internet of things comprises the following steps:
S1, data fusion and preprocessing, namely determining identified target factors, fusing a plurality of influence factors related to the identified target factors, acquiring corresponding historical data and preprocessing;
S2, data correlation analysis, namely performing correlation analysis on specific indexes in the multi-factor data preprocessed in the step S1 to select index data with obvious influence on target factors so as to form a final effective index combination;
s3, data prediction, namely constructing a prediction model of LSTM combined characteristic attention mechanism, predicting the screened data by utilizing correlation analysis, outputting a predicted value of a period of time in the future, and obtaining an actual predicted value through inverse normalization reduction;
And S4, identifying outlier data, namely adopting the prediction model, fusing an outlier data identification method of multi-factor comprehensive analysis, modeling by using an isolated forest algorithm, taking a prediction residual error and an influence factor as input features, calculating the anomaly score of each data point, dynamically setting a threshold value based on the distribution of the anomaly scores, and judging the data points exceeding the threshold value as outlier data by adopting a sliding window technology.
Further, the step S1 specifically includes the following steps:
s11, data fusion, namely in the identification of the outlier data of the Internet of things, firstly determining target factors to be identified, and integrating a plurality of influence factors related to the target factors;
S12, processing missing values and abnormal points, namely processing data by adopting a linear interpolation method, and estimating the value at a certain moment by using a linear interpolation function when a peak outlier or missing value appears at the moment;
And S13, normalization processing, namely normalizing the data by adopting an extremum normalization method to eliminate the influence of dimension, and performing inverse normalization processing on the result output by the prediction model to obtain an actual predicted value.
Further, in step S2, for nonlinear relations among different indexes, correlation among each pair of indexes is analyzed by adopting a Szelman rank correlation coefficient, and the method specifically comprises the steps of calculating the Szelman rank correlation coefficients of each pair of index data to be analyzed by carrying out rank conversion, reflecting monotone relation strength among the indexes, selecting an influence factor index highly correlated with a main factor index according to a result of correlation analysis, and eliminating indexes weakly correlated with the main factor, so that selection of input characteristics of a subsequent model is optimized.
Further, the step S3 specifically includes the following steps:
s31, dividing the preprocessed data into a training set, a testing set and a verification set;
S32, constructing a prediction model of LSTM combined with a characteristic attention mechanism, defining an input layer, a hidden layer and an output layer, carrying out window construction on a data set, then carrying out multi-element time sequence prediction on the input data serving as the prediction model, setting a time sequence Y= (Y1,…,yT-1,yT)∈RT serving as a prediction target, setting each index data of each factor of history T time as a time sequence matrix X= (X1,x2,…,xN)T∈RT×N) of related characteristic variables, wherein N represents the dimension of parameters and comprises each index parameter in each factor,Representing the value of the nth variable at time t;
The important variables are weighted in the encoding stage by combining the characteristic attention mechanism to obtain the importance weight cN of each hidden state to the predicted output, and the importance weight cN represents the important characteristics of the current input characteristics to the output, and in the encoding stage, the context vector updated by the characteristic attention mechanism is shown in the following formulaFusion with previous history information to outputUsing these weight coefficients, inputting the variables at each time after updating to obtain a matrix x= (c1x1,c2x2,…,cNxN)T∈RT×N;
cN=fattention(x)
the weight updating is carried out on the input vector and each hidden layer state by utilizing a characteristic attention mechanism, so that the time sequence coding hidden layer state at each moment contains the association relation corresponding to the predicted target parameter and other characteristic parameters, thereby obtaining the predicted value of the historical data at the next moment
S33, configuring network parameters, training a prediction model, and taking convergence of an L (theta) loss function as a termination, wherein the formula is as follows:
where θ represents the set of network parameters,AndRespectively representing an actual value and a predicted value of the target parameter i at the time t;
summing and averaging the root mean square errors of the predicted value and the actual value of each target parameter, and measuring the overall prediction accuracy of the model;
S34, predicting real-time data by using a trained prediction model, and predicting index parameters at the time t+1 by using the data at the first n times of the time t+1 to be predicted as an input sequence by using the prediction model, wherein the formula is as follows:
Further, the step S4 includes the steps of:
s41, comparing a predicted value output by a predicted model with an actual value, and calculating to obtain a predicted residual error;
S42, constructing an outlier data identification model of an isolated forest, taking a multidimensional prediction residual and an influence factor as input features, and taking the change of the influence factor into consideration in calculating the abnormal score of a sample point, wherein the updated abnormal score has the following calculation formula:
where s (x) represents the anomaly score for data point x, ht (x) represents the path length of the t-th tree, and c (n) is a constant that normalizes the path length for adjusting for differences between input features;
Counting abnormal scores in a period of time by adopting a sliding window technology, calculating the mean value and standard deviation in a window, and dynamically adjusting a threshold value;
The length of the sliding window is W at the time t, the anomaly score in the window is { st-W+1,st-W+2,...,st }, and after the mean mut and the standard deviation sigmat in the sliding window are calculated, the dynamic threshold formula is as follows:
Ut=μt+k·σt
Wherein Ut is a threshold value set in the current window, and k is a constant for controlling sensitivity of outlier data;
Under the condition of real-time data updating, adopting the mean value and standard deviation of a rolling updating window, introducing an anomaly score st+1 at the time t+1, removing the earliest data st-W+1 of the window, and calculating the method as follows:
Through the recursive mode, the mean value and the standard deviation are dynamically adjusted when the window is updated each time;
S43, setting super parameters of an isolated forest model, including the number of isolated trees, the sampling amount of each tree and the length of a sliding window, training the model by gradually adjusting the super parameters, and comparing performance differences under different settings;
s44, identifying outlier data of the predicted residual by using the trained isolated forest model, comparing the calculated anomaly score with a threshold value, and judging the data point as outlier data if the anomaly score is larger than the threshold value.
The invention has the beneficial effects that:
1. Through linear interpolation and extremum normalization technology, missing values and abnormal values can be effectively processed, inconsistency and dimension differences in data are eliminated, and dimension consistency among different indexes is ensured. The processing step obviously improves the quality and reliability of the data, and provides a more accurate basis for subsequent analysis and modeling.
2. Through analysis of correlation among various indexes in influence factors by the spearman rank correlation coefficient, strong correlation indexes between main factor indexes can be effectively identified, and indexes irrelevant to or weakly related to target factors are removed. The process is helpful for reducing interference of irrelevant features, optimizing selection of input features, and improving performance and efficiency of subsequent models.
3. By combining the feature attention mechanism, the LSTM model can adaptively adjust the weight according to the change of each influence factor, so that the model can dynamically pay attention to key features in an input sequence at each moment, and the accuracy and reliability of prediction are improved. And the precision of the prediction result is effectively improved through the feature selection and dynamic adjustment of the optimization model.
4. And constructing an outlier data identification model by using an isolated forest algorithm by adopting a method based on a prediction model and fusing multi-factor comprehensive analysis. By taking the multidimensional prediction residual and the influence factors as input features, the scale consistency among different features is ensured through standardization processing, and then the relation among the input features is implicitly considered. This relationship will be reflected in the construction of the decision tree, affecting the final anomaly score calculation. Finally, based on the anomaly score distribution, a sliding window technology is adopted to dynamically set a threshold value, and data points exceeding the threshold value are judged to be outlier data. The accuracy, the effectiveness and the reliability of outlier data identification are effectively improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.
As shown in fig. 1, the invention provides a multi-factor fusion internet of things outlier data identification method, which comprises the following specific steps:
The first step, data fusion and preprocessing comprises the following steps:
In the step 1, data fusion, namely in the identification of the outlier data of the Internet of things, firstly, determining target factors to be identified and fusing a plurality of influence factors related to the target factors. Then, the corresponding historical data are fused, and a comprehensive data base is provided for subsequent analysis. For example, in water environment detection, the water quality factor can be determined as a target factor to be identified, and influence factors associated with the target factor, such as weather factors, can influence the water quality index data to a certain extent, and the water quality index data can be more comprehensively analyzed by combining the factors.
And 2, processing missing values and abnormal points, wherein in the process of data acquisition of the Internet of things, data breakpoints and noise are often generated due to the influence of factors such as network, weather, overhaul and the like. Thus, a linear interpolation method is used to process the data at two times ti,tj, xi and xj, respectively, and when the peak outlier or missing value at a certain time is used to estimate the value at time t, i.e., L (t).
And 3, carrying out normalization processing on the data by adopting an extremum normalization method after carrying out linear interpolation processing on the data, wherein the data is normalized by adopting an extremum normalization method to eliminate the influence of the dimension, and the formula is shown as follows, wherein xmin and xmax are the minimum value and the maximum value in the data, epsilon is a constant, and the value is taken based on the range and the precision of the data and is used for preventing the model from falling into infinity. Performing inverse normalization processing on the result output by the prediction model to obtain an actual predicted value, whereinRepresenting the predicted value of a certain index after normalization.
And secondly, analyzing the data correlation, namely analyzing the correlation between each pair of indexes by adopting a spearman rank correlation coefficient aiming at the nonlinear relation among different indexes in order to explore the influence of different indexes in all influence factors on main factor indexes.
Specifically, by performing rank conversion on each pair of index data to be analyzed, the spearman rank correlation coefficients thereof are calculated, reflecting the monotonic relationship strength between the indexes. If the correlation coefficient is close to 1, the strong positive correlation exists between the two indexes, if the correlation coefficient is close to-1, the strong negative correlation exists between the two indexes, and if the correlation coefficient is close to 0, the no obvious monotone relation exists between the two indexes. And according to the result of the correlation analysis, selecting an influence factor index highly correlated with the main factor index, and eliminating the index weakly correlated with the main factor, thereby optimizing the selection of the input characteristics of the subsequent model. The highly relevant indexes are used as key input features in the subsequent modeling process, and provide a basis for updating the weights of the input features in the prediction model through a feature attention mechanism, so that the attention of the model to key factors is improved.
Third, referring to fig. 2, constructing LSTM combined with attention mechanism for prediction, comprising the steps of:
and step 1, dividing the preprocessed data into a training set, a testing set and a verification set according to the proportion of 8:1:1.
Step 2, constructing a prediction model of LSTM combined with a characteristic attention mechanism, defining an input layer, a hidden layer and an output layer, performing window construction on a data set, performing multi-component time sequence prediction on the input data serving as the prediction model, setting a time sequence Y= (Y1,…,yT-1,yT)∈RT) of any index as a prediction target, setting each index data of each factor of history T as a time sequence matrix X= (X1,x2,…,xN)T∈RT×N) of related characteristic variables, wherein N represents the dimension of parameters and comprises each index parameter in each factor,Representing the value of the nth variable at time t.
The important variables are weighted in the encoding stage by combining the characteristic attention mechanism to obtain the importance weight cN of each hidden state to the predicted output, and the importance weight cN represents the important characteristics of the current input characteristics to the output, and in the encoding stage, the context vector updated by the characteristic attention mechanism is shown in the following formulaFusion with previous history information to outputUsing these weight coefficients, the updated input variable for each time is obtained as a matrix x= (c1x1,c2x2,…,cNxN)T∈RT×N).
cN=fattention(x)
The weight updating is carried out on the input vector and each hidden layer state by utilizing a characteristic attention mechanism, so that the time sequence coding hidden layer state at each moment contains the association relation corresponding to the predicted target parameter and other characteristic parameters, thereby obtaining the predicted value of the historical data at the next moment
And 3, configuring network parameters such as learning rate, training times and the like. Training a predictive model, the training terminating in an L (θ) loss function convergence, the equation being as follows, where θ represents a set of network parameters,AndRepresenting the actual value and the predicted value of the target parameter i at time t, respectively.
The overall prediction accuracy of the model can be effectively measured by summing and averaging the root mean square errors of the predicted value and the actual value of each target parameter, and the model parameters can be optimized in the training process, so that the overfitting is reduced.
And 4, predicting real-time data by using a trained prediction model, and predicting index parameters at the time t+1 by using the data at the first n times of the time t+1 to be predicted as an input sequence by using the prediction model, wherein the formula is as follows:
fourth, referring to fig. 3, constructing an isolated forest for anomaly detection, comprising the steps of:
Firstly, comparing a predicted value output by a predicted model with an actual value, and calculating to obtain a predicted residual. And then, a Z score standardization method is used for carrying out standardization processing on the multidimensional prediction residual error and the influence factors so as to ensure the consistent scale among different features and facilitate the subsequent model training.
Step 2, constructing an outlier data identification model of an isolated forest, taking multidimensional prediction residual and influence factors as input features, wherein the model implicitly considers the relation between the input features during training, the relation is reflected in the result of a decision tree and reflected in the calculation of abnormal scores, specifically, the change of the influence factors is considered in the calculation of the abnormal scores of sample points, the updated abnormal score calculation formula is as follows, s (x) represents the abnormal score of a data point x, ht (x) represents the path length of a t-th tree, and c (n) is a constant of a standardized path length and is used for adjusting the difference between the input features.
The method comprises the steps of adopting a sliding window technology to count abnormal scores in a period of time, calculating the mean value and standard deviation in the window, dynamically adjusting a threshold value, flexibly setting the threshold value along with abnormal distribution change, improving the adaptability of a model to data fluctuation, setting the length of the sliding window to be W at the moment of time t, setting the abnormal score in the window to be { st-W+1,st-W+2,…,st }, and after calculating the mean value mut and the standard deviation sigmat in the sliding window, calculating a dynamic threshold value formula as follows, wherein Ut is the threshold value set in the current window, k is a constant, and is usually 2 or 3, and the dynamic threshold value formula is used for controlling the sensitivity of outlier data.
Ut=μt+k·σt
In the case of real-time data update, the mean value and standard deviation of the rolling update window can be adopted to reduce the calculated amount, and the abnormal score st+1 is introduced at the time t+1, and the earliest data st-W+1 of the window is removed, so that the calculation mode is as follows:
Through the recursive method, the mean value and the standard deviation can be dynamically adjusted when the window is updated each time, the recalculation cost is reduced, and meanwhile, the dynamic adjustment of the change of the threshold value along with time is ensured so as to adapt to the change of the abnormal score distribution.
And 3, setting super parameters of the isolated forest model, including the number of the isolated trees, the sampling amount of each tree, the length of a sliding window and the like. The model is trained by stepwise adjustment of these hyper-parameters and the performance differences at different settings are compared. When the performance of the model is evaluated, indexes such as ROC curve and AUC value are adopted to analyze the accuracy and stability of the model, so that the model parameter combination with optimal performance is selected.
And 4, identifying outlier data of the prediction residual by using the trained isolated forest model. Comparing the calculated anomaly score with a threshold value, and if the anomaly score is greater than the threshold value, determining the data point as outlier data.
In the foregoing embodiments, references in the specification to "this embodiment" indicate that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least some, but not necessarily all, embodiments. Multiple occurrences of "this embodiment" do not necessarily all refer to the same embodiment.
In the above embodiments, while the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims.
The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the methods of the present embodiments.
The embodiment also provides an electronic terminal, which comprises a processor and a memory;
The memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, so that the terminal executes any one of the methods in the present embodiment.
The computer readable storage medium of the present embodiment, those of ordinary skill in the art will appreciate that all or part of the steps of implementing the above-described method embodiments may be implemented by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs the steps comprising the method embodiments described above, and the storage medium described above includes various media capable of storing program code, such as ROM, RAM, magnetic or optical disk.
The electronic terminal provided in this embodiment includes a processor, a memory, a transceiver, and a communication interface, where the memory and the communication interface are connected to the processor and the transceiver and complete communication with each other, the memory is used to store a computer program, the communication interface is used to perform communication, and the processor and the transceiver are used to run the computer program, so that the electronic terminal performs each step of the above method.
In this embodiment, the memory may include a random access memory (Random Access Memory, abbreviated as RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor may be a general-purpose processor, including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable consumer electronics, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.