Disclosure of Invention
In order to achieve the above object, the present invention provides a method for detecting abnormality of power timing data based on long and short term memory network; the invention uses the long and short term memory network model in deep learning to analyze the power time sequence data and detect the abnormal data in the power time sequence data so as to help the power system to find the existing faults in time.
An abnormality detection method for power time series data based on a long-short term memory network comprises the following steps:
s1: preprocessing power time sequence data, removing unimportant features in the data, cleaning partial noise data, and taking a data preprocessing result as input of next model training;
s2: pre-training a neural network model, and calculating the layered dynamic attention by adopting an encoder-decoder structure; the basic model uses a Long short-term memory network (LSTM), the ReLU is used as an activation function, the loss function uses a custom loss function, the optimizer uses Adam, and the model is trained until convergence; this step is based on a power time series data set { x }1,…,xNObtaining initial parameters of the neural network model parameters of the next step and representative point representation c of the whole time sequence data set;
s3: abnormal data detection in neural network models
After training is completed, the model weight W is stored locally, the model is directly loaded when new power time sequence data x are detected, and calculation is carried out
Distance from the representative point c
And obtaining the abnormal score of the user to judge whether the user is abnormal or not.
Compared with the prior art, the invention has the beneficial effects that:
the method comprises the steps of firstly preprocessing the existing data to obtain a group of relatively regular data sets without redundancy; the invention adopts the long-term and short-term memory network, and is more suitable for the analysis of time sequence data compared with other networks; the Adam optimizer is adopted, the Adam algorithm is a method for calculating the self-adaptive learning rate of each parameter, the algorithm needs less memory and is efficient in calculation, and the method is suitable for solving the problems of large-scale data and parameters; the method uses the custom loss function, performs specific optimization aiming at the abnormal detection, and can improve the network training efficiency and the abnormal detection accuracy; according to the invention, a ReLU activation function is used, and the single-side inhibition of the function enables neurons in a neural network to have sparse activation, so that a model after sparse realization through the ReLU can better mine relevant characteristics and fit training data; the invention uses the network pre-training technology, thereby greatly reducing the training time of the model.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Example 1
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all changes that come within the meaning and range of equivalency of the claims are to be embraced therein.
As shown in fig. 1 and fig. 3, a method for detecting an abnormality of power time series data based on a long-short term memory network, the method comprising two parts of off-line training a model and detecting an abnormality based on the model by a power time series data analysis method based on a long-short term memory network model, comprises the following steps:
s1: and (3) preprocessing power time sequence data, removing unimportant features in the data, and cleaning part of noise data. The result of the data preprocessing is used as the input of the next model training;
s2: the method comprises the following steps of pre-training a neural network model, training the model until convergence, based on a hierarchical attention mechanism encoder-decoder neural network structure (figure 2), using a long-short-term memory network (LSTM) for a basic model, using Adam for an optimizer, using a custom loss function for a loss function, using a ReLU as an activation function, and training the model. In the training process, a back propagation algorithm is adopted, the algorithm is based on a chain rule of a complex function, a gradient descending mode is adopted, the gradient is intuitively understood to be first-order approximation, so the gradient can be understood to be a coefficient of the sensitivity of a certain variable or a certain intermediate variable to the output influence, and the understanding is more intuitive when the chain method is changed from multiplication to Jacobian matrix multiplication. To reduce training time, pre-training (pre-training) can be used to find a near-optimal solution for weights in the neural network.
The pre-trained model adopts a coder-decoder structure based on a hierarchical attention mechanism, and comprises the following modules:
2.1 encoder
Input sequence
For k variable sequence with length of l, the hidden state sequence of n encoder units is obtained by passing the k variable sequence through LSTM unit
Wherein X
iCorresponding encoder hidden state
2.2 decoder
Combining the previous decoder unit hidden state with the k-dimensional prediction result of the previous decoder unit
The current decoder hidden state is obtained by the LSTM unit. Thereby obtaining a sequence of hidden states of the decoder unit
Wherein
2.3 computing hierarchical dynamic attention
(1) Calculating attention weights
For each decoder unit, all encoder hidden states E obtained from step 2.1 and decoder hidden states obtained from step 2.2
Calculating attention alpha
i*The method used is a bilinear mapping method (i.e.,
) Or location-based methods (i.e. using a location-based method)
). Then to alpha
i*Normalization using softmax to obtain attention weight
(2) Computing dynamic attention context vectors
Using the attention weight obtained in the previous step
Set H of hidden states for all encoders and hidden states for decoders
iThe weighted sum is obtained for E ═ D,a dynamic attention context vector is obtained, i.e.,
(3) computing hierarchical attention hiding states
Obtained in the last step
After being connected with the hidden state of the decoder, the corresponding dynamic attention hidden state is obtained through a ReLU activation function
Then, a Max-Pooling or averaging method is used for calculating and obtaining the context vector of the encoder according to all the hidden states E of the encoder obtained in the step 2.1
Finally, the encoder context vector c
eAnd dynamic attention hidden state
By blending functions (serial affine transformation of two vectors, or pooling operations on both), we get a hierarchical attention hidden state
Specifically, the number of hidden units in the model is 128, a single-layer LSTM is used, and parameter estimation is performed by using an Adam optimization algorithm. The size of the model training batch is set to 512, the number of iterations is 500, the learning rate is set to 0.001, overfitting is prevented by using early stopping, and the time point of early stopping is judged by the error reduction trend of the model on the cross validation set.
S3: abnormal data detection in neural network models
After the training is finished, the model weight W is stored to the local, and the new power sequence is processedDirectly loading the model and calculating when the data x is detected
Distance from the representative point c
And obtaining the abnormal score of the user to judge whether the user is abnormal or not.
Further, in step S3, a Long short-term memory (LSTM) is used, which is an excellent variant model of RNN, inherits most characteristics of RNN model, and solves the problem of gradient disappearance caused by gradual reduction in the gradient back propagation process, and the LSTM is very suitable for processing the highly time-related problems, such as the power time-series data concerned in this patent.
Specifically, the number of hidden units in the model is 96, single-layer LSTM is used, and parameter estimation is performed by using an Adam optimization algorithm. The model training batch size is set to 512, the number of iterations is 200, the learning rate is set to 0.001, and overfitting is prevented by using the advance stop
The beneficial effect of adopting the further scheme is as follows: the LSTM is more suitable for time sequence data than other networks, so that the addition of the LSTM effectively improves the effect of anomaly detection.
Further, in step S3, Adam is used as an optimization function, which is essentially RMSprop with momentum term, and it dynamically adjusts the learning rate of each parameter using the first moment estimate and the second moment estimate of the gradient. The formula is as follows:
(1)mt=μ*mt-1+(1-μ)*gt
(2)nt=v*nt-1+(1-v)*gt2
where equations (1) (2) are the first and second order moment estimates, respectively, for the gradient, which can be considered as Egt|,E|gt2The estimate of | equation (3) (4) is a correction to the first order second moment estimate, which can be approximated as an unbiased estimate of the expectation. It can be seen that the moment estimation directly on the gradient has no additional requirements on the memory and can be dynamically adjusted according to the gradient. The last term is preceded by a dynamic constraint formed on the learning rate n and having a well-defined range.
The beneficial effect of adopting the further scheme is as follows: after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.
Further, step S3 uses a custom loss function, which is defined as follows:
herein, the
For the LSTM model, the parameters of the network are W, and a time sequence x is formed
iMapping into a hypersphere. The objective function described above requires that the raw data be mapped into a hypersphere, in order to contain as much data as possible within the hypersphere. Wherein the first term minimizes the volume of the hypersphere; the second term is the penalty term for those located outside the hypersphere, the hyperparameter α is used to control a trade-off of hypersphere volume and over-bounds; the third term is a regularization term that controls the norm of the network parameter W to avoid model overfitting.
After the Adam completes the training of the model, the model is directly loaded and calculated when new power time sequence data x are detected
Distance from the representative point c
And obtaining the abnormal score of the user to judge whether the user is abnormal or not.
The beneficial effect of adopting the further scheme is as follows: the loss function is specifically optimized for anomaly detection, network training efficiency is effectively improved, and anomaly detection accuracy and recall rate are improved.
Further, step S3 uses ReLU as the activation function, whose formula is defined as follows:
ReLU(x)=max{0,x}
the function changes all negative values into 0, and the positive values are unchanged, namely, unilateral inhibition is carried out, so that neurons in the neural network also have sparse activation.
The beneficial effect of adopting the further scheme is as follows: for a linear function, the expression capacity of the ReLU is stronger, and the ReLU is particularly embodied in a deep network; for the nonlinear function, because the Gradient of the non-negative interval is constant, the ReLU has no Problem of Gradient disappearance (cancellation Gradient distribution), so that the convergence rate of the model is maintained in a stable state.
Further, the step S3 initializes parameters of the LSTM model using the S2 pre-trained network parameters, the pre-trained model being a model trained on a large reference data set for solving a similar problem. Because the calculation cost for training the model is high, it is common practice to import published results and use the corresponding model, and slight adjustment of model parameters is performed on the basis of the model, so that the training process of the model is completed.
The beneficial effect of adopting the further scheme is as follows: the training speed of the model can be increased, the generated model can be stored in a weight mode and can be migrated to the solution of other problems, and the idea is adopted in the process of migration learning.