CN112308402A

Movatterモバイル変換

Info

Publication number: CN112308402A
Application number: CN202011182119.7A
Authority: CN
Inventors: 沙朝锋; 耿同欣; 郑伟杰
Original assignee: China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd; Fudan University
Current assignee: China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd; Fudan University
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-02-02
Anticipated expiration: 2040-10-29
Also published as: CN112308402B

Abstract

Translated fromChinese

本发明公开了一种基于长短期记忆网络的电力时间序列数据的异常检测方法。该方法包括如下步骤：(1)电力时序数据预处理；(2)神经网络模型预训练，采用编码器‑解码器结构，计算分层动态注意力；(3)异常数据检测，在神经网络模型

完成训练后，将模型权重W将模型保存到本地，在对新的电力时序数据x进行检测时直接加载模型，计算

与代表点c的距离

得到其异常分数从而判断是否异常。本发明方法用于电力时间序列数据的异常检测，方法简单，检测精准度高。

The invention discloses an abnormal detection method of electric power time series data based on a long short-term memory network. The method includes the following steps: (1) preprocessing power time series data; (2) neural network model pre-training, using an encoder-decoder structure to calculate hierarchical dynamic attention; (3) abnormal data detection, in the neural network model

After the training is completed, the model weight W is saved to the local, and the model is directly loaded when the new power time series data x is detected, and the calculation

distance from representative point c

Get its abnormal score to judge whether it is abnormal or not. The method of the invention is used for abnormal detection of power time series data, and the method is simple and the detection accuracy is high.

Description

Power time series data abnormity detection method based on long and short term memory network

Technical Field

The invention belongs to the technical field of data analysis and anomaly detection, and particularly relates to an anomaly detection method of power time series data based on a long-term and short-term memory network.

Background

The real-time operation data of the power system has the potential capability of reflecting the current operation state and the future development trend of the power system. Along with the rapid development of the intellectualization of the power system, the scale of embedding various sensors in the power system is continuously enlarged, so that the types of data collected by a sensing layer are more refined, and the data to be processed is increased rapidly. According to incomplete statistics, the power grid service data collected every day in a single city can reach a PB level. In general, the real-time operation data of the power system has the characteristics of multiple data acquisition devices, high acquisition frequency, large data scale, complex data types and the like. And the collected data is typical time-series data (TSD). The method has the advantages that the power time sequence data are fully utilized, the appropriate technology is adopted to carry out abnormity detection and timely find faults existing in the power system, and decision and auxiliary support can be provided for efficient and safe operation of the power system.

Disclosure of Invention

In order to achieve the above object, the present invention provides a method for detecting abnormality of power timing data based on long and short term memory network; the invention uses the long and short term memory network model in deep learning to analyze the power time sequence data and detect the abnormal data in the power time sequence data so as to help the power system to find the existing faults in time.

An abnormality detection method for power time series data based on a long-short term memory network comprises the following steps:

s1: preprocessing power time sequence data, removing unimportant features in the data, cleaning partial noise data, and taking a data preprocessing result as input of next model training;

s2: pre-training a neural network model, and calculating the layered dynamic attention by adopting an encoder-decoder structure; the basic model uses a Long short-term memory network (LSTM), the ReLU is used as an activation function, the loss function uses a custom loss function, the optimizer uses Adam, and the model is trained until convergence; this step is based on a power time series data set { x }₁,…,x_NObtaining initial parameters of the neural network model parameters of the next step and representative point representation c of the whole time sequence data set;

s3: abnormal data detection in neural network models

After training is completed, the model weight W is stored locally, the model is directly loaded when new power time sequence data x are detected, and calculation is carried out

Distance from the representative point c

And obtaining the abnormal score of the user to judge whether the user is abnormal or not.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of firstly preprocessing the existing data to obtain a group of relatively regular data sets without redundancy; the invention adopts the long-term and short-term memory network, and is more suitable for the analysis of time sequence data compared with other networks; the Adam optimizer is adopted, the Adam algorithm is a method for calculating the self-adaptive learning rate of each parameter, the algorithm needs less memory and is efficient in calculation, and the method is suitable for solving the problems of large-scale data and parameters; the method uses the custom loss function, performs specific optimization aiming at the abnormal detection, and can improve the network training efficiency and the abnormal detection accuracy; according to the invention, a ReLU activation function is used, and the single-side inhibition of the function enables neurons in a neural network to have sparse activation, so that a model after sparse realization through the ReLU can better mine relevant characteristics and fit training data; the invention uses the network pre-training technology, thereby greatly reducing the training time of the model.

Drawings

FIG. 1 is a flow chart of an anomaly detection method based on power time series data of a long-term and short-term memory network according to the present invention.

FIG. 2 is a block diagram of an encoder-decoder neural network architecture for pre-training a hierarchical attention mechanism in accordance with the present invention.

Fig. 3 shows the principle and structure of the method for detecting the abnormality of the power time series data based on the long-short term memory network according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Example 1

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all changes that come within the meaning and range of equivalency of the claims are to be embraced therein.

As shown in fig. 1 and fig. 3, a method for detecting an abnormality of power time series data based on a long-short term memory network, the method comprising two parts of off-line training a model and detecting an abnormality based on the model by a power time series data analysis method based on a long-short term memory network model, comprises the following steps:

s1: and (3) preprocessing power time sequence data, removing unimportant features in the data, and cleaning part of noise data. The result of the data preprocessing is used as the input of the next model training;

s2: the method comprises the following steps of pre-training a neural network model, training the model until convergence, based on a hierarchical attention mechanism encoder-decoder neural network structure (figure 2), using a long-short-term memory network (LSTM) for a basic model, using Adam for an optimizer, using a custom loss function for a loss function, using a ReLU as an activation function, and training the model. In the training process, a back propagation algorithm is adopted, the algorithm is based on a chain rule of a complex function, a gradient descending mode is adopted, the gradient is intuitively understood to be first-order approximation, so the gradient can be understood to be a coefficient of the sensitivity of a certain variable or a certain intermediate variable to the output influence, and the understanding is more intuitive when the chain method is changed from multiplication to Jacobian matrix multiplication. To reduce training time, pre-training (pre-training) can be used to find a near-optimal solution for weights in the neural network.

The pre-trained model adopts a coder-decoder structure based on a hierarchical attention mechanism, and comprises the following modules:

2.1 encoder

Input sequence

For k variable sequence with length of l, the hidden state sequence of n encoder units is obtained by passing the k variable sequence through LSTM unit

Wherein X_iCorresponding encoder hidden state

2.2 decoder

Combining the previous decoder unit hidden state with the k-dimensional prediction result of the previous decoder unit

The current decoder hidden state is obtained by the LSTM unit. Thereby obtaining a sequence of hidden states of the decoder unit

Wherein

2.3 computing hierarchical dynamic attention

(1) Calculating attention weights

For each decoder unit, all encoder hidden states E obtained from step 2.1 and decoder hidden states obtained from step 2.2

Calculating attention alpha_i*The method used is a bilinear mapping method (i.e.,

) Or location-based methods (i.e. using a location-based method)

). Then to alpha_i*Normalization using softmax to obtain attention weight

(2) Computing dynamic attention context vectors

Using the attention weight obtained in the previous step

Set H of hidden states for all encoders and hidden states for decoders_iThe weighted sum is obtained for E ═ D,a dynamic attention context vector is obtained, i.e.,

(3) computing hierarchical attention hiding states

Obtained in the last step

After being connected with the hidden state of the decoder, the corresponding dynamic attention hidden state is obtained through a ReLU activation function

Then, a Max-Pooling or averaging method is used for calculating and obtaining the context vector of the encoder according to all the hidden states E of the encoder obtained in the step 2.1

Finally, the encoder context vector c_eAnd dynamic attention hidden state

By blending functions (serial affine transformation of two vectors, or pooling operations on both), we get a hierarchical attention hidden state

Specifically, the number of hidden units in the model is 128, a single-layer LSTM is used, and parameter estimation is performed by using an Adam optimization algorithm. The size of the model training batch is set to 512, the number of iterations is 500, the learning rate is set to 0.001, overfitting is prevented by using early stopping, and the time point of early stopping is judged by the error reduction trend of the model on the cross validation set.

S3: abnormal data detection in neural network models

After the training is finished, the model weight W is stored to the local, and the new power sequence is processedDirectly loading the model and calculating when the data x is detected

Distance from the representative point c

Further, in step S3, a Long short-term memory (LSTM) is used, which is an excellent variant model of RNN, inherits most characteristics of RNN model, and solves the problem of gradient disappearance caused by gradual reduction in the gradient back propagation process, and the LSTM is very suitable for processing the highly time-related problems, such as the power time-series data concerned in this patent.

Specifically, the number of hidden units in the model is 96, single-layer LSTM is used, and parameter estimation is performed by using an Adam optimization algorithm. The model training batch size is set to 512, the number of iterations is 200, the learning rate is set to 0.001, and overfitting is prevented by using the advance stop

The beneficial effect of adopting the further scheme is as follows: the LSTM is more suitable for time sequence data than other networks, so that the addition of the LSTM effectively improves the effect of anomaly detection.

Further, in step S3, Adam is used as an optimization function, which is essentially RMSprop with momentum term, and it dynamically adjusts the learning rate of each parameter using the first moment estimate and the second moment estimate of the gradient. The formula is as follows:

(1)m_t＝μ*m_t-1+(1-μ)*g_t

(2)n_t＝v*n_t-1+(1-v)*g_t²

(3)

where equations (1) (2) are the first and second order moment estimates, respectively, for the gradient, which can be considered as Eg_t|，E|g_t²The estimate of | equation (3) (4) is a correction to the first order second moment estimate, which can be approximated as an unbiased estimate of the expectation. It can be seen that the moment estimation directly on the gradient has no additional requirements on the memory and can be dynamically adjusted according to the gradient. The last term is preceded by a dynamic constraint formed on the learning rate n and having a well-defined range.

The beneficial effect of adopting the further scheme is as follows: after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.

Further, step S3 uses a custom loss function, which is defined as follows:

herein, the

For the LSTM model, the parameters of the network are W, and a time sequence x is formed_iMapping into a hypersphere. The objective function described above requires that the raw data be mapped into a hypersphere, in order to contain as much data as possible within the hypersphere. Wherein the first term minimizes the volume of the hypersphere; the second term is the penalty term for those located outside the hypersphere, the hyperparameter α is used to control a trade-off of hypersphere volume and over-bounds; the third term is a regularization term that controls the norm of the network parameter W to avoid model overfitting.

After the Adam completes the training of the model, the model is directly loaded and calculated when new power time sequence data x are detected

Distance from the representative point c

The beneficial effect of adopting the further scheme is as follows: the loss function is specifically optimized for anomaly detection, network training efficiency is effectively improved, and anomaly detection accuracy and recall rate are improved.

Further, step S3 uses ReLU as the activation function, whose formula is defined as follows:

ReLU(x)＝max{0,x}

the function changes all negative values into 0, and the positive values are unchanged, namely, unilateral inhibition is carried out, so that neurons in the neural network also have sparse activation.

The beneficial effect of adopting the further scheme is as follows: for a linear function, the expression capacity of the ReLU is stronger, and the ReLU is particularly embodied in a deep network; for the nonlinear function, because the Gradient of the non-negative interval is constant, the ReLU has no Problem of Gradient disappearance (cancellation Gradient distribution), so that the convergence rate of the model is maintained in a stable state.

Further, the step S3 initializes parameters of the LSTM model using the S2 pre-trained network parameters, the pre-trained model being a model trained on a large reference data set for solving a similar problem. Because the calculation cost for training the model is high, it is common practice to import published results and use the corresponding model, and slight adjustment of model parameters is performed on the basis of the model, so that the training process of the model is completed.

The beneficial effect of adopting the further scheme is as follows: the training speed of the model can be increased, the generated model can be stored in a weight mode and can be migrated to the solution of other problems, and the idea is adopted in the process of migration learning.

Claims

Translated fromChinese

1.一种基于长短期记忆网络的电力时间序列数据的异常检测方法，其特征在于，包括以下步骤：1. an abnormal detection method based on the power time series data of long short-term memory network, is characterized in that, comprises the following steps:

S1：电力时序数据预处理，移除数据中不重要的特征，清洗部分噪声数据，数据预处理的结果作为下一步模型训练的输入；S1: Preprocessing of power time series data, removing unimportant features in the data, cleaning part of the noisy data, and the result of data preprocessing as the input of the next model training;

S2：神经网络模型预训练，采用编码器-解码器结构，计算分层动态注意力；基础模型使用长短期记忆网络LSTM，ReLU作为激活函数，损失函数使用自定义损失函数，优化器使用Adam，训练模型直至收敛为止；这一步基于电力时间序列数据集{x₁,…,x_N}得到下一步神经网络模型参数的初始参数，以及整个时间序列数据集的代表点表示c；S2: Neural network model pre-training, using encoder-decoder structure to calculate hierarchical dynamic attention; basic model uses long short-term memory network LSTM, ReLU as activation function, loss function uses custom loss function, optimizer uses Adam, Train the model until convergence; this step is based on the power time series data set {x₁ ,...,x_N } to obtain the initial parameters of the neural network model parameters in the next step, and the representative point representation c of the entire time series data set;

S3：异常数据检测，在神经网络模型

与代表点c的距离

得到其异常分数从而判断是否异常。S3: Anomaly data detection, in a neural network model

distance from representative point c

Get its abnormal score to judge whether it is abnormal or not.

2.根据权利要求1所述的异常检测方法，其特征在于，步骤S2中，采用编码器-解码器结构，计算分层动态注意力的方法如下：2. abnormal detection method according to claim 1 is characterized in that, in step S2, adopts encoder-decoder structure, and the method for calculating hierarchical dynamic attention is as follows:

(1)输入序列

为长度为l的k变量序列，将其通过LSTM单元得到n个编码器单元的隐藏状态序列

其中X_i对应的编码器隐藏状态

(1) Input sequence

is a k-variable sequence of length l, which is passed through the LSTM unit to obtain the hidden state sequence of n encoder units

where X_i corresponds to the hidden state of the encoder

(2)将前一解码器单元隐藏状态和前一解码器单元的k维预测结果

通过LSTM单元，获得当前解码器隐藏状态，进而获得解码器单元隐藏状态序列

其中

(2) Combine the hidden state of the previous decoder unit with the k-dimensional prediction result of the previous decoder unit

Through the LSTM unit, the current decoder hidden state is obtained, and then the hidden state sequence of the decoder unit is obtained

in

(3)计算分层动态注意力(3) Computing Hierarchical Dynamic Attention

①计算注意力权重①Calculate the attention weight

对于每个解码器单元，根据步骤(1)得到的所有编码器隐藏状态E和步骤(2)得到的解码器隐藏状态

计算注意力α_i*，使用的方法为双线性映射方法，即

或基于位置的方法，即

然后对α_i*使用softmax进行归一化，得到注意力权重

For each decoder unit, all encoder hidden states E obtained from step (1) and decoder hidden states obtained from step (2)

Calculate the attention α_i* , the method used is the bilinear mapping method, namely

or a location-based approach, i.e.

Then use softmax to normalize α_i* to get the attention weights

②计算动态注意上下文向量②Calculate dynamic attention context vector

使用上一步得到的注意力权重

对所有编码器隐藏状态和解码器隐藏状态的集合H_i＝E∪D求加权和，得到动态注意上下文向量，即，

h_k∈E∨h_k∈D；Use the attention weights from the previous step

A weighted sum of all encoder hidden states and decoder hidden states H_i = E∪D is obtained to obtain the dynamic attention context vector, i.e.,

h_k ∈E∨h_k ∈D;

③计算分层注意隐藏状态③ Calculate Hierarchical Attention Hidden State

将上一步得到的

与解码器隐藏状态进行连接后，通过ReLU激活函数得到对应的动态注意隐藏状态

然后使用Max-Pooling或均值方法，根据步骤(1)得到的所有编码器隐藏状态E计算得到编码器上下文向量

最后将编码器上下文向量c_e和动态注意隐藏状态

通过混合函数，即对两个向量的串联仿射变换，或对两者的池化操作，得到分层注意隐藏状态

will be obtained in the previous step

After connecting with the decoder hidden state, the corresponding dynamic attention hidden state is obtained through the ReLU activation function

Then use the Max-Pooling or mean method to calculate the encoder context vector according to all the encoder hidden states E obtained in step (1).

Finally, the encoder context vector c_e and the dynamic attention hidden state

Hierarchical attention hidden state is obtained through a mixing function, that is, a tandem affine transformation of two vectors, or a pooling operation of both

3.根据权利要求1所述的异常检测方法，其特征在于，步骤S2中，基于神经网络模型的激活函数选择ReLU，所有负值均变成0，正数不变，即单侧抑制，使得神经网络中的神经元也具有了稀疏激活性，公式如下：3. Anomaly detection method according to claim 1, is characterized in that, in step S2, select ReLU based on the activation function of the neural network model, all negative values become 0, positive numbers remain unchanged, namely unilateral inhibition, so that. The neurons in the neural network also have sparse activation, the formula is as follows:

ReLU(x)＝max{0,x}。ReLU(x)=max{0,x}.

4.根据权利要求1所述的异常检测方法，其特征在于，步骤S2中，自定义损失函数定义如下：4. The abnormality detection method according to claim 1, wherein in step S2, the self-defined loss function is defined as follows:

这里的

为另一个LSTM网络，由训练的参数初始化这个网络的参数为W，将一条时间序列x_i映射到一个超球体中。here

For another LSTM network, initialize this network with the parameters W from the training parameters to map a time series_xi into a hypersphere.

5.根据权利要求1所述的异常检测方法，其特征在于，步骤S2中，选择Adam作为训练神经网络模型的优化算法，利用梯度的一阶矩估计和二阶矩估计动态调整每个参数的学习率；其公式如下：5. abnormal detection method according to claim 1, is characterized in that, in step S2, select Adam as the optimization algorithm of training neural network model, utilize the first-order moment estimation of gradient and second-order moment estimation to dynamically adjust the value of each parameter. Learning rate; its formula is as follows:

(1)m_t＝μ*m_t-1+(1-μ)*g_t(1)m_t =μ*m_t-1 +(1-μ)*g_t

(2)n_t＝ν*n_t-1+(1-ν)*g_t²(2)n_t =ν*n_t-1 +(1-ν)*g_t²

(3)

(3)

其中公式(1)(2)分别是对梯度g_t的一阶矩估计和二阶矩估计，看作是期望值E|g_t|，E|g_t²|的估计，公式(3)(4)是对一阶二阶矩估计的校正，近似为对期望的无偏估计；公式(5)前面部分是对学习率n形成的一个动态约束。Among them, formulas (1) and (2) are the estimation of the first-order moment and the second-order moment of the gradient g_t respectively, which are regarded as the estimation of the expected values E|g_t |, E|g_t² |, formulas (3) (4) ) is the correction of the first-order second-order moment estimate, which is approximately an unbiased estimate of the expectation; the former part of formula (5) is a dynamic constraint formed by the learning rate n.