Disclosure of Invention
The present invention has been made in view of the above-described problems occurring in the prior art.
Therefore, the invention provides a new energy station monitoring data quality evaluation method and system based on multi-source data, which solve the problems of accurate identification and dynamic quality evaluation of new energy station monitoring data deviation in a multi-source data environment, and realize efficient evaluation and intelligent early warning of the influence of the data deviation on running performance.
In order to solve the technical problems, the invention provides the following technical scheme:
In a first aspect, an embodiment of the present invention provides a new energy station monitoring data quality evaluation method based on multi-source data, which includes,
Acquiring multi-source monitoring data acquired by each monitoring device in the new energy station, and preprocessing the multi-source monitoring data;
based on the preprocessed multi-source monitoring data, performing data deviation recognition model training on the preprocessed multi-source monitoring data by utilizing a Baum-Welch algorithm;
Integrating the learning network and the data deviation recognition model, constructing an observation sequence, converting the predicted hidden state sequence into state label data, and recognizing the data deviation among all monitoring data sources;
Calculating abnormal scores of the data deviations, setting data quality grades for all the data deviation points, and evaluating the influence of the data deviations on the running performance of the new energy station;
based on the data quality grade, a real-time monitoring platform is established, the change trend of the data deviation is tracked, a change rate threshold is set, and early warning is carried out on the data deviation exceeding the change rate threshold.
The method for evaluating the quality of the new energy station monitoring data based on the multi-source data is characterized by comprising the following main steps of:
Listing all monitoring devices in the new energy station, and definitely determining the type and index of data to be acquired;
installing monitoring equipment, ensuring that the position and the installation method of the equipment meet the standards, debugging the equipment, ensuring the normal operation of the equipment and accurately acquiring data;
And transmitting the acquired data to a cloud server through a network, and preprocessing the original data.
The invention relates to a new energy station monitoring data quality evaluation method based on multi-source data, which is a preferable scheme, wherein based on the preprocessed multi-source monitoring data, a Baum-Welch algorithm is utilized to train a data deviation recognition model of the preprocessed multi-source monitoring data, and the method mainly comprises the following steps:
Carrying out normalization processing on the multi-source monitoring data, and setting parameters of a data deviation recognition model based on the normalization processing result;
initializing an initial state probability vector, a state transition probability matrix and an observation probability matrix, and constructing a hidden Markov model;
The Baum-Welch algorithm is applied to calculate the forward probability at the initial moment, and the forward probability at each moment is calculated through forward recursion to calculate the total probability of the observation sequence;
setting the backward probability of the final moment, and calculating the backward probability of each moment through backward recursion;
calculating a state occupancy probability based on the forward probability and the backward probability;
calculating a state transition probability based on the forward probability and the backward probability;
The formula for calculating the state occupancy probability based on the forward probability and the backward probability is as follows:
the formula for calculating the state transition probability based on the forward probability and the backward probability is as follows:
Wherein γt (i) represents a state occupancy probability of the state i at the time t, Ωt (i) represents a forward probability of the state i at the time t, BETAt (i) represents a backward probability of the state i at the time t, W (o|λ) represents a total probability of occurrence of the observation sequence O at the given model parameter λ, Et (i, j) represents a state transition probability of the state i to the state j at the time t, aij represents a probability of the state i to the state j, bj(Ot+1) represents a probability of generating the observation data Ot+1 at the state j, Ot+1 represents data at the time t+1, βt+1 (i) represents a backward probability of the state j at the time t+1;
And updating the initial state probability vector, the state transition probability matrix and the observation probability matrix based on the state occupancy probability and the state transition probability.
As an optimal scheme of the new energy station monitoring data quality evaluation method based on the multi-source data, the invention comprises the following steps: the integrated learning network and the data deviation recognition model construct an observation sequence, convert the predicted hidden state sequence into state label data and recognize the data deviation among all monitoring data sources, and mainly comprises the following steps:
integrating a learning network and a data deviation recognition model;
based on the initial state probability vector, the state transition probability matrix, and the observation probability matrix;
the observed data of all data sources are aligned according to time and fused into a multidimensional time sequence;
inputting the fused multidimensional time sequence as an observation sequence into a trained data deviation recognition model, and constructing the observation sequence;
Initializing a path probability and a path record matrix, and generating an initialized path probability matrix and a path record matrix;
repeatedly executing the method, respectively calculating the path probability of each moment and recording the optimal precursor state of each state while calculating the path probability, and updating the path probability matrix and the path record matrix until the optimal precursor state of each state is reached;
the calculation formula of the path probability at each moment is as follows:
δt+1(i)=max[δt(i)·aij]bjOt+1;
Wherein δt+1 (i) represents the optimal path probability at time t+1 at the ith data, and δt (i) represents the optimal path probability at time t at the ith data;
Based on the optimal precursor state, calculating the optimal path probability and the optimal state at the final moment;
Extracting optimal path probability from the final moment, determining an optimal state at the final moment, and initializing an optimal hidden state sequence;
The optimal state of each moment is determined by utilizing the path record matrix and tracing back to the moment from the moment;
Constructing an optimal hidden state sequence according to the optimal state at each moment obtained by backtracking, and converting the predicted hidden state sequence into state label data;
By comparing the difference between the observed data and the state label data of each monitoring data, the data deviation between each monitoring data source is accurately identified.
As an optimal scheme of the new energy station monitoring data quality evaluation method based on multi-source data, the invention comprises the following main steps of:
The method comprises the steps that a log recording module is used for collecting a prediction result output by a model every minute and comparing the prediction result with prediction deviation, real deviation and corresponding timestamp data;
Classifying and marking the collected data according to a preset deviation type and severity standard by using an automatic script, and distinguishing true positives, false positives, true negatives and false negatives to form a marked feedback data set;
Setting a retraining day, and triggering the model to retrain on the monthly day;
for the hidden Markov model, re-estimating a state transition probability matrix and an observation probability matrix by using a Baum-Welch algorithm;
For an isolated forest model, the number of trees and the size of sub-samples are adjusted, so that the model is ensured to adapt to new data characteristics;
Adopting random gradient descent as an online learning algorithm, randomly extracting small batch data from the latest feedback data set each time, updating model parameters, and adopting the following formula:
the formula of the loss function is as follows;
wherein P (y|x; θ) represents the probability that the model predicts class y under parameter θ, N represents the total number of samples, θ represents the model parameter, η represents the learning rate, L (θt;xt,yt) represents the loss function, xt and yt represent the characteristics and labels of the current sample, and xi and yi represent the ith input characteristic and output label, respectively;
Randomly initializing a model parameter theta0, inputting samples in a data set in a streaming mode, acquiring one small batch of data each time, calculating the gradient of the model parameter about a loss function for each small batch of data, updating the model parameter according to the calculated gradient and a preset learning rate, and continuously iterating until the data stream reaches a preset iteration number;
Adopting a deep Q network algorithm, setting a reward function, giving forward rewards r+ for each correct early warning, and giving r- for each incorrect early warning, wherein the reward design follows the following principles:
R+=loge (1+V) for each correct pre-warning;
R-=-loge (1+RL) for each error warning;
Wherein V and RL represent response speed and response delay of the system under the condition of correct and error early warning respectively;
Setting a change rate threshold delta, and when the change rate deltanew > delta of the monitoring data, judging that an abnormal condition possibly exists by the system, and triggering early warning according to the formula:
Setting an abnormal score threshold value s, and triggering early warning when the abnormal score snew of a certain monitoring point is more than s;
Deltanew and snew represent a new change rate threshold and an abnormality score threshold, respectively, α and β are false positive rate and false negative rate, respectively, and T is average response time;
setting a working day of each month as an evaluation day, improving the early warning effect and the operation and maintenance efficiency based on the past month, and executing strategy evaluation, wherein the data deviation recognition accuracy ACC formula is as follows:
Wherein TP represents the abnormal number correctly recognized by the system, TN represents the data amount correctly judged as normal by the system, FP represents the number incorrectly judged as abnormal by the system, FN represents the actual abnormal data amount not recognized by the system;
The early warning response time RT formula is:
Wherein, Tr,i represents the response time point of the ith abnormal event by the system, Tr represents the response time, Tt,i represents the actual occurrence time of the ith abnormal event, Tt represents the actual occurrence time of the abnormal event, and N represents the total number of the abnormal events in the monitoring period;
The operation and maintenance efficiency improvement percentage E formula is:
wherein OT represents the manual operation and maintenance time required for solving the same scale problem before introducing an automatic early warning system, and NT represents the actual time consumption for processing the same problem after system intervention;
the CPI equation reflecting overall performance is:
wherein, RTmax represents the maximum acceptable early warning response time, alpha represents the regulating factor, and w1、w2 and w3 respectively represent the weight coefficient ratio of accuracy, response time and operation and maintenance efficiency improvement percentage.
The invention relates to a new energy station monitoring data quality evaluation method based on multi-source data, which comprises the following steps of calculating data deviation abnormal scores, setting data quality grades for each data deviation point, and evaluating the influence of the data deviation on the operation performance of the new energy station, wherein the method mainly comprises the following steps:
carrying out standardization processing on each data deviation, and forming a test data set from the standardized data deviations;
Constructing an anomaly monitoring model, setting parameters of the anomaly monitoring model, training the anomaly monitoring model by using a test data set as training data, predicting anomaly scores of data deviations in the test data set by using the trained anomaly monitoring model, traversing the data deviations in the test data set, inputting the data deviations into an isolated tree of the anomaly monitoring model, starting from a root node, traversing downwards according to splitting conditions of the isolated tree until leaf nodes are reached, recording traversal path lengths of each data point in each tree, calculating average values of path lengths of the data deviations in all the isolated trees with respect to the data deviations, taking the average values as final anomaly scores of current data deviations, and normalizing the anomaly scores to a preset range;
the calculation formula of the anomaly score is as follows;
wherein S (x) represents an anomaly score for the data bias x,Representing the average path length of the data bias x in all the isolated trees, C (n) representing the adjustment coefficient;
Setting a data quality level based on the anomaly score and scoring the anomaly data points;
setting a data quality class quality based on the anomaly score, the class being settable according to a score range:
High-quality data, wherein the abnormal score is in the range of 0-a;
Good data, anomaly score in the a-b range;
Moderate data, anomaly score in the b-c range;
Low-prime data, anomaly score in the c-d range;
abnormal data, wherein the abnormal score is in the range of d-e;
Assigning a corresponding quality level to each data point based on its anomaly score
Each outlier data point is given a score, which can be set according to the magnitude of the outlier score:
data points with anomaly scores in the d-E range are scored as E' (highest score);
data points with anomaly scores in the c-D range are scored as D';
Data points with anomaly scores in the b-C range are scored as C';
data points with anomaly scores in the a-B range are scored as B';
data points with anomaly scores in the range of 0-a are scored as a' (lowest score);
and evaluating the influence of the data deviation on the operation performance of the new energy station according to the scoring of the abnormal data points.
The invention relates to a new energy station monitoring data quality evaluation method based on multi-source data, which comprises the following main steps of establishing a real-time monitoring platform, tracking the change trend of data deviation, setting a change rate threshold value, and pre-warning the data deviation exceeding the change rate threshold value based on the data quality grade:
Based on the data quality level, establishing a real-time monitoring platform;
Performing differential operation on the data deviation indexes to obtain differential values, performing stability test on the differential values, and taking the deviation values after the stability test as the change rate of the data deviation;
setting a change rate threshold of data deviation as mu0, wherein a deviation change rate mu formula is as follows:
μ=k〃σ;
wherein sigma represents the standard deviation of the data, and k represents the safety factor;
introducing an objective function G (k), and evaluating early warning effects under different k values to obtain an optimal safety coefficient k, wherein the formula is as follows:
Wherein FPR (k) represents false alarm rate, FNR (k) represents false alarm rate, and w4 and w5 represent weight coefficients of false alarm rate and false alarm rate respectively;
The formula of the optimal safety coefficient k is:
when mu > mu0, triggering early warning, and transmitting early warning information to operation and maintenance personnel so as to analyze the early warning information:
collecting all data deviations with the change rate exceeding a preset threshold value;
classifying the super-threshold data deviation according to types to obtain abnormal data points of each category;
Performing root cause analysis on abnormal data points of each category to acquire the cause of data deviation;
And according to the analysis result, periodically evaluating and adjusting the early warning rule, and optimizing the real-time monitoring platform.
The invention provides a new energy station monitoring data quality evaluation system based on multi-source data, which comprises a data acquisition and preprocessing module, a data deviation recognition model training module, an integrated learning network and deviation recognition module, a data deviation abnormal score and quality evaluation module and a real-time monitoring and early warning response module;
The data acquisition and preprocessing module is used for acquiring multi-source monitoring data acquired by each monitoring device in the new energy station and preprocessing the multi-source monitoring data;
The data deviation recognition model training module is used for carrying out data deviation recognition model training on the preprocessed multi-source monitoring data by utilizing a Baum-Welch algorithm based on the preprocessed multi-source monitoring data;
the integrated learning network and deviation recognition module is used for integrating the learning network and the data deviation recognition model, constructing an observation sequence, converting the predicted hidden state sequence into state label data and recognizing the data deviation among all monitoring data sources;
the data deviation abnormal score and quality evaluation module is used for calculating the data deviation abnormal score, setting data quality grades for each data deviation point and evaluating the influence of the data deviation on the operation performance of the new energy station;
The real-time monitoring and early warning response module is used for establishing a real-time monitoring platform, tracking the change trend of the data deviation, setting a change rate threshold value and carrying out early warning on the data deviation exceeding the change rate threshold value.
In a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, where the computer program when executed by the processor implements any step of the new energy station monitoring data quality evaluation method based on multi-source data according to the first aspect of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any step of the new energy station monitoring data quality evaluation method based on multi-source data according to the first aspect of the present invention.
The method has the beneficial effects that the accuracy and the reliability of the monitoring data of the new energy station are improved, the predictability and the coping capacity of potential risks are greatly enhanced through a real-time monitoring and early warning mechanism, and finally the stable operation and the optimized management of the new energy station are ensured.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Embodiment 1, referring to fig. 1 and 2, provides a new energy station monitoring data quality evaluation method based on multi-source data, which comprises the following steps:
S1, acquiring multi-source monitoring data acquired by each monitoring device in the new energy station, and preprocessing the multi-source monitoring data.
Listing all monitoring devices in the new energy station, and definitely determining the type and index of data to be acquired;
installing monitoring equipment, ensuring that the position and the installation method of the equipment meet the standards, debugging the equipment, ensuring the normal operation of the equipment and accurately acquiring data;
And transmitting the acquired data to a cloud server through a network, and cleaning, integrating, characteristic engineering and normalizing the original data.
It is to be explained that all monitoring devices in the new energy station, including sensors, meters and the like, are listed, the data types and indexes, such as temperature, humidity, wind speed, generating capacity, device states and the like, which are required to be acquired clearly are installed, the positions and the installation methods of the devices are ensured to meet the standards, the device debugging is carried out, the normal operation of the devices is ensured, and the data can be acquired accurately.
S2, based on the multi-source monitoring data, training a data deviation recognition model of the preprocessed multi-source monitoring data by utilizing a Baum-Welch algorithm.
Carrying out normalization processing on the multi-source monitoring data, and setting parameters of a data deviation recognition model based on the normalization processing result;
It should be explained that, collecting all multi-source monitoring data, ensuring that the data is covered comprehensively, including all parameters to be monitored, sorting the data of different sources, ensuring that the time stamps are aligned, enabling the data to have a consistent time dimension, selecting a proper normalization method, such as minimum-maximum normalization (Min-Max Normalization) or Z-score normalization, performing normalization processing on each monitoring parameter, enabling the data value to fall within the same range, such as [0,1] or standard normal distribution, checking the normalized data, and ensuring that the normalized data is error-free and meets expectations.
In the application, the data deviation recognition model is a hidden Markov model, wherein the hidden Markov model (Hidden Markov Model, HMM) is a statistical model and is used for describing an observation sequence generated by a hidden state sequence, and the hidden Markov model is commonly used for modeling and analyzing time sequence data.
Initializing an initial state probability vector, a state transition probability matrix and an observation probability matrix to construct a hidden Markov model;
The Baum-Welch algorithm is applied to calculate the forward probability at the initial moment, and the forward probability at each moment is calculated through forward recursion to calculate the total probability of the observation sequence;
setting the backward probability of the final moment, and calculating the backward probability of each moment through backward recursion;
calculating a state occupancy probability based on the forward probability and the backward probability;
calculating a state transition probability based on the forward probability and the backward probability;
The formula for calculating the state occupancy probability based on the forward probability and the backward probability is as follows:
the formula for calculating the state transition probability based on the forward probability and the backward probability is:
Wherein γt (i) represents a state occupancy probability of the state i at the time t, Ωt (i) represents a forward probability of the state i at the time t, BETAt (i) represents a backward probability of the state i at the time t, W (o|λ) represents a total probability of occurrence of the observation sequence O at the given model parameter λ, Et (i, j) represents a state transition probability of the state i to the state j at the time t, aij represents a probability of the state i to the state j, bj(Ot+1) represents a probability of generating the observation data Ot+1 at the state j, Ot+1 represents data at the time t+1, βt+1 (i) represents a backward probability of the state j at the time t+1;
And updating the initial state probability vector, the state transition probability matrix and the observation probability matrix based on the state occupancy probability and the state transition probability.
It is to be explained that the initial state probability vector, the state transition probability matrix and the observation probability matrix are all parameters of the data deviation recognition model;
wherein the initial state probability vector represents the probability that the data bias identification model was initially in each hidden state;
the state transition probability matrix represents the probability of transitioning from one hidden state to another hidden state;
The observation probability matrix represents the probability of generating a certain observation under a certain hidden state.
The forward algorithm is a dynamic programming algorithm, and is used for calculating the probability of an observation sequence under the given Hidden Markov Model (HMM) parameter, and the algorithm propagates the state probability of the previous moment to the next moment through recursive calculation in each step, and updates the probability by combining the observation values of the current moment, wherein the probability is defined as being in a state i at a time t and the probability of observing the previous t observation values is the forward probability.
The backward algorithm is also a dynamic programming algorithm for calculating the probability of being in a certain state from a certain moment and generating a subsequent observation under the condition of a given observation sequence, defined as the probability of being in state i at time T and observed from time T to the final moment T.
S3, integrating the learning network and the data deviation recognition model, constructing an observation sequence, converting the predicted hidden state sequence into state label data, and recognizing the data deviation among all monitoring data sources.
Based on the initial state probability vector, the state transition probability matrix, and the observation probability matrix;
the observed data of all data sources are aligned according to time and fused into a multidimensional time sequence;
inputting the fused multidimensional time sequence as an observation sequence into a trained data deviation recognition model, and constructing the observation sequence;
Initializing a path probability and a path record matrix, and generating an initialized path probability matrix and a path record matrix;
repeatedly executing the method, respectively calculating the path probability of each moment and recording the optimal precursor state of each state while calculating the path probability, and updating the path probability matrix and the path record matrix until the optimal precursor state of each state is reached;
The calculation formula of the path probability at each moment is as follows:
δt+1(i)=max[δt(i)·aij]bjOt+1;
Wherein δt+1 (i) represents the optimal path probability at time t+1 at the ith data, and δt (i) represents the optimal path probability at time t at the ith data;
It should be explained that, determining the path probability of each possible state in the initial state, creating a matrix to store these initial probabilities based on the initial state probability and the probability of the first observation value, each row of the matrix representing a state and each column representing a time point, simultaneously creating a matrix to record the precursor state reaching each state, since this is the initial time point, each item of the precursor state matrix may be set to 0, indicating no precursor state, gradually calculating the path probability of each time point from time point t=2, considering the path probabilities of transitioning from all possible precursor states to the current state for each time point, comparing all possible paths by comparing all paths transitioning from each precursor state to the current state in order to calculate the optimal path probability of reaching each state at the current time point, selecting the path with the maximum path probability as the optimal path, and storing this optimal path probability in the path probability matrix corresponding to the current time point and the position of the state.
The method comprises the specific operations of storing an index of a precursor state with the highest path probability in the process of determining an optimal path, namely, transferring the precursor state from the precursor state to the current state at the current moment, storing the index of the precursor state in a path record matrix corresponding to the current time point and the position of the state, continuously updating the path record matrix, recording the optimal precursor state for each state at each time point, storing precursor state information of each step on the whole path in the path record matrix, continuously executing the previous steps, performing recursive calculation and recording on each time point and the state, calculating the optimal path probability of the current time point through the previous state probability and the transfer probability at each time point, simultaneously recording the optimal precursor state of each state, continuously updating the path probability matrix, enabling the optimal path probability of the current state to be reflected at each time point, simultaneously updating the path record matrix, enabling the optimal precursor state of each state at each time point to be recorded, and guaranteeing that the optimal path can be completely traced back. Based on the optimal precursor state, calculating optimal path probability and optimal state at the final moment, extracting the optimal path probability from the final moment, simultaneously determining the optimal state at the final moment, and initializing an optimal hidden state sequence;
It should be explained that, the observation data of all the monitoring devices are collected, each data source is ensured to have an accurate time stamp, the interpolation method is adopted to time align the data at different moments, and each time point is ensured to have complete multidimensional data.
The aligned data form is :X={(t1,x11,x12,…,x1n),(t2,x21,x22,…,x2n),…,(tT,xT1,xT2,…,xTn)},, where ti represents a timestamp, xij represents the j-th dimension observation at the i-th time point;
fusing the aligned multi-source data into a multi-dimensional time sequence, o= { O1,o2,…,oT }, wherein Oi=(xi1,xi2,…,xin) represents a multi-dimensional observation vector of the ith time point;
Inputting the fused multidimensional time sequence O into a trained data deviation recognition model, ensuring that the model can process multidimensional observation vectors, calculating the probability of each observation vector according to an observation probability matrix B of the model, initializing a path probability matrix delta, and representing the optimal path probability of each state reaching each moment;
δ1(i)=πibi(o1),πi represents the initial state probability, bi(o1) is the probability that o1 is observed at the i-th point in time for the state.
And initializing a path record matrix psi for recording the optimal precursor state of each state at each moment.
Based on the initialized path probability and the path record matrix, the path probability at each moment is calculated by recursion based on an observation sequence by utilizing a Viterbi algorithm, the optimal precursor state reaching each state is recorded, the optimal path probability and the optimal state at the final moment are calculated based on the optimal precursor state, and the optimal hidden state sequence is obtained by backtracking the recorded optimal precursor state from the optimal state at the final moment.
The viterbi algorithm (Viterbi Algorithm) is a dynamic programming algorithm used to find the most likely hidden state sequence of a given observation sequence in a Hidden Markov Model (HMM), and is widely used in the fields of speech recognition, bioinformatics, communication signal processing, etc.
The optimal state of each moment is determined by utilizing the path record matrix and tracing back to the moment from the moment;
Constructing an optimal hidden state sequence according to the optimal state at each moment obtained by backtracking, and converting the predicted hidden state sequence into state label data;
By comparing the difference between the observed data and the state label data of each monitoring data, the data deviation between each monitoring data source is accurately identified.
It should be explained that at the final moment, all values in the path probability matrix are checked to find the largest value, which means the optimal path probability of the whole observation sequence from the beginning to the final moment, after determining this optimal path probability, the states corresponding to the maximum path probability are recorded for subsequent analysis, this state is the optimal state at the final moment, this optimal state is recorded to initialize the optimal hidden state sequence, an array or list is created for storing the optimal hidden state sequence, the optimal state at the final moment is used as the last element of the sequence, starting from the known optimal state at the final moment, the optimal precursor state at each time point is found by gradually tracing back through the path record matrix, searching the precursor state of the current optimal state, storing the optimal precursor state of each state at each moment in a path record matrix, reading the optimal state at the current moment from the path record matrix, recording the optimal state in an optimal hidden state sequence, inserting the found optimal state into the corresponding position of the optimal hidden state sequence at each time point, continuously backtracking until the first moment is returned, finishing the optimal hidden state sequence, gradually backtracking until the optimal state at each time point is determined, finally constructing a complete optimal hidden state sequence, wherein the optimal hidden state sequence comprises the optimal hidden state of each step from the initial moment to the final moment, checking and verifying the optimal hidden state sequence, and ensuring that no time point is missed in the backtracking process.
Each hidden state can be corresponding to a specific label, such as normal, abnormal and the like, a corresponding label is allocated to the hidden state of each time point to form a complete state label data sequence, the observed data and the state label data are compared at each time point, the difference between the observed data and the state label data is checked, whether the monitored data sources are inconsistent or deviated or not can be found through the comparison, the difference between the observed data and the state label data is analyzed, and possible data deviation is identified.
S4, integrating the learning network and the data deviation recognition model.
The method comprises the steps that a log recording module is used for collecting a prediction result output by a model every minute and comparing the prediction result with prediction deviation, real deviation and corresponding timestamp data;
Classifying and marking the collected data according to a preset deviation type and severity standard by using an automatic script, and distinguishing true positives, false positives, true negatives and false negatives to form a marked feedback data set;
Setting a retraining day, and triggering the model to retrain on the monthly day;
for the hidden Markov model, re-estimating a state transition probability matrix and an observation probability matrix by using a Baum-Welch algorithm;
For an isolated forest model, the number of trees and the size of sub-samples are adjusted, so that the model is ensured to adapt to new data characteristics;
Adopting random gradient descent as an online learning algorithm, randomly extracting small batch data from the latest feedback data set each time, updating model parameters, and adopting the following formula:
the formula of the loss function is as follows;
wherein P (y|x; θ) represents the probability that the model predicts class y under parameter θ, N represents the total number of samples, θ represents the model parameter, η represents the learning rate, L (θt;xt,yt) represents the loss function, xt and yt represent the characteristics and labels of the current sample, and xi and yi represent the ith input characteristic and output label, respectively;
Randomly initializing a model parameter theta0, inputting samples in a data set in a streaming mode, acquiring one small batch of data each time, calculating the gradient of the model parameter about a loss function for each small batch of data, updating the model parameter according to the calculated gradient and a preset learning rate, and continuously iterating until the data stream reaches a preset iteration number;
Adopting a deep Q network algorithm, setting a reward function, giving forward rewards r+ for each correct early warning, and giving r- for each incorrect early warning, wherein the reward design follows the following principles:
R+=loge (1+V) for each correct pre-warning;
R-=-loge (1+RL) for each error warning;
Wherein V and RL represent response speed and response delay of the system under the condition of correct and error early warning respectively;
Setting a change rate threshold delta, and when the change rate deltanew > delta of the monitoring data, judging that an abnormal condition possibly exists by the system, and triggering early warning according to the formula:
Setting an abnormal score threshold value s, and triggering early warning when the abnormal score snew of a certain monitoring point is more than s;
Deltanew and snew represent a new change rate threshold and an abnormality score threshold, respectively, α and β are false positive rate and false negative rate, respectively, and T is average response time;
The method comprises the steps of determining main parameters of an isolated forest model, namely the number of trees (namely the number of isolated trees), the size of sub-samples (namely the number of data samples used for training by each tree), the maximum feature number and the like, wherein the parameters influence the performance and detection precision of the model, the parameters can be adjusted according to actual requirements and data characteristics, for example, the higher the number of trees is, the higher the stability of the model is, the size of the sub-samples influences the training speed and precision of the model, a test data set is used as training data, the data set is ensured to be preprocessed, such as the removal of a missing value and normalization processing, so that the effectiveness of model training is ensured, the data set comprises various data deviation conditions, the model can learn normal and abnormal modes of data, the set parameters and the prepared training data are input into the isolated forest model for training, the model learns the data distribution conditions by constructing a plurality of isolated trees, each isolated tree forms a small number of partitioning rules by randomly partitioning the data samples, the isolated tree can isolate the abnormal points through recursive partitioning, the isolated points can be used for the training data points to be used as the training data points, the abnormal points can be conveniently preprocessed according to the training data points, the normalized score is set to a unified score of the error score is set to be more than the threshold value 1, and the abnormal data points can be conveniently normalized to a score is better, and the abnormal point is better than the abnormal point is conveniently predicted, and the abnormal point is better than the score is better than a score normalized to be normalized according to the abnormal point score is better, and the score is better normalized to a score is better normalized.
A common method includes selecting a fixed score value as the threshold value or setting a percentile (such as marking the point with the top 5% of the score as abnormal) according to the distribution condition of the data, and marking all the data points with the score higher than the set threshold value in the test data set as abnormal data points.
Setting a working day of each month as an evaluation day, improving the early warning effect and the operation and maintenance efficiency based on the past month, and executing strategy evaluation, wherein the data deviation recognition accuracy ACC formula is as follows:
TP is the real number of cases, represents the number of anomalies correctly recognized by the system, TN is the true number of cases, represents the number of data correctly judged as normal by the system, FP is the false number of cases, represents the number of normal data incorrectly judged as anomalies by the system, FN is the false number of cases, represents the number of actual anomalies that the system fails to recognize;
The early warning response time RT formula is:
Wherein, Tr,i represents the response time point of the ith abnormal event by the system, Tr represents the response time, Tt,i represents the actual occurrence time of the ith abnormal event, Tt represents the actual occurrence time of the abnormal event, and N represents the total number of the abnormal events in the monitoring period;
The operation and maintenance efficiency improvement percentage E is expressed as follows:
wherein OT is old operation and maintenance time, representing manual operation and maintenance time required for solving the same scale problem before introducing an automatic early warning system, NT is new operation and maintenance time, representing actual time consumption for processing the same problem after system intervention;
the CPI equation reflecting overall performance is:
wherein, RTmax represents the maximum acceptable early warning response time, alpha represents the regulating factor, and w1、w2 and w3 respectively represent the weight coefficient ratio of accuracy, response time and operation and maintenance efficiency improvement percentage.
S5, calculating abnormal scores of the data deviations, setting data quality grades for all the data deviation points, and evaluating the influence of the data deviations on the operation performance of the new energy station.
Carrying out standardization processing on each data deviation, and forming a test data set from the standardized data deviations;
Constructing an anomaly monitoring model, setting parameters of the anomaly monitoring model, training the anomaly monitoring model by using a test data set as training data, predicting anomaly scores of data deviations in the test data set by using the trained anomaly monitoring model, traversing the data deviations in the test data set, inputting the data deviations into an isolated tree of the anomaly monitoring model, starting from a root node, traversing downwards according to splitting conditions of the isolated tree until leaf nodes are reached, recording traversal path lengths of each data point in each tree, calculating average values of path lengths of the data deviations in all the isolated trees with respect to the data deviations, taking the average values as final anomaly scores of current data deviations, and normalizing the anomaly scores to a preset range;
the calculation formula of the anomaly score is as follows;
wherein S (x) represents an anomaly score for the data bias x,Representing the average path length of the data deviation x in all the isolated trees, C (n) representing the adjustment coefficient, T representing the total number of trees in the isolated forest, n representing any positive integer;
It is to be explained that each data point in the test data set is traversed to ensure that each data point is evaluated by an isolated forest model, each data point in the data set is checked, anomaly detection of each point is sequentially processed, the current data point is input into each isolated tree of the trained isolated forest model, each isolated tree is constructed by randomly selecting characteristics and randomly selecting splitting points and is used for isolating abnormal points to the maximum extent, the isolated tree is traversed downwards according to the splitting conditions of the isolated tree from a root node according to the characteristic value of the current data point, and whether the isolated tree enters a left sub-node or a right sub-node is determined at each node according to the comparison result of the characteristic value and the splitting points until a leaf node is reached.
In each isolated tree, the traversing path length of the current data point from the root node to the leaf node is recorded, the path length represents the required splitting times of the isolated tree for isolating the data point, the shorter the path is, the easier the data point is isolated, the higher the possibility is, the abnormal point is, the traversing path lengths of the current data point in all the isolated trees are collected so as to calculate the average path length later, the path length of each data point in all the isolated trees is averaged, the average path length reflects the isolation difficulty of the data point in an isolated forest model, the shorter the path length is, the easier the data point is isolated, the higher the abnormal score is, the average path length is converted into the abnormal score, the isolated forest model generally carries out normalization processing on the shorter average path length corresponding to the higher abnormal score, all the scores are ensured to be in a preset range, for example, 0 to 1 is favorable for unifying the standards of the abnormal score, and the subsequent threshold setting and abnormal point identification are facilitated.
Setting a data quality level based on the anomaly score and scoring the anomaly data points;
The quality level may be set according to the score range. For example, the following levels may be set:
High-quality data, wherein the abnormal score is in the range of 0-a;
Good data, anomaly score in the a-b range;
Moderate data, anomaly score in the b-c range;
Low-prime data, anomaly score in the c-d range;
abnormal data, wherein the abnormal score is in the range of d-e;
where a=0.2, b=0.4, c=0.6, d=0.8, e=1.0;
each data point is assigned a corresponding quality level based on its anomaly score.
Each outlier data point is assigned a score, which may be set according to the magnitude of the outlier score.
For example:
data points with anomaly scores in the d-E range are scored as E' (highest score);
data points with anomaly scores in the c-D range are scored as D';
Data points with anomaly scores in the b-C range are scored as C';
data points with anomaly scores in the a-B range are scored as B';
data points with anomaly scores in the range of 0-a are scored as a' (lowest score);
Wherein a ' =1 score, B ' =2 score, C ' =3 score, D ' =4 score, E ' =5 score;
and evaluating the influence of the data deviation on the operation performance of the new energy station according to the scoring of the abnormal data points.
Summarizing scores of all abnormal data points, calculating the data point proportion and distribution condition of each quality grade, counting the number of data points of each grade and the proportion of the data points in the total data, analyzing the overall condition of data quality, and evaluating the influence of high-score (namely high abnormal score) data points on the station operation performance. High scoring data points often mean that the data bias is large and may adversely affect the operational decisions, predictions, and controls of the station.
Specific analysis may include:
(1) And (3) comparing the key performance indexes (such as generated energy, equipment utilization rate and the like) of the time period containing the high-score data points with the key performance indexes of the normal time period, and identifying the specific influence of the abnormal data on the performance.
(2) And (3) adjusting the operation strategy, namely adjusting the operation strategy and the control method of the station based on the distribution and the grading of the abnormal data points, such as removing high-grading data points or increasing the data correction strength.
(3) And (3) evaluating the prediction accuracy, namely checking the influence of historical data containing abnormal data on a prediction model of future operation, and evaluating the accuracy and reliability of the prediction model.
And according to the evaluation result, corresponding data management and improvement measures are formulated, and the influence of data deviation on the operation performance of the new energy station is reduced.
The improvement may include:
(1) And (3) data cleaning and preprocessing, namely enhancing data cleaning and preprocessing work, and correcting or eliminating abnormal data points in time.
(2) Optimizing the monitoring equipment, optimizing the arrangement of the monitoring equipment and the sensor, and improving the accuracy and reliability of data acquisition.
(3) And data fusion and correction, namely reducing the influence of the deviation of a single data source on the quality of the whole data by utilizing a multi-source data fusion and correction technology.
And S6, based on the data quality grade, a real-time monitoring platform is established, the change trend of the data deviation is tracked, a change rate threshold is set, and early warning is carried out on the data deviation exceeding the change rate threshold.
Based on the data quality level, establishing a real-time monitoring platform;
Performing differential operation on the data deviation indexes to obtain differential values, performing stability test on the differential values, and taking the deviation values after the stability test as the change rate of the data deviation;
setting a change rate threshold of data deviation as mu0, wherein a deviation change rate mu formula is as follows:
μ=k〃σ;
wherein sigma is the standard deviation of data, and k is the safety factor;
introducing an objective function G (k), and evaluating early warning effects under different k values to obtain an optimal safety coefficient k, wherein the formula is as follows:
wherein FPR (k) represents false alarm rate, FNR (k) represents false alarm rate, and w4 and w5 represent weight coefficients of false alarm rate and false alarm rate respectively;
The formula of the optimal safety coefficient k is:
when mu > mu0, triggering early warning, and transmitting early warning information to operation and maintenance personnel so as to analyze the early warning information:
collecting all data deviations with the change rate exceeding a preset threshold value;
classifying the super-threshold data deviation according to types to obtain abnormal data points of each category;
Performing root cause analysis on abnormal data points of each category to acquire the cause of data deviation;
And according to the analysis result, periodically evaluating and adjusting the early warning rule, and optimizing the real-time monitoring platform.
It should be explained that a platform capable of collecting, processing and displaying data in real time is constructed by selecting a proper technical framework and a tool, the common tool comprises Kafka, influxDB, grafana and the like, the real-time monitoring platform needs to have the capabilities of data collection, storage, analysis and visualization so as to monitor the change and deviation of the data at any time, various indexes in the real-time data are collected, data deviation values are calculated, the deviation values represent the differences between actual observed data and expected data, the difference operation is carried out on the collected data deviation values, the difference is a method for calculating the difference between the data values of adjacent time points, and the trend and seasonal components in the time sequence data can be eliminated through the difference, so that the data is more stable.
And a difference operation step of calculating the difference between the deviation value of each time point and the deviation value of the previous time point, namely, the difference value=the current time point deviation value-the previous time point deviation value.
The stationarity test is to ensure that the data after differentiation is stable, i.e. the statistical properties (such as mean and variance) of the data are unchanged with time, so that the stationary data is more suitable for time sequence analysis and prediction, and the common stationarity test method comprises ADF (Augmented diode-Fuller) test and KPSS
(Kwiatkowski-Phillips-Schmidt-Shin) test.
ADF test, namely, testing that the original hypothesis is that the data has a unit root (namely, the data is not stable), and rejecting the original hypothesis if the test result is obvious, and indicating that the data is stable.
KPSS checking that the original assumption is stable, and if the checking result is obvious, rejecting the original assumption to indicate that the data is not stable.
And carrying out the stability test on the data deviation value after the difference, if the data passes the stability test, considering that the data deviation value is stable, regarding the difference value passing the stability test as the change rate of the data deviation, reflecting the change condition of the data deviation along with time, and storing the calculated change rate of the data deviation into a real-time monitoring platform for subsequent analysis and early warning.
According to historical data and business requirements, a reasonable data deviation change rate threshold is set, and the change rate threshold can be a single fixed value or can be dynamically adjusted according to different time periods and conditions.
For example, a change rate threshold is set, and when the change rate of the data deviation exceeds the change rate threshold, the data deviation is considered to be abnormal in change, and attention is required;
Defining a specific early warning rule, triggering early warning when the change rate of data deviation exceeds a set threshold, wherein the early warning rule can comprise various forms such as short messages, mail notices and the like, and definitely determining the triggering conditions of early warning, such as that the change rate of the data deviation exceeds the threshold at a plurality of continuous time points or the change rate of a single time point is abnormally high, realizing an early warning function in a real-time monitoring platform, monitoring the change rate of the data deviation in real time, triggering an early warning mechanism when the change rate exceeds a preset threshold, and timely informing related personnel to take corresponding measures through the early warning mechanism so as to avoid great influence of the data deviation on service.
Collecting all data deviation points with the change rate exceeding a preset threshold value from a real-time monitoring platform, recording and storing the data deviation points for later analysis, setting classification standards according to the characteristics (such as time period, equipment type, geographical position and the like) of the data deviation, classifying the collected super-threshold data deviation according to the preset classification standards to obtain abnormal data points of each category, sorting and recording the abnormal data points, carrying out various analysis tools and methods such as Fault Tree Analysis (FTA), fish bone graph (Ishikawa graph), causal analysis and the like, carrying out detailed tracing on the abnormal data points of each category according to the classification result, searching specific causes causing the data deviation, summarizing and summarizing root causes of the data deviation of each category which can comprise equipment faults, data transmission problems, environmental factors and the like, recording the result of root cause analysis to form a detailed analysis report, evaluating the effectiveness of the current early warning rule according to the result of root cause analysis, judging whether the early warning rule can timely and accurately identify important abnormal data points, collecting feedback of related personnel to the early warning rule, knowing the effect and deficiency of the early warning rule in practical application, adjusting the early warning threshold according to the analysis result of data deviation and change rate, enabling the early warning to be more accurate and timely, updating and perfecting the early warning rule according to the newly discovered root cause, for example, aiming at the abnormal data points of a specific type, formulating a more refined early warning strategy, regularly checking and evaluating the early warning rule and the monitoring platform to ensure that the early warning rule and the monitoring platform are always in the optimal state, continuously improving the early warning rule and the monitoring platform according to the practical operation condition and the analysis result, improving the performance and reliability.
The embodiment also provides a new energy station monitoring data quality evaluation system based on multi-source data, which comprises a data acquisition and preprocessing module, a data deviation recognition model training module, an integrated learning network and deviation recognition module, a data deviation abnormal score and quality evaluation module and a real-time monitoring and early warning response module, wherein the data acquisition and preprocessing module is used for acquiring the multi-source monitoring data acquired by each monitoring device in the new energy station, preprocessing the multi-source monitoring data, the data deviation recognition model training module is used for carrying out data deviation recognition model training on the preprocessed multi-source monitoring data by using a Baum-Welch algorithm based on the preprocessed multi-source monitoring data, integrating the learning network and the data deviation recognition module, constructing an observation sequence, converting the predicted hidden state sequence into state label data, recognizing the data deviation between each monitoring data source, the data deviation abnormal score and the quality evaluation module is used for calculating the data deviation abnormal score, setting the data quality grade for each data deviation point, evaluating the influence of the data deviation on the running performance of the new energy station, carrying out real-time monitoring and early warning response module is used for setting the data deviation change rate exceeding a threshold value change trend, and setting the real-time change trend of the monitoring platform.
The embodiment also provides computer equipment, which is suitable for the condition of the new energy station monitoring data quality evaluation method based on the multi-source data, and comprises a memory and a processor, wherein the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the new energy station monitoring data quality evaluation method based on the multi-source data, which is provided by the embodiment.
The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
The present embodiment also provides a storage medium having a computer program stored thereon, which when executed by a processor implements the new energy station monitoring data quality evaluation method based on multi-source data as proposed in the above embodiments, and the storage medium may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as a static random access Memory (Static Random Access Memory, SRAM), an electrically erasable Programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), an erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.
In summary, the invention realizes the monitoring and understanding of the user behavior by analyzing the user file operation behavior, can accurately capture the operation habit and behavior pattern of the user, provides basic data for the subsequent encryption strategy generation, is also beneficial to the detection and prevention of the abnormal behavior by the system, predicts the future operation of the user according to the behavior habit of the user, thereby more intelligently formulating the encryption strategy, more accurately protecting the user privacy, reducing the risk of data leakage, improving the data security and confidentiality of the system, better adapting to the operation habit and requirement of the user, improving the intelligent level and user experience of the encryption algorithm, and enhancing the flexibility and adaptability of the system.
Example 2 referring to table 1, for the second example of the present invention, experimental simulation data of a new energy station monitoring data quality evaluation method based on multi-source data is provided for further verifying the advancement of the present invention.
A typical wind farm was chosen as the subject. The wind power plant is provided with an advanced sensor network, comprises multi-source monitoring equipment such as wind speed, temperature, humidity and equipment vibration, and has 200 monitoring points in total, and 5 monitoring points are randomly selected from the 200 monitoring points to carry out experiments. The experimental design compares the traditional monitoring method with the method provided by the invention, and the performance difference in the aspects of data preprocessing, deviation recognition, anomaly detection, real-time monitoring and the like is achieved.
Firstly, raw data of all monitoring points are continuously collected, data cleaning is carried out, missing values and obvious abnormal values are removed, the integrity of the data is guaranteed, the data is standardized, and the monitoring data of different types are comparable.
And then training the preprocessed data by using a Baum-Welch algorithm to construct a data deviation recognition model. The model can automatically identify potential deviations between the monitored data sources.
And secondly, combining the trained model with an integrated learning network, and identifying the data deviation among all the monitoring data sources. And accurately identifying the data deviation by comparing the observed data with the model prediction result.
Then, an anomaly score for the data bias was calculated and the data quality was classified into four classes, excellent, good, general, poor, according to the score. This helps to assess the impact of data bias on wind farm performance.
And finally, establishing a real-time monitoring platform, tracking the change trend of the data deviation, and setting a change rate threshold. When the data deviation exceeds a threshold value, the system automatically triggers early warning to inform operation and maintenance personnel of timely response.
The specific examples are shown in Table 1:
Table 1 table of experimental records
The effectiveness and superiority of the method of the invention are evident from the data before and after the comparison of the examples, and from the table above, the data deviation abnormality score of the monitoring point 001 is 0.12, and is rated as an "excellent" data quality grade, which means that the influence of the deviation of the data source on the running performance of the wind farm is very small and is only 0.03. In contrast, the score of monitoring point 004 was as high as 0.49, and was rated as "poor", the impact index on the running performance was 0.22, significantly higher than other monitoring points.
The invention also has obvious advantages in the aspect of data deviation recognition accuracy through the proposed system. For example, the accuracy of deviation identification of monitoring point 003 is 88%, whereas the conventional method is only 78%. The method and the device can not only accurately identify the data deviation, but also effectively evaluate the influence of the deviation on the running performance, and remarkably improve the running and maintenance efficiency and the safety of the wind power plant.
In summary, the innovation and the practicability of the new energy station monitoring data quality evaluation system based on the multi-source data are fully proved by comparing experimental data. The system shows performance superior to the traditional method in the aspects of data preprocessing, deviation recognition, anomaly detection, real-time monitoring and the like, and provides a more reliable and efficient monitoring solution for new energy stations.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.