Disclosure of Invention
The invention aims to provide a low-voltage distribution network voltage control method and system based on deep reinforcement learning, according to the method, a transducer network and an SAC algorithm are introduced, so that the problem that voltage is difficult to adjust due to the fact that new energy is accessed into a low-voltage distribution network in a large scale is solved. Specifically, global features are extracted through a transducer network, and a SAC algorithm is combined to optimize a control strategy, so that the optimal control of the voltage of the low-voltage distribution network is realized. The transform network is used as a powerful sequence modeling tool, can extract effective global features from high-dimensional data, and overcomes the defects of the traditional deep learning method in the process of complex low-voltage power distribution network states. The SAC algorithm enhances the exploration capacity and stability of the system by optimizing strategy entropy and expected return, further improves the accuracy and robustness of voltage control, and can provide a more efficient solution for voltage control of the low-voltage distribution network.
The invention is realized at least by one of the following technical schemes.
The low-voltage power distribution network voltage control method based on deep reinforcement learning comprises the following steps:
Embedding the trained transducer-SAC model into a voltage control system of the low-voltage power distribution network, and enabling the transducer-SAC model to interact with the environment of the low-voltage power distribution network to form an action strategy;
the action strategy is issued to the execution equipment to control the power of the distributed voltage and the switching quantity of the reactive compensator;
The execution device receives the command and then executes the action, the low-voltage distribution network forms a new state variable, and the new state variable is fed back to the transducer-SAC model in real time for optimization;
the Transformer-SAC model and the environment of the low-voltage power distribution network interact to form an action strategy, and the method comprises the following steps:
S1, extracting an associated feature matrix from state variables of a low-voltage power distribution network by using a transducer network, and outputting the feature matrix after the associated feature matrix is processed by the transducer network;
S2, inputting the feature matrix output by the transducer network into a SAC algorithm to generate an action strategy.
Further, in step S1, a transform network is used to extract an associated feature matrix for a state variable of the low-voltage power distribution network, and the associated feature matrix is processed by the transform network to output a feature matrix, which includes the following steps:
s1.1, extracting node voltage, active power load, reactive power load and node voltage sensitivity in a state variable of a low-voltage power distribution network as an associated feature matrix;
s1.2, calculating the attention weight of each attention head by the associated feature matrix through a multi-head attention mechanism;
S1.3, splicing the weights of all attention heads to obtain a spliced feature matrix, and performing linear mapping; adding the spliced feature matrix and the associated feature matrix to realize residual connection, and carrying out normalization processing to obtain a normalized feature matrix;
S1.4, inputting the normalized feature matrix into a feedforward neural network to obtain the feature matrix processed by the feedforward neural network.
Further, in step S2, the SAC algorithm includes:
1) A policy optimization objective for the SAC algorithm, the policy optimization objective being to maximize expected return and policy entropy:
2) Experience playback, namely storing interactive data of the intelligent agent and the environment, wherein the interactive data comprises the state, action and rewards fed back by the environment;
3) UpdatingNetwork by minimizingBelman error update of a functionA network;
4) Updating policy network by maximizing policy entropy sumThe value updates the policy network.
Further, a bonus functionCalculated from the following formula:
;
In the middle ofIs thatThe moment of time voltage out-of-limit penalty value,Is thatThe cost is controlled at the moment of time,Is thatThe moment is used for punishing the power loss of the low-voltage distribution network,Is thatA moment node voltage sensitivity penalty value,、、、Weights of the corresponding sub-items respectively; To at the same timeThe decision action of the moment in time,Is thatState variables of the low-voltage distribution network at the moment.
Further, training of the transducer-SAC model comprises the following steps:
(1) Constructing a virtual environment of the low-voltage power distribution network in a computer based on the historical data of the low-voltage power distribution network, and acquiring a state variable through the virtual environment by a Transformer network to extract an associated feature matrix and generate a new state variable;
(2) Inputting a new state variable into an action strategy obtained by the SAC algorithm and outputting an action;
(3) Executing the action by the execution equipment to change the state of the virtual environment of the low-voltage power distribution network, and feeding back the changed state variable to the transducer-SAC model;
(4) Optimizing the transducer-SAC model by using the changed state variables;
(5) Updating an action strategy by the optimized transducer-SAC model;
Repeating the step (3) -the step (5) for carrying out the learning iteration of the transducer-SAC model until convergence, and completing the training of the transducer-SAC model.
Further, the optimization of the transducer-SAC model includes the optimization of parameters of the transducer network, with the use of a loss function that minimizes the mean square error, back propagation through the chain law, and the optimization of parameters of the updated transducer network by the adaptive moment estimation optimizer.
Further, embedding the trained transducer-SAC model into a voltage control system of the low-voltage power distribution network, and enabling the transducer-SAC model to interact with the environment of the low-voltage power distribution network to form an action strategy, wherein the method comprises the following steps of:
1.1, generating an action strategy based on a trained transducer-SAC model of a real-time power grid state, and issuing an action instruction to execution equipment of a low-voltage power distribution network;
1.2, executing the selected control action by the execution equipment to change the state of the actual low-voltage power distribution network environment, and feeding back the changed power grid state to a trained converter-SAC model;
And 1.3, carrying out feature processing on the input power grid state by a converter network, optimizing the weight and bias items of the converter network, extracting the latest association feature matrix by the converter network according to the low-voltage power distribution network state updated in real time, inputting the latest association feature matrix into a SAC algorithm to generate a control strategy, and completing real-time iterative training of a converter-SAC model and voltage regulation control of the low-voltage power distribution network.
The system for realizing the low-voltage distribution network voltage control method based on deep reinforcement learning comprises a voltage control system of the low-voltage distribution network and a transducer-SAC control module;
The transducer-SAC control module comprises a transducer-SAC model for extracting features from a power grid state and generating a control strategy, wherein a transducer network is used for extracting the features and optimizing input features of the control strategy, and a SAC algorithm is used for generating an optimized control strategy through a strategy network and the control strategyA network to train;
The voltage control system of the low-voltage power distribution network comprises a power distribution network state monitoring system, a central controller and execution equipment;
The power distribution network state monitoring system monitors the state of a power grid in real time through a sensor, and the sensor feeds real-time data back to a transducer-SAC model for evaluating the effect of a control strategy;
The central controller sends a control instruction formed by the transducer-SAC model to the execution equipment through a communication network, the execution equipment adjusts the running state according to the instruction after receiving the control instruction, the environment state in the equipment execution process is fed back to the transducer-SAC model in real time through a sensor, and the transducer-SAC model realizes voltage control and optimization of the transducer network parameters in the transducer-SAC model according to the feedback.
Further, the execution device comprises a reactive compensator for adjusting the reactive power output of the power grid, a distributed power supply device for adjusting the active power output according to the control instructions.
Further, the communication network employs an IEC 61850 standardized communication protocol, and control instructions are transmitted from the central controller to each substation and forwarded by the substation to the device.
The method combines a Transformer network and an SAC algorithm, solves the problem of forward and reverse out-of-limit voltage under the condition that massive new energy is accessed into the low-voltage distribution network, and realizes the optimal control of the low-voltage distribution network voltage. When the low-voltage distribution network voltage optimization problem is processed, the method has obvious advantages in the aspects of global feature extraction, robustness and adaptability, real-time feedback mechanism, topology change adaptability and the like.
Compared with the prior art, the invention has the following beneficial effects:
(1) By introducing a transducer network to perform global feature extraction, the self-attention mechanism of the transducer network is utilized to identify complex dependency relationships among different nodes in the low-voltage power distribution network, so that global information among all nodes is captured when various factors such as voltage, power and load of the low-voltage power distribution network are considered, and the accuracy and adaptability of a voltage control strategy are improved.
(2) According to the invention, entropy regularization is introduced, so that the intelligent agent can balance exploration and utilization in the optimization process, the robustness of the system is improved, and particularly, when the uncertainty of the running condition of the power grid is faced, a relatively stable control effect can be maintained.
(3) Through real-time interactive feedback with the low-voltage distribution network environment, the intelligent agent can dynamically adjust and optimize control decisions according to the instant rewards and the new state. The self-adaptive learning process ensures that the intelligent body can adjust the control strategy in real time under the condition that the running condition of the power grid is continuously changed, thereby ensuring the optimal voltage regulation effect and further ensuring the stable running of the power grid.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the low-voltage distribution network voltage control method based on deep reinforcement learning of the embodiment includes the following steps:
The method comprises the steps of embedding a trained transducer-SAC model into a low-voltage distribution network, forming an action strategy by interaction between the transducer-SAC model and the environment of the low-voltage distribution network, issuing the action strategy to an intelligent controller at a switch or a user side for controlling the power of distributed voltage and the switching quantity of reactive compensators, executing actions after receiving commands by distributed power supply equipment and the reactive compensators, forming new environment data, and feeding the new environment data back to the transducer-SAC model for self-optimization in real time.
The action strategy formed by interaction of the transducer-SAC model and the environment of the low-voltage power distribution network comprises the following steps:
S1, extracting an associated feature matrix from state variables of a low-voltage power distribution network by using a transducer network, and outputting the feature matrix after processing by using the transducer network, wherein the method specifically comprises the following steps of:
S1.1 extractionState variable of low-voltage distribution network at momentVoltage of middle nodeActive power loadReactive power loadNode voltage sensitivityAs an associated feature matrix:
;
Wherein,,Representation ofThe real space of the dimensions is such that,The number of nodes is indicated and,As the number of features, the present embodiment has 4 features, namely。
State variable of low-voltage distribution network of this embodimentIncludes node voltageActive power loadReactive power loadNetwork topology connection matrixNode voltage sensitivityDistributed energy source power generationSwitching amount of capacitor。
Wherein the node voltage isThe unit is the per unit value,Is the firstThe individual nodes are at the momentIs used for the voltage value of the (c),The number of nodes is represented, and the value range of each node voltage converted into per unit value is 0.95-1.05.
Active power load isThe unit is kilowatt,Represent the firstThe individual nodes are at the momentTypical base load ranges are [20 kW, 500 kW ].
Reactive power load ofThe unit is kilovolt ampere,Represent the firstThe individual nodes are at the momentIs used for the reactive power load of the (a).
The distributed energy source generating power isThe unit is kilowatt,Represent the firstThe individual distributed power supplies are at timeIs used for generating power by using distributed energy sources,The distributed power source is distributed power source quantity, and the power generation range of the distributed energy source is [0 kW, 250 and kW ].
The switching capacity of the capacitor isThe unit is kilovolt ampere,Representing a capacitorAt the moment of timeReactive switching amount of (3).
Network topology matrixFor representing connection relations between nodes, matrix elementsRepresenting nodesAnd nodeIs connected with the two ends of the connecting rod,0 Represents a nodeAnd nodeAre not connected.
As one embodiment, the connection between the nodes in this embodiment is based on the standard topology of the IEEE 33 node low-voltage distribution network, and the network topology matrixContaining connection information for 33 nodes. Node voltage sensitivity,Is shown at the momentFirst, theThe sensitivity of the individual node voltages to active power changes is calculated at time by the following equationFirst, theSensitivity of individual node voltage to active power variation:
;
Wherein,Is the firstThe individual nodes are atAndThe difference in the per unit value of the voltages at the front and rear time instants,Is the firstThe individual nodes are atAndActive power difference at the front and rear time.
S1.2, associated feature matrixThe attention weight of each attention head is calculated through a multi-head attention mechanism:
;
;
Wherein the method comprises the steps ofIs the firstThe weight of the attention header is calculated,,Is the number of attention heads; Respectively represent a polling matrix, a key matrix and a value matrix,Representation ofThe real space of the dimensions is such that,Is the dimension of each of the attention heads,Is composed ofInput features of time of dayThe linear transformation is obtained and the linear transformation is performed,;Representing a normalization operation;、、 The weight matrix is a weight matrix of linear transformation and is used as a parameter obtained in a self-adaptive manner in the process of training a transducer network; Representation ofThe real space of the dimensions is such that,Is the feature number.
S1.3, splicing the weights of all the attention heads to obtain a spliced feature matrixAnd performing linear mapping:
;
Wherein,The function transversely splices the output matrixes of the plurality of attention heads along the channel dimension to form a complete spliced characteristic matrix,In order to pay attention to the total number of heads,Is the firstThe number of attention points of the user is that,Is the output weight matrix of the attention head and the spliced feature matrixCorrelation feature matrix with inputAdding to realize residual connection, and processing by normalization:
;
Wherein the method comprises the steps ofIs a normalization operation; Representing the normalized feature matrix.
S1.4, normalizing the feature matrixThrough the feed-forward neural network:
;
;
Wherein,Is thatThe feature matrix is subjected to residual connection processing,As a matrix of the feedforward neural network processed outputs,、Respectively a weight matrix of different layers of the feedforward network,For the middle layer dimension of the feed-forward network,;Is a first layer of a feedforward network with the size ofIs applied to the linear transformation result of the first layer; is a second layer of a feedforward network with the size ofIs applied to the linear transformation result of the second layer; For activating function, the linear conversion result is represented by a matrix obtained by applying a rectifying linear unit to the linear conversion result and processing the feedforward neural networkAndNormalization:
;
Wherein,Representing the output characteristics of the transducer network.
S2, inputting the feature matrix output by the transducer network into a SAC algorithm to generate an action strategy. The SAC algorithm comprises:
1) A policy optimization objective for the SAC algorithm, the policy optimization objective being to maximize expected return and policy entropy:
;
Wherein,For the purpose of entropy regularization of the coefficients,For the policy entropy,As a function of the reward,To at the same timeThe decision action of the moment in time,Is a state variable of the low-voltage power distribution network; Is a strategy in reinforcement learning, used to represent rules for taking action on the low voltage distribution network; Represented in state variablesThe probability distribution of the lower selection action; Ensuring the attenuation of the long-term rewards for a discount factor of a desired rewards; Representation based on policyIs a desired value operation of (1); representing policiesIs a target function of (2); Is an infinite step.
2) And experience playback, namely storing interaction data of the intelligent agent and the environment, wherein the interaction data comprise the state, action, rewards and the like fed back by the environment, and in the training process, the intelligent agent interacts with the environment of the low-voltage power distribution network and collects information such as the state, action, rewards and the like fed back by the environment, and the information forms an experience learning sample. In a certain training period, the intelligent agent stores the experience learning samples in an experience playback pool and randomly samples the experience learning samples (i.e. takes the samples from the storage pool for learning), so that the diversity of the samples is increased, and training deviation in the reinforcement learning process is avoided.
3) By minimizingBelman error update of a functionNetwork:
;
Wherein,Is thatThe discount factor of the function,Representation ofTime state variableExecute action downwardsA kind of electronic deviceA value reflecting the expected return obtained after taking the action,Representing next policy actionsIs used as a means for controlling the speed of the vehicle,Is thatError weighted averaging of the network; Is shown inLower pair ofThe network errors are weighted-averaged and,Is thatOutput features of a time-of-day converter networkThe time-state variable is used to determine,Is thatTime of day Transformer network outputThe state variable of the time-dependent variable,Is thatThe rewards of the moment of time,Is thatThe action performed by the agent at the moment,Is thatAction taken by the agent at the moment.
4) Updating policy network by maximizing policy entropy sumValue update policy network:
;
In the formula,As a function of the loss of the policy network,Represented in state variablesA weighted average of the following; To be in policyNext, policy selection based actionsIs used as a reference to the desired value of (a),To be in state variableUnder policySelection actionIs a probability of (2).
Reward functionCalculated from the following formula:
;
In the middle ofIs thatThe moment of time voltage out-of-limit penalty value,Is thatThe cost is controlled at the moment of time,Is thatThe moment is used for punishing the power loss of the low-voltage distribution network,Is thatA moment node voltage sensitivity penalty value,、、、Weights of the corresponding sub-items respectively; To at the same timeThe decision action of the moment in time,Is thatState variables of the low-voltage distribution network at the moment.
The voltage out-of-limit penalty value indicates that the voltage of the penalty node exceeds the allowable range, and the calculation formula is as follows:
;
In the middle ofIn order to be a penalty value,As a reference voltage to be applied to the circuit,In order to allow for the maximum deviation to be achieved,Is thatTime of day (time)The per-unit value of the voltage at each node,Is the number of nodes.
The control cost represents the execution cost of punishment control actions, and the calculation formula is as follows:
;
In the middle ofTo control the movementIs used for the cost factor of (a),Is the total execution cost.
The power loss calculation formula of the punishd low-voltage distribution network is as follows:
;
In the middle ofIs the firstThe power loss of the individual nodes is such that,Is the total power loss value.
The node voltage sensitivity penalty is used for guiding the intelligent agent to adjust the node which has influence on the whole voltage preferentially, and the calculation formula is as follows:
;
In the middle ofIs the firstThe amount of reactive power variation of the individual nodes,Is the firstThe individual nodes are at the momentIs used for the detection of the sensitivity of (a),Is a voltage sensitivity penalty.
The embodiment trains a transducer-SAC model based on load history data, topology parameters, distributed power parameters and capacitor parameters of a low-voltage power distribution network, and comprises the following steps:
(1) And constructing a virtual environment of the low-voltage power distribution network based on the historical data of the low-voltage power distribution network, acquiring a state variable through the virtual environment by the Transformer network, extracting an associated feature matrix, and generating a new state variable.
As one embodiment, pyTorch/TensorFlow is used for constructing a virtual environment of the low-voltage power distribution network, the IEEE 33 node low-voltage power distribution network is used for simulation, and different load fluctuation and distributed energy access conditions are simulated. The low-voltage distribution network historical data is relevant historical data of an actual engineering environment, and comprises load historical data including time sequence, network topology matrix parameters, distributed power supply parameters, capacitor parameters and the like. The simulation environment is formed by mathematical modeling of a topology matrix, historical active load information and reactive load information, and then the simulation environment operation states including voltage and current under different instructions or control strategies given by deep reinforcement learning DRL are obtained through load flow calculation.
The virtual environment is used for providing state variables of the low-voltage distribution network required by a transducer-SAC model and receiving a control strategy generated by the transducer-SAC modelDownward movementIs executed according to the execution result of (a).
(2) Action strategy obtained by SAC algorithm based on initial virtual environment stateThe transducer-SAC model outputs action instructionsAction ofIncluding discrete control actions for adjusting the reactive compensator and continuous control actions for adjusting the generated power of the distributed energy source.
(3) And the reactive compensator and the distributed power supply execute actions to change the state of the virtual environment of the low-voltage power distribution network, and the changed state variable is fed back to the transducer-SAC model for iterative optimization.
(4) The transducer-SAC model is optimized by using the changed state variables, and the training is iterated through loss calculation, back propagation and parameter updating.
(5) And updating the action strategy by the optimized transducer-SAC model.
And (3) repeating the step (5) to perform the learning iteration of the transducer-SAC model until convergence, thereby completing the training of the transducer-SAC model.
The optimization of the transducer-SAC model comprises the parameter optimization of the transducer network, and the parameter optimization of the transducer network is updated by using a loss function which minimizes the mean square error, back propagation through a chain rule and an adaptive moment estimation optimizer.
The loss function is to minimize the mean square error:
Wherein the method comprises the steps ofAs a function of the loss,Is the firstThe characteristic predictive value of each node is calculated,Is the firstThe true value of the characteristics of the individual nodes,Is the number of nodes. The back propagation calculates the gradient of the loss function relative to the transducer network parameters (including query, key, weights of the value matrix, bias) by the chain law:
;
Wherein the method comprises the steps ofThe output of each layer of the transducer network is that the gradient is reversely transferred through each layer of the transducer network, and the gradient of each parameter is calculated; To include、、、、Is a parameter of (a).
And updating parameters of the transducer network according to the calculated gradient optimization by using an adaptive moment estimation optimizer, wherein the optimization rule is as follows:
;
;
Wherein,AndIs thatThe momentum of the temporal gradient and the square gradient,AndIs thatThe time-of-day offset correction,Is the rate of learning to be performed,Is a constant that prevents divide-by-zero errors,、The momentum decay coefficients of the optimizer are estimated for the two adaptive moments.
Embedding the trained transducer-SAC model into a voltage control system of a low-voltage distribution network, and forming the transducer-SAC model interacting with the environment, wherein the method comprises the following steps of:
1.1, inputting real-time power grid state variables into a trained transducer-SAC model to generate an action strategy, and issuing action instructions to a distributed power supply and a reactive compensator of a low-voltage power distribution network.
And 1.2, the reactive compensator and the distributed power supply execute selected control actions to change the state of the actual low-voltage distribution network environment, and the changed power grid state is fed back to a trained converter-SAC model.
And 1.3, carrying out feature processing on the input power grid state by a converter network, optimizing the weight and bias items of the converter network based on the step 4, extracting the latest association feature matrix by the converter network according to the low-voltage power distribution network state updated in real time, inputting the latest association feature matrix into a SAC algorithm to generate a control strategy, and completing real-time iterative training of a converter-SAC model and voltage regulation control of the low-voltage power distribution network.
The embodiment also provides a system for realizing the low-voltage distribution network voltage control method based on deep reinforcement learning, which comprises a voltage control system of the low-voltage distribution network and a Transformer-SAC control module.
The transducer-SAC control module comprises a transducer-SAC model composed of a transducer network and a SAC algorithm and is used for extracting features from the power grid state and generating a control strategy. The transducer network is used for extracting characteristics and optimizing input characteristics of a control strategy. The SAC algorithm is used for generating an optimized control strategy, and the control strategy is controlled through a strategy network andThe network performs training.
The voltage control system of the low-voltage distribution network comprises:
The power distribution network state monitoring system monitors the state (including voltage, power and the like) of a power grid in real time through a sensor (such as a voltage sensor and a current sensor), and the sensor feeds real-time data back to a transducer-SAC model for evaluating the effect of a control strategy and guiding the next action.
And the central controller is arranged in a control system of the low-voltage power distribution side, acquires a control strategy of the power grid based on a transducer-SAC model, and issues a control instruction through a communication protocol.
And the communication network adopts an IEC 61850 standardized communication protocol, and transmits the control command from the central controller to each substation and forwards the control command from the substation to the execution equipment.
The execution device comprises a reactive compensator for adjusting reactive power output of the power grid and a distributed power supply device for adjusting active power output of the reactive compensator according to a control instruction.
The working flow of the system for realizing the low-voltage distribution network voltage control method based on deep reinforcement learning is as follows:
The distribution side central controller sends control instructions formed by a Transformer-SAC model (the control instructions comprise capacitance switching quantity adjustment quantity of a reactive compensator in the low-voltage power distribution network and active power output quantity of distributed power supply equipment) to all sub-stations of the low-voltage power distribution network through a standardized communication protocol IEC 61850, and the sub-stations serve as intermediate layers and are responsible for forwarding the received control instructions to the corresponding reactive compensator and the distributed power supply equipment;
After receiving the control command, the reactive power compensator and the distributed power equipment adjust the running state according to the command, and change the power output of the distributed power equipment and the capacitance switching quantity of the reactive power compensator;
The transducer-SAC model realizes voltage control according to environmental feedback and self-optimization of transducer parameters in the transducer-SAC model.
The control effect of the invention is shown in the attached drawing of the specification, wherein the rewarding function curve of the transducer-SAC model is shown in fig. 2, the cost function loss curve is shown in fig. 3, and the fact that the transducer-SAC model starts to converge at 82 steps of iteration is shown. Finally, the node voltage control effect is shown in fig. 4, and the node out-of-limit condition disappears after the step 63 iteration.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other modifications, substitutions, combinations, and simplifications without departing from the spirit and principles of the present invention should be made in the equivalent manner, and are included in the scope of the present invention.