Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. An intelligent reflector element intelligent activation method based on deep reinforcement learning is provided. The technical scheme of the invention is as follows:
an intelligent reflector element intelligent activation method based on deep reinforcement learning comprises the following steps:
s1, establishing a system model according to an actual communication scene, and providing a target problem to be solved according to the system model;
S2, abstracting a target solving problem into a Markov decision problem, and defining basic elements involved in the process of interaction with the environment by the reinforcement learning agent, wherein the basic elements comprise actions, states and setting of rewarding functions;
s3, constructing a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, wherein the deep reinforcement learning algorithm comprises a strategy network and an evaluation network, the strategy network is used for outputting action decisions of an intelligent agent, and the evaluation network is used for evaluating actions taken by the strategy network in a current state and providing a gradient of strategy network update on the basis;
S4, redesigning the structure of the network, and completing the extraction of the channel state information by adopting a full-connection structure of bidirectional convolution and dense connection. The channel state relates to the factors such as the position relation, the mutual interference degree, the signal attenuation and the like of the base station, the intelligent reflecting surface and the user, so that the mapping relation between the channel state and the element activation of the intelligent reflecting surface is fitted by using a neural network, and the intelligent activation of the element of the intelligent reflecting surface is completed.
Further, the intelligent reflecting surface is a passive intelligent reflecting surface, and the intelligent reflecting surface mainly plays a role of assisting communication between the base station and the user equipment as a relay device between the base station and the user equipment.
Further, the step S1 is to deploy an intelligent reflecting surface between the base station and the user according to the scene that no line-of-sight communication exists between the base station and the user due to shielding in practice, establish a system model of intelligent reflecting surface auxiliary communication, and propose a target problem to be solved according to the system model, and specifically comprises the following steps:
the transmission rate of the kth user equipment can be expressed as:
wherein,Phi represents the intelligent reflection surface coefficient matrix,Mean square error between information received by a receiving end and information of a transmitting end is represented, pt is user transmitting power, sigma2 is represented by Gaussian noise power, IM is represented by a standard identity matrix, and hd,k、hr,k and G are respectively represented by channel states from a base station to user equipment, from an intelligent reflecting surface to the user equipment and from the intelligent reflecting surface to the base station.
Further, the step S2 is to abstract the target solving problem into a Markov decision problem, define basic elements involved in the process of interaction with the environment by the reinforcement learning agent, including setting of actions, states and rewarding functions, and specifically includes:
in the establishment of the Markov decision model, the state is set as the channel state at the current moment, the channel state information comprises a real part and an imaginary part, the action of the strategy is a one-dimensional vector v with the number of intelligent reflecting surface elements, the number of the intelligent reflecting surface elements is assumed to be N, and the action at any moment t is expressed as follows:
v(t)=[v1(t),v2(t),…vN(t)]
the reward reflects the performance of the target problem, and according to the transmission rate formula of the user, the sum of the rates of all user equipment is taken as a reward function, which can be expressed as:
further, the step S3 constructs a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, and the deep reinforcement learning algorithm comprises a policy network and an evaluation network, wherein the policy network is used for outputting action decisions of an agent, and the evaluation network is used for evaluating actions taken by the policy network in a current state and providing gradients of updating of the policy network on the basis of the actions, and the method specifically comprises the following steps:
The policy network is entirely made up of three modules. The module 1 is a bidirectional convolution, mainly carries out convolution from horizontal to vertical on an input link channel, then carries out convolution operation on outputs in two directions through a 1X1 convolution kernel, and finally combines the outputs in the two directions into a one-dimensional vector to be input into the module 2. The module 2 is mainly a full-connection structure of two layers of close connection, and the close connection structure can enable the input of the current layer to contain the characteristics of all the previous layers, so that the characteristic extraction and analysis capability of a channel is further improved. The module 3 adjusts the output of the network by using 2-layer full connection, and the output of the strategy network is a one-dimensional probability vector with the number of intelligent reflection surface elements, and the probability is larger to indicate the possibility of being activated.
The evaluation network is likewise composed of three modules. The module 1 adopts a bidirectional convolution mode, and the input of the evaluation network relative to the strategy network also comprises the output action of the strategy network, so that the action of the strategy network is merged together by the evaluation network after the 1X1 convolution and then is input to the module 2. The structure of the module 2 is completely consistent with a policy network. The module 3 uses a 2-layer full connection mainly to adjust the output of the network, and evaluates the output of the network to be a Q value, and thus outputs a value at the end.
Further, the bidirectional convolution module respectively carries out bidirectional convolution on the channel states of all links;
The module 2 is a full-connection structure adopting 2 layers of close connection, the network width of all full-connection structures is limited within 1024, in the use of an activation function, except that the action output of a strategy network adopts Sigmoid to represent the probability of the activation of an intelligent reflecting surface element, the rest of the activation functions are ReLu, the size of an experience playback pool is 218, the learning rate of a neural network is 2-16, the variance of exploration noise is 0.0001, an optimizer adopts Adam, and a discount coefficient is set to be 0.
The invention has the advantages and beneficial effects as follows:
1. Aiming at the problem of troublesome integer optimization of the traditional communication algorithm, the invention provides a combined optimization mode combining the traditional communication algorithm and deep reinforcement learning, the complex iteration of the traditional communication algorithm is replaced by utilizing the fitting capacity of a neural network, and only the adjustment of the element phase of the intelligent reflecting surface is calculated by the traditional communication algorithm.
2. Aiming at the structure of the channel state, a mode of combining the bidirectional convolution and the close-contact full-connection structure is provided for extracting the channel characteristics, so that the capability of extracting the channel state information is greatly improved, and the convergence speed in the network training process is accelerated.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
An intelligent reflection surface element intelligent activation method based on deep reinforcement learning is characterized by comprising the following steps:
S1, establishing a system model and a target solving problem, establishing the system model according to an actual communication scene and providing the target problem to be solved according to the system model;
S2, the reinforcement learning problem is generally abstracted into a Markov decision problem, and basic elements involved in the process of interaction with the environment by the reinforcement learning agent are defined, wherein the basic elements comprise actions, states and setting of rewarding functions;
S3, constructing a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, wherein the algorithm mainly comprises two parts, a strategy network is used for outputting action decisions of an intelligent agent, and an evaluation network is used for evaluating actions taken by the strategy network in a current state and providing a gradient of strategy network update on the basis;
S4, introducing the intelligent reflecting surface brings challenges to the extraction of channel characteristics of the strategy network and the evaluation network, and redesigning the structure of the network to finish the extraction of channel state information.
Preferably, the intelligent reflecting surface is mainly used as relay equipment between the base station and the user equipment to assist communication between the base station and the user equipment, so that a channel with a range of vision range is seriously blocked by the base station and the user equipment in consideration of a multi-user uplink transmission scene, a cascade link of the intelligent reflecting surface is used for assisting to complete a communication task between the base station and the user equipment, and the channel state is integrally subjected to a Rayleigh channel model. Generally, unless specifically stated, the intelligent reflecting surfaces referred to in the present invention are all passive intelligent reflecting surfaces.
Preferably, in S2, according to the interaction model of the agent and the environment in the markov decision, at any time t, the agent takes an action according to the current environment state and obtains an instant prize, and then the environment state is updated to the next time. According to the basic interaction model and the target solving problem, channel state information at each moment is set to be the state of the environment, actions are one-dimensional vectors with the number of intelligent reflector elements and are used for indicating the current intelligent reflector activation strategy, and a reward function is usually used for measuring the target completion condition, so that the sum of the communication rates of user equipment is taken as a reward.
Preferably, the deep learning algorithm based on the frame of a presenter-reviewer is a currently mainstream deep reinforcement learning algorithm, a strategy network focuses on the strategy itself, and an intelligent reflector element activation strategy is output according to the current state, and is updated only by means of updating the strategy gradient, wherein an agent is usually required to complete a complete exploration track, and the requirement cannot be met in practice. In order to improve the updating speed of the strategy network, an evaluation network for evaluating the performance of the strategy network in real time is added, and on one hand, the strategy network can be simultaneously promoted to update towards the direction of the evaluation network guidance while the strategy is evaluated. It is worth mentioning that the phase of the intelligent reflecting surface is calculated by the traditional communication algorithm, and the neural network is used as a decision maker for selecting and activating the intelligent reflecting surface in the environment where the intelligent reflecting surface interacts with the intelligent body, so that the continuous mixed integer optimization problem is completed.
Preferably, the introduction of the intelligent reflecting surface greatly increases the number and complexity of the whole channels, and in order to improve the capability of extracting channel characteristics of the network, as shown in fig. 4 and 5, an extraction structure based on the ideas of bidirectional convolution and dense connection is used, and the basis of convolution kernel selection in the bidirectional convolution mainly selects a convolution kernel related to the shape of the channel state of the link according to the translational invariance of the channel state. The densely connected structure can make up the problem caused by insufficient network capacity on one hand, and can synthesize all the input features to further improve the feature extraction capability of the network on the other hand
The algorithm model used in the invention mainly comprises the following steps:
S1, establishing a system model and a target solving problem, establishing the system model according to an actual communication scene and providing the target problem to be solved according to the system model;
S2, the reinforcement learning problem is generally abstracted into a Markov decision problem, and basic elements involved in the process of interaction with the environment by the reinforcement learning agent are defined, wherein the basic elements comprise actions, states and setting of rewarding functions;
S3, constructing a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, wherein the algorithm mainly comprises two parts, a strategy network is used for outputting action decisions of an intelligent agent, and an evaluation network is used for evaluating actions taken by the strategy network in a current state and providing a gradient of strategy network update on the basis;
S4, introducing the intelligent reflecting surface brings challenges to the extraction of channel characteristics of the strategy network and the evaluation network, and redesigning the structure of the network to finish the extraction of channel state information.
According to the intelligent reflection enhanced multi-user uplink multi-input single-output system model considered in fig. 2, it is assumed that 2 single-antenna user equipments transmit information to a remote data center through a base station configured with 2 antennas, and communication can only be assisted by means of the intelligent reflection between the base station and the user equipments considering that no line-of-sight channel exists between the base station and the user equipments. Since the objective problem of the solution is to accomplish high quality activation of the number of intelligent reflective surfaces while guaranteeing the communication rate. Given only the necessary formulation to solve the problem, the transmission rate of the kth user device can be expressed as:
wherein,Φ represents the intelligent reflector coefficient matrix, IM represents the standard identity matrix, hd,k、hr,k and G represent the base station to user equipment, intelligent reflector to user equipment and intelligent reflector to base station channel conditions, respectively.
In the establishment of the Markov decision model, the state is set as the channel state at the current moment, the channel state information comprises a real part and an imaginary part, the action of the strategy is a one-dimensional vector v with the number of intelligent reflecting surface elements, the number of the intelligent reflecting surface elements is assumed to be N, and the action at any moment t can be expressed as follows:
v(t)=[v1(t),v2(t),…vN(t)]
The performance of rewards, which generally needs to reflect the target problem, can be expressed as a rewards function by taking the sum of the rates of all user devices as a function of the user's transmission rate formula:
According to the overall algorithm framework in fig. 1, the policy network outputs an activation decision of the high-quality intelligent reflector element, which is an activation action made by the policy network for the current environmental state, and the evaluation network is used for evaluating the action of the current policy network, and the evaluation network guides the policy network to update towards the direction of increasing the output of the evaluation network in the updating of the policy network, namely provides a gradient of the updating of the policy network. Meanwhile, in order to ensure that the strategy of the strategy network keeps a certain exploratory property in the algorithm, exploring noise is increased in the process of interactively collecting data between the agent and the environment, so that the exploring noise is used as a trade-off mode of the agent in exploring and utilizing. The inputs of the strategy network and the evaluation network both contain channel state information, and the input of the evaluation network needs the action of the strategy network output besides the channel state information, and is the overall evaluation of the current state action pair.
According to the overall network framework in fig. 3, the modules 1 and 2 are mainly used for completing new channel state extraction, and are the main cores of the network, and the module 3 mainly solves the requirement of the output dimension, and respectively meets the adjustment of different networks to the output dimension, and will not be described in detail. The focus is here on the structure in modules 1 and 2, where the network structure in module 1 is mainly a two-way convolution as in fig. 4, which has a stronger feature extraction capability than a fully connected structure, helping the algorithm to train faster. The way of adopting the bidirectional convolution mainly considers the actual translation invariance of the channel state. In practice, two implementations of the bidirectional convolution are mainly two, one is to splice channel state information of all links into a regular rectangle, so that the convolution can be conveniently implemented, and the other is to respectively carry out the bidirectional convolution on channel states of all links, and the first convolution mode is considered to lead to errors due to filling elements which are inevitably brought in the splicing process, so that the latter mode is adopted. Specifically, the total number of intelligent reflection plane elements is set to be 24, and the whole network flow of the bidirectional convolution is analyzed in detail by taking the channel G from the intelligent reflection plane to the base station as an example. The two-way convolution, as the name implies, mainly carries out convolution operation on the convolution kernels of (24X 1) and (1X 2) adopted by G respectively in the horizontal direction and the vertical direction, then carries out convolution operation on the output of the two directions respectively, wherein the convolution kernel size is (1X 1), the number of channels is 512 and 256 respectively, and then the two outputs are combined and output to be sent to the module 2 according to a certain output dimension.
The module 2 adopts a full-connection structure of 2-layer close-contact connection as shown in fig. 5, and the connection can avoid the problem of insufficient network capacity on one hand, and further improves the extraction capability of the channel state because the current layer comprehensively comprises the characteristics of all the previous layers in the close-contact connection on the other hand. It is worth mentioning that the network width of all the fully connected structures is limited within 1024, in the use of the activation functions, except that the action output of the strategy network adopts Sigmoid to represent the probability of the activation of the intelligent reflecting surface element, the rest of the activation functions are ReLu, the size of the experience playback pool is 218, the learning rate of the neural network is 2-16, the variance of the exploration noise is 0.0001, the optimizer adopts Adam, and the discount coefficient is set to 0.
FIG. 6 is a comparison of training curves for a practical algorithm training process using a two-way convolution and a fully-connected structure, FIG. 7 shows the sum of the rates of user equipment achievable with the total number of smart reflector elements at 18 and 24,32, respectively
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.