CN116208510B

Movatterモバイル変換

Info

Publication number: CN116208510B
Application number: CN202211598212.5A
Authority: CN
Inventors: 庞宇; 昝世明
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2024-12-10
Anticipated expiration: 2042-12-12
Also published as: CN116208510A

Abstract

The invention discloses an intelligent reflector element intelligent activation method based on deep reinforcement learning, and belongs to the field of deep reinforcement learning and intelligent reflector auxiliary communication. The method comprises the steps of system model establishment and target problem establishment, markov decision model element setting, algorithm framework establishment and network structure design. The method comprises the steps of establishing a system model according to a considered communication scene and providing a solved target problem, wherein the setting of Markov decision elements mainly defines states, actions and rewarding functions related to the intelligent body of reinforcement learning in the process of interacting with the environment, adopts a deep reinforcement learning classical algorithm framework based on a lecturer-reviewer, utilizes an evaluation network to assist gradient update of the strategy network while outputting actions by the strategy network, and adjusts the strategy network and the network structure of the evaluation network to solve the problem of insufficient extraction of channel state information by a full-connection structure caused by introduction of an intelligent reflecting surface. The method reduces the iteration complexity of the traditional communication algorithm.

Description

Intelligent reflector element intelligent activation method based on deep reinforcement learning

Technical Field

The invention belongs to the field of deep reinforcement learning and intelligent reflector auxiliary communication, in particular to an intelligent reflector element intelligent activation method based on deep reinforcement learning,

Background

Smart reflectors are currently considered an important technology that may be applied in future 6G wireless communication systems and is expected to surpass large-scale multi-antenna technology. In general, smart reflective surfaces are typically composed of a large number of passive reflective elements, each capable of producing controllable amplitude and phase variations to an incident signal. By densely arranging a large number of intelligent reflecting surfaces in a wireless network and intelligently coordinating their reflecting angles, the signal transmission between a base station and user equipment can be flexibly reconfigured to achieve the intended objective, which provides a new means for fundamentally solving the problems of channel fading and interference under various factors and possibly realizing a quantum leap of wireless communication capacity and reliability.

The deep reinforcement learning algorithm effectively combines the decision making capability of reinforcement learning with the perception capability of deep learning, and achieves a series of milestone events in the field of artificial intelligence in recent years. Most of the current main researches on the intelligent reflecting surface aim at the phase optimization problem of the intelligent reflecting surface, but the problem of the activation of the intelligent reflecting surface element is also a considerable problem in consideration of the relatively large number of intelligent reflecting surface elements. On the premise of ensuring the communication rate, the energy efficiency of the intelligent reflecting surface can be greatly improved by ensuring the activation of the high-quality intelligent reflecting surface element. In order to solve the problem, the invention provides an intelligent reflector element activation method based on combination of deep reinforcement learning and a traditional communication algorithm, which can be well used for solving a continuous integer optimization problem.

CN112019249A is an intelligent reflecting surface regulation and control method and device based on deep reinforcement learning, wherein the method comprises the steps of generating a first action by a strategy network according to a first state, fixing the amplitude and inputting the first action into an optimization module, updating the first action to obtain a second action, obtaining a first target value, enabling the second action to act on a wireless environment to obtain a second state to obtain a new sample and storing the new sample in an experience pool, training DDPG by the strategy network and a value network according to the sample, updating parameters of an executor by using a strategy gradient method, determining a third target value according to the first target value and a second target value generated by a target Q network, training DNN of the online Q network according to the third target value, updating the parameters of the online Q network, and repeatedly executing the steps until the change amplitude of transmitting power is smaller than a preset threshold value, obtaining network parameters for minimizing the transmitting power of an AP and outputting the network parameters. The invention can realize stable and efficient learning in a shorter time, and can converge to the optimal target more quickly.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. An intelligent reflector element intelligent activation method based on deep reinforcement learning is provided. The technical scheme of the invention is as follows:

an intelligent reflector element intelligent activation method based on deep reinforcement learning comprises the following steps:

s1, establishing a system model according to an actual communication scene, and providing a target problem to be solved according to the system model;

S2, abstracting a target solving problem into a Markov decision problem, and defining basic elements involved in the process of interaction with the environment by the reinforcement learning agent, wherein the basic elements comprise actions, states and setting of rewarding functions;

s3, constructing a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, wherein the deep reinforcement learning algorithm comprises a strategy network and an evaluation network, the strategy network is used for outputting action decisions of an intelligent agent, and the evaluation network is used for evaluating actions taken by the strategy network in a current state and providing a gradient of strategy network update on the basis;

S4, redesigning the structure of the network, and completing the extraction of the channel state information by adopting a full-connection structure of bidirectional convolution and dense connection. The channel state relates to the factors such as the position relation, the mutual interference degree, the signal attenuation and the like of the base station, the intelligent reflecting surface and the user, so that the mapping relation between the channel state and the element activation of the intelligent reflecting surface is fitted by using a neural network, and the intelligent activation of the element of the intelligent reflecting surface is completed.

Further, the intelligent reflecting surface is a passive intelligent reflecting surface, and the intelligent reflecting surface mainly plays a role of assisting communication between the base station and the user equipment as a relay device between the base station and the user equipment.

Further, the step S1 is to deploy an intelligent reflecting surface between the base station and the user according to the scene that no line-of-sight communication exists between the base station and the user due to shielding in practice, establish a system model of intelligent reflecting surface auxiliary communication, and propose a target problem to be solved according to the system model, and specifically comprises the following steps:

the transmission rate of the kth user equipment can be expressed as:

wherein,Phi represents the intelligent reflection surface coefficient matrix,Mean square error between information received by a receiving end and information of a transmitting end is represented, p_t is user transmitting power, sigma² is represented by Gaussian noise power, I_M is represented by a standard identity matrix, and h_d,k、h_r,k and G are respectively represented by channel states from a base station to user equipment, from an intelligent reflecting surface to the user equipment and from the intelligent reflecting surface to the base station.

Further, the step S2 is to abstract the target solving problem into a Markov decision problem, define basic elements involved in the process of interaction with the environment by the reinforcement learning agent, including setting of actions, states and rewarding functions, and specifically includes:

in the establishment of the Markov decision model, the state is set as the channel state at the current moment, the channel state information comprises a real part and an imaginary part, the action of the strategy is a one-dimensional vector v with the number of intelligent reflecting surface elements, the number of the intelligent reflecting surface elements is assumed to be N, and the action at any moment t is expressed as follows:

v(t)=[v₁(t),v₂(t),…v_N(t)]

the reward reflects the performance of the target problem, and according to the transmission rate formula of the user, the sum of the rates of all user equipment is taken as a reward function, which can be expressed as:

further, the step S3 constructs a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, and the deep reinforcement learning algorithm comprises a policy network and an evaluation network, wherein the policy network is used for outputting action decisions of an agent, and the evaluation network is used for evaluating actions taken by the policy network in a current state and providing gradients of updating of the policy network on the basis of the actions, and the method specifically comprises the following steps:

The policy network is entirely made up of three modules. The module 1 is a bidirectional convolution, mainly carries out convolution from horizontal to vertical on an input link channel, then carries out convolution operation on outputs in two directions through a 1X1 convolution kernel, and finally combines the outputs in the two directions into a one-dimensional vector to be input into the module 2. The module 2 is mainly a full-connection structure of two layers of close connection, and the close connection structure can enable the input of the current layer to contain the characteristics of all the previous layers, so that the characteristic extraction and analysis capability of a channel is further improved. The module 3 adjusts the output of the network by using 2-layer full connection, and the output of the strategy network is a one-dimensional probability vector with the number of intelligent reflection surface elements, and the probability is larger to indicate the possibility of being activated.

The evaluation network is likewise composed of three modules. The module 1 adopts a bidirectional convolution mode, and the input of the evaluation network relative to the strategy network also comprises the output action of the strategy network, so that the action of the strategy network is merged together by the evaluation network after the 1X1 convolution and then is input to the module 2. The structure of the module 2 is completely consistent with a policy network. The module 3 uses a 2-layer full connection mainly to adjust the output of the network, and evaluates the output of the network to be a Q value, and thus outputs a value at the end.

Further, the bidirectional convolution module respectively carries out bidirectional convolution on the channel states of all links;

The module 2 is a full-connection structure adopting 2 layers of close connection, the network width of all full-connection structures is limited within 1024, in the use of an activation function, except that the action output of a strategy network adopts Sigmoid to represent the probability of the activation of an intelligent reflecting surface element, the rest of the activation functions are ReLu, the size of an experience playback pool is 2¹⁸, the learning rate of a neural network is 2^-16, the variance of exploration noise is 0.0001, an optimizer adopts Adam, and a discount coefficient is set to be 0.

The invention has the advantages and beneficial effects as follows:

1. Aiming at the problem of troublesome integer optimization of the traditional communication algorithm, the invention provides a combined optimization mode combining the traditional communication algorithm and deep reinforcement learning, the complex iteration of the traditional communication algorithm is replaced by utilizing the fitting capacity of a neural network, and only the adjustment of the element phase of the intelligent reflecting surface is calculated by the traditional communication algorithm.

2. Aiming at the structure of the channel state, a mode of combining the bidirectional convolution and the close-contact full-connection structure is provided for extracting the channel characteristics, so that the capability of extracting the channel state information is greatly improved, and the convergence speed in the network training process is accelerated.

Drawings

FIG. 1 is a diagram of an algorithm framework of the overall preferred embodiment provided by the present invention;

FIG. 2 is a diagram of a system model in which the present invention is actually considered;

FIG. 3 is a block diagram of a policy network and an evaluation network as a whole;

FIG. 4 is a block diagram of a two-way convolution;

FIG. 5 is a diagram of a full-connection close-fitting connection configuration;

FIG. 6 is a graph comparing training curves of a two-way convolution and a full connection structure;

FIG. 7 is a graph comparing training curves for setting total number of different smart reflector elements given the number of smart reflector active elements;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

An intelligent reflection surface element intelligent activation method based on deep reinforcement learning is characterized by comprising the following steps:

S1, establishing a system model and a target solving problem, establishing the system model according to an actual communication scene and providing the target problem to be solved according to the system model;

S2, the reinforcement learning problem is generally abstracted into a Markov decision problem, and basic elements involved in the process of interaction with the environment by the reinforcement learning agent are defined, wherein the basic elements comprise actions, states and setting of rewarding functions;

S3, constructing a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, wherein the algorithm mainly comprises two parts, a strategy network is used for outputting action decisions of an intelligent agent, and an evaluation network is used for evaluating actions taken by the strategy network in a current state and providing a gradient of strategy network update on the basis;

S4, introducing the intelligent reflecting surface brings challenges to the extraction of channel characteristics of the strategy network and the evaluation network, and redesigning the structure of the network to finish the extraction of channel state information.

Preferably, the intelligent reflecting surface is mainly used as relay equipment between the base station and the user equipment to assist communication between the base station and the user equipment, so that a channel with a range of vision range is seriously blocked by the base station and the user equipment in consideration of a multi-user uplink transmission scene, a cascade link of the intelligent reflecting surface is used for assisting to complete a communication task between the base station and the user equipment, and the channel state is integrally subjected to a Rayleigh channel model. Generally, unless specifically stated, the intelligent reflecting surfaces referred to in the present invention are all passive intelligent reflecting surfaces.

Preferably, in S2, according to the interaction model of the agent and the environment in the markov decision, at any time t, the agent takes an action according to the current environment state and obtains an instant prize, and then the environment state is updated to the next time. According to the basic interaction model and the target solving problem, channel state information at each moment is set to be the state of the environment, actions are one-dimensional vectors with the number of intelligent reflector elements and are used for indicating the current intelligent reflector activation strategy, and a reward function is usually used for measuring the target completion condition, so that the sum of the communication rates of user equipment is taken as a reward.

Preferably, the deep learning algorithm based on the frame of a presenter-reviewer is a currently mainstream deep reinforcement learning algorithm, a strategy network focuses on the strategy itself, and an intelligent reflector element activation strategy is output according to the current state, and is updated only by means of updating the strategy gradient, wherein an agent is usually required to complete a complete exploration track, and the requirement cannot be met in practice. In order to improve the updating speed of the strategy network, an evaluation network for evaluating the performance of the strategy network in real time is added, and on one hand, the strategy network can be simultaneously promoted to update towards the direction of the evaluation network guidance while the strategy is evaluated. It is worth mentioning that the phase of the intelligent reflecting surface is calculated by the traditional communication algorithm, and the neural network is used as a decision maker for selecting and activating the intelligent reflecting surface in the environment where the intelligent reflecting surface interacts with the intelligent body, so that the continuous mixed integer optimization problem is completed.

Preferably, the introduction of the intelligent reflecting surface greatly increases the number and complexity of the whole channels, and in order to improve the capability of extracting channel characteristics of the network, as shown in fig. 4 and 5, an extraction structure based on the ideas of bidirectional convolution and dense connection is used, and the basis of convolution kernel selection in the bidirectional convolution mainly selects a convolution kernel related to the shape of the channel state of the link according to the translational invariance of the channel state. The densely connected structure can make up the problem caused by insufficient network capacity on one hand, and can synthesize all the input features to further improve the feature extraction capability of the network on the other hand

The algorithm model used in the invention mainly comprises the following steps:

According to the intelligent reflection enhanced multi-user uplink multi-input single-output system model considered in fig. 2, it is assumed that 2 single-antenna user equipments transmit information to a remote data center through a base station configured with 2 antennas, and communication can only be assisted by means of the intelligent reflection between the base station and the user equipments considering that no line-of-sight channel exists between the base station and the user equipments. Since the objective problem of the solution is to accomplish high quality activation of the number of intelligent reflective surfaces while guaranteeing the communication rate. Given only the necessary formulation to solve the problem, the transmission rate of the kth user device can be expressed as:

wherein,Φ represents the intelligent reflector coefficient matrix, I_M represents the standard identity matrix, h_d,k、h_r,k and G represent the base station to user equipment, intelligent reflector to user equipment and intelligent reflector to base station channel conditions, respectively.

In the establishment of the Markov decision model, the state is set as the channel state at the current moment, the channel state information comprises a real part and an imaginary part, the action of the strategy is a one-dimensional vector v with the number of intelligent reflecting surface elements, the number of the intelligent reflecting surface elements is assumed to be N, and the action at any moment t can be expressed as follows:

v(t)=[v₁(t),v₂(t),…v_N(t)]

The performance of rewards, which generally needs to reflect the target problem, can be expressed as a rewards function by taking the sum of the rates of all user devices as a function of the user's transmission rate formula:

According to the overall algorithm framework in fig. 1, the policy network outputs an activation decision of the high-quality intelligent reflector element, which is an activation action made by the policy network for the current environmental state, and the evaluation network is used for evaluating the action of the current policy network, and the evaluation network guides the policy network to update towards the direction of increasing the output of the evaluation network in the updating of the policy network, namely provides a gradient of the updating of the policy network. Meanwhile, in order to ensure that the strategy of the strategy network keeps a certain exploratory property in the algorithm, exploring noise is increased in the process of interactively collecting data between the agent and the environment, so that the exploring noise is used as a trade-off mode of the agent in exploring and utilizing. The inputs of the strategy network and the evaluation network both contain channel state information, and the input of the evaluation network needs the action of the strategy network output besides the channel state information, and is the overall evaluation of the current state action pair.

According to the overall network framework in fig. 3, the modules 1 and 2 are mainly used for completing new channel state extraction, and are the main cores of the network, and the module 3 mainly solves the requirement of the output dimension, and respectively meets the adjustment of different networks to the output dimension, and will not be described in detail. The focus is here on the structure in modules 1 and 2, where the network structure in module 1 is mainly a two-way convolution as in fig. 4, which has a stronger feature extraction capability than a fully connected structure, helping the algorithm to train faster. The way of adopting the bidirectional convolution mainly considers the actual translation invariance of the channel state. In practice, two implementations of the bidirectional convolution are mainly two, one is to splice channel state information of all links into a regular rectangle, so that the convolution can be conveniently implemented, and the other is to respectively carry out the bidirectional convolution on channel states of all links, and the first convolution mode is considered to lead to errors due to filling elements which are inevitably brought in the splicing process, so that the latter mode is adopted. Specifically, the total number of intelligent reflection plane elements is set to be 24, and the whole network flow of the bidirectional convolution is analyzed in detail by taking the channel G from the intelligent reflection plane to the base station as an example. The two-way convolution, as the name implies, mainly carries out convolution operation on the convolution kernels of (24X 1) and (1X 2) adopted by G respectively in the horizontal direction and the vertical direction, then carries out convolution operation on the output of the two directions respectively, wherein the convolution kernel size is (1X 1), the number of channels is 512 and 256 respectively, and then the two outputs are combined and output to be sent to the module 2 according to a certain output dimension.

The module 2 adopts a full-connection structure of 2-layer close-contact connection as shown in fig. 5, and the connection can avoid the problem of insufficient network capacity on one hand, and further improves the extraction capability of the channel state because the current layer comprehensively comprises the characteristics of all the previous layers in the close-contact connection on the other hand. It is worth mentioning that the network width of all the fully connected structures is limited within 1024, in the use of the activation functions, except that the action output of the strategy network adopts Sigmoid to represent the probability of the activation of the intelligent reflecting surface element, the rest of the activation functions are ReLu, the size of the experience playback pool is 2¹⁸, the learning rate of the neural network is 2^-16, the variance of the exploration noise is 0.0001, the optimizer adopts Adam, and the discount coefficient is set to 0.

FIG. 6 is a comparison of training curves for a practical algorithm training process using a two-way convolution and a fully-connected structure, FIG. 7 shows the sum of the rates of user equipment achievable with the total number of smart reflector elements at 18 and 24,32, respectively

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. The intelligent reflector element intelligent activation method based on deep reinforcement learning is characterized by comprising the following steps of:

S4, redesigning the structure of the network, namely adopting a full-connection structure of bidirectional convolution and dense connection to finish the extraction of channel state information, wherein the channel state relates to the position relation of a base station, an intelligent reflecting surface and a user, the mutual interference degree and signal attenuation factors, and the mapping relation between the channel state and the element activation of the intelligent reflecting surface is fitted by utilizing a neural network so as to finish the intelligent activation of the element of the intelligent reflecting surface;

The step S3 is to construct a deep reinforcement learning algorithm based on a classical presenter-reviewer framework, and comprises a strategy network and an evaluation network, wherein the strategy network is used for outputting action decisions of an agent, and the evaluation network is used for evaluating actions taken by the strategy network in a current state and providing a gradient of strategy network update on the basis of the actions, and the method specifically comprises the following steps:

The strategy network is integrally formed by three modules, wherein the module 1 is a bidirectional convolution, mainly carries out the convolution of an input link channel from the horizontal direction to the vertical direction, then carries out the convolution operation of a convolution kernel of 1X1 on the output of the two directions, and finally combines the output of the two directions into a one-dimensional vector to be input into the module 2, the module 2 is mainly a full-connection structure of two layers of close connection, the structure of the close connection can lead the input of the current layer to contain the characteristics of all the previous layers, the characteristic extraction and analysis capability of the channel is further improved, the module 3 adopts 2 layers of full-connection to adjust the output of the network, the output of the strategy network is a one-dimensional probability vector with the number of intelligent reflection surface elements, and the larger probability indicates the larger possibility of being activated;

The evaluation network is also composed of three modules, wherein the module 1 adopts a bidirectional convolution mode, and the input of the evaluation network is combined with the combination after the 1X1 convolution because the input of the evaluation network also comprises the output action of the strategy network, and then the actions of the strategy network are combined together and are input into the module 2;

The bidirectional convolution module is used for respectively carrying out bidirectional convolution on the channel states of all links;

2. The intelligent activation method for intelligent reflecting surface elements based on deep reinforcement learning according to claim 1, wherein the intelligent reflecting surface is a passive intelligent reflecting surface, and the intelligent reflecting surface mainly functions as a relay device between a base station and user equipment to assist communication between the base station and the user equipment.

3. The intelligent activation method of the intelligent reflecting surface element based on the deep reinforcement learning according to claim 1 is characterized in that the step S1 is to deploy an intelligent reflecting surface between a base station and a user according to a scene that no line-of-sight communication exists between the base station and the user due to shielding in practice, establish a system model of intelligent reflecting surface auxiliary communication, and propose a target problem to be solved according to the system model, and specifically comprises the following steps:

the transmission rate of the kth user equipment can be expressed as:

wherein,Phi represents the intelligent reflection surface coefficient matrix,The mean square error between the information received by the receiving end and the information of the transmitting end is represented, p_t is the user transmitting power, sigma² is the Gaussian noise power, I_M is the standard identity matrix, and h_d,k、h_r,k and G are the channel states of the base station to the user equipment, the intelligent reflecting surface to the user equipment and the intelligent reflecting surface to the base station respectively.

4. The intelligent activation method of intelligent reflecting surface elements based on deep reinforcement learning according to claim 3, wherein the step S2 is to abstract the solution of the objective problem into a markov decision problem, define basic elements involved in the process of interaction with the environment by the reinforcement learning agent, including the setting of actions, states and rewarding functions, and specifically include:

v(t)=[v₁(t),v₂(t),…v_N(t)]