Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flow chart of a multi-agent path planning method based on an improved RND3QN network according to the present application is shown.
As shown in fig. 1, the multi-agent path planning method based on the improved RND3QN network specifically includes the following steps:
Step S101, acquiring original image data, radar data, and state data of an agent.
In the step, a color image is obtained, and the color image is subjected to size adjustment;
The image shot by the laser radar is a 320-pixel by 240-pixel RGB three-channel color image, and is easily affected by surrounding environment factors to generate noise. The method comprises the steps of acquiring an RGB image shot by a laser radar, fusing the RGB image with IMU data to obtain a pavement visual RGB image, converting the pavement visual RGB image into an HSL image according to a preset format conversion strategy, wherein the HSL feature map provides a topology expression with unchanged illumination for an RND3QN network, so that the planning success rate of a path decision module in a day-night switching scene is improved, and the format conversion strategy has the following expression:
,
,
,
In the formula,Color channels R, G, B, max and min are the maximum value and the minimum value of the color channels,Hue, saturation and brightness of the HSL image, respectively;
In a simulation environment, a laser radar sensor provides an array at each time step, and represents distances from an intelligent body to obstacles at intervals of 1 DEG within a range of 180 DEG by taking the intelligent body as a reference point, so that distance and position information of objects in a scene are extracted. Calculating the self state of the intelligent body including the position of the intelligent body according to the output of the simulated inertial measurement unit sensorCurrent speed of agentAngular velocity of intelligent bodyDistance of agent to target locationAnd the angle of the agent to the target distanceFinally, a group of numbers is synthesized。
And step S102, extracting features of the original image data, the radar data and the state data of the intelligent agent according to a preset convolutional neural network-long-short-term memory architecture to obtain a multi-source measurement feature vector.
In the step, a plurality of convolution layers of a CNN network are utilized to extract spatial features in input data, the data enters a pooling layer after being output from the convolution layers, compression dimension reduction processing is carried out on the data, meanwhile, over fitting is avoided, convolution check input data are subjected to filtering operation, a ReLU function is used for realizing nonlinear mapping, and the pooling layer is used for carrying out dimension reduction through downsampling.
By stacking a plurality of LSTM units to capture long-term dependency in time series data, four-dimensional state vector attributes of an agent are combined with radar data to serve as LSTM network input, memory of past modes is kept when LSTM processes data, LSTM network output is combined with CNN output, and navigation decision is guided.
The multisource measurement feature vector extraction strategy (MMFS) is specifically characterized in that the traditional algorithm has poor processing effect on dynamic obstacle information, so that the multisource measurement feature vector extraction strategy is added, the information feature efficiency of the dynamic obstacle is improved, and the expression of the multisource measurement feature vector extraction strategy is as follows:
,
,
In the formula,A strategy is extracted for the multi-source measurement feature vector,Based on the spatial distance Pi and the terrain roughnessFor assigning different importance or influence to data points or regions according to different factors,Based on the spatial distance Pi and the terrain roughnessIs used as a function of the covariance of (a),For the localization function of the agent at different spatial distances,For the positioning function of the agent under different terrain roughness,As a function of the weight of the material,In order to be a spatial distance from each other,In order to re-roughen the terrain,For a phase difference of t=1 time from the initial time,A phase difference between time t=2 and time t=1,For a phase difference of time t=3 and time t=2,For a phase difference of time t=4 and time t=3,For a phase difference of time t=5 and time t=4,、、、、The vector base at t=1, the vector base at t=2, the vector base at t=3, the vector base at t=4, the vector base at t=5,As a vector base of the initial moment in time,The angle values are the angle values at different moments;
In a multisource measurement feature vector extraction strategy, a comprehensive environment perception model is built by fusing various sensor data, and the core is to extract key information from complex data sources so as to support subsequent path planning and decision making processes. Binocular cameras are capable of providing depth information and accurate two-dimensional pixel coordinates of a scene by simulating the principle of human binocular vision. Not only enriches the dimension of the multisource measurement feature vector, but also provides high-precision visual data support for obstacle detection and obstacle avoidance strategies in path planning. The method comprises the following steps:
,
,
In the formula,In order for the path to be energy-consuming,For the observation of data at the pose of the camera,As a function of the weight of the material,As a function of the characteristics of the curvature,For the purpose of the threshold value of the measurement,For the purpose of the threshold value of the measurement,For a particular angular frequency of the light,The path energy consumption at the moment of zero,For the path energy consumption at the initial moment,As the deflection angle at the initial moment in time,As the rotation angle at the initial time,For different moments of time of the deflection angle,For the rotation angle at different moments in time,For the angular frequency of the initial moment,In order to be able to take the time of day in a different way,Angular frequencies for different moments;
The global positioning efficiency is improved by calculating the scale of the feature points of the single-frame image. Extracting the characteristic points of the left eye image, tracking and positioning the characteristic points in the right eye image by using an optical flow method, and then matching the characteristic points of the left image and the right image. After the binocular camera is calibrated, a group of feature points which are matched correctly are selected, coordinates in a camera coordinate system are (u, v, 1), coordinates (x, y, z) of the binocular camera in a pixel coordinate system are calculated, and then feature point scales are obtained, wherein the formula is as follows:
,
,
wherein, C is the pixel coordinate under binocular vision,In order to project the matrix of the light,Is a coordinate matrix for left-hand photographing,Is a coordinate matrix for right-hand photographing,Is an internal reference matrix for left shooting,Is an internal reference matrix for right image pick-up,Is an external parameter matrix for left shooting,Is an external parameter matrix for right photographing,Is a distortion parameter.
The expression of the positioning function of the intelligent agent under different characteristic conditions is as follows:
,
In the formula,Is a set of natural numbers, which is a set of natural numbers,Is a natural number of the Chinese characters,Map edge information for any line feature point Pk,For the distortion of the best position estimate,Is an arbitrary line feature point k;
introducing non-iterative distortion compensation, and calculating the distortion of the optimal position estimation, wherein the expression is as follows:
,
In the formula,As a non-distorted point cloud,To use the jacobian matrix obtained by minimizing the weighted value and distance value between the points and the line and between the points and the plane,Is a position transform coefficient;
Optimizing the estimated position coordinates by using a Gauss-Newton method, introducing a left disturbance model, and constructing a Jacobian matrix as follows:
,
In the formula,For the inter-frame space to be the inter-frame space,In order to be a position-transformation matrix,The weight of the line feature point Pk point,The weight of the face feature point Ps point,Is an identity matrix of 3 rows and 3 columns,Is an identity matrix of 1 row and 3 columns,Is a zero matrix of 1 row and 3 columns,A zero matrix of 3 rows and 3 columns;
three-dimensional description is carried out on the line characteristic points, then an antisymmetric matrix of the line characteristic points is extracted, 5 points which are closest to the line characteristic points in the global map are selected, and two points are determined according to the characteristic vectors of the pointsAndAnd similarly, selecting 5 points closest to the surface characteristic points again, normalizing the points to form a plane, and constructing the residual matrix and the Jacobian matrix again.
When continuously acquiring line features and surface features, a weight function is introduced in order to balance the extraction effects of the line and surface features. Will be、Is defined as、The weight function is defined as:
,
In the formula,For the distance between the line feature point Pk and the global line feature map,Is the local roughness of the line feature point Pk,Is the local roughness of the edge information of the line feature point Pk,As the global map edge information,A line feature point set;
,
In the formula,For the local roughness of the edge information of the surface feature point Ps,A set of face feature points;
the position estimation of the mobile intelligent agent in the global map is realized by calculating the distance between the minimum feature point and the global feature. Calculation ofThe distance between the global line feature map is as follows:
,
In the formula,As the distance coefficient, the distance coefficient is used,As the surface feature points,As the line feature points of the line,Is the face position of the neighboring point cloud,In order to be the position of the line,As the coordinates of the points of interest,Is the normal vector of the face and is used for the image,To move the direction of the candidate line in the agent location space.
Step S103, inputting the multisource measurement feature vector to an improved RND3QN network, training the improved RND3QN network through an incremental learning strategy to obtain a path planning model, wherein training the improved RND3QN network comprises the steps of introducing a greedy strategy and a Boltzmann probability selection strategy to converge, and adopting an n-step time sequence difference algorithm to adjust path deviation.
In the step, a greedy strategy rapidly guides an agent to explore an effective path by selecting a current optimal action, and a Boltzmann probability selection strategy introduces randomness in a subsequent stage to ensure that the agent can jump out of a local optimal solution and search a global optimal path.
Combining the memory effect of the magnetic memristor with the time sequence modeling capability of the GRU (gate control circulating unit) to construct a chaotic generator with dynamic weight adjusting capability:
,
In the formula,Is the calculation formula of the GRU unit,To reset the weight matrix at the last instant of the function,To update the weight matrix at the last moment of the function,To conceal the bias vector of the state,For the weight matrix of the hidden state,For the weight matrix of the last moment of the hidden state,In order to multiply by the element(s),In the hidden state of the last moment,As the information of the current moment of time,In order to update the function,In order to update the bias vector of the function,In order to reset the function,As a function of sigmoide,As a result of the candidate hidden state,In order to reset the weight matrix of the function,As a bias vector for the weight function,To pass on to the hidden state at the next instant,In order to update the weight matrix of the function,Is a tanh function;
Under the driving of a GRU-memristor framework, a chaotic oscillator with multistable characteristics is designed based on the nonlinear coupling characteristic of an improved Chua's circuit and a magnetic control memristor, and the system equation expression of the chaotic oscillator is as follows:
,
In the formula,As a function of the state variable w, f (w) =dw2,As a function of the state variables of the system,As a function of the state variables of the system,As a function of the state variables of the system,As a function of the state variable x, and g (x) =0.9 x,For the system parameters, 9 is taken,As a function of the state variables of the system,Taking 10 for the system parameters,Taking 60 as a system parameter;
Selecting a step function as a nonlinear kernel function of chaotic generation, and evolving a system equation of the chaotic oscillator into:
,
In the formula,As a nonlinear function of the state variable y;
wherein,Is arranged asOr (b)The expression is:
,
,
In the formula,、Are all a non-linear function of the shape of the sample,As a function of the parameters of the system,Is a set of natural numbers, which is a set of natural numbers,As a function of the sign of the symbol,For the number of attractors,Is a set of natural numbers, which is a set of natural numbers,The application of the variable y for the sign function;
generating a chaotic sequence in [0, 1] through the improved Tent map, and converting the chaotic sequence into an individual search space, wherein the expression of the improved Tent map is as follows:
,
In the formula,The output value at time i +1,The output value at the moment i is the output value at the moment i,Is an input value;
The Tent map is expressed after the bernoulli shift transform as:
,
In the formula,And Z is the number of particles in the chaotic sequence, wherein the number is a random number between (0 and 1).
The chaotic maximum step number random generation strategy guides the intelligent agent to stably transition from a wide exploration stage to a refined utilization stage by gradually reducing randomness, ensures that the model can be efficiently converged to a global optimal solution, and remarkably improves the convergence speed and stability of the model. The greedy strategy and the boltzmann strategy form an action selection base, the chaotic mechanism provides an environment exploration driving force, dynamic balance of exploration breadth (chaotic disturbance) and decision accuracy (Q value guide) is achieved through time sequence coupling, the randomness strength is automatically attenuated along with the training process, and convergence oscillation caused by over exploration is avoided.
At the later stage of training, the model has fully explored the state space and learned effective strategies, and the action selection strategy adopts greedy algorithm to select the best action. In greedy algorithmThe value is set low to ensure that in most cases the optimal action is chosen.The initial setting of the value is 1.0 and gradually decreases, but never falls below 0.1, expressed as:
,
wherein,Episode _num (number of rounds of training) is represented.
The deviation is reduced by adopting an n-step time sequence difference algorithm (n-step bootstrapping), and the specific contents are as follows:
n-step reporting usage cost functionTo correct the return value at time t+nThe sum of all remaining rewards is then ensured that the expected maximum error is not greater thanAnd (3) withThe worst error product is expressed as:
,
In the formula,In order to reward rewards from time t to n steps in the future,Is the noise of zero mean value and,For the characteristic expression at the time t,As a domain of the features,For the value of the cost function of the probability mapping strategy pi,For a discount factor of n steps,As a function of the value of the function,Is a cost function of the probability mapping strategy pi.
The invention provides a dynamic density-guided rewarding reward (DDGR), which specifically aims to solve the problem of path deviation of multiple agents in a dynamic environment, so that the invention increases the dynamic density-guided rewarding reward and improves the robustness of path planning of the multiple agents in the dynamic environment;
And (3) generating density distribution of density function transition in advance, when the shape of the obstacle map is known, calculating a control point selected from forms needing interpolation of a target point, generating a dynamic transition function of an intermediate form, and taking the interpolation object as probability density. Probability density distributionAnd (3) withDistance between WassersteinThe following are provided:
,
,
In the formula,Representing a transportation plan from alpha to beta,Representing the slave probability distribution as a joint distribution function in two-dimensional Wasserstein distanceTo probability distributionThe density profile of the optimal transportation plan,A differential measure of the joint distribution function, describing how the mass of the midpoint moves from one location to another at the wasperstein distance calculation,In order to be a position-transformation matrix,For the left camera to take a picture of the best position estimation matrix,For the left-hand camera image of the actual position matrix,For the right camera to take a picture of the best position estimation matrix,For the right-hand camera image of the actual position matrix,Is the square of the distance between x and y.
Generating transition probability density distributionThe process of (2) is as follows:
,
In the formula,、The initial and desired density profiles for the formation respectively,Transition probability density distribution for differences;
Selecting a certain position of the path through the optimal transmission path between the two density distributionsAnd calculating transition density distribution by utilizing gravity center interpolation on point sets on all paths, and realizing transition transformation of a density function.
Centered on the extension of the Bellman equation (Bellman equation), its iterative formula can be derived as:
,
In the formula,In order to accumulate the prize,Is the return value at the time t,For the jackpot at time t + n,The return value at the time t+i;
the calculation method of the target value in the RND3QN network comprises the following steps:
,
wherein,Is the target value of the current value,Is a neural network parameter.
Step S104, acquiring a real-time multi-source measurement feature vector, inputting the real-time multi-source measurement feature vector into the path planning model, and outputting the path planning model to obtain a collaborative path plan of a plurality of agents, wherein the path planning model comprises a new rewarding function based on multi-agent position constraint.
In this step, the new rewards function based on the multi-agent position constraint includes a main line rewards function and an auxiliary rewards function;
The main line reward function is used for global guidance of algorithm convergence, and gives negative feedback when an intelligent body collides with an obstacle or a wallWhen the agent reaches the target point without collision, giving positive feedbackThe expression is:
,
In the formula,Is a dominant line bonus function,Is a wall which is provided with a plurality of holes,Is an obstacle;
Sub-targets of the auxiliary reward function for reacting tasks correspond to a state space comprising an angle between the agent orientation and the targetDistance d, the range of the clockwise rotation angle of the agent isWhen the angle isOr (b)When the direction is exactly opposite to the target direction, the expression is:
,
Constructing potential energy-based distance auxiliary rewards, wherein the expression is as follows:
,
,
In the formula,For the distance between the current location of the agent and the target location,The absolute distance is the distance between the initial position of the agent and the target position,Rewarding for distance;
the angle is used as a source of abnormal auxiliary rewards, and the intelligent agent is continuously guided to reach the target position through continuous rewards, and the calculation formula of the angle rewards is as follows:
,
,
In the formula,Taking the integer part of the floating point number, the relation between the head of the intelligent agent and the angle of the target point is expressed as,When the head of the intelligent body is yawed to the right, the yaw angle is positive, otherwise, the yaw angle is negative,Is the head angle of the intelligent body,For the angle of the target point,In order to be a yaw angle,To take the remainder function;
combining the distance and angle components of the target to obtain an auxiliary rewarding functionThe expression is:
,
In the formula,In order to take the form of a dot product,For the purpose of an angular incentive,Awarding a distance.
In summary, the method of the application acquires data and pre-processes the data, carries out format conversion on original images, radars and self state data, constructs a convolutional neural network-long-short-term memory (CNN-LSTM) framework, extracts multisource measurement feature vectors as the input of RND3QN, adopts an improved RND3QN network framework, designs a new rewarding function based on multi-agent position constraint, trains a model through an incremental learning strategy, realizes the collaborative path planning of a plurality of agents, introduces a greedy strategy (e-greedy) and Boltzmann probability selection strategy, avoids convergence to local extremum, adopts an n-step time sequence difference algorithm (n-step bootstrapping) to reduce deviation, improves the accuracy of path planning, reduces the probability of collision in the moving process, shortens the tracking time, enhances the avoidance capability of dynamic obstacles and is beneficial to reasonable scheduling of the multi-intelligent agents.
Referring to fig. 2, a block diagram of a multi-agent path planning system based on an improved RND3QN network according to the present application is shown.
As shown in fig. 2, the multi-agent path planning system 200 includes an acquisition module 210, an extraction module 220, a training module 230, and an output module 240.
The system comprises an acquisition module 210 configured to acquire original image data, radar data and state data of an agent, an extraction module 220 configured to perform feature extraction on the original image data, radar data and state data of the agent according to a preset convolutional neural network-long and short-term memory architecture to obtain multi-source measurement feature vectors, a training module 230 configured to input the multi-source measurement feature vectors to an improved RND3QN network and train the improved RND3QN network through an incremental learning strategy to obtain a path planning model, wherein training the improved RND3QN network comprises introducing a greedy strategy and a Boltzmann probability selection strategy to converge and adopting an n-step time sequence difference algorithm to adjust path deviation, and an output module 240 configured to acquire real-time multi-source measurement feature vectors and input the real-time multi-source measurement feature vectors into the path planning model, and output the path planning model to obtain a new reward path planning function based on the position constraint of a plurality of agents.
It should be understood that the modules depicted in fig. 2 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are equally applicable to the modules in fig. 2, and are not described here again.
In other embodiments, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program, the program instructions, when executed by a processor, cause the processor to perform the multi-agent path planning method based on the modified RND3QN network in any of the method embodiments described above;
as one embodiment, the computer-readable storage medium of the present invention stores computer-executable instructions configured to:
acquiring original image data, radar data and state data of an intelligent agent;
Performing feature extraction on the original image data, the radar data and the state data of the intelligent agent according to a preset convolutional neural network-long-short-term memory architecture to obtain a multi-source measurement feature vector;
inputting the multisource measurement feature vector to an improved RND3QN network, training the improved RND3QN network through an incremental learning strategy to obtain a path planning model, wherein training the improved RND3QN network comprises the steps of introducing a greedy strategy and a Boltzmann probability selection strategy to converge, and adopting an n-step time sequence difference algorithm to adjust path deviation;
And acquiring a real-time multi-source measurement feature vector, inputting the real-time multi-source measurement feature vector into the path planning model, and outputting the path planning model to obtain a collaborative path plan of a plurality of agents, wherein the path planning model comprises a new rewarding function based on multi-agent position constraint.
The computer-readable storage medium may include a storage program area that may store an operating system, an application program required for at least one function, and a storage data area that may store data created according to the use of a multi-agent path planning system based on an improved RND3QN network, etc. In addition, the computer-readable storage medium may include high-speed random access memory, and may also include memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the computer readable storage medium optionally includes memory remotely located with respect to the processor, the remote memory being connectable over a network to the improved RND3QN network-based multi-agent path planning system. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the device includes a processor 310 and a memory 320. The electronic device may further comprise input means 330 and output means 340. The processor 310, memory 320, input device 330, and output device 340 may be connected by a bus or other means, for example in fig. 3. Memory 320 is the computer-readable storage medium described above. The processor 310 executes various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 320, i.e. implementing the multi-agent path planning method of the above-described method embodiments based on improving the RND3QN network. The input device 330 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the improved RND3QN network based multi-agent path planning system. The output device 340 may include a display device such as a display screen.
The electronic equipment can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
As an embodiment, the electronic device is applied to a multi-agent path planning system based on an improved RND3QN network, and is used for a client, and comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to:
acquiring original image data, radar data and state data of an intelligent agent;
Performing feature extraction on the original image data, the radar data and the state data of the intelligent agent according to a preset convolutional neural network-long-short-term memory architecture to obtain a multi-source measurement feature vector;
inputting the multisource measurement feature vector to an improved RND3QN network, training the improved RND3QN network through an incremental learning strategy to obtain a path planning model, wherein training the improved RND3QN network comprises the steps of introducing a greedy strategy and a Boltzmann probability selection strategy to converge, and adopting an n-step time sequence difference algorithm to adjust path deviation;
And acquiring a real-time multi-source measurement feature vector, inputting the real-time multi-source measurement feature vector into the path planning model, and outputting the path planning model to obtain a collaborative path plan of a plurality of agents, wherein the path planning model comprises a new rewarding function based on multi-agent position constraint.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.