CN117933666B

Movatterモバイル変換

Info

Publication number: CN117933666B
Application number: CN202410325765.6A
Authority: CN
Inventors: 陶震宇
Original assignee: Hefei Yihao Intelligent Technology Co ltd
Current assignee: Hefei Yihao Intelligent Technology Co ltd
Priority date: 2024-03-21
Filing date: 2024-03-21
Publication date: 2024-07-12
Anticipated expiration: 2044-03-21
Also published as: CN117933666A

Abstract

The invention discloses a dispatching method, a dispatching device, a dispatching medium, dispatching equipment and dispatching system for a dense storage robot. The method comprises a warehouse overall state monitoring step, a robot scheduling analysis step and a robot scheduling execution step. The whole state monitoring step of the warehouse is used for tracking and monitoring the cargo state of each storage unit in the warehouse, the cargo state of each robot and the storage unit where each robot is located. The robot scheduling analysis step obtains the warehouse overall state data through the warehouse overall state monitoring step, constructs a Q value neural network input vector of each robot according to the warehouse overall state data, inputs the Q value neural network input vector into the Q value neural network to calculate the Q value obtained by different actions executed by each robot, then determines the next action based on the action Q value and greedy strategy, and finally instructs each robot to execute the next action through robot scheduling. The scheduling evaluation of the Q value neural network is based on the cooperative operation of all robots, so that the overall efficiency of all robots in a warehouse is highest.

Description

Intensive warehouse robot scheduling method, device, medium, equipment and system

Technical Field

The invention relates to dense warehouse robot scheduling.

Background

The dense storage stores goods in a compact space by optimizing the warehouse space, so that the maximization of the warehouse space utilization rate is realized. The automatic management of the goods stock management, the warehouse entry and the warehouse exit of the intensive warehouse is generally realized by a intensive warehouse system. Along with the development of artificial intelligence technology, the intensive warehouse system also gradually introduces artificial intelligence. An artificial intelligence based dense warehousing system typically includes a central control system and robots coupled to the central control system. Robots are intelligent mechanical devices that are placed in a warehouse, typically in many numbers. These robots are responsible for the handling of the in-warehouse, out-warehouse and bin collations of the goods within the warehouse and transact the goods registration as they are in-warehouse, out-warehouse and bin collations. The information of the robot handling the registration of the goods includes information of the goods themselves and information of the positions of the goods in the warehouse. The information is transmitted to the central control system through the network and is stored in a database of the central control system. These robots are required to work in conjunction with each other within the warehouse. The collaborative work between robots needs to be scheduled by the central control system. Because the robot walks the route limited in the warehouse, in order to avoid the collision of robot and other robots or on-the-spot staff when walking in the warehouse, under prior art, need the robot wait on the spot or detour, on the one hand extravagant energy consumption, on the other hand lead to the robot operating efficiency low, and reduced real-time response speed.

Disclosure of Invention

The invention aims to solve the problems that: the safety of the robot equipment in the warehouse is ensured, the operation efficiency of the robot in the warehouse is improved, and the real-time response speed is accelerated.

In order to solve the problems, the invention adopts the following scheme:

the invention relates to a dispatching method of a dense warehouse robot, which comprises a robot dispatching analysis step;

The robot scheduling analysis step comprises the following steps:

step S11: acquiring integral state data of a warehouse; the whole state data of the warehouse comprises stock state data of each warehouse unit in the warehouse, cargo carrying states of each robot and the warehouse unit where each robot is currently located in the warehouse;

step S12: constructing a robot Q value neural network input vector for each robot according to the inventory state data of each storage unit in the warehouse, the cargo carrying state of each robot and the storage unit where each robot is currently located in the warehouse;

Step S13: inputting the input vector of the robot Q value neural network of each robot into the Q value neural network to obtain the action Q value evaluation vector of each robot; the action Q value evaluation vector is a vector formed by a plurality of action Q value evaluation values; the action Q value evaluation value corresponds to the next action of the robot, and is the evaluation value of the Q value of the robot for executing the next action; the next action of the robot at least comprises azimuth movement and goods storage and retrieval; the Q value neural network is D3QN;

step S14: performing next action filtering on action Q value evaluation vectors of all robots according to rules of avoiding conflict when the robots walk in a warehouse to obtain filtered action Q value evaluation vectors of all robots;

step S15: selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot.

Further, according to the dense warehouse robot scheduling method, the method further comprises a robot state monitoring step and a robot scheduling execution step; the robot state monitoring step is used for tracking and monitoring the cargo carrying state of each robot and the storage unit of each robot in the warehouse at present; the robot scheduling execution step is used for distributing corresponding action instructions to the corresponding robots to enable the robots to execute the corresponding next actions.

According to the dense storage robot scheduling method, the method further comprises a neural network training step; the neural network training step comprises the following steps:

step ST1: initializing an evaluation network and a target network for the Q-value neural network, and generating current warehouse overall state data in a random generation mode;

Step ST2: inputting vectors of a current robot Q value neural network constructed by the current warehouse overall state data for each robot into an evaluation network to obtain action Q value evaluation vectors of each robot; then, performing next action filtering on action Q value evaluation vectors of all robots according to rules of avoiding conflict when the robots walk in a warehouse to obtain filtered action Q value evaluation vectors of all robots; finally, selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot;

Step ST3: determining the whole state data of a warehouse of the next step and the direct rewards of each robot according to the next step of action of each robot; then calculating joint rewards of all robots according to the direct rewards weight of all robots; the joint rewards of the robot are calculated according to the following formula:

ru (i) =sum (w (j) ×r (j)); wherein,

Ru (i) represents a joint reward for the ith robot;

sum represents all robot summations;

r (j) represents a direct reward for the jth robot;

w (j) represents the weight of the direct reward of the jth robot, wherein when i=j, w (j) =1;

Step ST4: building a historical experience data item, and adding the historical experience data item into an experience playback pool; the historical experience data items comprise current warehouse overall state data, a collection of robot action rewards and warehouse overall state data of the next step; the robot action rewarding item comprises the next action of the robot and combined rewarding thereof;

step ST5: if the number of the historical experience data items in the experience playback pool is smaller than a first fixed value, taking the whole state data of the warehouse in the next step as the whole state data of the current warehouse, and repeatedly executing the steps ST2 to ST5;

Step ST6: randomly extracting m historical experience data items from an experience playback pool, acquiring a current robot Q value neural network input vector of each robot and a next robot Q value neural network input vector of each robot according to the extracted historical experience data items, calculating a return difference value L (i) of each robot by inputting the robot Q value neural network input vector and the next action into an evaluation network and a target network, and updating the evaluation network according to the return difference value L (i) of each robot; if the number of times of updating the evaluation network reaches a second fixed value, updating the target network in a mode of copying model parameters;

step ST7: taking the whole state data of the warehouse in the next step as the whole state data of the current warehouse, and repeatedly executing the steps ST2 to ST7 until the iteration times reach a preset third fixed value or the return difference value L (i) of each robot is smaller than a preset threshold value;

the return difference is calculated according to the following formula:

L(i)=max[Qt(i,sn(i),al)]*gamma+Ru(i)-Qv(i,sc(i),a(i))；

Wherein,

L (i) represents the return difference of the ith robot;

sc (i) represents the current robot Q neural network input vector of the ith robot;

sn (i) represents a robot Q-value neural network input vector of the next step of the ith robot;

a (i) represents the next action of the ith robot;

al represents each possible next action of the robot;

Qv (i, sc (i), a (i)) represents a Q value evaluation value obtained after the current robot Q value neural network input vector sc (i) of the ith robot and the next action a (i) are input to the evaluation network;

max [ Qt (i, sn (i), al) ] represents the maximum value in the Q-value target values corresponding to each next-step action after the input vector sn (i) of the Q-value neural network of the robot at the next step of the ith robot and each possible next-step action are input to the target network;

gamma is a predetermined parameter.

The invention relates to a dense warehouse robot scheduling device, which comprises a robot scheduling analysis module;

the robot scheduling analysis module comprises the following modules:

A module M11 for: acquiring integral state data of a warehouse; the whole state data of the warehouse comprises stock state data of each warehouse unit in the warehouse, cargo carrying states of each robot and the warehouse unit where each robot is currently located in the warehouse;

Module M12 for: constructing a robot Q value neural network input vector for each robot according to the inventory state data of each storage unit in the warehouse, the cargo carrying state of each robot and the storage unit where each robot is currently located in the warehouse;

Module M13 for: inputting the input vector of the robot Q value neural network of each robot into the Q value neural network to obtain the action Q value evaluation vector of each robot; the action Q value evaluation vector is a vector formed by a plurality of action Q value evaluation values; the action Q value evaluation value corresponds to the next action of the robot, and is the evaluation value of the Q value of the robot for executing the next action; the next action of the robot at least comprises azimuth movement and goods storage and retrieval;

Module M14 for: performing next action filtering on action Q value evaluation vectors of all robots according to rules of avoiding conflict when the robots walk in a warehouse to obtain filtered action Q value evaluation vectors of all robots;

Module M15 for: selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot.

Further, the dense warehouse robot dispatching device also comprises a robot state monitoring module and a robot dispatching execution module; the robot state monitoring module is used for tracking and monitoring the cargo carrying state of each robot and the storage unit of each robot in the warehouse at present; the robot scheduling execution module is used for distributing corresponding action instructions to the corresponding robots according to the next actions of the robots obtained by the robot online scheduling module, so that the corresponding next actions are executed by the robots.

Further, the dense warehouse robot scheduling device also comprises a neural network training module; the neural network training module comprises the following modules:

A module MT1 for: initializing an evaluation network and a target network for the Q-value neural network, and generating current warehouse overall state data in a random generation mode;

A module MT2 for: inputting vectors of a current robot Q value neural network constructed by the current warehouse overall state data for each robot into an evaluation network to obtain action Q value evaluation vectors of each robot; then, performing next action filtering on action Q value evaluation vectors of all robots according to rules of avoiding conflict when the robots walk in a warehouse to obtain filtered action Q value evaluation vectors of all robots; finally, selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot;

A module MT3 for: determining the whole state data of a warehouse of the next step and the direct rewards of each robot according to the next step of action of each robot; then calculating joint rewards of all robots according to the direct rewards weight of all robots; the joint rewards of the robot are calculated according to the following formula:

ru (i) =sum (w (j) ×r (j)); wherein,

Ru (i) represents a joint reward for the ith robot;

sum represents all robot summations;

r (j) represents a direct reward for the jth robot;

A module MT4 for: building a historical experience data item, and adding the historical experience data item into an experience playback pool; the historical experience data items comprise current warehouse overall state data, a collection of robot action rewards and warehouse overall state data of the next step; the robot action rewarding item comprises the next action of the robot and combined rewarding thereof;

A module MT5 for: if the number of the historical experience data items in the experience playback pool is smaller than a first fixed value, taking the whole state data of the warehouse in the next step as the whole state data of the current warehouse, and repeatedly executing the functions of the modules MT2 to MT 5;

a module MT6 for: randomly extracting m historical experience data items from an experience playback pool, acquiring a current robot Q value neural network input vector of each robot and a next robot Q value neural network input vector of each robot according to the extracted historical experience data items, calculating a return difference value L (i) of each robot by inputting the robot Q value neural network input vector and the next action into an evaluation network and a target network, and updating the evaluation network according to the return difference value L (i) of each robot; if the number of times of updating the evaluation network reaches a second fixed value, updating the target network in a mode of copying model parameters;

A module MT7 for: taking the whole state data of the warehouse in the next step as the whole state data of the current warehouse, and repeatedly executing the functions of the modules MT2 to MT7 until the iteration times reach a preset third fixed value or the return difference value L (i) of each robot is smaller than a preset threshold value;

the return difference is calculated according to the following formula:

L=max[Qt(i,sn(i),al)]*gamma+Ru(i)-Qv(i,sc(i),a(i))；

Wherein,

L (i) represents the return difference of the ith robot;

a (i) represents the next action of the ith robot;

al represents each possible next action of the robot;

gamma is a predetermined parameter.

A machine-readable medium according to the present invention has stored thereon a set of program instructions readable by a machine; the method is characterized in that the method for dispatching the intensive warehouse robots can be realized when a program instruction set stored in the medium is read by a machine and loaded and executed.

An electronic device according to the present invention includes a processor and a memory; the memory stores a program instruction set; the method is characterized in that the device can realize the dense storage robot scheduling method when the program instruction set stored in the memory is executed by the processor.

The invention relates to a dispatching system of a dense warehouse robot, which comprises a dispatching center and a plurality of robots arranged in a warehouse; the robot is connected with the dispatching center; the robot moves among the storage units; the dispatching center comprises the dense storage robot dispatching device.

The invention has the following technical effects:

According to the intelligent scheduling method, the warehouse is divided into storage units, the input vector which can be input into the Q-value neural network is constructed for the warehouse goods shelf stock state, the robot position, the robot cargo carrying state and the tasks executed by the robots based on the storage units, so that the Q value of each robot for executing corresponding actions is judged, and finally, the next action of the robot is determined according to the Q value, and intelligent scheduling of the robots is realized.

The Q-value neural network constructed by the invention has strong expandability, and the increase or decrease of the number of robots in a warehouse does not affect the dimension number of input vectors and the dimension number of output vectors, so that the Q-value neural network does not need to be regulated.

The Q-value neural network is not based on a single robot but based on all robots in a warehouse during training, and belongs to joint evaluation of all robots, so that the Q-value neural network can be based on evaluation of all robots during actual scheduling, collaborative operation among the robots is realized, and the overall efficiency of all robots in the warehouse is highest.

Drawings

FIG. 1 is a flow chart of an embodiment of a dense warehouse robot scheduling method of the present invention.

Fig. 2 is a schematic structural diagram of an embodiment of the dense warehouse robot scheduling system of the present invention.

Fig. 3 is a schematic structural diagram of a dispatch center device of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

Fig. 2 illustrates a dense warehouse robot scheduling system. The system comprises a number of robots 101 arranged within a warehouse 100 and a dispatch center 200 connected to these robots 101 via a network 300. The dispatch center 200 is typically a general purpose computer device or a cluster of general purpose computer devices in von neumann form, and for the electronic device to which the present invention is directed as described above, referring to fig. 3, includes a processor 201, a memory 202, and a communication unit 203. Wherein the memory 202 and the communication unit 203 are connected to the processor 201. Memory 202 is used to store a set of computer program instructions and data. The processor 201 achieves its corresponding functions by loading and executing a set of computer program instructions stored in the memory 202. The memory 202, which is a machine-readable medium to which the present invention refers, is typically a sustainable storage device, including, but not limited to, storage devices such as magnetic disks, magnetic tape, solid state disks, and the like. The communication unit 203 is for connecting to the network 300. In particular, in the present invention, the processor 201 implements the connection between the dispatch center 200 and the network 300 by executing the instruction set of the computer program, and implements the information interaction with the robot 101, and implements the method for dispatching the intensive warehouse robot according to the present invention.

Referring to fig. 1, the dispatching method of the intensive warehouse robot of the embodiment mainly includes the following steps: the method comprises a warehouse overall state monitoring step, a robot scheduling analysis step, a robot scheduling execution step and a neural network training step. The warehouse overall state monitoring step is used for tracking and monitoring the warehouse overall state and comprises a warehouse inventory state monitoring step and a robot state monitoring step. The warehouse inventory status monitoring step is used for tracking and monitoring inventory status data of each warehouse unit in the warehouse. The robot state monitoring step is used for tracking and monitoring the cargo carrying state of each robot 101 and the storage unit where each robot 101 is currently located in the warehouse. And the step of monitoring the integral state of the warehouse forms integral state data of the warehouse through tracking and monitoring the integral state of the warehouse. The warehouse overall status data includes individual warehouse unit inventory status data, individual robot 101 cargo status, and the warehouse unit in which the individual robot 101 is currently located within the warehouse. The warehouse overall state monitoring step is input to the robot scheduling analysis step for scheduling analysis to determine the next action to be performed by each robot 101. When the robot scheduling analysis step determines the next action to be executed by each robot 101, the next action to be executed by each robot 101 is respectively issued to each robot 101 by means of the regulation and control instruction through the robot scheduling execution step, and each robot 101 executes the corresponding action after receiving the regulation and control instruction. The warehouse overall state monitoring step, the robot scheduling analysis step and the robot scheduling execution step are driven by a state that the timing driving or the robot action execution is completed. After the robot 101 completes its corresponding action, it waits for a regulatory instruction issued by the dispatch center 200. When the robots 101 are executing actions, the dispatching center 200 makes analysis and decision for the next action of the robots 101 through the robot dispatching analysis step according to the warehouse overall state data formed after the current actions of the robots 101 are completed. When the robot 101 action execution is completed, the warehouse overall state monitoring step collects and updates warehouse overall state data.

The core of the dispatching method of the intensive warehouse robots is a robot dispatching analysis step and a neural network training step.

The warehouse units are logical space units divided by warehouse. Referring to fig. 2, in the example of fig. 2, the warehouse is divided into M rows by N columns = K warehouse units. The gray part storage unit is used for indicating a main road through which the robot 101 passes, the white part storage unit is used for indicating a part with a goods shelf, the blank white storage unit is used for indicating that goods are not stored in the goods shelf, and the white storage unit with oblique lines is used for indicating that goods are stored in the goods shelf. The following rules are followed by the robot 101 while walking within the warehouse:

1. two robots 101 cannot enter one storage unit at the same time to avoid collision problems;

2. the robot 101 carrying the goods cannot enter the storage unit storing the goods on the shelf;

3. the robot 101 not carrying the goods can enter the storage unit with the goods stored on the goods shelf, but cannot travel from the storage unit with the goods stored on the goods shelf to the storage unit with the goods stored on the other goods shelf;

4. The robot 101 can walk arbitrarily between the stocker units on which the goods are not stored on the shelves or the stocker units on the main road.

In another alternative embodiment, the robot Q-value neural network input vector may further include robot task state information. The robot task state information includes three vector elements respectively representing three tasks: in the first vector element, if the task executed by the current robot is a cargo warehouse-in task, the first vector element is set to be 1, otherwise, the first vector element is set to be 0; in the second vector element, if the task executed by the current robot is a cargo unloading task, the second vector element is set to be 1, otherwise, the second vector element is set to be 0; and in the third vector element, if the task executed by the current robot is bin arrangement, the third vector element is set to be 1, otherwise, the third vector element is set to be 0. At this time, the robot Q-value neural network input vector contains 4×k+4=4×m×n+4 vector elements. In addition, those skilled in the art understand that if the robot further includes other tasks, the robot task state information may further include vector elements corresponding to the other tasks. Those skilled in the art understand that when the robot Q-value neural network input vector includes robot task state information, the warehouse overall state data also includes tasks currently performed by the respective robots correspondingly.

And step S13, namely, inputting the input vector of the robot Q-value neural network of each robot into the Q-value neural network to obtain the action Q-value evaluation vector of each robot. The Q-value neural Network may be DQN, i.e. Deep Q-Network, double DQN, i.e. Double Q-Network, or D3QN, i.e. Dueling Double Deep Q-Network, which is a neural Network trained in advance by a neural Network training step. Whether DQN, double DQN or D3QN is a Q-value neural network familiar to those skilled in the art, and will not be described in detail in this specification. In this embodiment, the Q-value neural network is preferably D3QN. The motion Q value evaluation vector here is a vector composed of a plurality of motion Q value evaluation values. The action Q value evaluation value corresponds to the next action of the robot, and is the evaluation value of the Q value of the robot executing the next action. The robot further actions include at least azimuthal movement and storing and picking up the cargo. The azimuth movement here includes movement in four directions, i.e., southwest and northwest in the present embodiment. In other alternative embodiments, the azimuthal movement may also include, for example, upward movement and downward movement. The goods are accessed in two actions, namely stock and goods taking, and vector elements of the vector can be evaluated through two action Q values, and vector elements of the vector can be evaluated through one action Q value. When the vector representation is evaluated through an action Q value, if the current robot carries goods, the vector element corresponds to the inventory action; if the current robot does not bear the goods, the vector element corresponds to the goods taking action.

In another alternative embodiment, the robot further action may also include actions such as stopping movement, sweeping the cargo, etc. Stopping the movement action means that the robot waits in place. The goods scanning action scans the identification code on the goods, and the identification code can be a two-dimensional code, an NFC identification code or RFID. The robot performs verification with the cargo information stored in the dispatch center 200 after scanning the identification code on the warehouse cargo.

And a conflict action eliminating step, namely the step S14, wherein the action Q value evaluation vector of each robot is subjected to next action filtering according to a rule that the robot walks in a warehouse to avoid conflict, so as to obtain the filtered action Q value evaluation vector of each robot. The rule that the robot walks in the warehouse to avoid collision is the rule that the robot needs to follow when walking in the warehouse:

two robots 101 cannot enter one storage unit at the same time;

The robot 101 carrying the goods cannot enter the storage unit storing the goods on the shelf;

The robot 101 not carrying the goods can enter the storage unit with the goods stored on the goods shelf, but cannot travel from the storage unit with the goods stored on the goods shelf to the storage unit with the goods stored on the other goods shelf;

The robot 101 can walk arbitrarily between the stocker units on which the goods are not stored on the shelves or the stocker units on the main road.

The rules of collisions that the robot walks to avoid are based on the warehouse and the shelf structure within the warehouse. For example, in another embodiment, the robot may only walk on the main road of the gray square portion in the example of fig. 2, and when the robot on the main road accesses goods, the robot on the main road accesses goods corresponds to the shelves adjacent to the main road square. In another possible implementation, the robot walks on the track, and whether the robot carries the goods or not and whether the goods are stored on the shelf or not does not collide, and at this time, the rule of avoiding the collision in the warehouse is that two robots 101 cannot enter one storage unit at the same time.

After the conflict action eliminating step, according to the rule that the robot walks in the warehouse to avoid the conflict, the Q value evaluation value corresponding to the next action which the robot cannot execute in the action Q value evaluation vector is set to 0, or the action Q value evaluation vector can be used as a set to delete the element corresponding to the next action which the robot cannot execute in the set.

The action optimization step, namely the step S15, is to select a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot. It should be noted that, in the foregoing collision action eliminating step, if an implementation in which the Q value evaluation value corresponding to the Q value evaluation vector is set to 0 is adopted for the next action that cannot be executed by the robot, when the next action is randomly selected, the situation that the Q value evaluation value corresponding to the Q value is set to 0 should be avoided, that is, when a next action corresponding to the Q value evaluation value greater than 0 is randomly selected as the next action of the robot. After the action priority step is finished, the next action of each robot output by the action priority step is the robot scheduling analysis step.

The neural network training step is used for training the Q value neural network. In this embodiment, the neural network training step adopts an offline mode and adopts an experience playback mechanism, and includes the following steps:

step ST1: initializing overall state data of an evaluation network, a target network and a current warehouse;

step ST2: carrying out evaluation network simulation scheduling analysis on the whole state data of the current warehouse;

step ST3: determining the whole state data of the warehouse and the joint rewards of the robots in the next step;

Step ST4: building a historical experience data item, and adding the historical experience data item into an experience playback pool;

Step ST5: if the number of the historical experience data items in the experience playback pool is smaller than the first fixed value, returning to the step ST2;

Step ST6: updating the evaluation network by calculating the return difference value, and further updating the target network;

step ST7: steps ST2 to ST7 are repeated until the return difference of each robot is sufficiently small or the number of iterations reaches a certain number.

In the above step, the evaluation network and the target network are two Q-value neural networks, and the structure thereof is the same as that of the Q-value neural network in the robot scheduling analysis step. After the training is received, the evaluation network is the Q value neural network in the robot scheduling analysis step. This initializing evaluation network and target network can be described as initializing evaluation network and target network for the Q-value neural network. At the initialization of step ST1, model parameters of the evaluation network and the target network are randomly generated, and the model parameters of the evaluation network and the target network are the same. The current warehouse overall state data is the same as the warehouse overall state data structure in the robot scheduling analysis step. The current warehouse overall state data is typically generated in a random manner. It should be noted that the number of robots in generating the current warehouse overall state data in a random generation manner is the same as the number of robots in the warehouse overall state data in the robot scheduling analysis step.

Step ST2 is specifically that the current warehouse overall state data is input into an evaluation network for the Q value neural network of the current robot constructed by each robot to obtain an action Q value evaluation vector of each robot; then, performing next action filtering on action Q value evaluation vectors of all robots according to rules of avoiding conflict when the robots walk in a warehouse to obtain filtered action Q value evaluation vectors of all robots; finally, selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot.

The specific process of step ST2 is substantially identical to steps S12, S13, S14 and S15 in the robot scheduling analysis step, and will not be described again. The above steps ST2 to ST7 are an iterative loop process. The current warehouse overall state data in step ST2 comes from step ST1 or the previous round of the iterative loop process. The current warehouse overall state data typically already builds the corresponding robot Q neural network input vector in the previous cycle. Thus, in another optimized embodiment, step ST2 does not need to construct the current robot Q-value neural network input vector from step S12, unlike the robot schedule analysis step, but directly from using the robot Q-value neural network input vector already constructed in the previous cycle.

Step ST3, specifically, determining the whole state data of the warehouse of the next step and the direct rewards of each robot according to the next step action of each robot; the joint rewards for each robot are then calculated based on the direct rewards weighting for each robot. In the process, the whole state data of the warehouse in the next step is obtained through executing the simulation action according to the next action of each robot and the whole state data of the warehouse in the current step. The direct rewards of the robot are determined according to the task completed by the goods access: if the robot completes the corresponding task, for example, the robot finishes the inventory action in the goods warehouse-in task, the robot gives a certain reward to the robot, and finishes the goods taking action in the goods warehouse-out task, the robot gives a certain reward to the robot. In this embodiment, the robot completes the inventory pick action with a direct reward at 10 points, otherwise with a direct reward at-0.1 points. -0.1 point direct rewards for motivating the robot to complete the pick-and-place action as soon as possible. The joint rewards of the robot are calculated according to the following formula: s

Ru (i) =sum (w (j) ×r (j)); wherein,

I, j represents a robot serial number;

Ru (i) represents a joint reward for the ith robot;

sum represents all robot summations;

r (j) represents a direct reward for the jth robot;

w (j) represents the weight of the direct reward of the jth robot, where when i=j, w (j) =1, otherwise w (j) =c, c is a predetermined constant less than 1.

In another alternative embodiment, step ST3 may further construct a Q-value neural network input vector of the next robot for each robot according to the warehouse overall state data of the next step, thereby facilitating the subsequent processing. The process of constructing the Q-value neural network input vector of the robot for each robot according to the warehouse overall state data of the next step is the same as that of the previous step S12, and will not be repeated.

In step ST4, more specifically, a history experience data item is constructed, and the history experience data item is added to the experience playback pool. Wherein the historical experience data items include current warehouse overall state data, a set of robotic action rewards items, and next warehouse overall state data. The robot action rewards term includes the next action of the robot and its joint rewards.

In another alternative embodiment, the robot action rewards term may also include the current robot Q neural network input vector and may even further include the next robot Q neural network input vector. And the Q value neural network input vector of the robot is obtained through construction of the whole state data of the warehouse in the next step. The robot action rewarding item comprises the current robot Q value neural network input vector and the next robot Q value neural network input vector, so that subsequent playback processing can be facilitated. And in the subsequent playback processing, the current robot Q value neural network input vector and the next robot Q value neural network input vector of each robot are not required to be respectively constructed according to the current warehouse overall state data and the next warehouse overall state data.

The historical experience data items in the experience playback pool are keyed to the current warehouse overall state data. And if the current warehouse overall state data in the current constructed historical experience data item exists in the experience playback pool, the constructed historical experience data item does not need to be added into the experience playback pool.

With sufficient memory, the data in the experience playback pool always increases as the training process progresses. In another alternative embodiment, there may be a problem of insufficient storage space, where historical experience data items in the experience playback pool may be stored by replacing old data with new data.

More specifically, if the number of history experience data items in the experience playback pool is smaller than the first fixed value, step ST5 is repeated with the next warehouse overall state data as the current warehouse overall state data, and steps ST2 to ST5 are repeated. In another embodiment, if the Q-value neural network input vector of each robot in the next step has been constructed from the warehouse global status data in the next step before step ST5, the Q-value neural network input vector of each robot in the next step is generally required to be used as the Q-value neural network input vector of each robot in the current robot before step ST5 returns to step ST2.

Step ST6: randomly extracting m historical experience data items from an experience playback pool, acquiring a current robot Q value neural network input vector of each robot and a next robot Q value neural network input vector of each robot according to the extracted historical experience data items, calculating a return difference value L (i) of each robot by inputting the robot Q value neural network input vector and the next action to an evaluation network and a target network, and updating the evaluation network according to the return difference value L (i) of each robot; if the number of times of updating the evaluation network reaches a second fixed value, updating the target network by means of model parameter copying. The update evaluation network here is a model parameter of the update evaluation network. The target network is updated by means of model parameter copying, in particular, model parameters of the evaluation network are copied to the target network such that the model parameters of the target network and the evaluation network are identical. In this step, updating the evaluation network according to the extracted historical experience data item can be specifically expressed as the following steps:

step ST61: acquiring a current robot Q value neural network input vector of each robot and a robot Q value neural network input vector of the next step according to the historical experience data item;

Step ST62: inputting the current robot Q value neural network input vector of each robot obtained according to the historical experience data item and the corresponding next action in the set of robot action rewards items into an evaluation network to obtain Q value evaluation values Qv (i, sc (i), a (i)) of each robot;

step ST63: inputting the input vector of the Q value neural network of the robot and each action of the next step of each robot into a target network to obtain a Q value target value Qt (i, sn (i), al (k)) of each action of each robot; then, the maximum Q value target value max [ Qt (i, sn (i), al) ] of the Q value target value maximum operation of each robot is obtained at the maximum Q value target value Qt (i, sn (i), al) corresponding to each operation;

Step ST64: calculating a return difference value L (i) =max [ Qt (i, sn (i), al) ]. Gamma+Ru (i) -Qv (i, sc (i), a (i)) of each robot according to the Q value evaluation value and the Q value target value of each robot and the joint rewards, and updating an evaluation network according to the return difference value L (i) of each robot;

In the above-mentioned formula(s),

L (i) represents a third return difference value of the ith robot;

a (i) represents the next action of the ith robot;

al (k) represents the kth next step action in each possible next step action of the robot;

al represents each possible next action of the robot;

Qv (i, sc (i), a (i)) represents a Q value evaluation value obtained after the current robot Q value neural network input vector sc (i) of the i-th robot and the next action a (i) are input to the evaluation network, or the Q value evaluation value of the i-th robot;

max [ Qt (i, sn (i), al) ] represents the maximum value of Q-value target values corresponding to each next-step action after the input vector sn (i) of the Q-value neural network of the robot of the next-step action of the i-th robot and each possible next-step action are input to the target network, or the Q-value target value of the Q-value target value maximum action of the next-step action Q-value of the i-th robot;

gamma is a predetermined parameter.

In step ST6, the evaluation network is updated according to the return difference, and the following method is specifically adopted in this embodiment: and calculating a loss coefficient by taking the return difference value as a loss coefficient factor, and updating model parameters of the Q-value neural network according to the loss coefficient by a gradient descent method. The Q-value neural network is herein referred to as an evaluation network. The loss coefficient is calculated by the following formula: loss (i) =l (i) ×l (i)/2 or Loss (i) =abs (L (i)). Where Loss (i) represents a Loss coefficient of the i-th robot, and abs represents an absolute value. In another possible embodiment, the Loss coefficient may also be calculated according to other formulas, such as Loss (i) =sqrt (abs (L (i))), where sqrt represents the square root.

Furthermore, it should be noted that each update of the evaluation network is based on the difference in return of a certain robot, and accordingly, how many robots have the difference in return, and how many times the update is needed. If there are n robots, one history experience data item corresponds to the update of n evaluation networks, and in combination with the m history experience items extracted in step ST6, the evaluation networks are updated m×n times in total in step ST6, corresponding to m×n return difference values.

In addition, in the foregoing step ST61, if the history experience data item does not include the current robot Q-value neural network input vector and the next robot Q-value neural network input vector of each robot, the current robot Q-value neural network input vector and the next robot Q-value neural network input vector of each robot need to be respectively constructed according to the current warehouse overall state data and the next warehouse overall state data in the history experience data item.

More specifically, step ST7 is repeated with the next warehouse overall state data as the current warehouse overall state data until the number of iterations reaches a preset third fixed value or the return difference of each robot is smaller than a preset threshold. The return difference in step ST7 is generally derived from step ST6, corresponding to the aforementioned m×n return differences. The difference in return for each robot being less than the preset threshold may be expressed in several ways such as the following, similarly:

the first way is: the squares of the m×n return difference values each satisfy a condition of less than a first threshold, i.e., L (i) < H1;

the second way is: the sum of squares of the m x n return difference values satisfies the condition of being smaller than the second threshold, i.e；

Third mode: the absolute values of the m×n return difference values all satisfy the condition of being smaller than the third threshold, that is abs (L (i)) < H3;

fourth mode: the sum of absolute values of m x n return difference values satisfies the condition of being smaller than the fourth threshold, i.e；

A fifth mode: the absolute average of the m n return difference values satisfies the condition of being smaller than the fifth threshold, i.e；

In the above formula, abs represents an absolute value; h1, H2, H3, H4, H5 are five predetermined thresholds, respectively. Although the above modes are different in form, they are substantially the same, and it will be understood by those skilled in the art that, in addition to the above modes, the differences in return of each robot being smaller than the preset threshold may be represented in other ways.

In the above embodiment, step ST7 directly uses the return difference calculated in step ST 6. In another possible implementation manner, the return difference value may also be calculated according to the current robot Q-value neural network input vector and the next robot Q-value neural network input vector corresponding to the current warehouse overall state data and the next warehouse overall state data obtained in the loop of steps ST2 to ST 7.

In another alternative embodiment, when the step ST2 to the step ST7 iterates, in step ST7, if the current warehouse overall state data of the history experience data item in the experience playback pool is the same as the warehouse overall state data of the next step, the current warehouse overall state data may be randomly generated again when the current warehouse overall state data is set in step ST 7.

In addition, it should be noted that, in the foregoing embodiment of the present invention, the dispatching device for the intensive warehouse robot is a virtual device implemented by the dispatching center 200 by executing the computer program instruction, and the modules included in the virtual device are in one-to-one correspondence with the steps of the dispatching method for the intensive warehouse robot, which is not described in detail.

Claims

1. The dispatching method of the intensive warehouse robot is characterized by comprising a robot dispatching analysis step and a neural network training step;

The robot scheduling analysis step comprises the following steps:

Step S13: inputting the input vector of the robot Q value neural network of each robot into the Q value neural network to obtain the action Q value evaluation vector of each robot; the action Q value evaluation vector is a vector formed by a plurality of action Q value evaluation values; the action Q value evaluation value corresponds to the next action of the robot, and is the evaluation value of the Q value of the robot for executing the next action; the next action of the robot at least comprises azimuth movement and goods storage and retrieval;

step S15: selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot;

the neural network training step comprises the following steps:

ru (i) =sum (w (j) ×r (j)); wherein,

Ru (i) represents a joint reward for the ith robot;

sum represents all robot summations;

r (j) represents a direct reward for the jth robot;

Step ST6: randomly extracting m historical experience data items from an experience playback pool, acquiring a current robot Q value neural network input vector of each robot and a next robot Q value neural network input vector of each robot according to the extracted historical experience data items, calculating a return difference value L (i) of each robot by inputting the robot Q value neural network input vector and the next action into an evaluation network and a target network, and updating the evaluation network according to the return difference value of each robot; if the number of times of updating the evaluation network reaches a second fixed value, updating the target network in a mode of copying model parameters;

the return difference is calculated according to the following formula:

L(i)=max[Qt(i,sn(i),al)]*gamma+Ru(i)-Qv(i,sc(i),a(i))；

Wherein,

L (i) represents the return difference of the ith robot;

a (i) represents the next action of the ith robot;

al represents each possible next action of the robot;

gamma is a predetermined parameter.

2. The dense warehouse robot scheduling method of claim 1, further comprising a robot state monitoring step and a robot scheduling execution step; the robot state monitoring step is used for tracking and monitoring the cargo carrying state of each robot and the storage unit of each robot in the warehouse at present; the robot scheduling execution step is used for distributing corresponding action instructions to each corresponding robot for the next action of each robot obtained in the robot scheduling analysis step, so that each robot executes the corresponding next action.

3. The intensive warehouse robot scheduling device is characterized by comprising a robot scheduling analysis module and a neural network training module;

the robot scheduling analysis module comprises the following modules:

module M15 for: selecting a random value for each robot; if the random value is smaller than epsilon, randomly selecting a next action from the filtered action Q value evaluation vector as the next action of the robot, otherwise, selecting the next action corresponding to the element with the largest Q value evaluation value from the filtered action Q value evaluation vector as the next action of the robot;

the neural network training module comprises the following modules:

ru (i) =sum (w (j) ×r (j)); wherein,

Ru (i) represents a joint reward for the ith robot;

sum represents all robot summations;

r (j) represents a direct reward for the jth robot;

the return difference is calculated according to the following formula:

L(i)=max[Qt(i,sn(i),al)]*gamma+Ru(i)-Qv(i,sc(i),a(i))；

Wherein,

L (i) represents the return difference of the ith robot;

a (i) represents the next action of the ith robot;

al represents each possible next action of the robot;

gamma is a predetermined parameter.

4. The dense warehouse robot scheduling apparatus of claim 3, further comprising a robot status monitoring module and a robot scheduling execution module; the robot state monitoring module is used for tracking and monitoring the cargo carrying state of each robot and the storage unit of each robot in the warehouse at present; the robot scheduling execution module is used for distributing corresponding action instructions to the corresponding robots according to the next actions of the robots obtained by the robot scheduling analysis module, so that the corresponding next actions are executed by the robots.

5. A machine-readable medium having stored thereon a set of program instructions readable by a machine; the method for dispatching the intensive warehouse robot according to claim 1 or 2 is realized when a program instruction set stored in the medium is read by a machine and loaded and executed.

6. An electronic device includes a processor and a memory; the memory stores a program instruction set; characterized in that the apparatus is capable of implementing a dense warehouse robot scheduling method according to claim 1 or 2 when the set of program instructions stored in said memory is executed by a processor.

7. The dispatching system of the intensive warehouse robots is characterized by comprising a dispatching center and a plurality of robots arranged in a warehouse; the robot is connected with the dispatching center; the robot moves among the storage units; the dispatching center comprises the dense storage robot dispatching device according to claim 3 or 4.