CN116690616A

Movatterモバイル変換

Info

Publication number: CN116690616A
Application number: CN202310746824.2A
Authority: CN
Inventors: 叶峰; 黄棋敬; 赖乙宗
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-09-05
Anticipated expiration: 2043-06-25
Also published as: CN116690616B

Abstract

The invention relates to a robot instruction operation method, a system and a medium based on natural language, wherein the method comprises the following steps: s1, defining action primitive labels required by a robot; s2, visual image information in the robot operation environment and language instructions given by operators in a natural language form are obtained; s3, inputting the visual image information and the language instruction of an operator into a language-visual target mask model, and outputting an action primitive target mask diagram; s4, inputting the visual image information and the action primitive object mask map into a strategy network based on reinforcement learning, and generating an operation instruction tuple of the corresponding action primitive of the corresponding operation target; s5, analyzing the obtained action primitive operation instruction tuple, converting the action primitive operation instruction tuple into a robot joint target position, and controlling the robot to execute the action primitive. The robot can complete task targets indicated by different language instructions, master various operation skills and adapt to the working environment of the multi-task targets.

Description

Translated fromChinese

一种基于自然语言的机器人指令操作方法、系统及介质A method, system and medium for robot command operation based on natural language

技术领域technical field

本发明涉及智能机器人控制技术领域，特别是涉及一种基于自然语言的机器人指令操作方法、系统及介质。The invention relates to the technical field of intelligent robot control, in particular to a natural language-based robot command operation method, system and medium.

背景技术Background technique

机器人技术已经广泛应用于工业制造、物流运输等多个领域，替代人类完成许多重复性且高强度的工作，进一步提高人类的生产效率。但传统的机器人应用集中于单一任务目标的结构化场景，且机器人所涉及的操作技能固定，一旦任务目标发生变化或是操作场景变得复杂，需要改变整个机器人操作的控制方式，缺乏灵活性和自主性。Robot technology has been widely used in many fields such as industrial manufacturing, logistics and transportation, replacing humans to complete many repetitive and high-intensity tasks, and further improving human production efficiency. However, traditional robot applications focus on the structured scene of a single task goal, and the operation skills involved in the robot are fixed. Once the task goal changes or the operation scene becomes complicated, the control method of the entire robot operation needs to be changed, which lacks flexibility and autonomy.

而当前，基于人工智能技术的机器人操作实现方式主要包括模仿学习和深度强化学习两种。这两种方式尽管能够让机器人适应一定的复杂操作场景，但大多数研究以端到端的形式直接操控机器人，集中于解决机器人完成单一任务目标的操作难题，策略网络泛化性较差，面对新任务目标时，需要重新进行技能学习，无法适应多任务目标的作业环境。At present, the realization of robot operation based on artificial intelligence technology mainly includes imitation learning and deep reinforcement learning. Although these two methods can make the robot adapt to certain complex operation scenarios, most of the research directly controls the robot in the form of end-to-end, focusing on solving the operation problem of the robot completing a single task, and the generalization of the policy network is poor. When a new task target needs to be re-learned, it cannot adapt to the working environment of multi-task targets.

而在现实生活中，自然语言是人类进行信息交流的重要方式，人类能够通过自然语言的方式灵活表达多种任务目标，智能机器人应具备对自然语言形式表征的语言指令的语义理解能力，从中推断出相应的操作目标，并进行相应的动作规划，以完成不同语言指令所指示的任务目标。In real life, natural language is an important way for human beings to communicate information. Human beings can flexibly express a variety of task goals through natural language. Intelligent robots should have the ability to understand the semantics of language instructions represented by natural language forms, and infer from them It can formulate the corresponding operating objectives and carry out corresponding action planning to complete the task objectives indicated by different language instructions.

发明内容Contents of the invention

针对现有技术中存在的技术问题，本发明的目的之一是：提供一种基于自然语言的机器人指令操作方法，以实现机器人完成不同语言指令所指示的任务目标，掌握多样的操作技能，适应多任务目标的作业环境。In view of the technical problems existing in the prior art, one of the purposes of the present invention is to provide a method for operating robot instructions based on natural language, so as to realize that the robot can complete the tasks indicated by instructions in different languages, master various operating skills, adapt to Working environment for multitasking targets.

本发明的目的之二是：提供一种基于自然语言的机器人指令操作系统。The second object of the present invention is to provide a robot instruction operating system based on natural language.

本发明的目的之三是：提供一种存储介质。The third object of the present invention is to provide a storage medium.

为了达到上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts following technical scheme:

第一方面，本发明提供了一种基于自然语言的机器人指令操作方法，该方法包括以下步骤：In a first aspect, the present invention provides a method for operating a robot instruction based on natural language, the method comprising the following steps:

S1、机器人动作基元库定义：根据作业环境，定义机器人所需的动作基元标签，可表示为：S1. Definition of robot action primitive library: According to the working environment, define the action primitive tags required by the robot, which can be expressed as:

Action Primitives＝{a₀(0,x₀),a₁(1,x₁),···,a_i(i,x_i),···,a_n(n,x_n)}；Action Primitives＝{a₀ (0,x₀ ),a₁ (1,x₁ ),···,a_i (i,x_i ),···,a_n (n,x_n )};

其中，第i个运动基元的名称为a_i，对应的标签值为i，对应的操作指令元组为x_i；Among them, the name of the i-th motion primitive is a_i , the corresponding label value is i, and the corresponding operation instruction tuple is x_i ;

S2、环境信息感知：获取机器人操作环境中的视觉图像信息，以及操作人员以自然语言形式给出的语言指令；S2. Environmental information perception: Obtain visual image information in the robot's operating environment, as well as language instructions given by the operator in the form of natural language;

S3、机器人任务理解：将视觉图像信息以及操作人员的语言指令输入到语言-视觉目标掩码模型中，输出动作基元目标掩码图；S3. Robot task understanding: input the visual image information and the operator's language instructions into the language-visual target mask model, and output the action primitive target mask map;

其中，所述的动作基元目标掩码图与输入的视觉图像信息的分辨率相同，并在语言指令所指示的操作目标上对应的像素值为机器人对该操作目标应执行的动作基元标签值，该动作基元标签值由步骤S1所定义；Wherein, the resolution of the action primitive target mask map is the same as that of the input visual image information, and the corresponding pixel value on the operation target indicated by the language instruction is the action primitive label that the robot should perform on the operation target value, the action primitive tag value is defined by step S1;

S4、机器人操作指令生成：将步骤S2所获取的视觉图像信息以及步骤S3所输出的动作基元目标掩码图输入到基于强化学习的策略网络中，生成相应操作目标的相应动作基元的操作指令元组，该操作指令元组的具体参数由步骤S1所定义；S4. Robot operation instruction generation: input the visual image information obtained in step S2 and the action primitive target mask map output in step S3 into the policy network based on reinforcement learning, and generate the operation of the corresponding action primitive of the corresponding operation target An instruction tuple, the specific parameters of the operation instruction tuple are defined by step S1;

S5、机器人底层运动控制：将步骤S4所获取的动作基元操作指令元组进行进一步的解析，通过坐标系转换、逆运动学求解等方法，转换为机器人关节目标位置，控制机器人进行运动，执行动作基元。S5. Bottom-layer motion control of the robot: further analyze the action primitive operation instruction tuple obtained in step S4, and convert it into the target position of the robot joint through coordinate system transformation and inverse kinematics solution to control the robot to move and execute Action primitives.

进一步地，步骤S1所述的操作指令元组包括机器人末端具体的操作位置、操作姿态等参数。Further, the operation instruction tuple described in step S1 includes parameters such as the specific operation position and operation posture of the robot end.

进一步地，步骤S2所述的视觉图像信息包括RGB图像和深度图像；步骤S2所述的语言指令指定机器人动作属性和对象属性。Further, the visual image information described in step S2 includes RGB images and depth images; the language instruction described in step S2 specifies robot action attributes and object attributes.

进一步地，步骤S2所述的语言指令遵循一定格式的指令模板。Further, the language instruction in step S2 follows an instruction template in a certain format.

进一步地，步骤S3所述的语言-视觉目标掩码模型包括视觉特征编码模块、语言特征编码模块、多模态特征融合模块、特征解码模块，基于视觉特征编码模块获得图像信息的特征表示，基于语言特征编码模块获得语言指令的特征表示，多模态融合模块将图像信息的特征表示与语言指令的特征表示进行多模态融合，获得融合特征表示，特征解码模块将融合特征表示进行上采样，最终获得动作基元目标掩码图。Further, the language-visual target mask model described in step S3 includes a visual feature encoding module, a language feature encoding module, a multi-modal feature fusion module, and a feature decoding module. The feature representation of image information is obtained based on the visual feature encoding module, based on The language feature encoding module obtains the feature representation of the language instruction. The multimodal fusion module performs multimodal fusion of the feature representation of the image information and the feature representation of the language instruction to obtain the fusion feature representation. The feature decoding module performs upsampling of the fusion feature representation. Finally, the action primitive target mask map is obtained.

进一步地，所述的多模态特征融合模块的工作过程，具体包括以下子步骤：Further, the working process of the multimodal feature fusion module specifically includes the following sub-steps:

S31：给定视觉特征V_i与语言特征L。S31: given visual features V_i and language features L.

S32：将语言特征L通过两个全连接层，并对所产生的特征向量进一步进行复制拓展，分别产生乘积因子α_i和偏置参数β_i，乘积因子α_i和偏置参数β_i的维度与视觉特征V_i相同；S32: Pass the language feature L through two fully connected layers, and further copy and expand the generated feature vectors to generate the multiplication factor α_i and the bias parameter β_i , and the dimensions of the multiplication factor α_i and the bias parameter β_i Same as visual feature V_i ;

S33：将视觉特征与乘积因子α_i进行逐元素相乘，与偏置参数β_i进行逐元素相加，产生融合特征G_i，可表示如下式：S33: Multiply the visual feature and the product factor α_i element by element, and add the bias parameter β_i element by element to generate the fusion feature G_i , which can be expressed as follows:

进一步地，步骤S4所述的基于强化学习的策略网络，采用深度Q网络的方式构建与训练策略网络，操作指令的生成过程包括以下步骤：Further, the policy network based on reinforcement learning described in step S4 adopts a deep Q network to construct and train the policy network, and the generation process of the operation instruction includes the following steps:

S41：给定视觉图像信息和动作基元目标掩码图，输入到深度Q网络中；S41: Input the visual image information and the action primitive target mask map into the deep Q network;

S42：深度Q网络端到端生成动作基元Q值图，动作基元Q值图与输入的视觉图像信息、动作基元目标掩码图的分辨率相同，通道数与相应动作基元操作姿态的数目相同，动作基元Q值图上某一点像素值Q_i(u_i,v_i,r_i)代表：机器人在像素点(u_i,v_i)上以通道r_i对应的操作姿态执行动作基元的动作Q值；S42: The deep Q network generates an action primitive Q value map end-to-end, the resolution of the action primitive Q value map is the same as that of the input visual image information and the action primitive target mask map, and the number of channels is the same as the operation gesture of the corresponding action primitive The number of is the same, and the pixel value Q_i (u_i , vi_, ri₎ of a certain point on the action primitive Q value map represents: the robot executes with the operation posture corresponding to channel_ri on the pixel point (u_i , v_i The action Q value of the action primitive;

S43：取动作基元Q值图上像素值最大的一点Q_max(u_max,v_max,r_max)作为该动作基元的最佳执行点，即该动作基元的最佳操作位置为像素点(u_max,v_max)，最佳操作姿态为通道r_max对应的操作姿态。S43: Take the point Q_max (u_max , v_max , r_max ) with the largest pixel value on the action primitive Q value map as the best execution point of the action primitive, that is, the best operation position of the action primitive is the pixel point (u_max , v_max ), the best operation posture is the operation posture corresponding to the channel r_max .

第二方面，本发明还提供了一种基于自然语言的机器人指令操作系统，包括视觉信息采集模块、计算机控制模块、机器人控制模块；In the second aspect, the present invention also provides a robot instruction operating system based on natural language, including a visual information acquisition module, a computer control module, and a robot control module;

所述的视觉信息采集模块用于实时获取视觉图像信息；The visual information collection module is used to obtain visual image information in real time;

所述的计算机控制模块包括人机交互子模块与计算推理子模块；The computer control module includes a human-computer interaction submodule and a calculation reasoning submodule;

所述的人机交互子模块用于实时观察所获取的视觉图像信息，并获取操作人员所输入的语言指令；The human-computer interaction sub-module is used to observe the obtained visual image information in real time, and obtain the language instruction input by the operator;

所述的计算推理子模块用于存储与运行所述的语言-视觉目标掩码模型以及基于强化学习的策略网络；The calculation reasoning sub-module is used to store and run the language-visual target mask model and the policy network based on reinforcement learning;

所述的机器人控制模块用于机器人底层运动控制，包括对所述的操作指令中的操作位置、操作姿态等进行坐标系转换，并进行逆运动学求解，驱动外部机器人的关节电机到达目标位置。The robot control module is used for bottom-level motion control of the robot, including performing coordinate system conversion on the operating position and posture in the operating instructions, and performing inverse kinematics solution to drive the joint motors of the external robot to reach the target position.

进一步地，所述视觉图像信息感知模块通过外部的RGB-D相机获取视觉图像信息，并返回给计算机控制模块。Further, the visual image information perception module acquires visual image information through an external RGB-D camera, and returns it to the computer control module.

第三方面，本发明还提供一种存储介质，其上存有计算机程序，所述计算机程序被处理器执行时，实现如上述的一种基于自然语言的机器人指令操作方法。In a third aspect, the present invention also provides a storage medium on which a computer program is stored. When the computer program is executed by a processor, the above-mentioned method for operating a robot based on natural language instructions is realized.

本发明具有以下有益效果：The present invention has the following beneficial effects:

传统的机器人应用集中于任务目标单一、操作技能单一的结构化场景，无法适应多任务目标的作业环境；而现有的基于人工智能的机器人操作方式，尽管能让机器人自主掌握操作技能，但通常用于解决单一任务目标的操作难题。而本发明方法能够让机器人根据操作人员的语言指令灵活变换操作目标，从中推断出哪个物体应该被执行哪个动作基元，并通过强化方法让机器人掌握更多样的操作技能，更具灵活性与自主性，能够完成不同语言指令表征的任务目标，适用于多任务目标的作业环境。Traditional robot applications focus on a structured scene with a single task objective and a single operational skill, and cannot adapt to a multi-task operating environment; while the existing artificial intelligence-based robot operation methods allow robots to autonomously master operational skills, but usually Used to solve operational puzzles with a single mission objective. The method of the present invention can allow the robot to flexibly change the operation target according to the operator's language instructions, infer which object should be executed and which action primitive, and use the strengthening method to allow the robot to master more diverse operating skills, which is more flexible and Autonomy, able to complete the task objectives represented by different language instructions, suitable for the working environment of multi-task objectives.

附图说明Description of drawings

图1为本发明所述的一种基于自然语言的机器人指令操作方法的工作流程示意图。FIG. 1 is a schematic workflow diagram of a natural language-based robot instruction operation method according to the present invention.

图2为本发明方法所述的动作基元目标掩码图的示意图。FIG. 2 is a schematic diagram of an action primitive object mask map described in the method of the present invention.

图3为本发明方法所述的语言-视觉目标掩码模型的结构示意图。Fig. 3 is a schematic structural diagram of the language-visual target mask model described in the method of the present invention.

图4为本发明实施例提供的一种基于自然语言的机器人指令操作方法的整体工作过程示意图。FIG. 4 is a schematic diagram of an overall working process of a natural language-based robot instruction operation method provided by an embodiment of the present invention.

图5为本发明所述的一种基于自然语言的机器人指令操作系统的结构示意图。Fig. 5 is a schematic structural diagram of a robot instruction operating system based on natural language according to the present invention.

具体实施方式Detailed ways

下面结合实施例及附图，对本发明作进一步地详细说明，但本发明的实施方式不限于此。The present invention will be described in further detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

如图1所示，本发明公开了一种基于自然语言的机器人指令操作方法，以机器人平面抓取与放置的作业环境为例，阐述其工作流程为：As shown in Fig. 1, the present invention discloses a robot command operation method based on natural language. Taking the working environment of robot plane grabbing and placing as an example, the working process is described as follows:

S1动作基元库定义：根据作业环境，定义机器人所需的动作基元标签，可表示为：S1 Action primitive library definition: According to the working environment, define the action primitive tags required by the robot, which can be expressed as:

Action Primitives＝{grasp(1,(x_grasp,y_grasp,γ_grasp)),place(2,(x_place,y_place)),}Action Primitives＝{grasp(1,(_xgrasp ,_ygrasp ,_γgrasp )),place(2,(_xplace ,_yplace )),}

其中，包括抓取动作基元grasp(1,(x_grasp,y_grasp,γ_grasp))，其代表抓取动作基元的标签值为1，操作指令元组包括二维坐标(x_grasp,y_grasp)的抓取操作位置，抓取操作姿态γ_grasp为垂直于工作平面的旋转角度，包括：Among them, the grasp action primitive grassp(1,(x_grasp ,y_grasp ,γ_grasp )) is included, which represents that the label value of the grasp action primitive is 1, and the operation instruction tuple includes two-dimensional coordinates (x grasp ,y_grasp_grasp operation position, grasp operation attitude γ_grasp is the rotation angle perpendicular to the working plane, including:

γ_grasp＝{0:0°1:30°2:60°3:90°4:-30°5:-60°}；γ_grasp = {0:0°1:30°2:60°3:90°4:-30°5:-60°};

其中，0代表旋转角度为0°，1代表旋转角度为1°，以此类推。Among them, 0 means the rotation angle is 0°, 1 means the rotation angle is 1°, and so on.

还包括放置动作基元place(2,(x_place,y_place))，其代表抓取动作基元的标签值为2，操作指令元组包括二维坐标(x_grasp,y_grasp)的放置操作位置，Also includes the placement action primitive place(2,(x_place ,y_place )), which represents the tag value of the grasping action primitive is 2, and the operation instruction tuple includes the placement operation of the two-dimensional coordinates (x_grasp ,y_grasp ) Location,

S2环境信息感知：获取机器人操作环境中的视觉图像信息，以及操作人员以自然语言形式给出的语言指令，如图4的步骤S2所示。S2 Environmental Information Perception: Obtain the visual image information in the robot's operating environment, as well as the language instructions given by the operator in the form of natural language, as shown in step S2 of Figure 4 .

其中，所述的视觉图像信息包括RGB图像和深度图像，且分辨率为H×W。Wherein, the visual image information includes RGB image and depth image, and the resolution is H×W.

其中，所述的语言指令指定机器人的动作属性和对象属性，并遵循一定格式的指令模板包括：Wherein, the language instruction specifies the action attribute and object attribute of the robot, and the instruction template following a certain format includes:

Stack{object1}on the{object2}；Stack{object1} on the{object2};

Pick the{object1}；Pick the{object1};

Put the{object1}into the{object2}；Put the{object1}into the{object2};

其中，stack/pick/put即为机器人的动作属性，object1/object2即为机器人需要操作的对象属性，该对象属性可包括物体类别、形状、颜色、位置，如语言指令：Among them, stack/pick/put are the action attributes of the robot, and object1/object2 are the object attributes that the robot needs to operate. The object attributes can include object category, shape, color, and position, such as language instructions:

Put the rightmost trapezoid into the top plate；Put the rightmost trapezoid into the top plate;

rightmost/top即为对象的位置属性，trapezoid/plate即为对象的形状属性。rightmost/top is the position attribute of the object, and trapezoid/plate is the shape attribute of the object.

S3机器人任务理解：将步骤S2所获得RGB图像以及操作人员的语言指令输入到语言-视觉目标掩码模型中，输出动作基元目标掩码图。S3 Robot task understanding: Input the RGB image obtained in step S2 and the operator's language instructions into the language-visual target mask model, and output the action primitive target mask map.

其中，如图3所示，S3所述的语言-视觉目标掩码模型包括视觉特征编码模块、语言特征编码模块、多模态特征融合模块、特征解码模块，基于视觉特征编码模块获得图像信息的特征表示，基于语言特征编码模块获得语言指令的特征表示，多模态融合模块将图像信息的特征表示与语言指令的特征表示进行多模态融合，获得融合特征表示，特征解码模块将融合特征表示进行上采样，最终获得动作基元目标掩码图。Among them, as shown in Figure 3, the language-visual target mask model described in S3 includes a visual feature encoding module, a language feature encoding module, a multimodal feature fusion module, and a feature decoding module, and the image information is obtained based on the visual feature encoding module. Feature representation, based on the language feature encoding module to obtain the feature representation of the language instruction, the multi-modal fusion module performs multi-modal fusion of the feature representation of the image information and the feature representation of the language instruction to obtain the fusion feature representation, and the feature decoding module fuses the feature representation Upsampling is performed to finally obtain the action primitive target mask map.

S31：给定视觉特征V_i与语言特征L；S31: given visual feature V_i and language feature L;

其中，所获得的动作基元目标掩码图与输入的RGB图像的分辨率相同，并在语言指令所指示的操作目标上对应的像素值为机器人对该操作目标应执行的动作基元标签值，如在Put the rightmost trapezoid into the top plate语言指令下，动作基元目标掩码图中的最右边的梯形所在区域的像素值为1，代表应该对最右边的梯形执行抓取动作，而最上面的盘子所在区域的像素值为2，代表应该对最上面的盘子执行放置动作，如图2所示。Among them, the obtained action primitive target mask image has the same resolution as the input RGB image, and the corresponding pixel value on the operation target indicated by the language instruction is the action primitive label value that the robot should perform on the operation target , for example, under the Put the rightmost trapezoid into the top plate language instruction, the pixel value of the area where the rightmost trapezoid is located in the action primitive target mask image is 1, which means that the capture action should be performed on the rightmost trapezoid, and the most The pixel value of the area where the upper plate is located is 2, which means that the uppermost plate should be placed, as shown in Figure 2.

S4机器人操作指令生成：将步骤S2所获取的视觉图像信息以及步骤S3所输出的动作基元目标掩码图输入到基于强化学习的策略网络中，生成相应操作目标的相应动作基元的操作指令元组，该操作指令元组的具体参数由步骤S1所定义。S4 Robot operation command generation: input the visual image information acquired in step S2 and the action primitive target mask map output in step S3 into the policy network based on reinforcement learning, and generate the operation command of the corresponding action primitive of the corresponding operation target tuple, the specific parameters of the operation instruction tuple are defined by step S1.

其中，如图4的步骤S4所示，所述的基于强化学习的策略网络，采用深度Q网络的方式构建与训练策略网络，操作指令的生成过程包括以下步骤：Wherein, as shown in step S4 of FIG. 4 , the policy network based on reinforcement learning adopts the deep Q network to construct and train the policy network, and the generation process of the operation instruction includes the following steps:

S41：给定RGB图像、深度图像以及动作基元目标掩码图，输入到深度Q网络中；S41: Input the RGB image, the depth image and the target mask image of the action primitive into the deep Q network;

S42：深度Q网络端到端生成抓取Q值图与放置Q值图，抓取Q值图及放置Q值图的分辨率也为H×W，抓取Q值图的通道数与抓取操作姿态γ_grasp的数目相同，即为6通道；放置操作只需考虑放置操作位置，因而放置Q值图的通道数为1。S42: The deep Q network generates end-to-end capture Q-value map and place Q-value map. The resolution of capture Q-value map and place Q-value map is also H×W. The number of operation posture γ_grasps is the same, that is, 6 channels; the placement operation only needs to consider the placement operation position, so the number of channels for placing the Q value map is 1.

抓取Q值图上某一点像素值Q_i(u_i,v_i,r_i)代表：机器人在像素点(u_i,v_i)上以通道r_i对应的操作姿态γ_grasp(r_i)执行抓取动作的动作Q值。如Q(100,100,1)，机器人操作位置应为像素点(100,100)所在位置，垂直于工作平面的旋转角度应为通道1对应的旋转角度γ_grasp(1)＝30°。Grasp the pixel value Q_i (u_i , v_i , r_i ) of a certain point on the Q value map to represent: the robot’s operation posture γ_grasp (r_i ) corresponding to the channel r_i on the pixel point (u_i , v_i ) The Q value of the action to perform the grasping action. For example, Q(100,100,1), the robot’s operating position should be the position of the pixel point (100,100), and the rotation angle perpendicular to the working plane should be the rotation angle corresponding to channel 1 γ_grasp (1)=30°.

S43：取抓取Q值图上像素值最大的一点Q_g(u_g,v_g,r_g)作为抓取操作的最佳执行点，即抓取操作的最佳操作位置为像素点(u_g,v_g)，最佳操作姿态为通道r_g对应的操作姿态γ_grasp(r_g)。S43: Take the point Q_g (u_g , v_g , r_g ) with the largest pixel value on the grab Q value map as the best execution point of the grab operation, that is, the best operation position of the grab operation is the pixel point (u_g , v_g ), and the best operation posture is the operation posture γ_grassp (r_g ) corresponding to the channel r_g .

取放置Q值图上像素值最大的一点Q_p(u_p,v_p)作为放置操作的最佳执行点，即放置操作的最佳操作位置为像素点(u_p,v_p)。Take the point Q_p (_up , v_p ) with the largest pixel value on the placement Q value map as the best execution point of the placement operation, that is, the best operation position of the placement operation is the pixel point (_up , v_p ).

综上，输出抓取操作指令grasp(u_g,v_g,γ_grasp(r_g)),放置操作指令place(u_p,v_p)。To sum up, the grab operation instruction grasp(u_g ,v_g ,γ_graspp (r_g )) and the place operation instruction place(u_p ,v_p ) are output.

S5机器人底层运动控制：将步骤S4所获取的动作基元操作指令元组进行进一步的解析，通过机器人坐标系与像素坐标系之间的转换，转换为机器人坐标系下坐标，如将抓取操作指令grasp(u_g,v_g,γ_grasp(r_g))转换为机器人坐标系下的坐标grasp(x_g,y_g)，机器人末端姿态为垂直于工作平面旋转γ_grasp(r_g)，而后再通过逆运动学求解获得机器人各关节的目标位置，驱动关节电机进行运动，完成抓取操作。S5 Bottom-level motion control of the robot: further analyze the action primitive operation instruction tuple obtained in step S4, and convert it into coordinates in the robot coordinate system through the conversion between the robot coordinate system and the pixel coordinate system. For example, the grasping operation The instruction graspp(u_g ,v_g ,γ_graspp (r_g )) is transformed into the coordinates graspp(x_g ,y_g ) in the robot coordinate system, and the end posture of the robot is to rotate γ_graspp (r_g ) perpendicular to the working plane, and then Then, the target position of each joint of the robot is obtained through inverse kinematics solution, and the joint motor is driven to move to complete the grasping operation.

对于放置操作指令place(u_p,v_p)，同样也是执行上述过程，机器人运动过程如图4的步骤S5所示。For the placing operation command place(_up , v_p ), the above-mentioned process is also performed, and the movement process of the robot is shown in step S5 of FIG. 4 .

此外，如图5所示，本发明还提供了一种基于自然语言的机器人指令操作系统，包括视觉信息采集模块、计算机控制模块、机器人控制模块。In addition, as shown in Figure 5, the present invention also provides a robot instruction operating system based on natural language, including a visual information collection module, a computer control module, and a robot control module.

所述的视觉信息采集模块通过外部的RGB-D相机实时获取RGB图像与深度图像，并返回给所述的计算机控制模块。The visual information acquisition module acquires RGB images and depth images in real time through an external RGB-D camera, and returns them to the computer control module.

所述的计算机控制模块包括人机交互子模块与计算推理子模块。The computer control module includes a human-computer interaction submodule and a calculation reasoning submodule.

所述的人机交互子模块将从所述的视觉信息采集模块中实时获取视觉图像信息，并实时显示给操作人员，同时操作人员可通过人机交互子模块输入语言指令，并传达机器人是否执行语言指令的控制命令。The human-computer interaction sub-module will acquire visual image information in real time from the visual information acquisition module, and display it to the operator in real time. At the same time, the operator can input language instructions through the human-computer interaction sub-module, and convey whether the robot executes or not. Control commands for language instructions.

所述的计算推理子模块用于存储与运行所述的语言-视觉目标掩码模型以及基于强化学习的策略网络。当所述的人机交互子模块传达机器人执行语言指令的控制命令时，所述的计算推理子模块从所述的人机交互子模块中获取操作人员的语言指令，并从所述的视觉信息采集模块中获取当前视觉图像信息，随后依次运行语言-视觉目标掩码模型与基于强化学习的策略网络，获取机器人的操作指令，并传达给所述的机器人控制模块中。The calculation reasoning sub-module is used for storing and running the language-visual target mask model and the strategy network based on reinforcement learning. When the human-computer interaction sub-module conveys the control command for the robot to execute language instructions, the calculation reasoning sub-module obtains the operator's language instructions from the human-computer interaction sub-module, and uses the visual information The current visual image information is obtained in the acquisition module, and then the language-visual target mask model and the policy network based on reinforcement learning are run in sequence to obtain the robot's operation instructions and communicate them to the robot control module.

所述的机器人控制模块从计算推理子模块中获取机器人的操作指令，对操作指令中的操作位置、操作姿态等进行坐标系转换，并进行逆运动学求解，获取外部机器人各关节的目标位置，并驱动外部机器人的关节电机到达目标位置，从而控制机器人完成相应的操作。The robot control module obtains the operation instruction of the robot from the calculation and reasoning sub-module, performs coordinate system transformation on the operation position and operation posture in the operation instruction, and performs inverse kinematics solution to obtain the target position of each joint of the external robot, And drive the joint motor of the external robot to reach the target position, so as to control the robot to complete the corresponding operation.

此外，本发明还提供了一种存储介质，有可执行本发明所提供的一种基于自然语言的机器人指令操作的计算机指令或程序，当运行该计算机指令或程序时，可执行方法实施例的任意组合实施步骤，具备该方法相应的功能和有益效果。In addition, the present invention also provides a storage medium with computer instructions or programs that can execute a natural language-based robot instruction operation provided by the present invention. When the computer instructions or programs are run, the method embodiments can be executed. Any combination of implementation steps has the corresponding functions and beneficial effects of the method.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.