CN106056207B

Movatterモバイル変換

Info

Publication number: CN106056207B
Application number: CN201610302605.5A
Authority: CN
Inventors: 闵华松; 李潇; 齐诗萌; 林云汉; 周昊天
Original assignee: Wuhan University of Science and Technology WHUST
Current assignee: Wuhan University of Science and Technology WHUST
Priority date: 2016-05-09
Filing date: 2016-05-09
Publication date: 2018-10-23
Anticipated expiration: 2036-05-09
Also published as: CN106056207A

Abstract

Translated fromChinese

本发明公开了一种基于自然语言的机器人深度交互与推理方法与装置，该方法包括以下步骤：1）语音识别：接收用户语音输入，对输入信号进行处理，得到文本信息；2）获取案例属性：将步骤1）中获取的文本进行分词处理，然后将分词后的文本与案例库中的案例进行相似度匹配提取案例的属性；3）深度对话与三维情景交互：如果根据步骤2）提取案例的属性获得的用户意图不完整，则结合Kinect传感器获取的实时地图文件对用户进行多次引导，直至获取完整意图，然后针对用户完整意图的作业任务生成解决方案；语音合成：将得到的解决方案以文本的形式表示出来，合成语音通过音响设备反馈给用户。本发明交互过程中机器人与用户均使用自然语言。

The invention discloses a method and device for in-depth interaction and reasoning of robots based on natural language. The method includes the following steps: 1) voice recognition: receiving user voice input, processing the input signal to obtain text information; 2) acquiring case attributes : Segment the text obtained in step 1), and then perform similarity matching between the text after segmentation and the cases in the case library to extract the attributes of the case; 3) In-depth dialogue and 3D scene interaction: if the case is extracted according to step 2) If the user's intention obtained by the attribute is incomplete, the user will be guided multiple times in combination with the real-time map file obtained by the Kinect sensor until the complete intention is obtained, and then a solution will be generated for the task of the user's complete intention; speech synthesis: the solution to be obtained Expressed in the form of text, the synthesized voice is fed back to the user through audio equipment. Both the robot and the user use natural language during the interaction process of the present invention.

Description

Translated fromChinese

一种基于自然语言的机器人深度交互与推理方法与装置A method and device for deep interaction and reasoning of robots based on natural language

技术领域technical field

本发明涉及人工智能技术，尤其涉及一种基于自然语言的机器人深度交互与推理方法与装置。The present invention relates to artificial intelligence technology, in particular to a method and device for in-depth interaction and reasoning of robots based on natural language.

背景技术Background technique

近年来，随着智能机器人的快速发展，人们期望通过对话的方式让机器人在复杂环境中完成各种作业任务。用自然语言与机器进行通信，这是人们长期以来所追求的：人们可以用自己最习惯的语言来使操作机器人，而无需再花大量的时间和精力去学习各种复杂的计算机语言。In recent years, with the rapid development of intelligent robots, people expect robots to complete various tasks in complex environments through dialogue. Using natural language to communicate with machines has been what people have been pursuing for a long time: people can use their most accustomed language to operate robots without spending a lot of time and energy learning various complex computer languages.

在这个过程中，就需要智能机器人系统理解自然语言，了解用户期望，并且具有一种推理机制对实时问题进行推理、求解和学习。目前的研究成果中，具有代表性的推理机制有基于规则推理(Rule-Based Reasoning,RBR)、过程推理(Procedural ReasoningSystem,PRS)以及基于实例推理(case-based reasoning，CBR)。其中，基于规则推理为核心的推理机制在某些领域内难以获取推理规则而没有被广泛使用；基于过程推理机制缩短了推理时间，但也存在一些不足，如规划库的限定，无法对新生成的规划进行学习和存储等；基于实例推理的机制通过访问事例库中的源事例从而获得当前事例的解决方案，具有一定的学习能力，也具有较高的实用性。In this process, the intelligent robot system needs to understand natural language, understand user expectations, and have a reasoning mechanism to reason, solve and learn real-time problems. Among the current research results, the representative reasoning mechanisms include Rule-Based Reasoning (RBR), Procedural Reasoning System (PRS) and case-based reasoning (CBR). Among them, the inference mechanism based on rule-based reasoning is difficult to obtain inference rules in some fields and has not been widely used; the process-based reasoning mechanism shortens the reasoning time, but there are also some shortcomings, such as the limitation of the planning library, which cannot be used for new generation. The planning of the case is learned and stored; the case-based reasoning mechanism obtains the solution of the current case by accessing the source case in the case base, which has a certain learning ability and high practicability.

但是基于实例推理的推理机制不具有分析能力，无法分析用户不明确的用途并反馈引导，不具有自主性。在此背景下，本方法引入BDI(belief-desire-intention)模型，BDI是一种行为认知架构，其本质是为了解决如何确定智能体的目标和智能体如何实现目标，将基于实例的推理机制与BDI模型结合，既可以增加推理系统的自主性，也解决了BDI模型不具有学习能力的缺点。同时，还引入深度对话与三维情景推理过程，将推理与实际场景结合起来，提高了机器人的智能性。However, the reasoning mechanism based on case-based reasoning does not have the ability to analyze, cannot analyze the user's unclear use and give feedback and guidance, and is not autonomous. In this context, this method introduces the BDI (belief-desire-intention) model. BDI is a behavioral cognition architecture. Its essence is to solve how to determine the goal of the agent and how to achieve the goal of the agent. The combination of the mechanism and the BDI model can not only increase the autonomy of the reasoning system, but also solve the shortcoming that the BDI model does not have the learning ability. At the same time, it also introduces in-depth dialogue and three-dimensional situational reasoning process, which combines reasoning with actual scenes to improve the intelligence of the robot.

发明内容Contents of the invention

本发明要解决的技术问题在于针对现有技术中的缺陷，提供一种基于自然语言的机器人深度交互与推理方法与装置，通过自然语言实现用户与机器人的深度交互与推理，提高机器人的智能性与自主性。The technical problem to be solved by the present invention is to provide a method and device for in-depth interaction and reasoning of robots based on natural language to realize in-depth interaction and reasoning between users and robots through natural language, and improve the intelligence of robots with autonomy.

本发明解决其技术问题所采用的技术方案是：一种基于自然语言的机器人深度交互与推理方法，包括以下步骤：The technical solution adopted by the present invention to solve the technical problem is: a method for deep interaction and reasoning of robots based on natural language, comprising the following steps:

1)语音识别：接收用户语音输入，对输入信号进行处理，得到文本信息；1) Speech recognition: receive user voice input, process the input signal, and obtain text information;

2)获取案例属性：将步骤1)中获取的文本进行分词处理，然后将分词后的文本与案例库中的案例进行匹配提取当前案例的属性；2) Obtaining case attributes: the text obtained in step 1) is subjected to word segmentation processing, and then the text after word segmentation is matched with the cases in the case library to extract the attributes of the current case;

所述案例库用于存储根据实际场景预先设计的案例，每一个案例有包括以下基本属性值，包括：案例的属性集合和案例的解决方案；The case library is used to store pre-designed cases according to actual scenarios, and each case includes the following basic attribute values, including: the attribute set of the case and the solution of the case;

3)深度对话与三维情景交互：如果根据步骤2)提取当前案例的属性获得的用户意图不完整，则结合Kinect传感器获取的实时地图信息对用户进行多次引导，直至获取完整意图，然后针对用户完整意图的作业任务生成解决方案；3) In-depth dialogue and 3D scene interaction: If the user intention obtained by extracting the attributes of the current case according to step 2) is incomplete, guide the user multiple times in combination with the real-time map information obtained by the Kinect sensor until the complete intention is obtained, and then target the user Complete intention job task generation solution;

语音合成：推理机将得到的解决方案以文本的形式表示出来，机器以语音的方式发送给用户，采用TTS技术合成语音通过音响设备反馈给用户。Speech synthesis: The reasoning machine expresses the obtained solution in the form of text, the machine sends it to the user in the form of voice, and uses TTS technology to synthesize the voice and feed it back to the user through audio equipment.

按上述方案，所述步骤1)语音识别过程具体包括如下步骤：According to the above scheme, described step 1) speech recognition process specifically includes the following steps:

1.1)预处理：通过麦克风阵列采集用户语音信息，对输入的原始语音信号进行处理，滤除掉其中的不重要的信息以及背景噪声，并进行语音信号的端点检测、语音分帧以及预加重处理；1.1) Preprocessing: collect user voice information through the microphone array, process the input original voice signal, filter out the unimportant information and background noise, and perform endpoint detection, voice framing and pre-emphasis processing of the voice signal ;

1.2)特征提取：提取出反映语音信号特征的关键特征参数形成特征矢量序列；1.2) feature extraction: extract the key characteristic parameter that reflects speech signal feature to form feature vector sequence;

1.3)采用隐马尔科夫模型(HMM)进行声学模型建模，在识别的过程中将待识别的语音与声学模型进行匹配，从而获取识别结果；1.3) The Hidden Markov Model (HMM) is used to model the acoustic model, and the speech to be recognized is matched with the acoustic model during the recognition process to obtain the recognition result;

1.4)对训练文本数据库进行语法、语义分析，经过基于统计模型训练得到N-Gram语言模型，从而提高识别率，减少搜索范围。1.4) Perform grammatical and semantic analysis on the training text database, and obtain the N-Gram language model through statistical model training, thereby improving the recognition rate and reducing the search range.

1.5)针对输入的语音信号，根据己经训练好的HMM声学模型、语言模型及字典建立一个识别网络，根据搜索算法在该网络中寻找最佳的一条路径，这个路径就是能够以最大概率输出该语音信号的词串，从而确定这个语音样本所包含的文字。1.5) For the input voice signal, build a recognition network according to the trained HMM acoustic model, language model and dictionary, and search for the best path in the network according to the search algorithm. The word string of the speech signal, so as to determine the text contained in the speech sample.

按上述方案，所述步骤2)中提取当前案例的属性是将分词后的文本与案例库中的案例进行基于向量空间模型的文本相似度匹配提取当前案例的属性。According to the above scheme, extracting the attributes of the current case in the step 2) is to perform text similarity matching based on the vector space model on the text after word segmentation and the cases in the case library to extract the attributes of the current case.

按上述方案，所述步骤2)中案例库的建立采用如下步骤：According to the above scheme, the establishment of the case base in the step 2) adopts the following steps:

根据需求设计对话主题，根据对话主题来设计主题树，主题树分为主题节点，必要属性节点和叶节点，它们之间的关系为叶节点从属与必要属性节点，必要属性节点从属于主题节点，每一个结点都有一个二值的有效状态符，其中叶节点之间为或的关系，必要属性节点之间为与的关系；Design the dialogue theme according to the requirements, and design the theme tree according to the dialogue theme. The theme tree is divided into theme nodes, necessary attribute nodes and leaf nodes. The relationship between them is that the leaf nodes are subordinate to the necessary attribute nodes, and the necessary attribute nodes are subordinate to the theme nodes. Each node has a binary effective status symbol, in which the relationship between the leaf nodes is or, and the relationship between the necessary attribute nodes is and;

根据主题树的节点来写对话生成函数，这些对话生成函数的集合构成引导库；在不同的系统状态下，调用该函数会得到不同的应答输出，每个对话生成函数都只负责它所对应结点的应答，在设计和修改时互不影响。Write dialog generation functions according to the nodes of the topic tree. The collection of these dialog generation functions constitutes a boot library; in different system states, calling this function will get different response outputs, and each dialog generation function is only responsible for its corresponding structure. Responses to points do not affect each other during design and modification.

按上述方案，所述步骤2)中获取案例属性过程具体包括如下步骤：According to the above scheme, the process of obtaining the case attributes in the step 2) specifically includes the following steps:

2.1)对步骤1)中获得的文本进行分词处理，即将文本分割成单个词组；2.1) word segmentation processing is carried out to the text obtained in step 1), that is, the text is divided into individual phrases;

2.2)将分词后的文本与案例库中的案例进行匹配，由于每个案例包含任务的属性集合，当检索到最相似案例时，将提取案例对应的任务属性；2.2) Match the word-segmented text with the cases in the case library. Since each case contains the attribute set of the task, when the most similar case is retrieved, the task attribute corresponding to the case will be extracted;

按上述方案，所述步骤3)深度对话与三维情景交互过程具体包括如下步骤：According to the above scheme, the step 3) in-depth dialogue and three-dimensional scene interaction process specifically includes the following steps:

3.1)当推理机接收到语音信息输入时，机器人根据Kinect传感器获取的地图信息判断用户输入语音，若与当前地图信息不相关，则机器人会进行用户引导；若用户输入与当前地图信息相关，则机器人会将用户输入与案例库中的案例进行匹配，若存在相似案例，则将用户输入信息与Kinect传感器获取的地图信息进行匹配，判断是否能够满足用户期望并反馈给用户；3.1) When the reasoning machine receives voice information input, the robot judges the user input voice according to the map information acquired by the Kinect sensor. If it is not related to the current map information, the robot will guide the user; if the user input is related to the current map information, then The robot will match the user input with the cases in the case library. If there is a similar case, it will match the user input information with the map information obtained by the Kinect sensor to judge whether it can meet the user's expectations and give feedback to the user;

3.2)通过案例检索和地图匹配之后，推理机就得到了相应任务属性和匹配度，接下来对这些信息进行分析从而得到用户期望，如果计算得到用户期望是完整的则不需要进行进一步引导，转入步骤3.4)，如果期望不完整，则需要进行进一步用户引导，转入步骤3.3)；3.2) After case retrieval and map matching, the inference engine obtains the corresponding task attributes and matching degree, and then analyzes the information to obtain user expectations. If the calculated user expectations are complete, no further guidance is required. Turn to Go to step 3.4), if the expectation is incomplete, further user guidance is required, go to step 3.3);

3.3)用XML文件构建一个引导案例库，所述引导库包含了用户期望不完整时针对缺少属性对用户做出的引导方案；将用户期望的每个属性与引导库案例的属性一一比较，相同为1，不同为0，得到的值相加，值最大的为最佳案例，取该引导案例作为引导方案引导用户；直至获取完整的用户期望；3.3) build a guide case storehouse with XML file, described guide storehouse has included the guide plan that user makes for lacking attribute when user's expectation is incomplete; Each attribute that user expects is compared with the attribute of guide storehouse case one by one, The same is 1, the difference is 0, the obtained values are added together, the one with the largest value is the best case, and the guidance case is taken as the guidance scheme to guide the user; until the complete user expectation is obtained;

3.4)调用案例库中该完整期望对应的解决方案并与实时三维环境信息匹配后重用，生成一连串可执行动作序列(Intention)，从而实现指定的作业任务。3.4) Invoke the solution corresponding to the complete expectation in the case library and match it with the real-time 3D environment information and reuse it to generate a series of executable action sequences (Intention) to achieve the specified task.

一种基于自然语言的机器人深度交互与推理装置，包括：A deep interaction and reasoning device for robots based on natural language, including:

点云采集模块，用于将Kinect采集到的地图深度信息和颜色信息经过融合处理后生成三维点云数据(PCD)，经过预处理、关键点提取、描述子提取，再通过物体特征数据库进行特征匹配得到三维场景语义地图描述文件；The point cloud acquisition module is used to fuse the map depth information and color information collected by Kinect to generate 3D point cloud data (PCD), after preprocessing, key point extraction, descriptor extraction, and then perform feature analysis through the object feature database Match to obtain the 3D scene semantic map description file;

语音识别模块，用于对麦克风阵列采集的用户输入的语音信号进行降噪处理，并采用MFCC算法进行特征提取，然后结合HMM声学模型和N-gram语言模型，通过语音解码搜索算法将语音信号转化为文本。The speech recognition module is used to perform noise reduction processing on the speech signal input by the user collected by the microphone array, and use the MFCC algorithm for feature extraction, and then combine the HMM acoustic model and the N-gram language model to convert the speech signal through the speech decoding search algorithm for the text.

深度对话与三维情景交互模块，用于将接收到的文本与案例库中的案例进行检索寻找最相似的案例，结合物体识别节点得到的地图文件进行地图匹配、期望分析和引导，从而完善用户的期望生成解决方案，同时对用户的答复和引导信息以文本的形式发送给语音合成节点；The in-depth dialogue and 3D scene interaction module is used to retrieve the received text and the cases in the case library to find the most similar cases, and combine the map files obtained by the object recognition node to perform map matching, expectation analysis and guidance, so as to improve the user experience It is expected to generate a solution, and at the same time, the user's reply and guidance information are sent to the speech synthesis node in the form of text;

语音合成模块，使用TTS技术将人机交互时得到的文本通过文本分析、韵律建模和语音合成三个步骤生成相应的语音信号反馈给用户；Speech synthesis module, which uses TTS technology to generate corresponding speech signals to feedback to the user through three steps of text analysis, prosody modeling and speech synthesis from the text obtained during human-computer interaction;

案例库，用XML文件构建的用于存储现实中经验的知识库，借鉴人类的经验记忆模式，根据实际场景设计案例，每个案例包含以下基本属性值：属性集合和案例的解决方案。The case library is a knowledge base built with XML files for storing real-world experience. It draws on the human experience memory model and designs cases according to actual scenarios. Each case contains the following basic attribute values: attribute collection and case solution.

本发明产生的有益效果是：The beneficial effects produced by the present invention are:

1.本发明交互过程中机器人与用户均使用自然语言，机器人能够自主引导用户获得用户完整期望，并与案例库匹配获取解决方案来执行任务。1. Both the robot and the user use natural language during the interaction process of the present invention, and the robot can autonomously guide the user to obtain the user's complete expectations, and match with the case library to obtain a solution to perform the task.

2.本发明采用一种面向中文语音的深度交互与推理机制，在传统CBR-BDI的基础上增加深度对话与三维情景交互模块，并实现在“用户表达意图与实际场景不匹配”和“用户表达意图不完整”时的交互与推理。由于本方法是通过人机交互来补充意图中的未知属性的，相比于基于常识的推理，本方法会更加准确、灵活和实用；同时推理机制以CBR-BDI为基础，能够利用过去的经验来解决现有问题，而且能对问题进行回馈、能够自主去实现目标，具有较好市场应用前景与发展潜力。2. The present invention adopts a deep interaction and reasoning mechanism oriented to Chinese speech, adds a deep dialogue and three-dimensional scene interaction module on the basis of traditional CBR-BDI, and realizes that "the user's expression intention does not match the actual scene" and "user Interaction and reasoning when expressing intent is incomplete". Since this method supplements the unknown attributes in the intention through human-computer interaction, this method will be more accurate, flexible and practical than common sense-based reasoning; at the same time, the reasoning mechanism is based on CBR-BDI and can use past experience To solve the existing problems, and can give feedback to the problems, can achieve the goal independently, has a good market application prospect and development potential.

附图说明Description of drawings

下面将结合附图及实施例对本发明作进一步说明，附图中：The present invention will be further described below in conjunction with accompanying drawing and embodiment, in the accompanying drawing:

图1是本发明实施例中机器人深度交互与推理装置的硬件体系架构图；Fig. 1 is a hardware architecture diagram of a robot deep interaction and reasoning device in an embodiment of the present invention;

图2是本发明实施例中机器人深度交互与推理方法程序流程图；Fig. 2 is a program flowchart of the robot deep interaction and reasoning method in the embodiment of the present invention;

图3是本发明实施例中深度交互与推理机制流程图；Fig. 3 is a flowchart of the deep interaction and reasoning mechanism in the embodiment of the present invention;

图4是本发明实施例中深度对话与三维情景交互模块推理流程图。Fig. 4 is a flow chart of reasoning of the in-depth dialogue and three-dimensional scene interaction module in the embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

如图1所示，图1为本发明提出的一种基于自然语言的机器人深度交互与推理方法用于机器人分拣系统时的硬件体系架构。由麦克风阵列输入语音通过语音识别模块获取文本信息，文本信息输入人机深度交互推理模块，同时Kinect相机识别获取的地图文件也发送给深度交互与推理模块，通过改进的CBR-BDI推理机制获取完整的用户期望，在地图文件中获取目标的坐标位置，生成解决方案。本发明中使用的系统平台为Ubuntu(版本12.04)嵌入式平台。As shown in Fig. 1, Fig. 1 is a hardware architecture when a natural language-based robot deep interaction and reasoning method proposed by the present invention is used in a robot sorting system. The voice input from the microphone array is used to obtain text information through the speech recognition module, and the text information is input to the human-computer depth interaction reasoning module. At the same time, the map file recognized by the Kinect camera is also sent to the depth interaction and reasoning module. The user expects to obtain the coordinate position of the target in the map file and generate a solution. The system platform used in the present invention is Ubuntu (version 12.04) embedded platform.

图2为本发明实施的一种基于自然语言的机器人深度交互与推理方法程序流程图，主要如下：经过语音识别过程将用户信息转化成文本，文本分词后在案例库中进行案例检索，对案例检索得到的案例属性进行分析，若案例属性数量大于0时，进行地图匹配；若案例中初始状态属性数量为0，则代表用户输入无效，需要从引导库中提取引导案例进行引导直至状态属性数量大于0，然后进行地图匹配。若用户期望物体数量与地图中的物体数量匹配，则进行期望分析；若用户期望物体数量与地图中物体数量不匹配，则需要进行引导，直至用户期望物体数量与地图中物体数量匹配为止。最后将案例的属性和地图匹配的值添加到用户期望做期望分析，分析期望是否完整，若当前案例获取属性值中所有必要属性都不为空，则期望完整，否则期望不完整；(期望是否完整通过当前案例获得的属性是否包括匹配案例的全部属性来判断)，若期望不完整，则需要进一步引导，直至期望完整为止；若期望完整，则提取所需要的信息生成解决方案，也就是用户意图。Fig. 2 is a kind of program flow chart of the robot deep interaction and reasoning method based on natural language implemented by the present invention, mainly as follows: through the voice recognition process, user information is converted into text, and the case retrieval is carried out in the case library after the text word segmentation, and the case The retrieved case attributes are analyzed, and if the number of case attributes is greater than 0, map matching is performed; if the number of initial state attributes in the case is 0, it means that the user input is invalid, and the guide case needs to be extracted from the guide library to guide until the number of state attributes Greater than 0, then map matching is performed. If the number of objects expected by the user matches the number of objects in the map, perform expectation analysis; if the number of objects expected by the user does not match the number of objects in the map, guidance is required until the number of objects expected by the user matches the number of objects in the map. Finally, add the attribute of the case and the matching value of the map to the user expectation for expectation analysis, and analyze whether the expectation is complete. If all the necessary attributes in the attribute value of the current case are not empty, the expectation is complete, otherwise the expectation is incomplete; (whether the expectation is complete) Completely judged by whether the attributes obtained by the current case include all the attributes of the matching case), if the expectation is incomplete, further guidance is required until the expectation is complete; if the expectation is complete, the required information is extracted to generate a solution, that is, the user intention.

图3基于自然语言的机器人深度交互与推理方法流程图，主要包括语音识别、案例存储、获取案例属性、深度对话与三维情景交互和语音合成五个部分。Figure 3 is a flow chart of the method for in-depth interaction and reasoning of robots based on natural language, which mainly includes five parts: speech recognition, case storage, acquisition of case attributes, in-depth dialogue and 3D scene interaction, and speech synthesis.

本发明具体的实施方法如下所示：Concrete implementation method of the present invention is as follows:

S1：语音识别S1: Speech Recognition

S11：用户通过麦克风阵列输入语音信息，对输入的原始语音信号进行处理，滤除掉其中的不重要的信息以及背景噪声，并进行语音信号的端点检测、语音分帧及预加重处理。S11: The user inputs voice information through the microphone array, processes the input original voice signal, filters out unimportant information and background noise, and performs endpoint detection, voice framing and pre-emphasis processing of the voice signal.

S12：采用Mel频率倒谱系数(MFCC)算法进行语音信号特征提取。使用MFCC特征，用帧去分割语音波形，每帧大概10ms，然后每帧提取可以代表该帧语音的39个数字，这39个数字也就是该帧语音的MFCC特征，用特征向量来表示。S12: Using the Mel Frequency Cepstral Coefficient (MFCC) algorithm to extract the feature of the speech signal. Use the MFCC feature to segment the voice waveform with frames, each frame is about 10ms, and then extract 39 numbers that can represent the voice of the frame for each frame. These 39 numbers are the MFCC features of the voice of the frame, represented by feature vectors.

S13：采用隐马尔可夫模型(HMM)进行声学模型建模。对语音信号的时间序列结构建立统计模型，用具有有限状态数的Markov链来模拟语音信号统计特性变化。S13: Acoustic model modeling is performed using a Hidden Markov Model (HMM). A statistical model is established for the time series structure of the speech signal, and a Markov chain with a finite number of states is used to simulate the change of the statistical characteristics of the speech signal.

S14：采用N-Gram模型进行语言模型建模，来描述词与词之间的关系。本技术方案使用CMU提供的训练工具CMUCLMTK来得到N-gram语言模型。S14: Use the N-Gram model for language model modeling to describe the relationship between words. This technical solution uses the training tool CMUCLMTK provided by CMU to obtain the N-gram language model.

S15：采用基于动态规划的Viterbi算法在每个时间点上的各个状态，计算解码状态序列对观察序列的后验概率，保留概率最大的路径，并在每个节点记录下相应的状态信息以便最后反向获取词解码序列。S15: Use the Viterbi algorithm based on dynamic programming for each state at each time point, calculate the posterior probability of the decoded state sequence to the observed sequence, keep the path with the highest probability, and record the corresponding state information at each node for the final Get the word decoding sequence in reverse.

S2：案例存储S2: Case Storage

本技术方案中案例库采用XML文件形式存储，案例库中有根据实际场景设计的1～n个案例，每一个案例有两个基本属性值，包括：案例的属性集合，案例的解决方案(机器人一系列动作序列)，经过和环境交互与推理后产生的最终属性集合。In this technical solution, the case library is stored in the form of an XML file. There are 1 to n cases designed according to the actual scene in the case library. Each case has two basic attribute values, including: the attribute set of the case, and the solution of the case (robot) A series of action sequences), the final attribute set generated after interacting with the environment and reasoning.

对每一个新生成的案例而言，通过相似性匹配会得到一个初始属性集合，初始属性集合会随着交互推理过程不断变化，当最终产生完整意图之后，将最终状态存储在最终属性集合中。For each newly generated case, an initial attribute set will be obtained through similarity matching, and the initial attribute set will continue to change with the interactive reasoning process. When the complete intent is finally generated, the final state will be stored in the final attribute set.

本设计方案中案例分为询问主题和分拣主题，案例的属性集合包括：物体数量、物体名称、物体的位置、物体颜色、物体大小、物体放置的目的地的名称。例如案例：“抓取一个红色大苹果放在左边篮子里”，其属性分配为：物体数量：“一个”；物体名称：“苹果”；物体位置为空；物体颜色：“红色”；物体大小：“大”；物体放置的目的地名称：“左边篮子”。Cases in this design scheme are divided into inquiry topics and sorting topics. The attribute collection of cases includes: object quantity, object name, object location, object color, object size, and the name of the destination where the object is placed. For example: "grab a big red apple and put it in the left basket", its attributes are assigned as follows: number of objects: "one"; object name: "apple"; object position is empty; object color: "red"; object size : "large"; the name of the destination where the object is placed: "left basket".

S3：获取案例属性S3: Get Case Attributes

S31：将S1中获取的文本利用分词器进行分词。例1：用户输入语音转化的文本为：“抓取一个苹果”，分词器分词后，结果为：“抓取/一个/苹果/”。S31: Segment the text obtained in S1 using a tokenizer. Example 1: The text converted by the user input voice is: "grab an apple", after word segmentation by the tokenizer, the result is: "grab/one/apple/".

S32：将分词后每个词与案例库进行匹配，如果没有检索到相似案例，则建立新案例；如果检索到相似案例，则返回相似案例:，并计算初始案例属性数量。例2：案例：“抓取一个苹果”，案例初始属性有：物体数量和物体名称，则初始案例属性数量值为2。当初始案例属性数量大于0，则进行地图匹配；当初始案例属性数量等于0，则输入无效，机器人主动进行引导。S32: Match each word after word segmentation with the case base, if no similar case is retrieved, create a new case; if similar case is retrieved, return similar case:, and calculate the number of initial case attributes. Example 2: Case: "grab an apple", the initial attributes of the case are: object quantity and object name, then the initial case attribute quantity value is 2. When the number of initial case attributes is greater than 0, map matching is performed; when the number of initial case attributes is equal to 0, the input is invalid and the robot actively guides.

S4：深度对话与三维情景交互，其具体流程如图4所示：S4: In-depth dialogue and 3D scene interaction, the specific process is shown in Figure 4:

S41：地图匹配；S41: map matching;

S411：系统需要通过三维视觉环境感知获取高质量的作业环境语义地图信息。本设计方案通过Kinect提取3D点云图像并建立CSHOT物体模型用于场景中的特征匹配。调用点云库(PCL)，采用基于局部表面特征描述符的方法对普通的日常刚性物体实现实时的物体识别与理解。通过区域增长分割算法实现物体的检测，提取场景的ISS特征点；在关键点处计算CSHOT特征描述向量；通过基于距离阈值的3D特征匹配生成候选模型；通过随机采样一致性算法生成转变假设，通过迭代最近点算法对假设进行验证，产生一个与场景保持全局一致性的解决方案并通过坐标转换将物体的坐标信息转换至机器人坐标系。将获取物体的标识和几何信息写入XML语义地图文件。S411: The system needs to obtain high-quality semantic map information of the working environment through 3D visual environment perception. This design scheme uses Kinect to extract 3D point cloud images and establishes a CSHOT object model for feature matching in the scene. Call Point Cloud Library (PCL), and use the method based on local surface feature descriptors to realize real-time object recognition and understanding for common daily rigid objects. Realize the detection of objects through the region growth segmentation algorithm, extract the ISS feature points of the scene; calculate the CSHOT feature description vector at the key points; generate candidate models through 3D feature matching based on distance threshold; generate transition hypotheses through random sampling consensus algorithm, through The iterative closest point algorithm verifies the hypothesis, produces a solution that maintains global consistency with the scene, and converts the coordinate information of the object to the robot coordinate system through coordinate transformation. Write the identification and geometric information of the obtained object into an XML semantic map file.

XML地图文件物体的属性包括：场景地图中的物体的编号；物体的名称，如苹果，橘子等等；物体的颜色；物体的形状，如圆柱型、立方体等；物体的大小即长*宽*高、π*(底面半径)2*高等。The properties of the object in the XML map file include: the number of the object in the scene map; the name of the object, such as apple, orange, etc.; the color of the object; the shape of the object, such as cylinder, cube, etc.; the size of the object is length*width* Height, π*(base radius)2*height, etc.

S412：寻找XML地图中是否有用户期望的物体，并统计其数量，计算用户的期望物体数量与地图的匹配的情况。这里会出现四种匹配情况：(1)场景中没有符合要求的物体；(2)场景中该物体数量少于用户期望的数量；(3)场景中两者数量正好相等；(4)场景中该物体数量多于用户期望的数量。S412: Find whether there are objects expected by the user in the XML map, and count the number thereof, and calculate the matching between the number of objects expected by the user and the map. There will be four matching situations here: (1) There is no object that meets the requirements in the scene; (2) The number of objects in the scene is less than the number expected by the user; (3) The number of the two objects in the scene is exactly equal; (4) The number of objects in the scene is exactly equal. The number of objects is more than the user expected.

S42：期望分析S42: Expectation Analysis

当出现S412中情况(3)时说明用户的期望是有效且确定的，此时可以进行下一步的期望分析，对于出现S412中情况(1)(2)(4)是则需要采用S43中的方法进行用户引导。When the situation (3) in S412 occurs, it means that the user's expectation is valid and definite, and the next step of expectation analysis can be performed at this time. For the situation (1)(2)(4) in S412, it is necessary to adopt the method in S43 method for user guidance.

进行期望分析时，将案例的属性和地图匹配的值添加到用户期望做期望分析，期望完整可以进行案例重用并产生意图，即机器人动作序列，否则期望不完整需调用引导案例库进行引导。When performing expectation analysis, add the attribute of the case and the value matched by the map to the user expectation for expectation analysis. If the expectation is complete, the case can be reused and the intent can be generated, that is, the robot action sequence. Otherwise, the guidance case library needs to be called for guidance if the expectation is incomplete.

例3：地图文件中有一个红色大苹果，用户对机器人说：“抓取一个苹果”，进行地图匹配是对应的情况为：场景中物体数量和用户要求数量相等，但是进行期望分析时候，物体放置的目的地名称属性值为空，则期望不完整，需要采用S43中的方法进行用户引导。Example 3: There is a big red apple in the map file. The user says to the robot: "grab an apple". The corresponding situation for map matching is: the number of objects in the scene is equal to the number required by the user, but when performing the expected analysis, the objects If the attribute value of the placed destination name is empty, the expectation is incomplete, and the method in S43 needs to be used for user guidance.

S43：用户引导S43: User guidance

当用户期望分析得到的期望不完整时，要进行用户引导。案例存储中每一个属性结点都有一个对话生成函数与之对应，这些对话生成函数的集合构成引导库。例3中，期望不完整，此时检索引导库，机器人根据缺省属性会询问用户：“请问要放到哪个篮子里”，再根据用户反馈的信息，补全案例属性。When the expectations obtained by user expectations analysis are incomplete, user guidance should be carried out. Each attribute node in the case storage has a corresponding dialogue generation function, and the collection of these dialogue generation functions constitutes the bootstrap library. In example 3, the expectation is incomplete. At this time, when searching the guide library, the robot will ask the user: "Which basket do you want to put in" according to the default attributes, and then complete the case attributes based on the information fed back by the user.

S44：案例重用及完整意图生成S44: Case reuse and complete intent generation

当不完整的意图通过一次或多次引导，产生完整的期望(必要属性都不为空，并不是所有属性都要求有值)，调用案例库中该完整期望对应的解决方案并与实时三维环境信息匹配后重用，生成一连串可执行动作序列，从而实现指定的作业任务。When an incomplete intent is guided through one or more times to generate a complete expectation (necessary attributes are not empty, not all attributes require a value), call the solution corresponding to the complete expectation in the case library and communicate with the real-time 3D environment The information is reused after matching to generate a series of executable action sequences to achieve the specified job tasks.

S5：语音合成S5: Speech Synthesis

S51：文本分析S51: Text Analysis

将输入的文本规范化，并处理用户可能的拼接错误，将出现的不规范或无法发音的字符过滤掉。分析文本中的词或短语的边界，确定文字的读音，同时分析文本中出现的数字、姓氏、特殊字符以及各种多音字的读音方式。确定发音时的语气变换机不同音的轻重方式。最终，将输入的文字转换成计算机能够处理的内部参数，便于后续模块进一步处理并生成相应的信息。Normalize the input text, and handle the user's possible splicing errors, and filter out irregular or unpronounceable characters. Analyze the boundaries of words or phrases in the text, determine the pronunciation of the text, and analyze the pronunciation of numbers, surnames, special characters and various polyphonic characters that appear in the text. Determines how tone-changing machines differ in pronunciation when pronounced. Finally, the input text is converted into internal parameters that can be processed by the computer, which facilitates further processing by subsequent modules and generates corresponding information.

S52：韵律建模S52: Prosodic modeling

为合成语音规划出音段特征，韵律参数包括了基频、音长、音强，使合成语音能正确表达语意，听起来更加自然。The segment features are planned for the synthesized speech, and the prosody parameters include fundamental frequency, sound length, and sound intensity, so that the synthesized speech can correctly express the meaning and sound more natural.

S53：语音合成S53: Speech Synthesis

根据韵律建模的结果，采用基音同步叠加法PSOLA将文本转换成语音输出。把处理好的文本所对应的单字或短语的语音基元从语音合成库中提取，利用特定的语音合成技术对语音基元进行韵律特性的调整和修改，最终合成出符合要求的语音，通过音响设备反馈给用户。According to the results of prosody modeling, the pitch synchronous superposition method PSOLA is used to convert the text into speech output. Extract the phonetic primitives of the single word or phrase corresponding to the processed text from the speech synthesis library, use specific speech synthesis technology to adjust and modify the prosodic characteristics of the speech primitives, and finally synthesize the speech that meets the requirements. Device feedback to user.

应当理解的是，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that those skilled in the art can make improvements or changes based on the above description, and all these improvements and changes should belong to the protection scope of the appended claims of the present invention.