CN108766429B

Movatterモバイル変換

Info

Publication number: CN108766429B
Application number: CN201810568760.0A
Authority: CN
Inventors: 路华; 黄世维; 黄硕
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-05
Filing date: 2018-06-05
Publication date: 2020-08-21
Anticipated expiration: 2038-06-05
Also published as: CN108766429A

Abstract

The embodiment of the application discloses a voice interaction method and device. One embodiment of the method comprises: extracting first voice information containing a target word voice segment; superposing a prompt tone at the target word sound fragment, and outputting first voice information superposed with the prompt tone by voice, wherein the prompt tone is used for prompting the currently broadcasted content as a target word; responding to the collected second voice information fed back by the user, and matching the second voice information with the target word; responsive to determining that the second speech information matches the target word, speech outputting third speech information associated with the target word. This embodiment improves the efficiency of the voice interaction.

Description

Translated fromChinese

语音交互方法和装置Voice interaction method and device

技术领域technical field

本申请实施例涉及计算机技术领域，具体涉及语音交互方法和装置。The embodiments of the present application relate to the field of computer technologies, and in particular, to a voice interaction method and apparatus.

背景技术Background technique

随着计算机技术的发展，语音交互产品的种类越来越丰富。在纯语音交互的产品中，用户表达不受图形界面限制，自由度极高，通常需要对用户的回答进行限制。因此，在纯语音交互环境下，高效且低成本地告知用户有那些限制尤为重要。With the development of computer technology, the types of voice interaction products are becoming more and more abundant. In products with pure voice interaction, user expression is not restricted by the graphical interface, and the degree of freedom is extremely high, which usually requires the user's answer to be restricted. Therefore, in a voice-only interaction environment, it is particularly important to inform users of those limitations efficiently and cost-effectively.

现有的方式，通常是通过图形界面给予用户相应提示，用户在阅读说明或者教程后，了解可以使用的语音指令。现有的另一方式，可以通过语音输出的方式，告知用户可使用的语音指令。In the existing method, the user is usually given corresponding prompts through a graphical interface, and the user understands the available voice commands after reading the instructions or tutorials. In another existing method, the user can be informed of available voice commands by means of voice output.

发明内容SUMMARY OF THE INVENTION

本申请实施例提出了语音交互方法和装置。The embodiments of the present application propose a voice interaction method and apparatus.

第一方面，本申请实施例提供了一种语音交互方法，该方法包括：提取包含目标词语音片段的第一语音信息；在目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，提示音用于提示当前所播报的内容为目标词；响应于采集到用户反馈的第二语音信息，将第二语音信息与目标词进行匹配；响应于确定第二语音信息与目标词相匹配，语音输出与目标词相关联的第三语音信息。In a first aspect, an embodiment of the present application provides a voice interaction method, the method includes: extracting first voice information including a target word voice fragment; superimposing a prompt tone at the target word voice fragment, and voice outputting the first voice after the superimposed prompt tone. a voice message, the prompt tone is used to prompt that the currently broadcasted content is the target word; in response to collecting the second voice information fed back by the user, match the second voice information with the target word; in response to determining that the second voice information and the target word The words match, and the third speech information associated with the target word is output by voice.

在一些实施例中，在目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，包括：在目标词语音片段的起始处叠加脉冲型的提示音，语音输出叠加提示音后的第一语音信息，其中，提示音在目标词语音片段的结束处之前结束。In some embodiments, superimposing a prompt tone at the target word voice segment, and outputting the first voice information after the superimposed prompt tone is output, including: superimposing a pulse-shaped prompt tone at the beginning of the target word voice segment, and voice outputting the superimposed prompt The first voice information after the sound, wherein the prompt sound ends before the end of the target word voice segment.

在一些实施例中，在目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，包括：在目标词语音片段的起始处叠加持续型的提示音，语音输出叠加提示音后的第一语音信息，其中，提示音在目标词语音片段结束时结束。In some embodiments, superimposing a prompt tone at the voice segment of the target word, and outputting the first voice information after the superimposed prompt tone includes: superimposing a continuous prompt tone at the beginning of the target word voice segment, and outputting the superimposed prompt by voice The first voice information after the sound, wherein the prompt sound ends when the target word voice segment ends.

在一些实施例中，响应于确定第二语音信息与目标词相匹配，语音输出与目标词相关联的第三语音信息，包括：响应于确定第二语音信息与目标词相匹配，确定第一语音信息的类型，基于第一语音信息的类型，确定与目标词相关联的第三语音信息，语音输出第三语音信息。In some embodiments, in response to determining that the second phonetic information matches the target word, the phonetically outputting third phonetic information associated with the target word includes: in response to determining that the second phonetic information matches the target word, determining that the first phonetic information matches the target word. The type of voice information, the third voice information associated with the target word is determined based on the type of the first voice information, and the third voice information is output by voice.

在一些实施例中，基于第一语音信息的类型，确定与目标词相关联的第三语音信息，语音输出第三语音信息，包括：响应于确定第一语音信息的类型为新闻播报类，生成包含目标词的信息搜索请求；向服务器发送信息搜索请求，接收服务器返回的搜索结果；将搜索结果所对应的语音信息作为第三语音信息，语音输出第三语音信息。In some embodiments, determining the third voice information associated with the target word based on the type of the first voice information, and outputting the third voice information by voice includes: in response to determining that the type of the first voice information is a news broadcast type, generating An information search request including a target word; an information search request is sent to the server, and a search result returned by the server is received; the voice information corresponding to the search result is used as the third voice information, and the third voice information is output by voice.

在一些实施例中，基于第一语音信息的类型，确定与目标词相关联的第三语音信息，语音输出第三语音信息，包括：响应于确定第一语音信息的类型为业务查询类，生成包含目标词的业务查询请求；向服务器发送业务查询请求，接收服务器返回的查询结果；将查询结果所对应的语音信息作为第三语音信息，语音输出第三语音信息。In some embodiments, determining the third voice information associated with the target word based on the type of the first voice information, and outputting the third voice information by voice includes: in response to determining that the type of the first voice information is a service query type, generating A service query request including a target word; a service query request is sent to the server, and a query result returned by the server is received; the voice information corresponding to the query result is used as the third voice information, and the third voice information is output by voice.

在一些实施例中，基于第一语音信息的类型，确定与目标词相关联的第三语音信息，语音输出第三语音信息，包括：响应于确定第一语音信息的类型为信息确认类，生成用于指示跳转至预设的下一条语音信息的跳转指令，将下一条语音信息确定为第三语音信息。In some embodiments, determining the third voice information associated with the target word based on the type of the first voice information, and outputting the third voice information by voice includes: in response to determining that the type of the first voice information is an information confirmation type, generating A jump instruction used to instruct to jump to a preset next piece of voice information, and the next piece of voice information is determined as the third voice information.

在一些实施例中，提示音的音量小于目标词语音片段的音量。In some embodiments, the volume of the prompt tone is lower than the volume of the target word speech segment.

第二方面，本申请实施例提供了一种语音交互装置，该装置包括：提取单元，被配置成提取包含目标词语音片段的第一语音信息；第一输出单元，被配置成在目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，提示音用于提示当前所播报的内容为目标词；匹配单元，被配置成响应于采集到用户反馈的第二语音信息，将第二语音信息与目标词进行匹配；第二输出单元，被配置成响应于确定第二语音信息与目标词相匹配，语音输出与目标词相关联的第三语音信息。In a second aspect, an embodiment of the present application provides a voice interaction device, the device includes: an extraction unit configured to extract first voice information including a target word voice segment; a first output unit configured to A prompt tone is superimposed at the segment, and the first voice information after the superimposed prompt tone is outputted by voice, and the prompt tone is used to prompt that the currently broadcasted content is the target word; the matching unit is configured to respond to the second voice information collected by the user feedback, matching the second phonetic information with the target word; and a second output unit configured to phonetically output third phonetic information associated with the target word in response to determining that the second phonetic information matches the target word.

在一些实施例中，第一输出单元进一步被配置成：在目标词语音片段的起始处叠加脉冲型的提示音，语音输出叠加提示音后的第一语音信息，其中，提示音在目标词语音片段的结束处之前结束。In some embodiments, the first output unit is further configured to: superimpose a pulse-shaped prompt tone at the beginning of the speech segment of the target word, and output the first voice information after the superimposed prompt tone, wherein the prompt tone is at the target word Ends before the end of the speech segment.

在一些实施例中，第一输出单元进一步被配置成：在目标词语音片段的起始处叠加持续型的提示音，语音输出叠加提示音后的第一语音信息，其中，提示音在目标词语音片段结束时结束。In some embodiments, the first output unit is further configured to: superimpose a continuous prompt tone at the beginning of the target word speech segment, and voice output the first voice information after the superimposed prompt tone, wherein the prompt tone is at the target word Ends when the voice clip ends.

在一些实施例中，匹配单元进一步被配置成：响应于确定第二语音信息与目标词相匹配，确定第一语音信息的类型，基于第一语音信息的类型，确定与目标词相关联的第三语音信息，语音输出第三语音信息。In some embodiments, the matching unit is further configured to: in response to determining that the second phonetic information matches the target word, determine a type of the first phonetic information, and based on the type of the first phonetic information, determine a first phonetic information associated with the target word Three voice information, the voice output third voice information.

在一些实施例中，匹配单元包括：第一生成模块，被配置成响应于确定第一语音信息的类型为新闻播报类，生成包含目标词的信息搜索请求；第一发送模块，被配置成向服务器发送信息搜索请求，接收服务器返回的搜索结果；第一输出模块，被配置成将搜索结果所对应的语音信息作为第三语音信息，语音输出第三语音信息。In some embodiments, the matching unit includes: a first generating module configured to generate an information search request including a target word in response to determining that the type of the first voice information is newscast; a first sending module configured to send The server sends an information search request, and receives the search result returned by the server; the first output module is configured to use the voice information corresponding to the search result as the third voice information, and output the third voice information by voice.

在一些实施例中，匹配单元包括：第二生成模块，被配置成响应于确定第一语音信息的类型为业务查询类，生成包含目标词的业务查询请求；第二发送模块，被配置成向服务器发送业务查询请求，接收服务器返回的查询结果；第二输出模块，被配置成将查询结果所对应的语音信息作为第三语音信息，语音输出第三语音信息。In some embodiments, the matching unit includes: a second generating module, configured to generate a business query request including the target word in response to determining that the type of the first voice information is a business query class; a second sending module, configured to send The server sends a service query request, and receives the query result returned by the server; the second output module is configured to take the voice information corresponding to the query result as the third voice information, and output the third voice information by voice.

在一些实施例中，匹配单元包括：第三生成模块，被配置成响应于确定第一语音信息的类型为信息确认类，生成用于指示跳转至预设的下一条语音信息的跳转指令，将下一条语音信息确定为第三语音信息。In some embodiments, the matching unit includes: a third generating module configured to, in response to determining that the type of the first voice information is an information confirmation type, generate a jump instruction for instructing to jump to a preset next piece of voice information , and the next piece of voice information is determined as the third voice information.

第三方面，本申请实施例提供了一种终端设备，包括：一个或多个处理器；存储装置，其上存储有一个或多个程序，当一个或多个程序被一个或多个处理器执行，使得一个或多个处理器实现如语音交互方法中任一实施例的方法。In a third aspect, an embodiment of the present application provides a terminal device, including: one or more processors; a storage device, on which one or more programs are stored, when the one or more programs are processed by the one or more processors Executing such that one or more processors implement a method as in any of the embodiments of the voice interaction method.

第四方面，本申请实施例提供了一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现如语音交互方法中任一实施例的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method in any of the embodiments of the voice interaction method.

本申请实施例提供的语音交互方法和装置，通过提取包含目标词语音片段的第一语音信息，而后在该目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，之后当采集到用户反馈的第二语音信息，基于该第二语音信息与目标词的匹配，确定待语音输出的第三语音信息，最后语音输出该第三语音信息。从而，不需要利用图形界面或者语音告知用户可输入的语音指令，也不需要用户额外花费时间阅读或听说明和教程，利用叠加提示音的方式即可提示用户哪些语音指令可以下达，提高了语音交互的效率。The voice interaction method and device provided by the embodiments of the present application extract the first voice information including the target word voice segment, and then superimpose a prompt tone at the target word voice fragment, and output the first voice information after the superimposed prompt tone. When the second voice information fed back by the user is collected, based on the match between the second voice information and the target word, the third voice information to be outputted by voice is determined, and finally the third voice information is output by voice. Therefore, there is no need to use a graphical interface or voice to inform the user of the voice commands that can be input, and it is not necessary for the user to spend extra time reading or listening to instructions and tutorials, and the user can be prompted which voice commands can be issued by superimposing the prompt sound, which improves the voice. efficiency of interaction.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本申请的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本申请可以应用于其中的示例性系统架构图；FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

图2是根据本申请的语音交互方法的一个实施例的流程图；FIG. 2 is a flowchart of an embodiment of a voice interaction method according to the present application;

图3是根据本申请的语音交互方法的一个应用场景的示意图；3 is a schematic diagram of an application scenario of the voice interaction method according to the present application;

图4是根据本申请的语音交互方法的又一个实施例的流程图；4 is a flow chart of yet another embodiment of a voice interaction method according to the present application;

图5是根据本申请的语音交互装置的一个实施例的结构示意图；5 is a schematic structural diagram of an embodiment of a voice interaction device according to the present application;

图6是适于用来实现本申请实施例的终端设备的计算机系统的结构示意图。FIG. 6 is a schematic structural diagram of a computer system suitable for implementing a terminal device according to an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

图1示出了可以应用本申请的语音交互方法或语音交互装置的示例性系统架构100。FIG. 1 shows anexemplary system architecture 100 to which the voice interaction method or voice interaction apparatus of the present application may be applied.

如图1所示，系统架构100可以包括终端设备101、102、103，网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , thesystem architecture 100 may includeterminal devices 101 , 102 , and 103 , anetwork 104 and aserver 105 . Thenetwork 104 is a medium used to provide a communication link between theterminal devices 101 , 102 , 103 and theserver 105 . Thenetwork 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

用户可以使用终端设备101、102、103通过网络104与服务器105交互，以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用，例如语音交互类应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use theterminal devices 101, 102, 103 to interact with theserver 105 through thenetwork 104 to receive or send messages and the like. Various communication client applications may be installed on theterminal devices 101 , 102 and 103 , such as voice interaction applications, shopping applications, search applications, instant communication tools, email clients, social platform software, and the like.

终端设备101、102、103可以是硬件，也可以是软件。当终端设备101、102、103为硬件时，可以是支持语音交互的各种电子设备，包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时，可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。Theterminal devices 101, 102, and 103 may be hardware or software. When theterminal devices 101, 102, and 103 are hardware, they can be various electronic devices supporting voice interaction, including but not limited to smart phones, tablet computers, e-book readers, laptop computers, desktop computers, and the like. When theterminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.

服务器105可以是提供各种服务的服务器，例如对终端设备101、102、103上所安装的语音交互类应用提供支持的后台服务器。后台服务器可以对接收到的信息搜索请求、业务查询请求等数据进行分析等处理，并将处理结果反馈给终端设备。Theserver 105 may be a server that provides various services, for example, a background server that provides support for voice interaction applications installed on theterminal devices 101 , 102 , and 103 . The background server can analyze and process the received information search request, business query request and other data, and feed back the processing result to the terminal device.

服务器105可以是硬件，也可以是软件。当服务器为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当服务器为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。Theserver 105 may be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server. When the server is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module. There is no specific limitation here.

需要说明的是，本申请实施例所提供的语音交互方法一般由终端设备101、102、103执行，相应地，语音交互装置一般设置于终端设备101、102、103中。It should be noted that the voice interaction methods provided in the embodiments of the present application are generally executed by theterminal devices 101 , 102 , and 103 , and correspondingly, the voice interaction apparatuses are generally set in theterminal devices 101 , 102 , and 103 .

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

继续参考图2，示出了根据本申请的语音交互方法的一个实施例的流程200。该语音交互方法，包括以下步骤：Continuing to refer to FIG. 2 , aflow 200 of an embodiment of the voice interaction method according to the present application is shown. The voice interaction method includes the following steps:

步骤201，提取包含目标词语音片段的第一语音信息。Step 201, extracting the first voice information including the voice segment of the target word.

在本实施例中，语音交互方法的执行主体(例如图1所示的终端设备101、102、103)可以提取待语音输出的第一语音信息。其中，上述第一语音信息可以包含目标词语音片段。In this embodiment, the executing subject of the voice interaction method (for example, theterminal devices 101 , 102 , and 103 shown in FIG. 1 ) can extract the first voice information to be outputted by voice. Wherein, the above-mentioned first voice information may include a target word voice segment.

上述目标词语音片段可以是上述第一语音信息中的、由目标词转换成的语音所构成的语音片段。上述目标词可以是用于生成指令(例如信息搜索指令、业务查询指令、跳转指令的)的词。作为示例，第一语音信息“阿森纳3比0战胜切尔西”中，“阿森纳”、“切尔西”均可以作为目标词。“阿森纳”、“切尔西”所对应的语音片段，则为目标词语音片段。用户在语音回答“阿森纳”后，上述执行主体可以生成包含字符串“阿森纳”的语信息搜索指令。用户在语音回答“切尔西”后，上述执行主体可以生成包含字符串“切尔西”的信息搜索指令。The above-mentioned target word speech segment may be a speech segment in the above-mentioned first speech information, which is formed by the speech converted from the target word. The above-mentioned target word may be a word used to generate an instruction (eg, an information search instruction, a business query instruction, and a jump instruction). As an example, in the first voice message "Arsenal beat Chelsea 3-0", both "Arsenal" and "Chelsea" can be used as target words. The speech fragments corresponding to "Arsenal" and "Chelsea" are the target word speech fragments. After the user answers "Arsenal" by voice, the above-mentioned execution body may generate a language information search instruction including the string "Arsenal". After the user answers "Chelsea" by voice, the above-mentioned execution body can generate an information search instruction containing the string "Chelsea".

步骤202，在目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息。Step 202 , superimposing a prompt tone at the target word voice segment, and outputting the first voice information after the superimposed prompt tone is voiced.

在本实施例中，上述执行主体可以在目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息。其中，上述提示音可以用于提示当前所播报的内容为目标词。作为示例，上述提示音可以是脉冲型的提示音(例如“叮”、“咚”)，音量随时间逐渐减弱。上述提示音也可以是持续型的提示音，从提示音的开始至结束的过程中，提示音保持相同音量。需要说明的是，上述提示音可以与目标词语音片段具有不同的音量(例如小于目标词语音片段)、音色(例如人主观感受较为柔和的音色)等，以降低对用户的干扰。In this embodiment, the above-mentioned execution body may superimpose a prompt tone on the target word voice segment, and output the first voice information after the superimposed prompt tone is voiced. Wherein, the above-mentioned prompt sound may be used to prompt that the currently broadcast content is the target word. As an example, the above-mentioned prompt sound may be a pulse-type prompt sound (for example, "ding", "dong"), and the volume gradually decreases with time. The above-mentioned prompt sound may also be a continuous prompt sound, and the prompt sound maintains the same volume during the process from the start to the end of the prompt sound. It should be noted that the above-mentioned prompt tone may have a different volume (eg, smaller than the target word voice segment) and timbre (eg, a softer timbre subjectively perceived by humans) and the like to reduce interference to the user.

此处，在目标词语音片段处叠加提示音，可以是在目标词语音片段的起始处叠加提示音；也可以是在目标词语音片段起始之前的预设时间(例如在目标词语音片段起始前的0.1秒、或者0.2秒等)处叠加提示音。需要说明的是，上述提示音可以在上述目标词语音片段结束时结束，也可以在上述目标词语音片段结束前结束。Here, superimposing the prompt tone at the target word speech segment may be superimposing the prompt tone at the beginning of the target word speech segment; it may also be a preset time before the start of the target word speech segment (for example, at the target word speech segment). A prompt sound is superimposed at 0.1 seconds, or 0.2 seconds, etc. before the start. It should be noted that, the above-mentioned prompt sound may end when the above-mentioned target word speech segment ends, or may end before the above-mentioned target word speech segment ends.

在本实施例的一些可选的实现方式中，上述执行主体可以在上述目标词语音片段的起始处叠加脉冲型的提示音，语音输出叠加上述提示音后的第一语音信息。其中，上述提示音可以在上述目标词语音片段的结束处之前结束。In some optional implementations of this embodiment, the execution body may superimpose a pulse-type prompt tone at the beginning of the target word speech segment, and the voice output is the first voice information after the prompt tone is superimposed. Wherein, the above-mentioned prompt sound may end before the end of the above-mentioned target word speech segment.

在本实施例的一些可选的实现方式中，上述执行主体可以在上述目标词语音片段的起始处叠加持续型的提示音，语音输出叠加上述提示音后的第一语音信息。其中，上述提示音可以在上述目标词语音片段结束时结束。In some optional implementations of this embodiment, the above-mentioned execution body may superimpose a continuous prompt tone at the beginning of the above-mentioned target word voice segment, and output the first voice information after the above-mentioned prompt tone is superimposed. Wherein, the above-mentioned prompt sound may end when the above-mentioned target word voice segment ends.

步骤203，响应于采集到用户反馈的第二语音信息，将第二语音信息与目标词进行匹配。Step 203 , in response to collecting the second voice information fed back by the user, match the second voice information with the target word.

在本实施例中，上述执行主体响应于采集到用户反馈的第二语音信息，可以将第二语音信息与目标词进行匹配。具体可以按照如下步骤执行：In this embodiment, in response to collecting the second voice information fed back by the user, the execution subject may match the second voice information with the target word. Specifically, you can follow the steps below:

第一步，上述执行主体在语音输出叠加提示音后的第一语音信息后，可以利用所安装的传声器采集预设时长内的语音信号。In the first step, after the above-mentioned executive body voice outputs the first voice information after the superimposed prompt tone, the installed microphone can collect the voice signal within a preset time period.

第二步，上述执行主体可以对所采集的语音信号进行处理，得到语音信息，并将该语音信息作为用户反馈的第二语音信息。需要说明的是，上述执行主体可以利用各种方式对采集到的语音信号的处理。作为示例，可以首先对语音信号进行高通滤波处理，以消除(或削弱)上述语音信号中的干扰音信号。而后，可以利用各种回声消除方法，对消除干扰音信号后的语音信号进行回声消除处理，得到消除回声信号后的语音信号。最后，可以对消除回声信号后的语音信号进行自动增益控制处理，以增大消除回声信号后的语音信号，得到语音信息，并将该语音信息作为用户反馈的第二语音信息。In the second step, the above-mentioned executive body may process the collected voice signal to obtain voice information, and use the voice information as the second voice information fed back by the user. It should be noted that the above-mentioned execution body may process the collected voice signal in various ways. As an example, a high-pass filtering process may be performed on the speech signal first, so as to eliminate (or attenuate) the interfering sound signal in the above-mentioned speech signal. Then, various echo cancellation methods can be used to perform echo cancellation processing on the voice signal after the interference sound signal has been eliminated, so as to obtain the voice signal after the echo signal has been eliminated. Finally, automatic gain control processing can be performed on the voice signal after the echo signal is eliminated to increase the voice signal after the echo signal is eliminated, and voice information is obtained, and the voice information is used as the second voice information fed back by the user.

第三步，上述执行主体可以利用预先训练的声学模型对第二语音信息进行识别，得到语音识别结果(例如上述第二语音信息所对应的字符串)。此处，上述声学模型可以是通过机器学习方法，对由大量语音信息所构成的训练样本进行有监督训练得到。此处，可以是使用各种模型进行声学模型的训练，如隐马尔可夫模型(Hidden Markov Model，HMM)、循环神经网络(Recurrent Neural Networks，RNN)、深度神经网络(Deep Neural Network，DNN)等，也可以使用多个模型的结合。In the third step, the above-mentioned executive body may use a pre-trained acoustic model to recognize the second speech information, and obtain a speech recognition result (for example, a character string corresponding to the above-mentioned second speech information). Here, the above acoustic model may be obtained by performing supervised training on training samples composed of a large amount of speech information through a machine learning method. Here, various models can be used to train the acoustic model, such as Hidden Markov Model (HMM), Recurrent Neural Networks (RNN), Deep Neural Network (DNN) etc., a combination of multiple models can also be used.

第四步，上述执行主体可以将上述语音识别结果与上述目标词进行匹配。作为示例，可以确定上述语音识别结果与上述目标词是否一致。若一致，则可以确定第二语音信息与上述目标词相匹配；反之，则可以确定不匹配。作为又一示例，上述执行主体可以确定上述语音识别结果是否包含上述目标词。若包含，则可以确定第二语音信息与上述目标词相匹配；反之，则可以确定不匹配。In the fourth step, the above-mentioned executive body may match the above-mentioned speech recognition result with the above-mentioned target word. As an example, it may be determined whether the above-mentioned speech recognition result is consistent with the above-mentioned target word. If they are consistent, it can be determined that the second voice information matches the above target word; otherwise, it can be determined that they do not match. As yet another example, the above-mentioned execution subject may determine whether the above-mentioned speech recognition result contains the above-mentioned target word. If it is included, it can be determined that the second voice information matches the above target word; otherwise, it can be determined that it does not match.

需要说明的是，上述电子设备还可以通过其他方式确定第二语音信息与上述目标词是否相匹配。例如，可以通过第二语音信息与目标词语音片段的比对，确定第二语音信息与目标词语音片段是否相近。若是，则可以确定第二语音信息与上述目标词相匹配。It should be noted that, the above-mentioned electronic device may also determine whether the second voice information matches the above-mentioned target word in other ways. For example, it can be determined whether the second speech information is similar to the speech segment of the target word by comparing the second speech information with the speech segment of the target word. If so, it can be determined that the second voice information matches the above target word.

步骤204，响应于确定第二语音信息与目标词相匹配，语音输出与目标词相关联的第三语音信息。Step 204, in response to determining that the second voice information matches the target word, voice output third voice information associated with the target word.

在本实施例中，上述执行主体响应于确定第二语音信息与目标词相匹配，可以语音输出与目标词相关联的第三语音信息。此处，可以首先执行预置的、与上述目标词关联的指令(例如用于指示以该目标词为搜索词进行信息搜索的信息搜索指令)，将执行结果(例如搜索得到的信息)确定为与目标词相关联的第三语音信息。In this embodiment, in response to determining that the second voice information matches the target word, the above-mentioned execution subject may voice output the third voice information associated with the target word. Here, a preset instruction associated with the target word (for example, an information search instruction for instructing to use the target word as a search word to search for information) may be executed first, and the execution result (for example, the information obtained from the search) is determined as The third phonetic information associated with the target word.

需要说明的是，响应于确定上述语音识别结果与上述目标词不匹配，可以将预置的、用于提示用户重新发送语音信息的语音信息确定为第三语音信息。It should be noted that, in response to determining that the above-mentioned speech recognition result does not match the above-mentioned target word, the preset speech information for prompting the user to resend the speech information may be determined as the third speech information.

继续参见图3，图3是根据本实施例的语音交互方法的应用场景的一个示意图。在图3的应用场景中，用户手持终端设备301，与终端设备301进行语音交互。Continue to refer to FIG. 3 , which is a schematic diagram of an application scenario of the voice interaction method according to this embodiment. In the application scenario of FIG. 3 , the user holds theterminal device 301 and performs voice interaction with theterminal device 301 .

首先，终端设备301提取包含目标词语音片段“阿森纳”和“切尔西”的第一语音信息“阿森纳3:0战胜切尔西”。分别在目标词语音片段“阿森纳”和“切尔西”处叠加了提示音。而后语音输出了叠加提示音后的第一语音信息。而后，用户在听到终端设备301所播报的第一语音信息后，得知可以对“阿森纳”和“切尔西”进行提问。并说出第二语音信息“阿森纳”。之后，终端设备301对目标词“阿森纳”进行搜索，将搜索出的阿森纳的介绍信息转换成语音，作为第三语音信息进行播报。First, theterminal device 301 extracts the first voice information "Arsenal beat Chelsea 3:0" containing the target word voice segments "Arsenal" and "Chelsea". Prompt tones are superimposed on the target word speech segments "Arsenal" and "Chelsea" respectively. Then the voice outputs the first voice information after superimposing the prompt tone. Then, after hearing the first voice information broadcast by theterminal device 301, the user learns that he can ask questions about "Arsenal" and "Chelsea". And say the second voice message "Arsenal". After that, theterminal device 301 searches for the target word "Arsenal", converts the searched introduction information of Arsenal into voice, and broadcasts it as the third voice information.

本申请的上述实施例提供的方法，通过提取包含目标词语音片段的第一语音信息，而后在该目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，之后当采集到用户反馈的第二语音信息，基于该第二语音信息与目标词的匹配，确定待语音输出的第三语音信息，最后语音输出该第三语音信息。从而，不需要利用图形界面或者语音告知用户可输入的语音指令，也不需要用户额外花费时间阅读或听说明和教程，利用叠加提示音的方式即可提示用户哪些语音指令可以下达，提高了语音交互的效率和灵活性。The method provided by the above-mentioned embodiment of the present application extracts the first voice information including the target word voice segment, and then superimposes a prompt tone at the target word voice fragment, and outputs the first voice information after the superimposed prompt tone. To the second voice information fed back by the user, based on the match between the second voice information and the target word, determine the third voice information to be outputted by voice, and finally output the third voice information by voice. Therefore, there is no need to use a graphical interface or voice to inform the user of the voice commands that can be input, and it is not necessary for the user to spend extra time reading or listening to instructions and tutorials, and the user can be prompted which voice commands can be issued by superimposing the prompt sound, which improves the voice. Efficiency and flexibility of interaction.

进一步参考图4，其示出了语音交互方法的又一个实施例的流程400。该语音交互方法的流程400，包括以下步骤：With further reference to FIG. 4, aflow 400 of yet another embodiment of a voice interaction method is shown. Theprocess 400 of the voice interaction method includes the following steps:

步骤401，提取包含目标词语音片段的第一语音信息。Step 401, extracting the first voice information including the voice segment of the target word.

在本实施例中，语音交互方法的执行主体(例如图1所示的终端设备101、102、103)可以提取包含目标词语音片段的第一语音信息。其中，上述目标词语音片段可以是上述第一语音信息中的、由目标词转换成的语音所构成的语音片段。上述目标词可以是用于生成指令(例如信息搜索指令、业务查询指令、跳转指令的)的词。In this embodiment, the executing subject of the voice interaction method (for example, theterminal devices 101 , 102 , and 103 shown in FIG. 1 ) can extract the first voice information including the voice segment of the target word. Wherein, the above-mentioned target word speech segment may be a speech segment in the above-mentioned first speech information, which is formed by the speech converted from the target word. The above-mentioned target word may be a word used to generate an instruction (eg, an information search instruction, a business query instruction, and a jump instruction).

步骤402，在目标词语音片段的起始处叠加持续型的提示音，语音输出叠加提示音后的第一语音信息。Step 402 , superimpose a continuous prompt tone at the beginning of the target word voice segment, and output the first voice information after the superimposed prompt tone is voiced.

在本实施例中，上述执行主体可以在上述目标词语音片段的起始处叠加持续型的提示音，语音输出叠加上述提示音后的第一语音信息，其中，上述提示音在上述目标词语音片段结束时结束。此处，上述提示音的音量可以小于上述目标词语音片段的音量。In this embodiment, the execution body may superimpose a continuous prompt tone at the beginning of the target word voice segment, and output the first voice information after the prompt tone is superimposed, wherein the prompt tone is in the target word voice Ends when the fragment ends. Here, the volume of the prompt sound may be smaller than the volume of the voice segment of the target word.

步骤403，响应于采集到用户反馈的第二语音信息，将第二语音信息与目标词进行匹配。Step 403 , in response to collecting the second voice information fed back by the user, match the second voice information with the target word.

在本实施例中，上述执行主体在语音输出叠加提示音后的第一语音信息后，可以利用所安装的传声器采集预设时长内的语音信号。而后，可以对所采集的语音信号进行处理，得到语音信息，并将该语音信息作为用户反馈的第二语音信息。之后，可以利用预先训练的声学模型对第二语音信息进行识别，得到语音识别结果。最后，可以将上述语音识别结果与上述目标词进行匹配。作为示例，可以确定上述语音识别结果与上述目标词是否一致。若一致，则可以确定第二语音信息与上述目标词相匹配；反之，则可以确定不匹配。In this embodiment, after the above-mentioned execution subject voice outputs the first voice information after the superimposed prompt tone, the installed microphone may collect the voice signal within a preset time period. Then, the collected voice signal can be processed to obtain voice information, and the voice information can be used as the second voice information fed back by the user. Afterwards, the second speech information may be recognized by using the pre-trained acoustic model to obtain a speech recognition result. Finally, the above-mentioned speech recognition result can be matched with the above-mentioned target word. As an example, it may be determined whether the above-mentioned speech recognition result is consistent with the above-mentioned target word. If they are consistent, it can be determined that the second voice information matches the above target word; otherwise, it can be determined that they do not match.

步骤404，响应于确定第二语音信息与目标词相匹配，确定第一语音信息的类型。Step 404, in response to determining that the second voice information matches the target word, determine the type of the first voice information.

在本实施例中，响应于确定上述第二语音信息与上述目标词相匹配，确定上述第一语音信息的类型。此处，上述第一语音信息的类型可以包括但不限于新闻播报类、业务查询类、信息确认类等多种类型，不同类型的语音信息可以对应有不同的、与上述目标词相关联的第三语音信息。In this embodiment, in response to determining that the second voice information matches the target word, the type of the first voice information is determined. Here, the types of the above-mentioned first voice information may include but are not limited to various types such as news broadcast, business inquiry, and information confirmation, and different types of voice information may correspond to different first voice information associated with the above-mentioned target word. Three voice messages.

步骤405，基于第一语音信息的类型，确定与目标词相关联的第三语音信息，语音输出第三语音信息。Step 405 , based on the type of the first voice information, determine third voice information associated with the target word, and output the third voice information by voice.

在本实施例中，上述执行主体可以基于上述第一语音信息的类型，确定与上述目标词相关联的第三语音信息，语音输出上述第三语音信息。此处，不同类型的语音信息可以对应有不同的、与上述目标词相关联的第三语音信息。In this embodiment, the execution subject may determine third voice information associated with the target word based on the type of the first voice information, and output the third voice information by voice. Here, different types of speech information may correspond to different third speech information associated with the target word.

在本实施例的一些可选的实现方式中，响应于确定上述第一语音信息的类型为新闻播报类，上述执行主体可以首先生成包含上述目标词的信息搜索请求。而后，可以向服务器(例如图1所示的服务器105)发送上述信息搜索请求，接收上述服务器返回的搜索结果。之后，可以将上述搜索结果所对应的语音信息作为第三语音信息，语音输出上述第三语音信息。作为示例，上述执行主体语音输出了第一语音信息“阿森纳3:0战胜切尔西”。其中，该第一语音信息中的“阿森纳”和“切尔西”均为目标词语音片段，且均叠加了提示音。上述执行主体响应于确定用户回复的第二语音信息中包含目标词语音片段“阿森纳”，可以向服务器发送包含该目标词“阿森纳”的信息搜索请求，以搜索与“阿森纳”相关的信息(例如对阿森纳的介绍信息)。上述执行主体可以将搜索结果转换成语音，语音输出转换后的语音。转后的语音即为第三语音信息。In some optional implementations of this embodiment, in response to determining that the type of the first voice information is news broadcast, the execution body may first generate an information search request including the target word. Then, the above-mentioned information search request may be sent to a server (for example, theserver 105 shown in FIG. 1 ), and a search result returned by the above-mentioned server may be received. Afterwards, the voice information corresponding to the search result may be used as the third voice information, and the third voice information is output by voice. As an example, the above executive body voice outputs the first voice information "Arsenal beat Chelsea 3:0". Among them, "Arsenal" and "Chelsea" in the first voice information are both target word voice fragments, and prompt sounds are superimposed on both. In response to determining that the second voice information replied by the user contains the target word voice segment "Arsenal", the above-mentioned executive body may send an information search request containing the target word "Arsenal" to the server, so as to search for the target word "Arsenal". Information (eg introductory information on Arsenal). The above executive body may convert the search result into speech, and output the converted speech by speech. The converted voice is the third voice information.

在本实施例的一些可选的实现方式中，响应于确定上述第一语音信息的类型为业务查询类，上述执行主体可以首先生成包含上述目标词的业务查询请求。而后，可以向服务器发送上述业务查询请求，接收上述服务器返回的查询结果。之后，可以将上述查询结果所对应的语音信息作为第三语音信息，语音输出上述第三语音信息。作为示例，上述执行主体语音输出了第一语音信息“您可以查询您的余额和其他信息”。其中，该第一语音信息中的“余额”和“其他”均为目标词语音片段，且均叠加了提示音。上述执行主体响应于确定用户回复的第二语音信息中包含目标词语音片段“余额”，可以向服务器发送包含该目标词“余额”的业务查询请求，以查询余额。上述执行主体可以将查询结果转换成语音，语音输出转换后的语音。转后的语音即为第三语音信息。In some optional implementations of this embodiment, in response to determining that the type of the first voice information is a service query type, the execution subject may first generate a service query request including the target word. Then, the service query request may be sent to the server, and the query result returned by the server may be received. Afterwards, the voice information corresponding to the query result may be used as the third voice information, and the third voice information is output by voice. As an example, the above-mentioned executive body voice outputs the first voice information "You can inquire about your balance and other information". Wherein, "balance" and "other" in the first voice information are both target word voice fragments, and prompt sounds are superimposed on both. In response to determining that the second voice information replied by the user contains the target word voice segment "balance", the execution body may send a service query request containing the target word "balance" to the server to query the balance. The above executive body may convert the query result into speech, and the speech outputs the converted speech. The converted voice is the third voice information.

在本实施例的一些可选的实现方式中，响应于确定上述第一语音信息的类型为信息确认类，上述执行主体可以生成用于指示跳转至预设的下一条语音信息的跳转指令，并可以将上述下一条语音信息确定为第三语音信息。作为示例，上述执行主体语音输出了第一语音信息“您的目的地是5号会议室，确认吗”。其中，该第一语音信息中的“目的地”和“确认”均为目标词语音片段，且均叠加了提示音。上述执行主体响应于确定用户回复的第二语音信息中包含目标词语音片段“确认”，则可以生成用于指示跳转至预设的下一条语音信息“现在开始为您导航”的跳转指令，并可以将该语音信息“现在开始为您导航”确定为第三语音信息。上述执行主体响应于确定用户回复的第二语音信息中包含目标词语音片段“目的地”，则可以生成用于指示跳转至预设的下一条语音信息“请重新输入目的地”的跳转指令，并可以将该语音信息“请重新输入目的地”确定为第三语音信息。需要说明的是，信息确认类的语音信息可以不包含“确认”一词。例如“您好，我们这里提供中餐和汉堡，饮料有可乐和橙汁”。该语音信息的类型也可以作为信息确认类。以此语音信息作为示例，其中的“中餐”、“汉堡”、“可乐”和“橙汁”均为目标词语音片段，且均叠加了提示音。上述执行主体响应于确定用户回复的第二语音信息中包含任一目标词语音片段(例如“中餐”)，则可以生成跳转至预设的与“中餐”对应的下一条语音信息“中餐有包子、饺子和米饭，请选择”的跳转指令，并可以将该语音信息“中餐有包子、饺子和米饭，请选择”确定为第三语音信息。In some optional implementations of this embodiment, in response to determining that the type of the first voice information is an information confirmation type, the execution body may generate a jump instruction for instructing to jump to a preset next piece of voice information , and the above-mentioned next piece of voice information may be determined as the third voice information. As an example, the above-mentioned execution subject voice outputs the first voice information "Your destination is the conference room No. 5, do you want to confirm?". Wherein, "destination" and "confirmation" in the first voice information are both target word voice fragments, and prompt sounds are superimposed on both. In response to determining that the second voice information replied by the user contains the target word voice segment "confirm", the above-mentioned execution body may generate a jump instruction for instructing to jump to the preset next voice information "now navigating for you" , and the voice message "start to navigate for you now" can be determined as the third voice message. In response to determining that the second voice information replied by the user contains the target word voice segment "destination", the above-mentioned execution body may generate a jump for instructing to jump to the preset next voice information "please re-enter the destination" instruction, and the voice message "Please re-enter the destination" may be determined as the third voice message. It should be noted that the voice information of the information confirmation type may not contain the word "confirmation". For example, "Hello, we serve Chinese food and burgers, and drinks include cola and orange juice." The type of the voice information can also be used as an information confirmation type. Taking this voice information as an example, "Chinese food", "hamburger", "Coke" and "orange juice" are all target word voice fragments, and prompt sounds are superimposed on them. In response to determining that the second voice information replied by the user contains any target word voice segment (for example, "Chinese food"), the execution body may generate a jump to the preset next voice information corresponding to "Chinese food" "Chinese food has. Steamed buns, dumplings and rice, please select the jump instruction, and the voice message “Chinese food has buns, dumplings and rice, please select” can be determined as the third voice message.

从图4中可以看出，与图2对应的实施例相比，本实施例中的语音交互方法的流程400突出了对于不同类型的第一语音信息确定待语音输出的第三语音信息的步骤。由此，本实施例描述的方案利用在目标词语音片段处叠加提示音的方式提示用户哪些语音指令可以下达，提高了语音交互的效率和灵活性。同时，该方式可以支持多次语音交互，且在交互过程中不需要用户阅读说明或教程，也不需要播报用户发送指令的规则，进一步提高了语音交互的灵活性，进一步提高了语音交互的效率。As can be seen from FIG. 4 , compared with the embodiment corresponding to FIG. 2 , theprocess 400 of the voice interaction method in this embodiment highlights the step of determining the third voice information to be outputted by voice for different types of first voice information . Therefore, the solution described in this embodiment uses the method of superimposing a prompt tone at the target word voice segment to prompt the user which voice commands can be issued, thereby improving the efficiency and flexibility of voice interaction. At the same time, this method can support multiple voice interactions, and the user does not need to read instructions or tutorials during the interaction process, nor does it need to broadcast the rules for users to send instructions, which further improves the flexibility of voice interaction and further improves the efficiency of voice interaction. .

进一步参考图5，作为对上述各图所示方法的实现，本申请提供了一种语音交互装置的一个实施例，该装置实施例与图2所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。Further referring to FIG. 5 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a voice interaction device, which corresponds to the method embodiment shown in FIG. 2 , and the device may specifically Used in various electronic devices.

如图5所示，本实施例所述的语音交互装置500包括：提取单元501，被配置成提取包含目标词语音片段的第一语音信息；第一输出单元502，被配置成在上述目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，上述提示音用于提示当前所播报的内容为目标词；匹配单元503，被配置成响应于采集到用户反馈的第二语音信息，将上述第二语音信息与上述目标词进行匹配；第二输出单元504，被配置成响应于确定上述第二语音信息与上述目标词相匹配，语音输出与上述目标词相关联的第三语音信息。As shown in FIG. 5 , thevoice interaction device 500 in this embodiment includes: anextraction unit 501 configured to extract first voice information including a target word voice segment; afirst output unit 502 configured to A prompt tone is superimposed at the voice segment, and the first voice information after the superimposed prompt tone is outputted by the voice, and the above-mentioned prompt tone is used to prompt that the currently broadcasted content is the target word; thematching unit 503 is configured to respond to the collected second feedback from the user. Voice information, matching the above-mentioned second voice information with the above-mentioned target word; thesecond output unit 504 is configured to respond to determining that the above-mentioned second voice information is matched with the above-mentioned target word, and the voice output is associated with the above-mentioned target word. Three voice messages.

在一些实施例中，上述第一输出单元502可以进一步被配置成在上述目标词语音片段的起始处叠加脉冲型的提示音，语音输出叠加上述提示音后的第一语音信息，其中，上述提示音在上述目标词语音片段的结束处之前结束。In some embodiments, the above-mentionedfirst output unit 502 may be further configured to superimpose a pulse-shaped prompt tone at the beginning of the above-mentioned target word speech segment, and output the first voice information after the above-mentioned prompt tone is superimposed, wherein the above-mentioned The prompt tone ends before the end of the above-mentioned target word speech segment.

在一些实施例中，上述第一输出单元502可以进一步被配置成在上述目标词语音片段的起始处叠加持续型的提示音，语音输出叠加上述提示音后的第一语音信息，其中，上述提示音在上述目标词语音片段结束时结束。In some embodiments, the above-mentionedfirst output unit 502 may be further configured to superimpose a continuous prompt tone at the beginning of the above-mentioned target word voice segment, and output the first voice information after the above-mentioned prompt tone is superimposed, wherein the above-mentioned The prompt tone ends when the above-mentioned target word speech segment ends.

在一些实施例中，上述匹配单元503可以进一步被配置成响应于确定上述第二语音信息与上述目标词相匹配，确定上述第一语音信息的类型，基于上述第一语音信息的类型，确定与上述目标词相关联的第三语音信息，语音输出上述第三语音信息。In some embodiments, the above-mentionedmatching unit 503 may be further configured to, in response to determining that the above-mentioned second speech information matches the above-mentioned target word, determine the type of the above-mentioned first speech information, and determine the type of the above-mentioned first speech information based on the type of the above-mentioned first speech information. For the third voice information associated with the target word, the third voice information is output by voice.

在一些实施例中，上述匹配单元503可以包括第一生成模块、第一发送模块和第一输出模块(图中未示出)。其中，上述第一生成模块可以被配置成响应于确定上述第一语音信息的类型为新闻播报类，生成包含上述目标词的信息搜索请求。上述第一发送模块可以被配置成向服务器发送上述信息搜索请求，接收上述服务器返回的搜索结果。上述第一输出模块可以被配置成将上述搜索结果所对应的语音信息作为第三语音信息，语音输出上述第三语音信息。In some embodiments, theabove matching unit 503 may include a first generating module, a first sending module and a first outputting module (not shown in the figure). The above-mentioned first generating module may be configured to generate an information search request including the above-mentioned target word in response to determining that the type of the above-mentioned first voice information is news broadcast. The above-mentioned first sending module may be configured to send the above-mentioned information search request to the server, and receive the above-mentioned search result returned by the above-mentioned server. The first output module may be configured to use the voice information corresponding to the search result as the third voice information, and output the third voice information by voice.

在一些实施例中，上述匹配单元503可以包括第二生成模块、第二发送模块和第二输出模块(图中未示出)。其中，上述第二生成模块可以被配置成响应于确定上述第一语音信息的类型为业务查询类，生成包含上述目标词的业务查询请求。上述第二发送模块可以被配置成向服务器发送上述业务查询请求，接收上述服务器返回的查询结果。上述第二输出模块可以被配置成将上述查询结果所对应的语音信息作为第三语音信息，语音输出上述第三语音信息。In some embodiments, theabove matching unit 503 may include a second generating module, a second sending module and a second output module (not shown in the figure). Wherein, the above-mentioned second generating module may be configured to generate a business query request including the above-mentioned target word in response to determining that the type of the above-mentioned first voice information is a business query type. The second sending module may be configured to send the service query request to the server, and receive the query result returned by the server. The second output module may be configured to use the voice information corresponding to the query result as the third voice information, and output the third voice information by voice.

在一些实施例中，上述匹配单元503可以包括第三生成模块(图中未示出)。其中，上述第三生成模块可以被配置成响应于确定上述第一语音信息的类型为信息确认类，生成用于指示跳转至预设的下一条语音信息的跳转指令，将上述下一条语音信息确定为第三语音信息。In some embodiments, the above-mentionedmatching unit 503 may include a third generation module (not shown in the figure). Wherein, the above-mentioned third generation module may be configured to, in response to determining that the type of the above-mentioned first voice information is an information confirmation type, generate a jump instruction for instructing to jump to a preset next piece of voice information, and convert the above-mentioned next voice information The information is determined to be the third voice information.

在一些实施例中，上述提示音的音量小于上述目标词语音片段的音量。In some embodiments, the volume of the prompt tone is lower than the volume of the voice segment of the target word.

本申请的上述实施例提供的装置，通过提取单元501提取包含目标词语音片段的第一语音信息，而后第一输出单元502在该目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，之后匹配单元503当采集到用户反馈的第二语音信息，将该第二语音信息与目标词进行匹配，若匹配，第二输出单元504语音输出与目标词相关联的第三语音信息。从而，不需要利用图形界面或者语音告知用户可输入的语音指令，也不需要用户额外花费时间阅读或听说明和教程，利用叠加提示音的方式即可提示用户哪些语音指令可以下达，提高了语音交互的效率和灵活性。In the apparatus provided by the above-mentioned embodiment of the present application, the extractingunit 501 extracts the first voice information including the target word voice segment, and then thefirst output unit 502 superimposes a prompt tone at the target word voice fragment, and outputs the voice after the superimposed prompt tone. After the first voice information, when thematching unit 503 collects the second voice information fed back by the user, it matches the second voice information with the target word. If there is a match, thesecond output unit 504 voice outputs the third voice information associated with the target word. voice message. Therefore, there is no need to use a graphical interface or voice to inform the user of the voice commands that can be input, and it is not necessary for the user to spend extra time reading or listening to instructions and tutorials, and the user can be prompted which voice commands can be issued by superimposing the prompt sound, which improves the voice. Efficiency and flexibility of interaction.

下面参考图6，其示出了适于用来实现本申请实施例的终端设备的计算机系统600的结构示意图。图6示出的终端设备仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。Referring next to FIG. 6 , it shows a schematic structural diagram of acomputer system 600 suitable for implementing the terminal device of the embodiment of the present application. The terminal device shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.

如图6所示，计算机系统600包括中央处理单元(CPU)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有系统600操作所需的各种程序和数据。CPU 601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, acomputer system 600 includes a central processing unit (CPU) 601, which can be loaded into a random access memory (RAM) 603 according to a program stored in a read only memory (ROM) 602 or a program from astorage section 608 Instead, various appropriate actions and processes are performed. In theRAM 603, various programs and data necessary for the operation of thesystem 600 are also stored. The CPU 601 , the ROM 602 , and theRAM 603 are connected to each other through a bus 604 . An input/output (I/O)interface 605 is also connected to bus 604 .

以下部件连接至I/O接口605：包括触摸屏、触摸板等的输入部分606；包括诸如液晶显示器(LCD)等以及扬声器等的输出部分607；存储部分608；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611，诸如半导体存储器等等，根据需要安装在驱动器610上，以便于从其上读出的计算机程序根据需要被安装入存储部分608。The following components are connected to the I/O interface 605: aninput section 606 including a touch screen, a touch panel, etc.; anoutput section 607 including a liquid crystal display (LCD), etc., and a speaker, etc.; astorage section 608; Communication section 609 of the network interface card. The communication section 609 performs communication processing via a network such as the Internet. Adrive 610 is also connected to the I/O interface 605 as needed. Aremovable medium 611, such as a semiconductor memory or the like, is mounted on thedrive 610 as needed so that a computer program read therefrom is installed into thestorage section 608 as needed.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分609从网络上被下载和安装，和/或从可拆卸介质611被安装。在该计算机程序被中央处理单元(CPU)601执行时，执行本申请的方法中限定的上述功能。需要说明的是，本申请所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 609 and/or installed from theremovable medium 611 . When the computer program is executed by the central processing unit (CPU) 601, the above-described functions defined in the method of the present application are performed. It should be noted that the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

描述于本申请实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中，例如，可以描述为：一种处理器包括提取单元、第一输出单元、匹配单元和第二输出单元。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定，例如，提取单元还可以被描述为“提取包含目标词语音片段的第一语音信息的单元”。The units involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner. The described unit can also be provided in the processor, for example, it can be described as: a processor includes an extraction unit, a first output unit, a matching unit and a second output unit. Wherein, the names of these units do not constitute a limitation of the unit itself under certain circumstances, for example, the extraction unit may also be described as "a unit for extracting the first speech information of the speech segment containing the target word".

作为另一方面，本申请还提供了一种计算机可读介质，该计算机可读介质可以是上述实施例中描述的装置中所包含的；也可以是单独存在，而未装配入该装置中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该装置执行时，使得该装置：提取包含目标词语音片段的第一语音信息；在该目标词语音片段处叠加提示音，语音输出叠加提示音后的第一语音信息，该提示音用于提示当前所播报的内容为目标词；响应于采集到用户反馈的第二语音信息，将该第二语音信息与该目标词进行匹配；响应于确定该第二语音信息与该目标词相匹配，语音输出与该目标词相关联的第三语音信息。As another aspect, the present application also provides a computer-readable medium, which may be included in the apparatus described in the above-mentioned embodiments, or may exist independently without being assembled into the apparatus. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the device, the device is made to: extract the first voice information containing the target word voice fragment; superimpose at the target word voice fragment Prompt tone, the first voice information after the superimposed prompt tone is outputted by voice, and the prompt tone is used to prompt that the currently broadcasted content is the target word; in response to collecting the second voice information fed back by the user, the second voice information is combined with the The target word is matched; in response to determining that the second phonetic information matches the target word, third phonetic information associated with the target word is phonetically output.

以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本申请中所涉及的发明范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述发明构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to the technical solution formed by the specific combination of the above technical features, and should also cover the above technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above-mentioned features with the technical features disclosed in this application (but not limited to) with similar functions.