CN108831439B

Movatterモバイル変換

Info

Publication number: CN108831439B
Application number: CN201810677565.1A
Authority: CN
Inventors: 李忠杰
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2023-04-18
Anticipated expiration: 2038-06-27
Also published as: CN108831439A

Abstract

The invention discloses a voice recognition method, which comprises the following steps: acquiring a voice signal; decoding the voice signal to obtain a plurality of optimal paths; evaluating the multiple optimal paths according to a pre-trained user model; and according to the evaluation result, extracting an optimal path matched with the user model from the plurality of optimal paths to serve as a target optimal path, and determining a voice recognition result of the voice signal according to the target optimal path. A speech recognition device, a speech recognition apparatus and a speech recognition system are also disclosed. The voice signals are decoded to obtain a plurality of optimal paths, the user models are called for the optimal paths to be evaluated, and the voice recognition result is finally obtained according to the evaluation result, so that the problem of low accuracy of the recognition result in the traditional voice recognition technology is solved, and the accuracy of the recognition result is greatly improved. The voice recognition system has high recognition accuracy and can effectively improve the safety of personal information of the user.

Description

Translated fromChinese

语音识别方法、装置、设备和系统Speech recognition method, device, device and system

技术领域technical field

本发明涉及语音识别技术领域，特别是涉及一种语音识别方法、装置、设备和系统。The present invention relates to the technical field of voice recognition, in particular to a voice recognition method, device, equipment and system.

背景技术Background technique

随着智能交互技术的快速发展和市场需求的不断扩展，语音识别技术在近年来取得长足发展，至今已经在多个领域中得到广泛的应用。语音识别技术，顾名思义，就是对输入的语音信号进行识别，从而转换成计算机可处理的文本信息。利用语音识别技术可以实现众多应用场景中的智能语音交互，例如语音助手，基于语音识别的智能控制等。With the rapid development of intelligent interactive technology and the continuous expansion of market demand, speech recognition technology has made great progress in recent years, and has been widely used in many fields so far. Speech recognition technology, as the name suggests, is to recognize the input speech signal and convert it into text information that can be processed by the computer. Speech recognition technology can be used to realize intelligent voice interaction in many application scenarios, such as voice assistant, intelligent control based on voice recognition, etc.

传统的语音识别技术方案一般是系统接收到语音信号后进行特征提取，并基于提取的特征对语音信号进行分类计算，而后结合加权有限状态机(WFST)进行解码输出语音识别结果。然而，传统的语音识别技术的识别结果准确率仍然不高。The traditional speech recognition technology scheme generally extracts features after the system receives the speech signal, and classifies and calculates the speech signal based on the extracted features, and then combines the weighted finite state machine (WFST) to decode and output the speech recognition result. However, the recognition result accuracy rate of traditional speech recognition technology is still not high.

发明内容Contents of the invention

基于此，本发明提供一种语音识别方法，一种语音识别装置，一种语音识别设备以及一种语音识别系统。Based on this, the present invention provides a speech recognition method, a speech recognition device, a speech recognition device and a speech recognition system.

为实现上述目的，一方面，本发明实施例提供一种语音识别方法，包括步骤：In order to achieve the above object, on the one hand, the embodiment of the present invention provides a speech recognition method, including steps:

获取语音信号；Acquire voice signal;

对所述语音信号进行解码处理，获得多个最佳路径；Decoding the voice signal to obtain multiple optimal paths;

根据预先训练的用户模型，对多个所述最佳路径进行评价；Evaluating a plurality of the best paths according to the pre-trained user model;

根据评价结果，从多个所述最佳路径中提取与所述用户模型匹配的一个所述最佳路径作为目标最佳路径，并根据所述目标最佳路径确定所述语音信号的语音识别结果。According to the evaluation result, extracting one of the best paths matching the user model as the target best path from a plurality of the best paths, and determining the speech recognition result of the speech signal according to the target best path .

在其中一个实施例中，对所述语音信号进行解码处理，获得多个最佳路径的过程包括以下步骤：In one of the embodiments, the speech signal is decoded, and the process of obtaining multiple optimal paths includes the following steps:

对所述语音信号进行特征提取，得到对应的声学特征信息；performing feature extraction on the speech signal to obtain corresponding acoustic feature information;

根据所述声学特征信息，通过预先构建的声学模型将所述语音信号分类为各个类别并确定对应的分类概率；According to the acoustic feature information, classify the speech signal into each category through a pre-built acoustic model and determine the corresponding classification probability;

根据各个类别的所述语音信号及对应的所述分类概率，基于预先构建的WFST模块进行前向搜索，获得多个所述最佳路径。According to the voice signals of each category and the corresponding classification probabilities, a forward search is performed based on a pre-built WFST module to obtain multiple optimal paths.

在其中一个实施例中，根据各个类别的所述语音信号及对应的所述分类概率，基于预先构建的WFST模块进行前向搜索，获得多个最佳路径的步骤，包括：In one of the embodiments, according to the speech signals of each category and the corresponding classification probabilities, a forward search is performed based on a pre-built WFST module, and the steps of obtaining multiple optimal paths include:

基于预先构建的多个所述WFST模块分别进行独立前向搜索，获得与多个所述WFST模块分别对应的多个所述最佳路径。Independent forward searches are performed based on the plurality of pre-built WFST modules to obtain the plurality of optimal paths respectively corresponding to the plurality of WFST modules.

在其中一个实施例中，根据各个类别的所述语音信号及对应的所述分类概率，基于预先构建的WFST模块进行前向搜索，获得多个最佳路径的步骤，还包括：In one of the embodiments, according to the speech signals of each category and the corresponding classification probabilities, the step of performing a forward search based on a pre-built WFST module to obtain multiple optimal paths also includes:

基于预先构建的多个所述WFST模块及对应的权重，进行同步前向搜索，获得与多个所述WFST模块对应的多个所述最佳路径。语音识别的准确率较高同时，大大提升识别速度。Based on the multiple pre-built WFST modules and corresponding weights, perform a synchronous forward search to obtain multiple optimal paths corresponding to the multiple WFST modules. The accuracy of speech recognition is high, and the recognition speed is greatly improved.

在其中一个实施例中，在在根据评价结果，从多个所述最佳路径中提取与所述用户模型匹配的一个所述最佳路径作为目标最佳路径，并根据所述目标最佳路径确定所述语音信号的语音识别结果的步骤后，还包括：In one of the embodiments, according to the evaluation results, one of the best paths matching the user model is extracted from a plurality of the best paths as the target best path, and according to the target best path After the step of determining the speech recognition result of the speech signal, it also includes:

若检测到所述语音识别结果包含新增的联系人信息、新增的自创词组和/或新增的特征语言信息，则根据所述新增的联系人信息、所述新增的自创词组和/或所述新增的特征语言信息，更新所述用户模型。If it is detected that the speech recognition result contains newly added contact information, newly added self-created phrases and/or newly added characteristic language information, then according to the newly added contact information, the newly added self-created Phrases and/or the newly added feature language information to update the user model.

在其中一个实施例中，多个所述WFST模块包含定制WFST模块，所述定制WFST模块通过以下步骤获取：In one of the embodiments, a plurality of the WFST modules include custom WFST modules, and the custom WFST modules are obtained through the following steps:

采集设定的词句及语法信息；Collect the set words and grammar information;

通过词典对所述设定的词句进行分词处理；Segmenting the set words and sentences through a dictionary;

对所述语法信息进行统计训练，得到对应的语言模型；performing statistical training on the grammatical information to obtain a corresponding language model;

根据所述分词处理的结果和所述语言模型，编译得到所述定制WFST模块。可以通过结合定制WFST模块，进一步提高语音识别的准确率。Compile and obtain the customized WFST module according to the word segmentation processing result and the language model. The accuracy of speech recognition can be further improved by combining custom WFST modules.

另一方面，本发明实施例还提供一种语音识别方法，包括步骤：On the other hand, the embodiment of the present invention also provides a speech recognition method, comprising the steps of:

向服务器发送语音信号；Send a voice signal to the server;

获取服务器对所述语音信号进行解码处理后反馈的多个最佳路径；Obtain multiple optimal paths fed back by the server after decoding and processing the voice signal;

再一方面，本发明实施例提供一种语音识别装置，包括：In another aspect, an embodiment of the present invention provides a speech recognition device, including:

语音获取模块，用于获取语音信号；A voice acquisition module, configured to acquire a voice signal;

解码处理模块，用于对所述语音信号进行解码处理，获得多个最佳路径；A decoding processing module, configured to perform decoding processing on the speech signal to obtain multiple optimal paths;

第一评价模块，用于根据预先训练的用户模型，对多个所述最佳路径进行评价；A first evaluation module, configured to evaluate multiple optimal paths according to a pre-trained user model;

第一结果获取模块，用于根据评价结果，从多个所述最佳路径中提取与所述用户模型匹配的一个所述最佳路径作为目标最佳路径，并根据所述目标最佳路径确定所述语音信号的语音识别结果。The first result acquisition module is configured to extract one of the best paths matching the user model as the target best path from a plurality of the best paths according to the evaluation result, and determine according to the target best path A speech recognition result of the speech signal.

再一方面，本发明实施例还提供一种语音识别装置，包括：In another aspect, an embodiment of the present invention also provides a speech recognition device, including:

语音发送模块，用于向服务器发送语音信号；Voice sending module, used for sending voice signal to server;

词序列获取模块，用于获取服务器对所述语音信号进行解码处理后反馈的最佳路径；A word sequence acquisition module, configured to acquire the optimal path for feedback after the server decodes and processes the speech signal;

第二评价模块，用于根据预先训练的用户模型，对多个所述最佳路径进行评价；The second evaluation module is used to evaluate a plurality of the best paths according to the pre-trained user model;

第二结果获取模块，用于根据评价结果，从多个所述最佳路径中提取与所述用户模型匹配的一个所述最佳路径作为目标最佳路径，并根据所述目标最佳路径确定所述语音信号的语音识别结果。The second result acquisition module is configured to extract one of the best paths that matches the user model from a plurality of the best paths as the target best path according to the evaluation result, and determine according to the target best path A speech recognition result of the speech signal.

再一方面，本发明实施例提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述任一种的语音识别方法的步骤。In another aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the speech recognition methods above are implemented.

再一方面，本发明实施例提供一种语音识别设备，包括存储器和处理器，所述存储器存储有计算机程序，所述计算机程序被所述处理器执行时实现上述任一种的语音识别方法。In another aspect, an embodiment of the present invention provides a speech recognition device, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, any one of the above speech recognition methods is implemented.

再一方面，本发明实施例还提供一种语音识别系统，包括服务器和终端；In another aspect, the embodiment of the present invention also provides a speech recognition system, including a server and a terminal;

所述终端用于发送语音信号至所述服务器；The terminal is used to send a voice signal to the server;

所述服务器用于对所述语音信号进行解码处理，获得多个最佳路径；The server is used to decode and process the voice signal to obtain multiple optimal paths;

所述终端还用于根据预先训练的用户模型，对多个所述最佳路径进行评价；根据评价结果，从多个所述最佳路径中提取与所述用户模型匹配的一个所述最佳路径作为目标最佳路径，并根据所述目标最佳路径确定所述语音信号的语音识别结果。The terminal is further configured to evaluate a plurality of the best paths according to the pre-trained user model; and extract one of the best paths matching the user model from the plurality of the best paths according to the evaluation result. The path is used as the target optimal path, and the speech recognition result of the speech signal is determined according to the target optimal path.

在其中一个实施例中，所述终端还用于：若检测到所述语音识别结果包含新增的联系人信息、新增的自创词组和/或新增的特征语言信息，则根据所述新增的联系人信息、所述新增的自创词组和/或所述新增的特征语言信息，更新所述用户模型。In one of the embodiments, the terminal is further configured to: if it is detected that the voice recognition result contains newly added contact information, newly created phrases and/or newly added characteristic language information, then according to the The added contact information, the added self-created phrase and/or the added characteristic language information update the user model.

上述技术方案中的一个技术方案具有如下优点和有益效果：One of the above technical solutions has the following advantages and beneficial effects:

通过对WFST模块输出的多个最佳路径，调用预先训练的用户模型对多个所述最佳路径进行评价，并根据评价结果从多个所述最佳路径中提取与所述用户模型匹配的一个所述最佳路径作为目标最佳路径，并根据目标最佳路径确定所述语音信号的语音识别结果。所得语音识别结果可有效覆盖尽多的语音交互应用场景和领域，并有效结合了用户的语音特征，达到了所得语音识别结果更贴近用户的实际应用场景，识别结果准确率得到较大提高的效果。By calling the pre-trained user model to evaluate a plurality of the best paths output by the WFST module, and extracting from the plurality of the best paths according to the evaluation results that matches the user model One of the best paths is used as the target best path, and the speech recognition result of the speech signal is determined according to the target best path. The obtained speech recognition results can effectively cover as many speech interaction application scenarios and fields as possible, and effectively combine the user's speech characteristics, so that the obtained speech recognition results are closer to the user's actual application scenarios, and the accuracy of the recognition results is greatly improved. .

附图说明Description of drawings

图1为一个实施例的语音识别方法的流程示意图；Fig. 1 is a schematic flow chart of the speech recognition method of an embodiment;

图2为一个实施例的最佳路径获取流程示意图；Fig. 2 is a schematic diagram of the optimal path acquisition flow chart of an embodiment;

图3为一个实施例的定制解码器构建的简要流程示意图；Fig. 3 is a brief schematic flow diagram of the construction of a custom decoder of an embodiment;

图4为一个实施例的第一种示意性语音识别过程示意图；Fig. 4 is a schematic diagram of a first schematic speech recognition process of an embodiment;

图5为一个实施例的第二种示意性语音识别过程示意图；Fig. 5 is a schematic diagram of a second schematic speech recognition process of an embodiment;

图6为一个实施例的另一种语音识别方法的流程示意图；Fig. 6 is a schematic flow chart of another speech recognition method of an embodiment;

图7为一个实施例的第一种语音识别装置的模块结构示意图；Fig. 7 is a schematic diagram of the module structure of the first speech recognition device of an embodiment;

图8为一个实施例的解码处理模块的结构示意图；FIG. 8 is a schematic structural diagram of a decoding processing module of an embodiment;

图9为一个实施例的第二种语音识别装置的模块结构示意图；FIG. 9 is a schematic diagram of a module structure of a second speech recognition device according to an embodiment;

图10为一个实施例的语音识别系统结构示意图；Fig. 10 is a schematic structural diagram of a speech recognition system of an embodiment;

图11为一个实施例的语音识别过程的第一种时序示意图；FIG. 11 is a schematic diagram of the first sequence of the speech recognition process of an embodiment;

图12为一个实施例的语音识别过程的第二种时序示意图。FIG. 12 is a second time sequence diagram of the speech recognition process of an embodiment.

具体实施方式Detailed ways

下面将结合较佳实施例及附图对本发明的内容作进一步详细描述。显然，下文所描述的实施例仅用于解释本发明，而非对本发明的限定。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The content of the present invention will be further described in detail below in conjunction with preferred embodiments and accompanying drawings. Apparently, the embodiments described below are only used to explain the present invention, not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

语音识别技术，也可以称为自动语音识别(Automatic Speech Recognition，ASR)，其任务是把人所发出的语音中的词汇内容转换为计算机可读入的文本。语音识别技术是一种综合性的技术，涉及多个学科领域，如发声机理和听觉机理、信号处理、概率论和信息论、模式识别以及人工智能等。目前，主流的大词汇量语音识别系统中通常采用基于统计模型的识别技术。语音识别技术的应用载体一般是语音识别系统，主体通常可以包含有服务器及终端，语音信号一般由终端输入后发送到服务器，由服务器对语音信号进行语音识别处理并返回相应的结果。终端例如可以是智能手机，例如用户可以通过手机讲一段话，手机会将输入的这段语音送到服务器进行语音识别后，接收服务器返回的语音识别结果，最终用户在手机上看到的是一段与输入的语音对应的文字或者手机显示对应文字后执行对应的控制操作，例如开启对应的应用等。除此之外，上述的终端还可以是各种智能设备，例如智能电视、平板甚至是其他各种智能家电、智能办公设备等。Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), its task is to convert the vocabulary content in the voice uttered by a person into a text that can be read by a computer. Speech recognition technology is a comprehensive technology, involving multiple disciplines, such as vocal mechanism and auditory mechanism, signal processing, probability theory and information theory, pattern recognition and artificial intelligence. Currently, the mainstream large vocabulary speech recognition system usually adopts the recognition technology based on the statistical model. The application carrier of speech recognition technology is generally a speech recognition system. The main body usually includes a server and a terminal. The speech signal is generally input by the terminal and sent to the server. The server performs speech recognition processing on the speech signal and returns the corresponding result. The terminal can be a smart phone, for example. For example, the user can speak a paragraph through the mobile phone, and the mobile phone will send the input voice to the server for voice recognition, and then receive the voice recognition result returned by the server. What the end user sees on the mobile phone is a paragraph After the text corresponding to the input voice or the mobile phone displays the corresponding text, the corresponding control operation is performed, such as opening the corresponding application. In addition, the above-mentioned terminals may also be various smart devices, such as smart TVs, tablets, or even other smart home appliances, smart office equipment, and the like.

然而，发明人在实现本发明实施例的技术方案过程中发现，在随着日益提高的应用要求，传统的语音识别技术中的识别方法仍然存在着语音识别准确率不高的问题。为此，请参阅图1，提供一种语音识别方法，包括如下步骤：However, the inventors found in the process of realizing the technical solutions of the embodiments of the present invention that, with the increasing application requirements, the recognition method in the traditional speech recognition technology still has the problem of low speech recognition accuracy. To this end, referring to Figure 1, a speech recognition method is provided, including the following steps:

S10，获取语音信号。S10, acquiring a voice signal.

其中，语音信号可以是服务器从终端上获得的用户输入的语音信号，终端可以是但不限于智能手机、平板电脑、智能电视机、智能机器人、智能交互平板、智能穿戴设备、智能医疗设备等，还可以是其他类型的智能家电、汽车等。Wherein, the voice signal may be a voice signal input by the user obtained by the server from the terminal, and the terminal may be but not limited to a smart phone, a tablet computer, a smart TV, a smart robot, a smart interactive tablet, a smart wearable device, a smart medical device, etc. Other types of smart home appliances, cars, etc. may also be used.

S12，对语音信号进行解码处理，获得多个最佳路径；S12, decoding and processing the voice signal to obtain multiple optimal paths;

其中，解码处理可以是通过预先构建的搜索模块对语音信号进行的解码处理，最佳路径可以是解码处理输出的搜索路径中满足要求的路径，例如权重最高的一个解码结果所对应的搜索路径。Wherein, the decoding processing may be the decoding processing of the speech signal by a pre-built search module, and the optimal path may be a path satisfying requirements among the search paths output by the decoding processing, for example, the search path corresponding to the decoding result with the highest weight.

在一些实施例中，预先构建的搜索模块可以是WFST模块，WFST模块是解码器中的搜索功能模块，其中，解码器是指将输入的音频信号解码输出对应文字结果的软件程序(如手机应用程序、服务器程序等等)或装置(如独立的语音翻译机)。通常可以通过多个WFST模块直接获得多个最佳路径，或者通过解码处理输出的词格信息获得多个最佳路径，其中，词格信息也即词网格信息(word lattice)，词格信息是解码处理结果的一种表示形式，词格信息中包含多个最佳路径。In some embodiments, the pre-built search module can be a WFST module, and the WFST module is a search function module in a decoder, wherein a decoder refers to a software program (such as a mobile phone application) that decodes an input audio signal and outputs a corresponding text result. programs, server programs, etc.) or devices (such as stand-alone voice translators). Usually, multiple optimal paths can be directly obtained through multiple WFST modules, or multiple optimal paths can be obtained by decoding the word lattice information output by the decoding process. It is a representation of the result of decoding processing, and the word lattice information contains multiple optimal paths.

在本实施例中预先构建的WFST模块可以是根据各预定领域、各预定场景和各设定语言模式的声学模型、发音词典和语言模型，分别构建得到的对应各预定领域、各预定场景和各设定语言模式的各个WFST模块，也可以是组合各个WFST模块后成的一个通用WFST模块。其中，各预定领域可以是各种学科领域、各类商品领域或其他具体领域，通常每一个预定领域都会有该领域对应的常用词句、专业词句等具有区别性的词句，相应的发音习惯也会有所不同或侧重。各预定场景例如可以是用户常处在的各种生活场景和工作场景等，同样也会具有对应各种场景下的语音特点。各设定语言模式可以是用户自身的语言习惯或语音发音习惯，产生的能够代表该用户个人特征的语言模式，例如用户的口音以及习惯用语等。In this embodiment, the pre-built WFST module can be constructed according to the acoustic model, pronunciation dictionary and language model of each predetermined field, each predetermined scene and each set language mode, corresponding to each predetermined field, each predetermined scene and each Each WFST module for setting the language mode may also be a general WFST module formed by combining various WFST modules. Among them, each predetermined field can be various subject fields, various commodity fields, or other specific fields. Usually, each predetermined field will have distinctive words and sentences such as commonly used words and sentences corresponding to the field, and the corresponding pronunciation habits will also different or focused. Each predetermined scene can be, for example, various life scenes and work scenes where the user is often located, and also has voice characteristics corresponding to various scenes. Each set language mode may be the user's own language habits or speech pronunciation habits, which can represent the user's personal characteristics, such as the user's accent and idioms.

具体的，可以通过服务器调用预先构建的各个WFST模块或者各个WFST模块组合而成的一个通用WFST模块对语音信号进行解码处理，输出多个最佳路径。至此，服务器即可以完成通过WFST模块进行搜索后，得到多个具有不同概率的初步的语音识别结果的过程。各个WFST模块或通用WFST模块的构建方法或组合方法，可以利用本领域的常用方法，在本说明书中不做限定。Specifically, the server may invoke various pre-built WFST modules or a general WFST module composed of various WFST modules to decode and process the voice signal, and output multiple optimal paths. So far, the server can complete the process of obtaining multiple preliminary speech recognition results with different probabilities after searching through the WFST module. The construction method or combination method of each WFST module or general WFST module can use common methods in the field, which is not limited in this description.

在另一些实施例中，可以通过现有技术中其他类型的搜索模块进行处理以获得最佳路径，此处不予赘述。In some other embodiments, other types of search modules in the prior art may be used to obtain the best path, which will not be repeated here.

S14，根据预先训练的用户模型，对多个最佳路径进行评价；S14, evaluating multiple optimal paths according to the pre-trained user model;

S16根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。S16 extracts an optimal path matching the user model from multiple optimal paths as the target optimal path according to the evaluation result, and determines the speech recognition result of the speech signal according to the target optimal path.

其中，用户模型可以是反映用户个人特征的数据统计形态的模型，一般可以通过预先采集需要的用户数据来进行训练获得。用户模型可以通过本领域的各种常用技术手段，对需要的用户数据进行预先训练得到，本说明书对用户模型的训练方法不做限定。Wherein, the user model may be a model reflecting the statistical form of data of the user's personal characteristics, which may generally be obtained by pre-collecting required user data for training. The user model can be obtained by pre-training the required user data through various common technical means in the field, and this specification does not limit the training method of the user model.

可以理解，可以通过服务器或者终端调用预先训练的用户模型，对前述得到的多个最佳路径进行评价，如此评价后的多个最佳路径，可以分别赋予一个对应的评价指标，例如接近用户个人特征的程度得分，或者例如接近用户个人特征的程度与最佳路径对应的权重两方面的综合得分。服务器或者终端可以但不限于从多个最佳路径中提取与用户模型匹配度最高的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定对应的语音信号的语音识别结果。It can be understood that the pre-trained user model can be invoked through the server or the terminal to evaluate the multiple optimal paths obtained above, and the multiple optimal paths after such evaluation can be assigned a corresponding evaluation index, such as being close to the user's personal The degree score of the feature, or the comprehensive score of two aspects, such as the degree close to the user's personal feature and the weight corresponding to the best path. The server or terminal may, but is not limited to, extract an optimal path with the highest degree of matching with the user model from multiple optimal paths as the target optimal path, and determine the speech recognition result of the corresponding voice signal according to the target optimal path.

上述实施例的语音识别方法，通过调用预先训练的用户模型对多个最佳路径进行评价，从多个最佳路径中得到与用户模型匹配度，也即最符合用户实际情况的语音识别结果。In the speech recognition method of the above-mentioned embodiment, the multiple optimal paths are evaluated by invoking the pre-trained user model, and the matching degree with the user model is obtained from the multiple optimal paths, that is, the speech recognition result that is most in line with the actual situation of the user.

此外，结合WFST模块的构建及运用预先训练的用户模型进行评价，可有效适应复杂多变的语音交流场景，并可以兼顾用户语音交流的内容所覆盖的各种领域及说话习惯，更贴近用户的实际应用情景，识别结果准确率得到大大提高，有效避免传统的语音识别技术的识别结果准确率较低的问题。In addition, combined with the construction of the WFST module and the use of pre-trained user models for evaluation, it can effectively adapt to complex and changeable voice communication scenarios, and can take into account the various fields and speaking habits covered by the user's voice communication content, which is closer to the user's needs. In practical application scenarios, the accuracy of recognition results is greatly improved, effectively avoiding the problem of low accuracy of recognition results in traditional speech recognition technologies.

在其中一个实施例中，语音识别结果可以是词序列，也可以是词序列对应控制指令。其中，词序列可以是目标最佳路径对应的具有相应概率、且具有网格结构的字符串，具体可以是语音信息解码搜索后得到的文本信息。如此，语音识别结果被终端接收后，可以是进行文字显示，也可以是执行相应的控制操作。例如，终端是手机时，用户可以对着手机说一段语音，后台的服务器可以快速、准确的将用户说的语音转换成文字，并且显示出来。或者例如终端是电视机时，用户可以对着电视机说出一个语音指令，后台的服务器可以快速、准确的将用户的语音指令识别，得到对应的控制指令并返回到电视机上，使电视机执行相应控制操作，如切换节目等。In one of the embodiments, the speech recognition result may be a word sequence, or may be a control instruction corresponding to the word sequence. The word sequence may be a character string with a corresponding probability corresponding to the target optimal path and has a grid structure, specifically, it may be text information obtained after decoding and searching voice information. In this way, after the voice recognition result is received by the terminal, text display may be performed, or corresponding control operations may be performed. For example, when the terminal is a mobile phone, the user can speak a voice to the mobile phone, and the server in the background can quickly and accurately convert the voice spoken by the user into text and display it. Or for example, when the terminal is a TV, the user can speak a voice command to the TV, and the server in the background can quickly and accurately recognize the user's voice command, obtain the corresponding control command and return it to the TV, so that the TV can execute Corresponding control operations, such as switching programs, etc.

在其中一个实施例中，上述实施例中的用户模型可以根据与用户关联的联系人信息、自创词组和/或特征语言信息进行训练获得。关联的联系人信息可以预先从用户的终端上调取得到，也可以在终端自动同步联系人信息到服务器上时进行获取。自创词组可以是从用户在日常使用终端的过程中通过各种方式创建的词组，例如通过输入文字的方式创建的词组，或者对输入到终端上的语音信息进行提取所得到的自创词组。自创词组一般不存在于现有词典中，而是用户首次创建的。特征语言信息可以包含有表征用户的语言习惯的信息及语音使用习惯的信息等，例如用户的发音、平均语速、口头禅或者其他表征用户的语音特性的信息。如此，通过定期或在线收集用户的语音特性信息用于用户模型的训练，得到尽可能符合用户真实情况的用户模型，从而确保语音识别结果的准确度提高效果。In one embodiment, the user model in the above embodiment can be obtained through training based on contact information, self-created phrases and/or characteristic language information associated with the user. The associated contact information can be obtained from the user's terminal in advance, or can be obtained when the terminal automatically synchronizes the contact information to the server. The self-created phrase may be a phrase created by the user in various ways during daily use of the terminal, such as a phrase created by inputting text, or a self-created phrase obtained by extracting voice information input into the terminal. Self-created phrases generally do not exist in existing dictionaries, but are created by users for the first time. The characteristic language information may include information representing the user's language habits and voice usage habits, such as the user's pronunciation, average speech rate, catchphrase, or other information representing the user's voice characteristics. In this way, by regularly or online collecting the speech characteristic information of the user for the training of the user model, a user model that conforms to the real situation of the user as much as possible is obtained, thereby ensuring the effect of improving the accuracy of the speech recognition result.

需要说明的是，本说明书中的语音识别方法的各个步骤，可以部分步骤在终端上执行，其余部分步骤可以在服务器上执行，也可以在终端上执行各个步骤，例如离线的语音识别，因此所述的通过服务器执行各个步骤是示例性的执行方式，而非全部的执行方式。It should be noted that, each step of the voice recognition method in this specification can be partially executed on the terminal, and the rest of the steps can be executed on the server, or each step can be executed on the terminal, such as offline voice recognition, so the The above-mentioned execution of each step by the server is an exemplary execution manner, not a complete execution manner.

请参阅图2，在其中一个实施例中，对于步骤S12，可以具体包含如下步骤：Referring to Fig. 2, in one of the embodiments, for step S12, the following steps may be specifically included:

S122，对语音信号进行特征提取，得到对应的声学特征信息。S122. Perform feature extraction on the speech signal to obtain corresponding acoustic feature information.

可以理解，服务器可以获取语音信号后，对获得的语音信号进行特征提取以得到该语音信号的声学特征信息。服务器对执行特征提取的过程中可以采用本领域常规技术手段来完成，本说明书实施例不对服务器执行声学特征信息提取过程所采用的方法进行限定，例如可以采用线性预测倒谱系数法(LPCC：LinearPrediction CepstrumCoefficient)、美尔频率倒谱系数法(MFCC：Mel Frequency Cepstrum Coefficient)、感知线性预测参数法(PLP：Perceptual Linear Predict ive)和梅尔标度滤波法(FBANK：Mel-scale Filter Bank)中的任意一种均可。It can be understood that after acquiring the voice signal, the server may perform feature extraction on the acquired voice signal to obtain the acoustic feature information of the voice signal. The server can use conventional technical means in the field to complete the feature extraction process. The embodiment of this specification does not limit the method used by the server to perform the acoustic feature information extraction process. For example, the linear predictive cepstral coefficient method (LPCC: LinearPrediction) can be used. Cepstrum Coefficient), Mel Frequency Cepstral Coefficient Method (MFCC: Mel Frequency Cepstrum Coefficient), Perceptual Linear Prediction Parameter Method (PLP: Perceptual Linear Predict ive) and Mel Scale Filtering Method (FBANK: Mel-scale Filter Bank) Either one is fine.

S124，根据声学特征信息，通过预先构建的声学模型将语音信号分类为各个类别并确定对应的分类概率。S124. According to the acoustic feature information, classify the speech signal into each category through the pre-built acoustic model and determine the corresponding classification probability.

其中，声学模型可以通过本领域常规方法进行预先构建，本说明书不对构建声学模型的方法进行限定，例如可以基于卷积神经网络、循环神经网络、深度神经网络、高斯混合模型和长短期记忆网络中的任一种方法进行声学模型的构建。Among them, the acoustic model can be pre-constructed by conventional methods in the field, and this description does not limit the method of constructing the acoustic model, for example, it can be based on convolutional neural network, recurrent neural network, deep neural network, Gaussian mixture model and long short-term memory network. Any method to construct the acoustic model.

可以理解，服务器可以通过预先构建好的声学模型，根据前述获得的声学特征信息对语音信号进行分类计算，结合设定的分类数量和类别等指标，将语音信号分成一定数量的类别并且给出每个类别的对应分类概率。一般的，声学模型中的各条分类搜索路径包含有对应的权重(概率)，通过对各条分类路径的相应权重进行合并，就可以在输出的类别结果同时得到该类别的分类概率。例如该语音信号中的某帧被分类到A类的概率为0.8，被分到B类的概率为0.4等。一定数量的类别例如可以是3000到10000个类别，其可以根据语音识别技术所需要应用到的常见场景的各种细分类别来进行确定，例如可以是A类为手机类，B类为电视机类，C类为电子体温计类。It can be understood that the server can classify and calculate the voice signal according to the acoustic feature information obtained above through the pre-built acoustic model, and divide the voice signal into a certain number of categories and give each Classification probabilities for each category. Generally, each classification search path in the acoustic model includes a corresponding weight (probability), and by merging the corresponding weights of each classification path, the classification probability of the category can be obtained at the same time as the output category result. For example, a certain frame in the voice signal has a probability of being classified into class A as 0.8, a probability of being classified as class B is 0.4, and so on. A certain number of categories can be, for example, 3,000 to 10,000 categories, which can be determined according to various subdivision categories of common scenarios that speech recognition technology needs to be applied to. For example, category A can be for mobile phones, and category B can be for television sets. Class, C class for electronic thermometers.

S126，根据各个类别的语音信号及对应的分类概率，基于预先构建的WFST模块进行前向搜索，获得多个最佳路径。S126. According to the speech signals of each category and the corresponding classification probabilities, perform a forward search based on the pre-built WFST module to obtain multiple optimal paths.

具体的，服务器可以基于预先构建的多个WFST模块或者一个通用的WFST模块进行前向搜索，得到对应于各预定领域、各预定场景和各设定语言模式的多个最佳路径。如此，通过上述的解码处理步骤，可以快速得到有效覆盖尽多的语音交互应用场景和领域的多个最佳路径输出，适用性更强。Specifically, the server may perform a forward search based on multiple pre-built WFST modules or a general WFST module to obtain multiple optimal paths corresponding to each predetermined field, each predetermined scene, and each set language mode. In this way, through the above decoding processing steps, multiple optimal path outputs that effectively cover as many voice interaction application scenarios and fields as possible can be quickly obtained, and the applicability is stronger.

在其中一个实施例中，对于步骤S126，可以具体包含如下步骤：In one of the embodiments, for step S126, the following steps may be specifically included:

基于预先构建的多个WFST模块分别进行独立前向搜索，获得与多个WFST模块分别对应的多个最佳路径。Independent forward searches are performed based on multiple pre-built WFST modules, and multiple optimal paths corresponding to multiple WFST modules are obtained.

可以理解，服务器在执行解码搜索的过程中，可以通过各个领域的各WFST模块、各个场景的各WFST模块和/或各个设定语言模式中的各WFST模块，分别依据各个类别的语音信号及对应的分类概率进行独立的前向搜索，得到多个WFST模块输出的各个最佳路径。一个WFST模块可以对应一个最佳路径，且各个最佳路径一般包含有各自的权重。如此，可以通过对多个WFST模块分别进行独立前向搜索获得的多个最佳路径，可以确保在各个领域、各个场景和/或设定语言模式中得到较准确的识别结果。It can be understood that, in the process of performing the decoding search, the server can use each WFST module in each field, each WFST module in each scene and/or each WFST module in each set language mode, respectively according to the voice signals of each category and the corresponding The classification probability is independently forward searched, and the best paths output by multiple WFST modules are obtained. A WFST module can correspond to an optimal path, and each optimal path generally contains its own weight. In this way, multiple optimal paths obtained by performing independent forward searches on multiple WFST modules can ensure more accurate recognition results in various fields, various scenarios and/or set language modes.

在其中一个实施例中，对于步骤S126，具体还可以是：基于预先构建的多个WFST模块及对应的权重，进行同步前向搜索，获得与多个WFST模块对应的多个最佳路径。In one embodiment, for step S126, it may specifically be: based on multiple pre-built WFST modules and corresponding weights, perform a synchronous forward search to obtain multiple optimal paths corresponding to multiple WFST modules.

可以理解，服务器可以将各个类别的语音信号及对应的分类概率，同时输入到多个WFST模块，结合维特比算法，将各个WFST模块各自的权重带入搜索过程，例如根据维特比算法和各个WFST各自的权重，多个WFST模块进行同步前向搜索，将搜索中所得的路径进行统一的阈值剪枝管理，如低于设定概率阈值的路径剪枝去除，保留有限数量的较佳路径继续进行前向搜索，从而最终得到多个最佳路径输出。各个WFST模块可以在生成时即获得各自的相应权重，例如该语音信号在该WFST模块所对应的领域内的权重。如此，各个WFST模块在同步前向搜索过程中就可以基于自身的权重的大小，输出带有相应权重值的各个最佳路径，有效降低搜索过程的时间消耗。在后续的用户模型评价中，服务器或者终端可以结合前述的权重进行综合评价，实现提高识别速度的同时，也能够提高识别准确度。It can be understood that the server can simultaneously input the speech signals of each category and the corresponding classification probabilities into multiple WFST modules, and combine the Viterbi algorithm to bring the respective weights of each WFST module into the search process, for example, according to the Viterbi algorithm and each WFST Respective weights, multiple WFST modules perform synchronous forward search, and conduct unified threshold pruning management on the paths obtained in the search, such as pruning and removing paths below the set probability threshold, and retain a limited number of better paths to continue Search forward, so as to finally get multiple optimal path outputs. Each WFST module can obtain its corresponding weight when it is generated, for example, the weight of the voice signal in the field corresponding to the WFST module. In this way, each WFST module can output each optimal path with a corresponding weight value based on its own weight during the synchronous forward search process, effectively reducing the time consumption of the search process. In the subsequent evaluation of the user model, the server or the terminal can perform a comprehensive evaluation in combination with the aforementioned weights, so as to improve the recognition speed and at the same time improve the recognition accuracy.

在其中一个实施例中，对于步骤S16之后，还可以包括步骤：若检测到语音识别结果中包含新增的联系人信息、新增的自创词组和/或新增的特征语言信息，则根据新增的联系人信息、新增的自创词组和/或新增的特征语言信息，更新用户模型。In one of the embodiments, after step S16, a step may also be included: if it is detected that the voice recognition result contains newly added contact information, newly created phrases and/or newly added feature language information, then according to The user model is updated by adding new contact information, adding self-created phrases and/or adding feature language information.

其中，新增的联系人信息可以是用户的联系人信息中新添加的联系人信息，或者可以是联系人信息中，被用户进行过更改之后产生的新名称、新号码或新地址等更新的部分信息。新增的自创词组可以是指用户在终端的日常使用过程中首创的词组，例如用户对识别结果进行修改时，出现的自创词组。新增的特征语言信息可以是用户在终端的日常使用过程中最新形成的语言习惯信息，例如用户长期在一个不同语言环境中生活，形成新的口音或者新的用语习惯等，用语习惯也可以通过用户对识别结果所进行的修改获得，如口头禅、高频词语等。Wherein, the newly added contact information may be newly added contact information in the user's contact information, or may be updated by a new name, a new number or a new address, etc. generated after the user has changed the contact information. partial information. The newly added self-created phrase may refer to a phrase created by the user during daily use of the terminal, for example, a self-created phrase that appears when the user modifies the recognition result. The newly added feature language information can be the latest language habit information formed by the user in the daily use of the terminal. For example, the user has lived in a different language environment for a long time and has formed a new accent or new language habits. The language habits can also be passed through Modifications made by the user to the recognition results, such as catchphrases, high-frequency words, etc.

可以理解，服务器或者终端在检测到语音识别结果中包含新增的联系人信息、新增的自创词组和/或新增的特征语言信息等情况时，将会自动获取新增的联系人信息、新增的自创词组和/或新增的特征语言信息，以及时训练更新用户模型，从而确保用户模型在日常使用过程中可以保持与用户的特性一致，能够准确反映用户实际情况。如此，通过上述的用户模型的训练更新可以确保利用用户模型的评价结果的准确度。It can be understood that when the server or the terminal detects that the voice recognition result contains newly added contact information, newly created phrases and/or newly added characteristic language information, etc., it will automatically obtain the newly added contact information , new self-created phrases and/or new feature language information, to train and update the user model in time, so as to ensure that the user model can remain consistent with the user's characteristics during daily use and can accurately reflect the actual situation of the user. In this way, the accuracy of the evaluation results using the user model can be ensured through the above-mentioned training update of the user model.

请参阅图3，在其中一个实施例中，上述各实施例中的多个WFST模块可以包含有定制的WFST模块，也即是说，各个WFST模块中可以包含至少两类WFST模块，其中一类为根据各预定领域、各预定场景和各设定语言模式的声学模型、发音词典和语言模型，分别构建对应前述各预定领域、各预定场景和各设定语言模式的各个常规WFST模块(相对于定制解码器而言的)。另一类为基于日常使用较少的特殊的语法、生僻词句以及最新出现的新词句或者网络热点词等构建的定制WFST模块，其中新词句或者热点词，例如可以是网络上每年流行的新词或热词，如“我要打、我要看、我要听、我要买、OMG(Oh My God)”等。定制WFST模块构建时所需的上述词句可以通过从网上爬取相关语料的方式获取，关于爬取语料的具体方法此处不做限定，可以采用本领域常用的方法。Please refer to Fig. 3, in one of the embodiments, a plurality of WFST modules in the above-mentioned embodiments can contain customized WFST modules, that is to say, each WFST module can contain at least two types of WFST modules, one of which For the acoustic model, pronunciation dictionary and language model according to each predetermined field, each predetermined scene and each set language mode, each conventional WFST module (relative to for custom decoders). The other type is a custom WFST module based on special grammar, rare words and sentences that are less used in daily life, and the latest new words or hot words on the Internet. The new words or hot words, for example, can be new words that are popular on the Internet every year Or hot words, such as "I want to play, I want to watch, I want to listen, I want to buy, OMG (Oh My God)" and so on. The above words and sentences required for custom WFST module construction can be obtained by crawling relevant corpus from the Internet. The specific method of crawling corpus is not limited here, and methods commonly used in this field can be used.

定制WFST模块的构建的主要步骤可以如下S20～S26：The main steps of building a customized WFST module can be as follows S20-S26:

S20，采集设定的词句及语法信息；S20, collecting set words and sentences and grammatical information;

S22，通过词典对设定的词句进行分词处理；S22, performing word segmentation processing on the set words and sentences through the dictionary;

S24，对语法信息进行统计训练，得到对应的语言模型；S24. Perform statistical training on the grammatical information to obtain a corresponding language model;

S26，根据分词处理的结果和语言模型，编译得到定制WFST模块。S26, compile and obtain a customized WFST module according to the word segmentation processing result and the language model.

其中，前述的词典可以是常规WFST模块生成过程中所使用的传统的发音词典。语言模型的统计训练也可以采用本领域的常规方法，例如N-Gram语言模型。Wherein, the foregoing dictionary may be a traditional pronunciation dictionary used in the conventional WFST module generation process. The statistical training of the language model can also adopt conventional methods in this field, such as N-Gram language model.

可以理解，可以通过服务器在利用传统的WFST生成方法生成各个领域的WFST模块时，通过采集设定的词句及语法信息，并对分别进行分词处理和语言模型的统计训练，从而根据分词处理的结果和训练得到的语言模型，将设定的词句与语法信息通过传统的常用解码器构建方法，编译得到定制WFST模块，定制WFST模块例如可以是口语、书面语、化学或者数学等各个细分领域的各个定制WFST模块。如此，通过常规WFST模块和定制WFST模块分别进行前向搜索，可以实现获取到的语音信号包含生僻词句、网络流行的新词句、热点词句及其存在的语法时，同样能够输出准确度较高的语音识别结果。It can be understood that when the server uses the traditional WFST generation method to generate WFST modules in various fields, it can collect the set words and sentences and grammatical information, and perform word segmentation processing and statistical training of the language model respectively, so that according to the results of word segmentation processing And the language model obtained by training, the set words and sentences and grammatical information are compiled through the traditional common decoder construction method, and the customized WFST module is compiled. The customized WFST module can be, for example, various subdivided fields such as spoken language, written language, chemistry or mathematics. Custom WFST modules. In this way, through the forward search of the conventional WFST module and the customized WFST module, it can be realized that when the acquired speech signal contains uncommon words and sentences, popular new words and sentences on the Internet, hot words and sentences and their existing grammar, it can also output high-accuracy Speech recognition results.

在其中一个实施例中，上述的提及的终端是语音信号来源的终端，例如手机、平板设备或PDA或者智能交互设备；也可以是语音信号所对应的需要控制的其他设备，例如电视、智能平板或者其他智能交互设备。语音信号可以在被服务器处理成对应的语音识别结果(例如词序列对应的文本信息)后，由服务器根据语音识别结果中包含的指令信息，确定该语音信号指向的终端。也即是说，服务器在获取用户的语音信号并进行语音识别，获得对应的语音识别结果后，可以将语音识别结果发送到语音信号对应的终端，从而可以实现语音信号的语音识别响应的全过程，方便相应的终端及时执行对应的显示、交互或操作控制等，服务器的集成度较高。In one of the embodiments, the terminal mentioned above is the terminal from which the voice signal originates, such as a mobile phone, a tablet device or a PDA or an intelligent interactive device; it may also be other equipment corresponding to the voice signal that needs to be controlled, such as a TV, a smart Tablet or other smart interactive devices. After the voice signal is processed by the server into a corresponding voice recognition result (such as text information corresponding to a word sequence), the server determines the terminal to which the voice signal is directed according to the instruction information contained in the voice recognition result. That is to say, after the server acquires the user's voice signal and performs voice recognition, and obtains the corresponding voice recognition result, the server can send the voice recognition result to the terminal corresponding to the voice signal, so that the whole process of voice recognition and response of the voice signal can be realized , to facilitate corresponding terminals to perform corresponding display, interaction, or operation control in a timely manner, and the integration degree of the server is relatively high.

请参阅图4至5，给出的是语音识别过程的简要示意图，以更易于理解上述一些实施例中的各个步骤。需要说明的是，对于前述的各方法实施例，为了简便描述，将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其它顺序。Please refer to FIGS. 4 to 5 , which are brief schematic diagrams of the speech recognition process, so as to make it easier to understand the various steps in some of the above-mentioned embodiments. It should be noted that for the foregoing method embodiments, for the sake of simplicity of description, they are expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence, because Certain steps may be in other orders in accordance with the invention.

请参阅图6，还提供另一种语音识别方法，包括如下步骤S11～S17：Referring to Fig. 6, another voice recognition method is provided, including the following steps S11-S17:

S11，向服务器发送语音信号；S11, sending a voice signal to the server;

S13，获取服务器对语音信号进行解码处理后反馈的多个最佳路径；S13, acquiring multiple optimal paths fed back by the server after decoding and processing the voice signal;

S15根据预先训练的用户模型，对多个最佳路径进行评价；S15 evaluates multiple optimal paths according to the pre-trained user model;

S17，根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。S17. According to the evaluation result, extract an optimal path that matches the user model from the plurality of optimal paths as the target optimal path, and determine the speech recognition result of the speech signal according to the target optimal path.

可以理解，实现上述各个步骤涉及的各种解码处理和评价所采用的方式，可以参见前述各实施例中相应的解码处理过程和评价方式，此处不再赘述。It can be understood that for the various decoding processing and evaluation methods involved in the above steps, reference may be made to the corresponding decoding processing processes and evaluation methods in the foregoing embodiments, which will not be repeated here.

具体的，可以由终端在接收到用户输入的语音信号后，向负责执行语音信号解码处理的服务器发送该语音信号。该服务器接收到该语音信号后，对该语音信号进行解码处理，得到多个最佳路径并反馈到终端上。从而，终端可以在接收到服务器返回的多个最佳路径后，根据预先训练的用户模型，对返回的多个最佳路径进行评价，进而根据评价结果，从该多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。如此，通过在终端上利用用户模型对多个最佳路径进行评价，以得到最终的语音识别结果，可以防止用户模型涉及的用户个人信息泄露，提高用户个人信息的安全度。Specifically, after receiving the voice signal input by the user, the terminal may send the voice signal to the server responsible for decoding the voice signal. After receiving the voice signal, the server decodes the voice signal to obtain multiple optimal paths and feeds back to the terminal. Therefore, after receiving the multiple optimal paths returned by the server, the terminal can evaluate the multiple optimal paths returned according to the pre-trained user model, and then extract the optimal paths from the multiple optimal paths according to the evaluation results. An optimal path matched by the user model is used as the target optimal path, and the speech recognition result of the speech signal is determined according to the target optimal path. In this way, by using the user model on the terminal to evaluate multiple optimal paths to obtain the final speech recognition result, leakage of user personal information related to the user model can be prevented and the security of user personal information can be improved.

请参阅图7，提供一种语音识别装置100，包括语音获取模块12、解码处理模块14、第一评价模块16、第一结果获取模块18。语音获取模块12用于获取语音信号。第一评价模块16用于根据预先训练的用户模型，对多个最佳路径进行评价。第一结果获取模块18用于根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。Referring to FIG. 7 , aspeech recognition device 100 is provided, including a speech acquisition module 12 , adecoding processing module 14 , afirst evaluation module 16 , and a firstresult acquisition module 18 . The voice acquisition module 12 is used for acquiring voice signals. Thefirst evaluation module 16 is used to evaluate multiple optimal paths according to the pre-trained user model. The firstresult acquisition module 18 is used to extract an optimal path matching the user model from multiple optimal paths as the target optimal path according to the evaluation result, and determine the speech recognition result of the voice signal according to the target optimal path.

如此，上述实施例的技术方案，通过各模块，结合预先训练的用户模型对解码处理得到的多个最佳路径进行评价，根据评价结果得到目标最佳路径以得到最终的语音识别结果，可有效适应复杂多变的语音交流场景，兼顾用户语音交流的内容所覆盖的各种领域及说话习惯，更贴近用户的实际应用情景，识别结果准确率得到大大提高，有效避免传统的语音识别技术的识别结果准确率较低的问题。In this way, the technical solution of the above-mentioned embodiment evaluates a plurality of optimal paths obtained by the decoding process through each module, combined with the pre-trained user model, and obtains the target optimal path according to the evaluation results to obtain the final speech recognition result, which can be effectively Adapting to complex and changeable voice communication scenarios, taking into account the various fields and speaking habits covered by the user's voice communication content, it is closer to the user's actual application scenario, the accuracy of the recognition result is greatly improved, and the recognition of traditional voice recognition technology is effectively avoided The problem of low accuracy of the results.

请参阅图8，在其中一个实施例中，解码处理模块14可以包含特征提取模块142、分类计算模块144和解码搜索模块146。特征提取模块142用于对语音信号进行特征提取，得到对应的声学特征信息。分类计算模块144用于根据声学特征信息，通过预先构建的声学模型将语音信号分类为各个类别并确定对应的分类概率。解码搜索模块146用于根据各个类别的语音信号及对应的分类概率，基于预先构建的WFST模块进行前向搜索，获得多个最佳路径。本实施例中的特征提取、分类以及前向搜索的方法可以参见前述语音识别方法各实施例中的特征提取、分类以及前向搜索方法，此处不再赘述。Referring to FIG. 8 , in one embodiment, thedecoding processing module 14 may include afeature extraction module 142 , aclassification calculation module 144 and adecoding search module 146 . Thefeature extraction module 142 is used to perform feature extraction on the speech signal to obtain corresponding acoustic feature information. Theclassification calculation module 144 is used to classify the speech signal into each category and determine the corresponding classification probability through the pre-built acoustic model according to the acoustic feature information. Thedecoding search module 146 is used to perform a forward search based on a pre-built WFST module according to each category of speech signals and corresponding classification probabilities to obtain multiple optimal paths. For the method of feature extraction, classification and forward search in this embodiment, reference may be made to the method of feature extraction, classification and forward search in the foregoing speech recognition method embodiments, which will not be repeated here.

在其中一个实施例中，解码搜索模块146可以包含第一搜索模块，第一搜索模块用于基于预先构建的多个WFST模块分别进行独立前向搜索，获得与多个WFST模块分别对应的多个最佳路径。In one of the embodiments, thedecoding search module 146 may include a first search module, and the first search module is used to perform an independent forward search based on a plurality of pre-built WFST modules to obtain a plurality of WFST modules respectively corresponding to best path.

在其中一个实施例中，解码搜索模块146可以包含第二搜索模块，第二搜索模块用于基于预先构建的多个WFST模块及对应的权重，进行同步前向搜索，获得与多个WFST模块分别对应的多个最佳路径。In one of the embodiments, thedecoding search module 146 may include a second search module, and the second search module is used to perform a synchronous forward search based on a plurality of pre-built WFST modules and corresponding weights to obtain the Corresponding multiple optimal paths.

在其中一个实施例中，语音识别装置100还可以包括用户模型更新模块。用户模型更新模块用于若检测到语音识别结果包含新增的联系人信息、新增的自创词组和/或新增的特征语言信息，则根据新增的联系人信息、新增的自创词组和/或新增的特征语言信息，更新用户模型。In one of the embodiments, thespeech recognition device 100 may further include a user model updating module. The user model update module is used to detect that the speech recognition result contains newly added contact information, newly added self-created phrases and/or newly added feature language information, then according to the newly added contact information, newly added self-created Phrase and/or newly added feature language information to update the user model.

在其中一个实施例中，上述的语音识别装置100，还可以包括预设信息采集模块、分词处理模块和定制解码器构建模块。预设信息采集模块用于采集设定的词句及语法信息。分词处理模块用于通过词典对设定的词句进行分词处理，对语法信息进行统计训练，得到对应的语言模型。定制解码器构建模块用于根据分词处理的结果和得到的语言模型，编译得到定制WFST模块。如此，通过常规WFST模块和定制WFST模块分别进行前向搜索，可以实现获取到的语音信号在包含生僻词句、网络流行的新词句、热点词句及其包含的语法时，同样能够输出准确度较高的语音识别结果。In one of the embodiments, the abovespeech recognition device 100 may further include a preset information collection module, a word segmentation processing module and a custom decoder building module. The preset information collection module is used for collecting set words and sentences and grammatical information. The word segmentation processing module is used to perform word segmentation processing on the set words and sentences through the dictionary, perform statistical training on the grammatical information, and obtain the corresponding language model. The custom decoder building block is used to compile and obtain a custom WFST module according to the word segmentation processing result and the obtained language model. In this way, through the forward search of the conventional WFST module and the customized WFST module, it can be realized that the acquired speech signal can also output with high accuracy when it contains uncommon words and sentences, popular new words and sentences on the Internet, hot words and sentences and their grammar. speech recognition results.

请参阅图9，在其中一个实施例中，还提供一个语音识别装置200，语音识别装置200包括语音发送模块22、路径获取模块24、第二评价模块26和第二结果获取模块28。语音发送模块22用于向服务器发送语音信号。路径获取模块24用于获取服务器对语音信号进行解码处理后反馈的多个最佳路径。第二评价模块26用于根据预先训练的用户模型，对多个最佳路径进行评价。第二结果获取模块28用于根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。Please refer to FIG. 9 , in one embodiment, aspeech recognition device 200 is provided, and thespeech recognition device 200 includes aspeech sending module 22 , aroute obtaining module 24 , asecond evaluation module 26 and a secondresult obtaining module 28 . Thevoice sending module 22 is used for sending voice signals to the server. Thepath obtaining module 24 is used to obtain multiple optimal paths fed back by the server after decoding and processing the voice signal. Thesecond evaluation module 26 is used for evaluating multiple optimal paths according to the pre-trained user model. The secondresult obtaining module 28 is used to extract an optimal path matching the user model from multiple optimal paths as the target optimal path according to the evaluation result, and determine the speech recognition result of the voice signal according to the target optimal path.

如此，上述实施例的技术方案，通过各模块，结合预先训练的用户模型对服务器返回的多个最佳路径进行评价，根据评价结果得到目标最佳路径以得到最终的语音识别结果，可有效适应复杂多变的语音交流场景，兼顾用户语音交流的内容所覆盖的各种领域及说话习惯，更贴近用户的实际应用情景，识别结果准确率得到大大提高，有效避免传统的语音识别技术的识别结果准确率较低的问题，此外还能够提高用户个人信息的安全度。In this way, the technical solutions of the above-mentioned embodiments, through each module, combine the pre-trained user model to evaluate a plurality of optimal paths returned by the server, and obtain the target optimal path according to the evaluation results to obtain the final speech recognition result, which can effectively adapt to Complex and changeable voice communication scenarios, taking into account the various fields and speaking habits covered by the user's voice communication content, closer to the user's actual application scenarios, greatly improving the accuracy of recognition results, effectively avoiding the recognition results of traditional voice recognition technology In addition to the problem of low accuracy, it can also improve the security of users' personal information.

上述语音识别装置100中的第一评价模块16，与语音识别装置200中的第二评价模块26可以理解为具有相同功能的相同模块，加以名称区别可以仅是因其属于不同的装置，而非具有本质的不同。同理的，可以理解语音识别装置100中的第一结果获取模块18、与语音识别装置200中的第二结果获取模块28的关系。Thefirst evaluation module 16 in the above-mentionedspeech recognition device 100 and thesecond evaluation module 26 in thespeech recognition device 200 can be understood as the same modules with the same function, and the name distinction can be only because they belong to different devices, not are fundamentally different. Similarly, the relationship between the firstresult acquisition module 18 in thespeech recognition device 100 and the secondresult acquisition module 28 in thespeech recognition device 200 can be understood.

上述语音识别装置100和语音识别装置200中的各个模块可以全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in thespeech recognition device 100 and thespeech recognition device 200 described above may be fully or partially implemented by software, hardware or a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

在其中一个实施例中，提供一种语音识别设备，该语音识别设备可以是计算机设备；例如普通电脑或者可以是服务器。该语音识别设备包括存储器和处理器。存储器上存储有可在处理器上运行的计算机程序。该语音识别设备的处理器用于提供计算和控制能力。该语音识别设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该语音识别设备可以包含网络接口，用于与外部的交互终端通过网络连接通信。处理器执行存储器上的计算机程序时，可以执行如下步骤：获取语音信号；对语音信号进行解码处理，获得多个最佳路径；根据预先训练的用户模型，对多个最佳路径进行评价；根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。In one embodiment, a speech recognition device is provided, and the speech recognition device may be a computer device; for example, a common computer or may be a server. The voice recognition device includes a memory and a processor. A computer program executable on the processor is stored on the memory. The processor of the speech recognition device is used to provide computing and control capabilities. The memory of the speech recognition device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The speech recognition device may include a network interface for communicating with an external interactive terminal through a network connection. When the processor executes the computer program on the memory, it can perform the following steps: obtain the voice signal; decode the voice signal to obtain multiple optimal paths; evaluate the multiple optimal paths according to the pre-trained user model; Evaluate the results, extract an optimal path that matches the user model from multiple optimal paths as the target optimal path, and determine the speech recognition result of the speech signal according to the target optimal path.

在其中一个实施例中，还提供另一种语音识别设备，该语音识别设备可以是智能终端设备；例如移动终端或者可以是智能电视、智能平板等各类智能交互设备。该语音识别设备包括存储器和处理器。存储器上存储有可在处理器上运行的计算机程序。该语音识别设备的处理器用于提供计算和控制能力。该语音识别设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该语音识别设备可以包含网络接口，用于与外部的其他交互终端通过网络连接通信。处理器执行存储器上的计算机程序时，可以执行如下步骤：向服务器发送语音信号；获取服务器对语音信号进行解码处理后反馈的多个最佳路径；根据预先训练的用户模型，对多个最佳路径进行评价；根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。In one of the embodiments, another speech recognition device is provided, and the speech recognition device may be a smart terminal device; for example, a mobile terminal or various smart interactive devices such as a smart TV and a smart tablet. The voice recognition device includes a memory and a processor. A computer program executable on the processor is stored on the memory. The processor of the speech recognition device is used to provide computing and control capabilities. The memory of the speech recognition device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The speech recognition device may include a network interface for communicating with other external interactive terminals via a network connection. When the processor executes the computer program on the memory, it can perform the following steps: send the voice signal to the server; obtain multiple optimal paths fed back by the server after decoding the voice signal; The path is evaluated; according to the evaluation result, an optimal path matching the user model is extracted from multiple optimal paths as the target optimal path, and the speech recognition result of the speech signal is determined according to the target optimal path.

在其中一个实施例中，上述各实施例的语音识别设备中的处理器执行其存储器上的计算机程序时，还可以实现本发明上述语音识别方法的各种相应部分的实施例。In one of the embodiments, when the processor in the speech recognition device of each of the above embodiments executes the computer program on its memory, it can also realize the embodiments of various corresponding parts of the speech recognition method of the present invention.

通常存储在一个存储介质中的程序，可通过直接将程序读取出存储介质或者通过将程序安装或复制到数据处理设备的存储设备(如硬盘和或内存)中执行。因此，这样的存储介质也构成了本发明。存储介质可以使用任何类型的记录方式，例如纸张存储介质(如纸带等)、磁存储介质(如软盘、硬盘、闪存等)、光存储介质(如CD-ROM等)、磁光存储介质(如MO等)等。因此本发明还公开了一种计算机可读存储介质，其中存储有计算机程序，该计算机程序被运行时用于执行如下步骤：获取语音信号；对语音信号进行解码处理，获得多个最佳路径；根据预先训练的用户模型，对多个最佳路径进行评价；根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。Generally, a program stored in a storage medium can be executed by directly reading the program out of the storage medium or by installing or copying the program into a storage device (such as a hard disk and/or memory) of a data processing device. Therefore, such a storage medium also constitutes the present invention. The storage medium can use any type of recording method, such as paper storage medium (such as paper tape, etc.), magnetic storage medium (such as floppy disk, hard disk, flash memory, etc.), optical storage medium (such as CD-ROM, etc.), magneto-optical storage medium ( Such as MO, etc.) etc. Therefore, the present invention also discloses a computer-readable storage medium, in which a computer program is stored, and the computer program is used to perform the following steps when being run: acquiring a voice signal; decoding the voice signal to obtain multiple optimal paths; According to the pre-trained user model, evaluate multiple optimal paths; according to the evaluation results, extract an optimal path that matches the user model from multiple optimal paths as the target optimal path, and determine according to the target optimal path Speech recognition results for speech signals.

在其中一个实施例中，本发明还公开了另一种计算机可读存储介质，其中存储有计算机程序，该计算机程序被运行时用于执行如下步骤：向服务器发送语音信号；获取服务器对语音信号进行解码处理后反馈的多个最佳路径；根据预先训练的用户模型，对多个最佳路径进行评价；根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。In one of the embodiments, the present invention also discloses another computer-readable storage medium, in which a computer program is stored. When the computer program is run, the computer program is used to perform the following steps: send a voice signal to the server; Multiple optimal paths that are fed back after decoding processing; according to the pre-trained user model, evaluate multiple optimal paths; according to the evaluation results, extract an optimal path that matches the user model from multiple optimal paths as The target optimal path, and determine the speech recognition result of the speech signal according to the target optimal path.

在其中一个实施例中，前述各实施例的计算机可读存储介质上的计算机程序被运行时还用于执行本发明上述语音识别方法的相应各实施例。In one of the embodiments, the computer programs on the computer-readable storage medium of the foregoing embodiments are also used to execute the corresponding embodiments of the speech recognition method of the present invention when executed.

根据上述本发明各实施例的语音识别方法，请参阅图10，本发明实施例还提供一种语音识别系统300，下面结合图11、图12所示时序及可选实施例对本发明的语音识别系统300进行详细说明。According to the speech recognition method of each embodiment of the present invention described above, please refer to FIG. 10, the embodiment of the present invention also provides aspeech recognition system 300, the speech recognition of the present invention will be described below in conjunction with the sequence shown in Fig. 11 and Fig. 12 andoptional embodiments System 300 is described in detail.

语音识别系统300可以包含服务器32和终端34。终端34可以用于发送语音信号至服务器32。服务器32可以用于语音信号进行解码处理，获得多个最佳路径；终端34还可以用于根据预先训练的用户模型，对多个所述最佳路径进行评价；根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定语音信号的语音识别结果。Thespeech recognition system 300 may include aserver 32 and a terminal 34 .Terminal 34 may be used to send voice signals toserver 32 .Server 32 can be used for voice signal to carry out decoding processing, obtains a plurality of best paths;Terminal 34 can also be used for according to pre-trained user model, a plurality of described best paths are evaluated; Extract a best path that matches the user model from the best paths as the target best path, and determine the speech recognition result of the speech signal according to the target best path.

其中，服务器32可以是语音信号的后台处理设备，例如实体服务器或云计算服务器，或者实体服务器与云计算服务器组合而成的语音信号的识别处理平台。终端34可以是各种智能设备，例如智能手机、智能电视、平板电脑或者是其他各种智能家电、智能办公设备和智能交通工具。Wherein, theserver 32 may be a voice signal background processing device, such as a physical server or a cloud computing server, or a voice signal recognition and processing platform formed by a combination of a physical server and a cloud computing server. The terminal 34 may be various smart devices, such as smart phones, smart TVs, tablet computers, or other smart home appliances, smart office equipment and smart vehicles.

具体的，上述的终端34可以在获得用户直接口头输入，或者通过其他设备间接输入的语音信号后，将得到的语音信号发送到服务器32。服务器32从而可以对接收到语音信号进行解码处理，得到多个最佳路径输出后，将该多个最佳路径返回到终端34上。此时，终端34可以调用预先训练的用户模型，对返回的多个最佳路径进行评价，根据评价结果，从多个最佳路径中提取与用户模型匹配的一个最佳路径作为目标最佳路径，并根据目标最佳路径确定用户输入的语音信号的语音识别结果。可以理解，服务器32所进行的解码处理可以根据上述语音识别方法的各实施例中的解码处理理解，终端34根据用户模型对多个最佳路径的评价也可以参见上述语音识别方法的各实施例中的用户模型评价的处理，本实施例中不再赘述。Specifically, the above-mentionedterminal 34 may send the obtained voice signal to theserver 32 after obtaining the voice signal directly orally input by the user or indirectly input through other devices. Therefore, theserver 32 can decode the received voice signal, and after outputting multiple optimal paths, return the multiple optimal paths to the terminal 34 . At this point, the terminal 34 can invoke the pre-trained user model to evaluate the multiple optimal paths returned, and extract an optimal path that matches the user model from the multiple optimal paths as the target optimal path according to the evaluation results , and determine the speech recognition result of the speech signal input by the user according to the target optimal path. It can be understood that the decoding process performed by theserver 32 can be understood according to the decoding process in each embodiment of the above-mentioned speech recognition method, and the evaluation of multiple optimal paths by the terminal 34 according to the user model can also refer to the various embodiments of the above-mentioned speech recognition method The processing of user model evaluation in , will not be repeated in this embodiment.

如此，通过服务器32利用各个WFST模块或者一个通用WFST模块，对语音信号进行解码处理后放回多个最佳路径至终端34上，再由终端34根据预先训练的用户模型对该多个最佳路径进行评价，从而最终确定输入的语音信号的语音识别结果。综上，上述的语音识别系统300可以有效覆盖尽多的语音应用场景和领域，并可以兼顾用户习惯，更贴近用户的实际应用情景，识别结果准确率得到较大的提高；此外，还能够避免用户模型涉及的用户个人信息因共享至服务器32所在的公共环境而造成个人信息外泄的问题，用户个人信息安全度高，用户体验可得到较大的改善。In this way, theserver 32 utilizes each WFST module or a general WFST module to decode the speech signal and put back multiple optimal paths to the terminal 34, and then the terminal 34 uses the pre-trained user model for the multiple optimal paths. The path is evaluated, so as to finally determine the speech recognition result of the input speech signal. To sum up, the abovespeech recognition system 300 can effectively cover as many speech application scenarios and fields as possible, and can take into account user habits, be closer to the user's actual application scenarios, and greatly improve the accuracy of recognition results; in addition, it can also avoid The personal information of the user involved in the user model is shared to the public environment where theserver 32 is located, causing leakage of personal information. The security of the personal information of the user is high, and the user experience can be greatly improved.

在其中一个实施例中，服务器32可以包含一台，可以包含有多台，例如多台互联的服务器32中，每一台服务器32上可以存储有一个或者多个领域、场景或者预设语言模式中的WFST模块，通过多台服务器32构成分布式服务器解码网络进行联动工作，可以较快地对语音信号在不同领域、场景或者预设语言模式中进行解码搜索，从而可以更快速、准确地完成上述语音信号的语音解码过程，还可以同时容纳较多数量的终端34在同时段发送的待识别语音信号的解码处理，处理效率较高。In one of the embodiments, theserver 32 may include one or multiple, for example, among multipleinterconnected servers 32, eachserver 32 may store one or more domains, scenes or preset language modes The WFST module in the WFST module forms a distributed server decoding network throughmultiple servers 32 for linkage work, which can quickly decode and search voice signals in different fields, scenes or preset language modes, thereby completing the task more quickly and accurately. The speech decoding process of the above speech signal can also simultaneously accommodate the decoding processing of the speech signals to be recognized sent by a large number ofterminals 34 at the same time, and the processing efficiency is high.

前述的多台服务器32可配置一台主控制服务器32，以实现与各终端34的对接和结果返回时的寻址配对，提高多个最佳路径或包含该多个最佳路径的词格(lattice)信息返回到各终端34的速度。如此，可以通过分布式的服务器32网络来协作完成用户通过终端34输入的语音信号的语音解码处理过程，提高整个语音识别系统300的语音识别处理效率和容量。Aforesaidmultiple servers 32 can configure amaster control server 32, to realize the addressing pairing when docking with each terminal 34 and result return, improve a plurality of optimal paths or include the lattice of this multiple optimal paths ( lattice) information is returned to each terminal 34 speed. In this way, the distributedserver 32 network can cooperate to complete the voice decoding process of the voice signal input by the user through the terminal 34, and improve the voice recognition processing efficiency and capacity of the entirevoice recognition system 300.

在其中一个实施例中，终端34还可以用于：若检测到语音识别结果中包含新增的联系人信息、新增的自创词组和/或新增的特征语言信息，则根据新增的联系人信息、新增的自创词组和/或新增的特征语言信息，更新用户模型。如此，终端34可通过定期检测、收集用户的前述联系人信息、自创词组和/或特征语言信息等的新增特性信息，用于用户模型的训练更新，得到尽可能符合用户真实情况的用户模型，从而确保不同时间内，均可达到有效提高语音识别结果的准确度的效果。In one of the embodiments, the terminal 34 can also be used to: if it is detected that the speech recognition result contains newly added contact information, newly created phrases and/or newly added characteristic language information, then according to the newly added Contact information, newly-added self-created phrases and/or newly-added characteristic language information, and update the user model. In this way, the terminal 34 can regularly detect and collect the user's aforementioned contact information, self-created phrases and/or feature language information, etc., for the training and updating of the user model, so as to obtain a user profile that matches the user's real situation as much as possible. Model, so as to ensure that the effect of effectively improving the accuracy of speech recognition results can be achieved in different time periods.

在其中一个实施例中，语音识别系统300的上述实施例中的服务器32，其执行解码过程中所使用的各个WFST模块中或者组成通用解码器的各个WFST模块中，包含有定制WFST模块。定制WFST模块可以通过服务器32采集设定的词句及语法信息，并通过词典对设定的词句进行分词处理，对语法信息进行统计训练，得到对应的语言模型后，根据分词处理的结果和得到的语言模型编译得到。如此，结合常规WFST模块和定制WFST模块，可以实现获取到的语音信号包含生僻词句、网络流行的新词句、热点词句及其存在的语法时，仍能够输出准确度较高的多个最佳路径，以便终端34最终得到准确度较高的语音识别结果。In one of the embodiments, theserver 32 in the above embodiment of thespeech recognition system 300 includes a customized WFST module in each WFST module used in the decoding process or in each WFST module forming a general decoder. The customized WFST module can collect the set words and sentences and grammatical information through theserver 32, and perform word segmentation processing on the set words and sentences through the dictionary, carry out statistical training on the grammatical information, and after obtaining the corresponding language model, according to the result of the word segmentation processing and the obtained The language model is compiled. In this way, combined with the conventional WFST module and the customized WFST module, it is possible to output multiple optimal paths with high accuracy even when the acquired speech signal contains rare words and sentences, popular new words and sentences on the Internet, hot words and sentences and their existing grammar , so that the terminal 34 finally obtains a speech recognition result with higher accuracy.

在其中一个实施例中，上述实施例中的终端34上可以安装有客户端。客户端可以用于执行终端34与服务器32之间的通信，以及执行终端34的上述语音识别的步骤。In one of the embodiments, a client may be installed on the terminal 34 in the above embodiment. The client can be used to execute the communication between the terminal 34 and theserver 32, and to execute the above-mentioned speech recognition steps of the terminal 34.

在其中一个实施例中，终端34或服务器32在获得语音信号输入后，根据预存的音色特征对该语音信号进行音色匹配，若音色匹配的结果一致，则对该语音信息继续执行后续的语音识别处理步骤；否则拦截该语音信号并报警或者删除该语音信号，使该语音信号的后续识别步骤终止。其中，预存的音色特征可以是终端34的第一用户(例如终端34的机主)录入的语音的频谱特征，音色匹配的过程即是将预存的频谱特征与输入的语音信号的频谱特征进行匹配分析的过程。如此，通过对语音信号进行识别前期，可以避免终端34被盗用的问题，提高语音识别的安全性。In one of the embodiments, after receiving the voice signal input, the terminal 34 or theserver 32 performs timbre matching on the voice signal according to the pre-stored timbre features, and if the timbre matching results are consistent, then continue to perform subsequent voice recognition on the voice information Processing steps; otherwise, intercept the voice signal and give an alarm or delete the voice signal, so that the subsequent recognition step of the voice signal is terminated. Wherein, the pre-stored timbre feature can be the spectral feature of the voice entered by the first user of the terminal 34 (such as the owner of the terminal 34), and the process of timbre matching is to match the pre-stored spectral feature with the spectral feature of the input voice signal The process of analysis. In this way, the problem of the terminal 34 being stolen can be avoided and the security of voice recognition can be improved by performing recognition on the voice signal in the early stage.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the invention. It should be pointed out that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.