CN110148416B

Movatterモバイル変換

Info

Publication number: CN110148416B
Application number: CN201910327337.6A
Authority: CN
Inventors: 骆彬; 彭学政; 涂润; 陈召
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2024-03-15
Anticipated expiration: 2039-04-23
Also published as: CN110148416A

Abstract

Description

Speech recognition method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of voice recognition, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium.

Background

The voice recognition technology is a technology for converting human voice into text, and is widely applied to various artificial intelligent products, such as intelligent dialogue robots, intelligent sound boxes, intelligent translation equipment and the like.

The general flow of speech recognition includes: the voice recognition device acquires voice data input by a user and sends the voice data to the voice recognition engine, the voice recognition engine recognizes the voice data and feeds a voice recognition result back to the voice recognition device, and the voice recognition device outputs the voice recognition result.

Currently, speech recognition relies on a single speech recognition engine, and the speech recognition engine is a general platform, and has relatively poor recognition effect for certain specific fields, resulting in inaccurate recognition results.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a device, equipment and a storage medium, which can be used for solving the problems that in the related technology, the recognition effect of a voice recognition engine aiming at certain specific fields is relatively poor and the recognition result is inaccurate. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for voice recognition, where the method includes:

acquiring voice data to be recognized;

transmitting the voice data to be recognized to n voice recognition engines to obtain n voice recognition results, wherein n is an integer greater than 1;

selecting a target voice recognition result from the n voice recognition results according to the characteristic information of the n voice recognition results; the characteristic information of the voice recognition result is used for indicating the adaptation degree between a voice recognition engine outputting the voice recognition result and the voice data to be recognized and the credibility degree of words contained in the voice recognition result.

In another aspect, an embodiment of the present application provides a voice recognition apparatus, including:

the data acquisition module is used for acquiring voice data to be identified;

the data transmission module is used for transmitting the voice data to be recognized to n voice recognition engines to obtain n voice recognition results, wherein n is an integer greater than 1;

the result selection module is used for selecting a target voice recognition result from the n voice recognition results according to the characteristic information of the n voice recognition results; the characteristic information of the voice recognition result is used for indicating the adaptation degree between a voice recognition engine outputting the voice recognition result and the voice data to be recognized and the credibility degree of words contained in the voice recognition result.

In yet another aspect, embodiments of the present application provide a computer device including a processor and a memory having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the speech recognition method as described in the above aspects.

In yet another aspect, embodiments of the present application provide a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the speech recognition method as described in the above aspects.

In yet another aspect, embodiments of the present application provide a computer program product for implementing the above-described speech recognition method when the computer program product is executed by a processor.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

and sending the voice data to be recognized to a plurality of voice recognition engines for recognition to obtain a plurality of voice recognition results, and selecting one from the plurality of voice recognition results as a target voice recognition result according to the characteristic information of the plurality of voice recognition results. Compared with the prior art, the voice recognition is dependent on a single voice recognition engine, the voice recognition engine is a universal platform, recognition effects for certain specific fields are relatively poor, in the technical scheme provided by the application, voice recognition results of a plurality of voice recognition engines are adopted as references, and a better voice recognition result is selected from the voice recognition results, so that accuracy of the recognition results is improved.

Drawings

FIG. 1 is a schematic illustration of an implementation environment provided by one embodiment of the present application;

FIG. 2 is a schematic diagram of a complete speech recognition process of the present application;

FIG. 3 is a flow chart of a method of speech recognition provided in one embodiment of the present application;

FIG. 4 is a flow chart of a method of speech recognition provided in another embodiment of the present application;

FIG. 5 is a block diagram illustrating an error correction rewrite system;

FIG. 6 is a block diagram illustrating another error correction rewrite system;

fig. 7 is a schematic diagram exemplarily showing a second rewrite layer rewrite flow;

FIG. 8 is a block diagram of a speech recognition device provided in one embodiment of the present application;

FIG. 9 is a block diagram of a speech recognition device provided in one embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an implementation environment provided in one embodiment of the present application is shown. The implementation environment may include: a terminal 10, a background server 20 and a speech recognition server 30.

In the embodiment of the present application, the terminal 10 may be configured with a voice capturing device, where the voice capturing device may be a microphone, a microphone array, a microphone, or the like, for acquiring voice data input by a user. The background server 20 provides a background service to the terminal 10.

Alternatively, an application having a voice input function, such as an instant messaging application, a voice input method application, a voice assistant, etc., may be installed in the terminal 10. When the applications run, the voice acquisition device is called to acquire voice data input by a user.

Alternatively, the background server 20 may be a background server of the application having the voice input function.

In one possible embodiment, the application with a voice input function can collect voice data input by a user and recognize the voice data.

In another possible implementation, the application with the voice input function only can collect voice data input by a user, the collected voice data needs to be sent to the voice recognition server 30 via the background server 20 to be recognized, and the voice recognition server 30 feeds back the voice recognition result to the terminal 10 via the background server 20 to output after the recognition is completed. In the embodiment of the present application, the voice recognition server 30 recognizes voice data input by a user as an example.

The terminal 10 may be an electronic device that can interact with a user in voice, such as a smart phone, tablet, PC (Personal Computer ), smart robot, smart television, smart speaker, etc.

In the embodiment of the present application, the speech recognition engine is run on the speech recognition server 30. Alternatively, different speech recognition engines may run in different speech recognition servers; a speech recognition server may have one speech recognition engine running therein, or may have a plurality of different speech recognition engines running therein, which is not limited in this embodiment of the present application.

The background server 20 and the speech recognition server 30 may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center.

The terminal 10 may communicate with the background server 20 through a network, and the background server 20 communicates with the voice recognition server 30 through a network.

It should be noted that in some embodiments, the number of the voice recognition servers 30 is one, and a plurality of different voice recognition engines are running in the voice recognition server 30; in other embodiments, the number of speech recognition servers 30 is multiple, one speech recognition engine running in each speech recognition server, and the speech recognition engines in each speech recognition server being different.

In addition, the technical scheme provided by the embodiment of the application can be suitable for the voice recognition of a plurality of different languages, such as Chinese, english, french, german, japanese, korean and the like. In the embodiment of the present application, the application is mainly described by taking the application in chinese as an example, but the technical scheme of the present application is not limited.

It should be noted that the technical solution provided in the embodiment of the present application may be applied to various artificial intelligence products, where application scenarios include, but are not limited to, home, vehicle, game, and so on.

Referring to fig. 2, a schematic diagram of a complete speech recognition process of the present application is shown. A user inputs voice data to be recognized through a voice acquisition device configured in the terminal 10 (such as an intelligent robot, an intelligent television, an intelligent sound box and the like); then, the terminal 10 transmits the voice data to be recognized to a plurality of voice recognition engines, such as a voice recognition engine a, a voice recognition engine B, and a voice recognition engine C, via the background server 20, to obtain voice recognition results, such as a voice recognition result a, a voice recognition result B, and a voice recognition result C, of the respective voice recognition engines. The above-described speech recognition engine operates in the speech recognition server 30. Then, the background server 20 may select a target speech recognition result from the plurality of speech recognition results. And further performing error correction rewriting on the target voice recognition result to obtain a target voice recognition result after error correction rewriting. In the process of selecting the target voice recognition result and correcting and rewriting, the contents such as a historical interaction log and a domain knowledge graph are combined. Finally, the background server 20 feeds back the target speech recognition result after error correction and rewriting to the terminal 10, so that the terminal 10 can respond to the user by combining the target speech recognition result after error correction and rewriting through the interactive system.

The following describes the technical scheme of the application through several embodiments.

Referring to fig. 3, a flowchart of a voice recognition method according to an embodiment of the present application is shown. In the embodiment of the present application, the background server in the implementation environment shown in fig. 1 is mainly used for illustration. The method may comprise the following steps:

step 301, obtaining voice data to be recognized.

The voice data to be recognized refers to voice data input by a user. When a user wants to interact with the terminal in a voice mode, the user can directly speak to the voice acquisition device in the terminal, and correspondingly, the terminal can acquire the words spoken by the user through the voice acquisition device as voice data to be recognized.

Alternatively, the terminal may collect the voice data to be recognized when receiving a voice recognition instruction, where the voice recognition instruction may be triggered by a user, and the user may be triggered by a specified operation, where the specified operation may include a clicking operation, a sliding operation, and so on, and the embodiment of the present application is not limited to this.

For example, a voice recognition option may be provided in the terminal that may be clicked to trigger the voice recognition instruction when the user wants to interact with the terminal by voice. After receiving the voice recognition instruction, the terminal acquires voice data to be recognized.

Step 302, sending the voice data to be recognized to n voice recognition engines to obtain n voice recognition results, where n is an integer greater than 1.

After the terminal obtains the voice data to be recognized, the voice data to be recognized can be sent to a plurality of voice recognition engines for voice recognition, and correspondingly, the voice recognition engines can convert the voice data to be recognized into texts after receiving the voice data to be recognized, and the texts are the voice recognition results.

The n speech recognition engines may be operated in one server or in a plurality of servers. The n speech recognition correspondence characteristics are different, and the characteristics may include a domain characteristic, a history characteristic, a gender characteristic, a region characteristic, and the like.

In the embodiment of the present application, the n speech recognition engines are general speech recognition engines.

Step 303, selecting a target speech recognition result from the n speech recognition results according to the feature information of the n speech recognition results.

After n voice recognition results output by the n voice recognition engines are obtained, the n voice recognition engines may have different accuracies for different scene recognition, so that one recognition result is selected from the n voice recognition results as a target voice recognition result.

The above-mentioned degree of confidence is used for reflecting the degree of coincidence of the words contained in the speech recognition result and the words in the predefined word stock, and the higher the degree of confidence, the higher the degree of coincidence of the words contained in the speech recognition result and the words in the predefined word stock, the higher the accuracy of the words contained in the speech recognition result. For example, the speech recognition result a is "you do not and Lv Bu who is severe", the speech recognition result B is "you do not and Lv Bu who is severe", the words in the predefined word stock include "plum white", "Lv Bu", i.e., the speech recognition result B overlaps with the words in the predefined word stock, which indicates that the accuracy of the words contained in the speech recognition result B is higher.

Details of the above feature information are described in the following embodiment of fig. 4, and are not repeated here.

Optionally, the selecting the target speech recognition result from the n speech recognition results according to the feature information of the n speech recognition results includes: calculating confidence scores corresponding to the n voice recognition results according to the characteristic information of the n voice recognition results; and selecting the voice recognition result with the highest confidence score from the n voice recognition results as the target voice recognition result.

The calculation of confidence scores corresponding to the n speech recognition results may be implemented by a machine learning model. The machine learning model may be a markov continuous speech recognition model, a neural network model, an SVM (Support Vector Machine ), etc., which embodiments of the present application do not limit.

In summary, in the technical solution provided in the embodiments of the present application, the voice data to be recognized is sent to a plurality of voice recognition engines to perform recognition, so as to obtain a plurality of voice recognition results, and one of the plurality of voice recognition results is selected as the target voice recognition result according to the feature information of the plurality of voice recognition results. Compared with the prior art, the voice recognition is dependent on a single voice recognition engine, the voice recognition engine is a universal platform, recognition effects for certain specific fields are relatively poor, in the technical scheme provided by the application, voice recognition results of a plurality of voice recognition engines are adopted as references, and a better voice recognition result is selected from the voice recognition results, so that accuracy of the recognition results is improved.

Referring to fig. 4, a flowchart of a voice recognition method according to another embodiment of the present application is shown. In the embodiment of the present application, the background server in the implementation environment shown in fig. 1 is mainly used for illustration. The method may comprise the following steps:

in step 401, speech data to be recognized is acquired.

This step is the same as or similar to step 301 in the embodiment of fig. 3, and will not be described again here.

Step 402, sending the voice data to be recognized to n voice recognition engines to obtain n voice recognition results, where n is an integer greater than 1.

This step is the same as or similar to step 302 in the embodiment of fig. 3, and will not be described again here.

Step 403, calculating confidence scores corresponding to the n voice recognition results according to the feature information of the n voice recognition results.

The confidence score is used for reflecting the accuracy of the voice recognition result, and the higher the confidence score is, the higher the accuracy of the voice recognition result is.

The calculating the confidence scores corresponding to the n voice recognition results according to the feature information of the n voice recognition results may include the following two steps: for the ith voice recognition result in the n voice recognition results, acquiring characteristic information of the ith voice recognition result; and inputting the characteristic information of the ith voice recognition result into a machine learning model to obtain a confidence score corresponding to the ith voice recognition result, wherein i is a positive integer less than or equal to n.

The above feature information is used to reflect the voice data to be recognized, the voice recognition engine outputting the i-th voice recognition result, and the abstract feature of the i-th voice recognition result, and the feature information includes, but is not limited to, at least one of the following: the method comprises the steps of identifying field features corresponding to voice data to be identified, regional features corresponding to the voice data to be identified, gender features corresponding to the voice data to be identified, historical features corresponding to the voice data to be identified, regional features corresponding to a voice recognition engine outputting an ith voice recognition result, gender features corresponding to a voice recognition engine outputting an ith voice recognition result, historical features corresponding to a voice recognition engine outputting an ith voice recognition result, matching degree between the ith voice recognition result and a predefined word stock and the like.

The domain features corresponding to the voice data to be identified are used for reflecting the domain to which the theme of the voice data belongs, such as the domain of game theme, the domain of music theme, the domain of children story, the domain of economic theme, the domain of scientific theme and the like; the domain features corresponding to the voice recognition engine outputting the ith voice recognition result are used for reflecting the confidence of the voice recognition engine for the domain recognition.

The regional characteristics corresponding to the voice data to be recognized are used for reflecting regional accents of the user inputting the voice data, such as northeast dialect, guangdong dialect, shanghai dialect and the like; the region characteristics corresponding to the voice recognition engine outputting the ith voice recognition result are used for reflecting the confidence of the voice recognition engine on the voice data recognition with different region characteristics.

The gender characteristic corresponding to the voice data to be recognized is used for reflecting the gender of the user inputting the voice data, such as male voice or female voice; the gender feature corresponding to the voice recognition engine outputting the ith voice recognition result is used for reflecting the confidence of the voice recognition engine for voice data recognition with different gender features.

The historical features corresponding to the voice data to be recognized are used for reflecting whether the voice data to be recognized appears in a historical interaction log; the historical features corresponding to the voice recognition engine outputting the ith voice recognition result are used for reflecting the confidence of the voice recognition engine on the recognition of the voice data in the historical interaction log.

The matching degree between the ith speech recognition result and the predefined word stock is used for reflecting the similarity degree of the ith speech recognition result and the words in the predefined word stock. The predefined word stock may include a domain knowledge graph of a domain to which the subject of the ith speech recognition result belongs, where the domain knowledge graph includes a proprietary word stock of the domain; the method can also comprise a word stock of history identification recorded in a history interaction log; but also a word library of high frequency usage words collected by a background designer in advance, etc., which is not limited in the embodiment of the present application.

In addition, the above feature information may further include other features, which are not limited in this embodiment of the present application.

After the feature information corresponding to the ith speech recognition result is obtained, the feature information may be input into a pre-trained machine learning model, and a confidence score corresponding to the ith speech recognition result may be calculated. The input of the machine learning model is a plurality of feature information corresponding to the voice recognition result, and the output is a confidence score corresponding to the voice recognition result.

Step 404, selecting the speech recognition result with the highest confidence score from the n speech recognition results as the target speech recognition result.

The higher the confidence score, the more accurate the speech recognition result. Therefore, after the confidence scores of the n speech recognition results are obtained, the speech recognition result with the highest confidence score can be used as the target speech recognition result.

In addition, when two or more confidence scores are highest in parallel in the speech recognition result, any one of them may be selected as the target speech recognition result.

Optionally, after the target speech recognition result is obtained, error correction and rewrite can be performed on the error word in the target speech recognition result, so as to obtain the target speech recognition result after error correction and rewrite.

Because the above-mentioned speech recognition engine is a general speech recognition engine, for some specific scenes, the accuracy of recognition is not enough, and some error words may occur, which affects man-machine interaction. Therefore, error correction rewrite is introduced, error words in the target voice recognition result can be found, error correction rewrite is further carried out, the target voice recognition result after error correction rewrite is obtained, response is carried out through an interaction system of the terminal, and the accuracy of voice recognition is improved.

The above-mentioned error correction and rewrite of the error word in the target speech recognition result may include the following steps 405 to 406.

Step 405, inputting the target speech recognition result to the error correction rewrite system.

Because the above-mentioned speech recognition engine is a general speech recognition engine, for some specific scenes, the accuracy of recognition is not enough, and some error words may occur, which affects man-machine interaction. Therefore, the target voice recognition result is input into the error correction rewrite system, the error word in the target voice recognition result can be found, error correction rewrite is further carried out, and the accuracy of voice recognition is improved.

The error correction rewrite system is used for determining the error word in the target voice recognition result, obtaining the correct word corresponding to the error word, and rewriting the error word in the target voice recognition result into the correct word to obtain the target voice recognition result after error correction rewrite. For example, the target voice recognition result is "you do not match" and Lv Bu who is severe ", the error correction rewrite system may determine" you do not match "in the target voice recognition result as an error word, further obtain a correct word corresponding to" you do so, such as "liqueur", and rewrite the error word "you do so as to be the correct word" liqueur ", so as to obtain a corrected and rewritten target voice recognition result of" liqueur and Lv Bu who is severe "; for another example, the target speech recognition result is "song of my Zhou Jielun", the error correction rewrite system may determine that "my" in the target speech recognition result is an error word, further obtain a correct word corresponding to "my", such as "play", and rewrite the error word "my" to the correct word "play", so as to obtain a corrected and rewritten target speech recognition result as "song of play Zhou Jielun".

Optionally, referring to fig. 5, the error correction rewrite system 500 includes: a first writing layer 501, a second writing layer 502, and a third writing layer 503; the first rewrite layer 501 is used for rewriting high-frequency error words in the target voice recognition result, the second rewrite layer 502 is used for rewriting error words in the field to which the target voice recognition result belongs, and the third rewrite layer 503 is used for rewriting redundant error words in the target voice recognition result.

Next, the contents of the first writing layer, the second writing layer, and the third writing layer will be described in detail with reference to fig. 6.

1. The first rewrite layer 501 includes a whitelist and/or at least one rewrite rule. The white list is a mapping table of high-frequency error words and correct words corresponding to the high-frequency error words, and when the high-frequency error words are detected to be contained in the target voice recognition result, error correction and rewrite are carried out on the high-frequency error words according to the white list and/or rewrite rules; the mapping table may be acquired according to a historical interaction log, or may be determined in advance by a designer, which is not limited in the embodiment of the present application.

The above-mentioned rewrite rule is a rewrite rule set for a high-frequency erroneous word, which may include a rewrite rule for the same high-frequency erroneous word, when belonging to different fields, such as a rewrite mode of the high-frequency erroneous word "i put", the rewrite rule may be a rewrite of "i put" to "play" when the voice data belongs to the music field; when the voice data belongs to the game field, "my place" is rewritten as "my side". The rewrite rule may be acquired according to a history interaction log, or may be determined in advance by a background designer, which is not limited in the embodiment of the present application.

2. The second rewrite layer 502 includes a first error detection module, a second error detection module, and a rewrite module. The first error detection module is used for calling the language model to detect the error word in the field of the target voice recognition result. Alternatively, the language model may be an N-gram (N-gram) model. The values of N in the application are 1, 2 and 3, namely a unigram (1-gram) model, a bigram (2-gram) model and a trigram (3-gram) model. In other embodiments, N may also be a natural number greater than 3, which is not limited in this embodiment. The method for detecting the error words in the field of target voice recognition results by using the N-gram model mainly comprises the following steps:

(1) Inputting the target voice recognition result into an N-gram model to obtain a corresponding score value;

after the target speech recognition result is obtained, the target speech recognition result may be input into the above-described one-gram model, and one-gram model, respectively. Optionally, before the grammar model is input, the target speech recognition result may be segmented to obtain a word pair list corresponding to the target speech recognition result, for example, taking the binary grammar model as an example, assuming that the target speech recognition result is "red bean of my Wang Fei", the word pair list of the binary grammar model is: [ My, wang Fei ], [ Wang Fei ], red bean ]. Similarly, a word pair list of the unigram model and the trigram model is obtained. Further, the word pair list of each grammar model is respectively input into the corresponding grammar model, and each grammar model can calculate two score values for any word. In the embodiment of the application, for any word in the target speech recognition result, score values calculated by the unigram model, the bigram model and the trigram model can be obtained respectively, and total six score values are obtained. Alternatively, the score value may be calculated by using the longest common substring and/or edit distance method.

(2) And calculating the average value of the score values according to the score values corresponding to the N-gram model.

In the embodiment of the application, for any word, score values calculated by a unigram model, a bigram model and a trigram model can be respectively obtained, and the total is six score values; further, a mean score for each word may be obtained.

(3) And detecting the error word in the field of the target voice recognition result by adopting a filtering algorithm.

In the embodiment of the present application, taking the filtering algorithm as an MAD (Mean Absolute Differences, average absolute difference) algorithm as an example for illustration, the larger the MAD value, the greater the possibility that the word is an erroneous word, and further, the word with the largest MAD value may be determined as the erroneous word.

After the average score of each word is obtained, the average absolute difference of each word relative to the average score can be further obtained, and the word with the largest average absolute difference is determined as the error word according to the filtering algorithm. In some other embodiments, the filtering algorithm may also be a SAD (Sum of Absolute Differences, sum of absolute error) algorithm, SSD (Sum of Squared Differences, sum of square error) algorithm, MSD (Mean Square Differences, sum of average error) algorithm, and the like, which is not limited in this embodiment.

The second error detection module is used for detecting the error word in the field of the target voice recognition result according to a syntactic analysis algorithm, namely, detecting the error word in the field of the target voice recognition result by analyzing the syntactic structure (mainly called binoding-shaped complement) of the target voice recognition result or analyzing the dependency relationship among vocabularies in the target voice recognition result. Mainly comprises the following steps:

(1) Detecting keywords in a target voice recognition result;

typically, the keywords of a sentence are predicates. Illustratively, assuming that the target speech recognition result is "guide to play Zhou Jielun," it may be determined that "play" should be a predicate, "Zhou Jielun" be a predicate, and "guide" be an object, according to a syntactic structure, and thus "play" may be determined as a keyword.

(2) The entity in the keyword detection target voice recognition result is explored radically;

typically the entity of a sentence is a word following a keyword. Illustratively, assuming that the target speech recognition result is a "guide to play Zhou Jielun," the entity may include "Zhou Jielun" and "guide.

(3) Extracting target entities in the entities, and determining the target entities as error words in the field of target voice recognition results;

the target entity may be a suspected wrong word in the entity. Such as "guidance" among the entities "Zhou Jielun" and "guidance", and determines the "guidance" as an erroneous word in the field to which the target speech recognition result belongs.

The rewriting module is used for rewriting the error words detected by the first error detection module and the second error detection module into correct words.

The rewrite module may include a correct word recall unit, a filter unit, and a rewrite unit. Optionally, the rewrite module may further comprise a ranking unit.

The correct word recall unit is used for selecting at least one candidate correct word with the similarity score larger than a preset threshold value from the domain knowledge graph of the domain to which the target voice recognition result belongs.

Illustratively, the process of determining candidate correct words is described using pinyin similarity as an example. Assuming that the target voice recognition result is the guiding of playing Zhou Jielun, and belongs to the field of music; if the detected error word is "guide", the corresponding pinyin text is "dao xiang", the "dao xiang" is compared with the pinyin of the special word stock in the music field, and the similarity score is calculated, for example, the similarity score of "Daoxiang" is 90 points, the similarity score of "reverse" is 65 points, and the similarity score of "island direction" is 50 points; and selecting a proper preset threshold value, such as 60 minutes, and taking words in the special word stock of the music field, such as 'Daoxiang' and 'reverse', which are larger than the preset threshold value, as candidate correct words.

The ordering unit is used for ordering the similarity scores corresponding to the at least one candidate correct word according to a decreasing rule. For example, if the similarity score of "rice aroma" is 90 points and the similarity score of "reverse" is 65 points, then "rice aroma" is used as the first candidate correct word and "reverse" is used as the second candidate correct word.

The filtering unit is used for calculating the confusion degree score of each candidate correct word and determining the target correct word from each candidate correct word according to the confusion degree score of each candidate correct word. In this embodiment of the present application, the PPL (confusion) score is used to characterize the accuracy of overwriting the at least one candidate correct word into the target speech recognition result. The PPL score may be calculated using the following formula:

where N represents the length of the target speech recognition result, P (w_i ) Is the probability of occurrence of the i-th candidate correct word. The smaller the PPL score, the P (w)_i ) The larger the i-th candidate correct word is, the higher the accuracy of rewriting the i-th candidate correct word into the target speech recognition result is.

Further, the candidate correct word with the highest PPL score may be determined as the target correct word.

For example, the target speech recognition result is "guidance in playing Zhou Jielun", and the candidate correct words include "Daoxiang" whose PPL score is 95 and "reverse" whose PPL score is 55, so that "Daoxiang" can be regarded as the target correct word.

Alternatively, the filtering unit may invoke the language model to determine the target correct word from the at least one candidate correct word.

The rewriting unit is used for rewriting the wrong word in the target voice recognition result into the target correct word.

For example, the target speech recognition result is "guide to play Zhou Jielun" and "rice" is the target correct word, so that "rice" can be substituted for "guide" in the target speech recognition result to obtain the target speech recognition result after error correction and rewriting, that is, "rice" play Zhou Jielun ".

Further, the error correction rewritten target speech recognition result may be output.

3. The third rewrite layer 503 is configured to invoke the neural network model to rewrite redundant error words in the target speech recognition result.

Optionally, the neural network model has an end-to-end (sequence to sequence) framework, namely an encoding-decoding (Encoder-Decoder) framework; the neural network model may be a CNN (Convolutional Neural Networks, convolutional neural network) model, an RNN (Recurrent Neural Network, cyclic neural network) model, a transformer model, an LSTM (Long Short-Term Memory) model, or the like, which is not limited in this embodiment of the present application.

The redundant error word may be a repeated word in the target speech recognition result, or may be a word of a mood in the target speech recognition result. For example, the target speech recognition result is "i am to listen to the song", and the target speech recognition result after rewriting the redundant wrong word is "i am to listen to the song".

Step 406, obtaining the target voice recognition result after error correction and rewriting output by the error correction and rewriting system.

The terminal can acquire the target voice recognition result after error correction and rewriting output by the error correction and rewriting system, and further responds to the user through the interaction system of the terminal according to the target voice recognition result after error correction and rewriting.

Note that, in the embodiment of the present application, the error correction rewrite is not dependent on the general-purpose speech recognition engine, and may be executed directly by the terminal or a server serving the terminal.

In addition, the error correction rewrite system uses different error correction rewrite strategies or models for different types of error words, such as high-frequency error words, error words in the field, redundant error words and the like, so that the rationality of the final voice recognition result is effectively improved.

In addition, in the embodiment of the application, the error correction rewriting is not dependent on a general language recognition engine, and can be directly executed by the terminal or a server serving the terminal, so that the time and cost of voice recognition are saved.

Next, referring to fig. 7 in combination, taking the language model of the first error detection module as an N-gram (N-gram) model, the filtering algorithm as an MAD algorithm as an example, a simple introduction is made on the complete flow of rewriting the error word in the field to which the target speech recognition result belongs by the second rewrite layer. The target voice recognition result is "you are severe with Lv Bu", the first error detection module of the second rewrite layer calls an N-gram model, such as a unigram model, a bigram model and a trigram model, and further obtains score values calculated by the unigram model, the bigram model and the trigram model respectively, namely a 1-gram score, a 2-gram score and a 3-gram score; calculating according to the 1-gram score, the 2-gram score and the 3-gram score to obtain N-gram average score, and determining "you do not" as an error word by adopting an MAD filtering algorithm based on the N-gram average score; the second error detection module determines 'you do not' as an error word through keyword detection, entity detection and extraction of a target entity; then, in the correct word recall unit, at least one candidate correct word is retrieved from the domain knowledge graph according to the pinyin similarity and/or the font similarity; the sorting unit sorts the similarity scores corresponding to the candidate correct words according to a decreasing rule, calculates the PPL score of each candidate correct word according to the sorting of the candidate correct words in the filtering unit, and determines the candidate correct word with the highest PPL score as a target correct word; and the rewriting unit rewrites the error word in the target voice recognition result into the target correct word to obtain an error correction rewritten target voice recognition result of 'Libai and Lv Bu who is severe', and thus the error correction rewriting is completed.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 8, a block diagram of a voice recognition device according to an embodiment of the present application is shown. The device has the function of realizing the method example, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The device may be a speech recognition device or a terminal as described above, or may be provided on a background server or terminal. The apparatus 800 may include: a data acquisition module 810, a data transmission module 820, a result selection module 830, and a result rewrite module 840.

The data acquisition module 810 is configured to acquire voice data to be recognized.

The data sending module 820 is configured to send the voice data to be recognized to n voice recognition engines to obtain n voice recognition results, where n is an integer greater than 1.

A result selecting module 830, configured to select a target speech recognition result from the n speech recognition results according to the feature information of the n speech recognition results; the characteristic information of the voice recognition result is used for indicating the adaptation degree between a voice recognition engine outputting the voice recognition result and the voice data to be recognized and the credibility degree of words contained in the voice recognition result.

In some possible designs, as shown in fig. 9, the result selection module 830 includes: a score calculation unit 831 and a result selection unit 832.

And the score calculating unit 831 is used for calculating the confidence scores corresponding to the n voice recognition results according to the characteristic information of the n voice recognition results.

The result selecting unit 832 is configured to select, from the n speech recognition results, the speech recognition result with the highest confidence score as the target speech recognition result.

In some possible designs, as shown in fig. 9, the apparatus 800 further includes: the result is rewritten block 840.

And the result rewriting module 840 is configured to perform error correction rewriting on the error word in the target speech recognition result, so as to obtain the target speech recognition result after error correction rewriting.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to fig. 10, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. The computer device is used for implementing the voice recognition method provided in the above embodiment. The computer device may be a background server as described above, or may be a terminal as described above, such as a smart phone, tablet, PC, smart robot, smart television, smart speaker, etc., that may interact with a user in voice. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The computer apparatus 1000 includes a Central Processing Unit (CPU) 1001, a system memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system memory 1004 and the central processing unit 1001. The computer device 1000 also includes a basic input/output system (I/O system) 1006, which facilitates the transfer of information between the various devices within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1012.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1008 and the input device 1009 are connected to the central processing unit 1001 via an input output controller 1010 connected to a system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1007 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1004 and mass storage devices 1007 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1000 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory also includes at least one instruction, at least one program, code set, or instruction set stored in the memory and configured to be executed by one or more processors to implement the above-described speech recognition method.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a set of codes or a set of instructions is stored, which, when executed by a processor, implements the above-described speech recognition method.

In an exemplary embodiment, a computer program product is also provided, which, when being executed by a processor, is adapted to carry out the above-mentioned speech recognition method.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method of speech recognition, the method comprising:

acquiring voice data to be recognized;

calculating confidence scores corresponding to the n voice recognition results according to the characteristic information of the n voice recognition results; the feature information of the voice recognition result is used for indicating the adaptation degree between a voice recognition engine outputting the voice recognition result and the voice data to be recognized and the credibility of words contained in the voice recognition result, the adaptation degree is used for reflecting the accuracy of the voice recognition engine for recognizing the voice data to be recognized, the credibility is used for reflecting the overlapping degree of words contained in the voice recognition result and words in a predefined word bank, and the feature information comprises the domain feature corresponding to the voice data to be recognized, the domain feature corresponding to the voice recognition engine outputting the voice recognition result and the matching degree between the voice recognition result and the predefined word bank;

And selecting the voice recognition result with the highest confidence score from the n voice recognition results as a target voice recognition result.

2. The method according to claim 1, wherein calculating confidence scores for the n speech recognition results according to the feature information of the n speech recognition results comprises:

for an ith voice recognition result in the n voice recognition results, acquiring characteristic information of the ith voice recognition result, wherein i is a positive integer less than or equal to n;

and inputting the characteristic information of the ith voice recognition result into a machine learning model to obtain a confidence score corresponding to the ith voice recognition result.

3. The method according to any one of claims 1 to 2, further comprising:

and carrying out error correction and rewriting on the error words in the target voice recognition result to obtain the target voice recognition result after error correction and rewriting.

4. The method of claim 3, wherein performing error correction rewrite on the error word in the target speech recognition result to obtain the error correction rewritten target speech recognition result comprises:

Inputting the target voice recognition result into an error correction rewrite system, wherein the error correction rewrite system is used for determining error words in the target voice recognition result, obtaining correct words corresponding to the error words, and rewriting the error words in the target voice recognition result into the correct words to obtain the target voice recognition result after error correction rewrite;

and acquiring the target voice recognition result after error correction and rewriting, which is output by the error correction and rewriting system.

5. The method of claim 4, wherein the error correction rewrite system comprises: a first writing layer, a second writing layer, and a third writing layer;

the first rewrite layer is used for rewriting high-frequency error words in the target voice recognition result, the second rewrite layer is used for rewriting error words in the field of the target voice recognition result, and the third rewrite layer is used for rewriting redundant error words in the target voice recognition result.

6. The method of claim 5, wherein the first overwrite layer is configured to:

when the target voice recognition result contains the high-frequency error word, correcting and rewriting the high-frequency error word according to a white list and/or a rewrite rule, wherein the white list is a mapping table between the high-frequency error word and a correct word corresponding to the high-frequency error word, and the rewrite rule is set for the high-frequency error word.

7. The method of claim 5, wherein the second overwrite layer comprises: the system comprises a first error detection module, a second error detection module and a rewriting module;

the first error detection module is used for calling a language model to detect error words in the field to which the target voice recognition result belongs;

the second error detection module is used for detecting error words in the field to which the target voice recognition result belongs according to a syntactic analysis algorithm;

and the rewriting module is used for rewriting the error word detected by the first error detection module and the second error detection module into a correct word.

8. The method of claim 7, wherein the rewrite module comprises a correct word recall unit, a filter unit, and a rewrite unit;

the correct word recall unit is used for selecting at least one candidate correct word with the similarity score larger than a preset threshold value from a domain knowledge graph of the domain to which the target voice recognition result belongs;

the filtering unit is used for calculating the confusion degree scores of the candidate correct words, and determining target correct words from the candidate correct words according to the confusion degree scores of the candidate correct words, wherein the confusion degree scores are used for representing the accuracy of rewriting the candidate correct words into the target voice recognition results;

And the rewriting unit is used for rewriting the error word in the target voice recognition result into the target correct word.

9. The method of claim 5, wherein the third overwrite layer is configured to:

and calling a neural network model to rewrite redundant error words in the target voice recognition result.

10. A speech recognition device, the device comprising:

the data acquisition module is used for acquiring voice data to be identified;

11. The apparatus of claim 10, wherein the apparatus further comprises:

and the result rewriting module is used for correcting and rewriting the error words in the target voice recognition result to obtain the corrected and rewritten target voice recognition result.

12. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set that is loaded and executed by the processor to implement the method of any one of claims 1 to 9.

13. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of any one of claims 1 to 9.