Disclosure of Invention
The application provides a text error correction method and a text error correction device, which are used for accurately correcting a voice recognition text output by a voice recognition system, so as to output the voice recognition text with extremely high accuracy to a user.
According to a first aspect, an embodiment of the application provides a text error correction method, which comprises the steps of identifying a voice identification text based on the voice identification text output by a voice identification system, respectively obtaining error information of the voice identification text, text semantics of the voice identification text and field information of the voice identification text, obtaining an error correction position of the voice identification text through a position sub-model in a text error correction model based on the error information, the text semantics and the field information, correcting the error text in the error correction position through an error correction sub-model in the text error correction model, and obtaining the corrected voice identification text, wherein the text error correction model is obtained through training a first loss value of the position sub-model and a second loss value of the error correction sub-model.
In the above scheme, by respectively identifying error information, text semantics and domain information of a section of speech recognition text output by the speech recognition system, the error correction position of the speech recognition text can be obtained through the position submodel in the text error correction model according to the obtained error information, text semantics and domain information of the speech recognition text, and the error correction is performed on the error text at the determined error correction position through the error correction submodel in the text error correction model, so that the corrected speech recognition text can be obtained. In the method, the accuracy of the error correction position of the voice recognition text obtained through the error correction sub-model can be greatly improved by integrating the error information, the text semantics and the field information of the voice recognition text, and then, the accuracy of the finally corrected voice recognition text can be greatly improved by correcting the error text at the accurate error position in the text, so that the experience of a user in a voice-to-text scene is improved.
In one possible implementation method, the step of recognizing the voice recognition text to obtain the error information of the voice recognition text includes the step of recognizing the voice recognition text through an error learning module to obtain the error information of the voice recognition text, wherein the error learning module is used for comparing an original text with an error correction text in a history sample to generate and store each error information.
In the scheme, in view of the fact that the error learning module compares the original text in the history sample with the error correction text, error correction positions and error correction information of all the original texts are obtained and stored, the error learning module is used for identifying the voice recognition text, error information of the voice recognition text can be obtained, the obtained error information can be used for a text error correction model subsequently, finally the voice recognition text can be subjected to error correction processing of the text error correction model to generate error corrected voice recognition text with high accuracy, and experience of a user in a voice-to-text scene is improved.
In one possible implementation method, the method for recognizing the voice recognition text to obtain text semantics of the voice recognition text comprises the steps of recognizing the voice recognition text through a text semantics acquisition module to obtain text semantics of the voice recognition text, wherein the text semantics acquisition module is obtained by performing word vector learning on an original text in a history sample and performing semantic learning based on word vectors.
According to the scheme, the text semantic acquisition module performs word vector learning through the original text in the history sample and performs semantic learning based on the word vector, so that text semantic acquisition is performed on the voice recognition text by using the text semantic acquisition module, the text semantic of the voice recognition text can be obtained, the obtained text semantic can be used for a text correction model subsequently, and finally the voice recognition text can be subjected to correction processing of the text correction model to generate corrected voice recognition text with high accuracy, so that experience of a user in a voice-to-text scene is improved.
In one possible implementation method, the step of recognizing the voice recognition text to obtain the domain information of the voice recognition text comprises the step of recognizing the voice recognition text through a domain information acquisition module to obtain the domain information of the voice recognition text, wherein the domain information acquisition module performs domain learning on word vectors of original texts in historical samples, and sets domain weights on the domain word vectors of the original texts based on domains.
In the above scheme, in view of the fact that the domain information acquisition module performs domain learning on the word vector of the original text in the history sample and sets the domain weight for the domain word vector of the original text based on the domain, the domain information acquisition module is used for acquiring the domain information of the voice recognition text, the acquired domain information can be used for a text error correction model, and finally the voice recognition text after error correction with high accuracy can be generated after error correction processing of the voice recognition text by the text error correction model, so that experience of a user in a voice-to-text scene is improved.
In a possible implementation method, the text error correction model is trained by the following steps of obtaining error information of an original text through the error learning module according to the original text and the error correction text in a history sample, obtaining text semantics of the original text through the text semantic acquisition module according to the original text in the history sample, obtaining domain information of the original text through the domain information acquisition module according to the original text in the history sample, taking the error information of the original text, the text semantics of the original text and the domain information of the original text as input values of the text error correction model, taking the error correction position of the error information as a label value of the position sub-model, taking the error correction information of the error information as a label value of the error correction sub-model, and training the text error correction model.
In the above scheme, for the original text and the error correction text in the history sample, firstly, the error learning module can obtain the error information of the original text, the text semantic of the original text can be obtained by the text semantic obtaining module, the field information of the original text can be obtained by the field information obtaining module, then the error information, the text semantic and the field information of the original text can be input into the original text error correction model, the error correction position in the error information is used as the label value of the position sub-model during training, and the error correction information in the error information is used as the label value of the error correction sub-model during training, so that the original text error correction model can be trained. The text error correction model trained by the method can be used for accurately identifying the position of a text error in a voice recognition text and the information of the correct text corresponding to the error position, so that the experience of a user in a voice-to-text scene is improved.
In a possible implementation method, the error correction sub-model comprises a first sub-model based on a generation mode, a second sub-model based on a judgment mode and an evaluation sub-model, error correction is carried out on the error text at the error correction position through the error correction sub-model in the text error correction model to obtain an error corrected voice recognition text, the error correction method comprises the steps of correcting the error text at the error correction position through the first sub-model to obtain a first error correction result, correcting the error text at the error correction position through the second sub-model to obtain a second error correction result, evaluating the first error correction result and the second error correction result through the evaluation sub-model, and taking the error correction result meeting the evaluation result as the error corrected voice recognition text.
In the above scheme, for a speech recognition text, after the text is confirmed to have a position where an error occurs through a position sub-model in the text error correction model, the error correction sub-model in the text error correction model can be used to correct the error text at the error position. In order to improve the accuracy of text error correction, the application proposes that error correction can be carried out on error text at the error position according to a first sub-model based on a generation model and a second sub-model based on a judgment mode respectively, and a first error correction result and a second error correction result are respectively generated, and for the two error correction results, according to the embodiment of the application, the two error correction results can be further evaluated by using the evaluation submodel, and the error correction result meeting the evaluation result is used as the speech recognition result after error correction, so that the accuracy of text error correction can be greatly improved by integrating the error correction results of the text error correction model in at least two modes, and the experience of a user in a speech-to-text scene is improved.
In one possible implementation method, the first error correction result and the second error correction result are evaluated, the error correction result meeting the evaluation result is used as an error corrected voice recognition text, a first score is determined based on word semantic distances between the first error correction result and word vectors in the voice recognition text and sentence semantic distances between the first error correction result and the voice recognition text, a second score is determined based on word semantic distances between the second error correction result and word vectors in the voice recognition text and sentence semantic distances between the second error correction result and the voice recognition text, and the error correction result with low scores in the first score and the second score is used as the error corrected voice recognition text.
In the above scheme, when the evaluation submodel is used to screen the first error correction result and the second error correction result, the word semantic distance and the sentence semantic distance between the error correction result and each word vector in the speech recognition text may be calculated for any error correction result, the score corresponding to the current error correction result is generated, and finally, the error correction result with the lower score in the two scores may be used as the speech recognition text after error correction. According to the method, meaning of the vocabulary and semantic consistency of sentence level are comprehensively considered, so that when a good error correction result is confirmed, the confirmed error correction result can be infinitely close to a correct text, namely accuracy in the error correction process of the voice recognition text is improved, and experience of a user in a voice-to-text scene is improved.
In one possible implementation, the speech recognition system is any of a variety of speech recognition systems.
In the above scheme, since the error learning module can be used for learning and recording the error information of the speech recognition text, the application can be used for interfacing any speech recognition system in various speech recognition systems, namely the method of the application can realize the effect of cooperative work with various speech recognition systems.
In a second aspect, an embodiment of the present application provides a text error correction device, where the text error correction device is configured to identify a speech recognition text based on the speech recognition text output by a speech recognition system, and obtain error information of the speech recognition text, text semantics of the speech recognition text, and domain information of the speech recognition text, respectively, and the error correction unit is configured to obtain an error correction position of the speech recognition text through a position sub-model in a text error correction model based on the error information, the text semantics, and the domain information, and correct the error text in the error correction position through an error correction sub-model in the text error correction model, so as to obtain an error corrected speech recognition text, where the text error correction model is obtained by training a first loss value of the position sub-model and a second loss value of the error correction sub-model.
In a third aspect, embodiments of the present application provide a computing device comprising:
a memory for storing program instructions;
And the processor is used for calling the program instructions stored in the memory and executing any implementation method according to the obtained program.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform any of the implementation methods of the first aspect.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
For the problem that the current voice recognition system is prone to error on the text converted from voice, the embodiment of the application can provide a text error correction method. As shown in fig. 1, a schematic diagram of a text error correction method according to an embodiment of the present application is provided, where the method includes the following steps:
Step 101, based on a voice recognition text output by a voice recognition system, recognizing the voice recognition text to respectively obtain error information of the voice recognition text, text semantics of the voice recognition text and domain information of the voice recognition text;
and 102, obtaining the error correction position of the voice recognition text through a position sub-model in a text error correction model based on the error information, the text semantics and the field information, and correcting the error text at the error correction position through the error correction sub-model in the text error correction model to obtain the corrected voice recognition text.
The text error correction model is obtained by training a first loss value of the position sub-model and a second loss value of the error correction sub-model.
In the above scheme, by respectively identifying error information, text semantics and domain information of a section of speech recognition text output by the speech recognition system, the error correction position of the speech recognition text can be obtained through the position submodel in the text error correction model according to the obtained error information, text semantics and domain information of the speech recognition text, and the error correction is performed on the error text at the determined error correction position through the error correction submodel in the text error correction model, so that the corrected speech recognition text can be obtained. In the method, the accuracy of the error correction position of the voice recognition text obtained through the error correction sub-model can be greatly improved by integrating the error information, the text semantics and the field information of the voice recognition text, and then, the accuracy of the finally corrected voice recognition text can be greatly improved by correcting the error text at the accurate error position in the text, so that the experience of a user in a voice-to-text scene is improved.
Some of the above steps will be described in detail below with reference to examples, respectively.
In one implementation of the step 101, the step of identifying the speech recognition text to obtain the error information of the speech recognition text includes identifying the speech recognition text by an error learning module to obtain the error information of the speech recognition text, where the error learning module is configured to compare an original text in a history sample with an error correction text to generate and store each error information.
In the embodiment of the application, the error learning module is used for identifying and recording the error information in the voice recognition text output by the voice recognition system, wherein the identified and recorded error information can be used for learning a subsequent text error correction model.
Specifically, the corpus for learning by the error learning module is composed of supervised speech recognition text and weakly supervised speech recognition text. The supervised speech recognition text is marked manually, the error position and the corrected information in the original text are marked into the original text through labels, the weakly supervised speech recognition text comprises the text after the speech recognition without correction (namely the original text) and the text formed by automatic correction of the correction system (namely the correction text), and the error position labels are not included. Wherein the error correction system may be a text error correction model of the present application or an error correction system currently used to correct speech recognition text.
Furthermore, the error learning module may further comprise a speech recognition evaluation unit and an error information memory unit, including:
The voice recognition evaluation unit can be used for sorting unstructured texts into structured data with a learnable model, automatically comparing differences between the original texts and the corrected texts (namely the corrected texts), and marking error positions and corrected contents in the original texts.
And further, when training the text error correction model, the error information memory unit can also transmit the error information to the text error correction model for training the text error correction model.
The following two aspects are descriptions made in connection with the error learning module, the supervised speech recognition text, and the weakly supervised speech recognition text.
1. Supervised speech recognition text
The supervised speech recognition text may be used for supervised training of the text error correction model.
For example, (w1,w2,…,wn) may be used to represent a segment of speech recognized text, wi is the ith term in the speech recognized text, and (c0,c1,…,cn,cn+1) is used to represent a text encoding, c0 is a text start encoding, and cn+1 is a text end encoding. Then the error location label is the end-to-end encoding of the error word in the text, e.g., (ci,ci+3) indicates that wi to wi+3 belong to the error content.
For example, for a segment of speech "natural language processing means processing linguistic problems by a computer", a speech recognition system is provided to recognize the speech as speech recognition text of "natural language processing essentially processing linguistic problems by a computer", 7 words are included in the segment of speech recognition text, and thus the speech recognition text can be represented as (w1,w2,…,w7), and a text code representing the speech recognition text is used as (c0,c1,…,c7,c8). Then in the speech recognition text, the speech recognition evaluation unit in the error learning module can determine that the error location tag of the speech recognition text is (c2,c2), and the error location tag (c2,c2) represents the word error of w2, "substantial" shall be "meaning".
After that, the error information memory unit can store the original text, the text error position and the text after error correction, and form an error message. The above error information will be used as text error characteristics of the text error correction model for training the text error correction model.
2. Weak supervision speech recognition text
The weakly supervised speech recognition text may be derived from a text correction model of the present application or text data generated for correction systems currently used to correct speech recognition text. After the data is accumulated to some extent, it can be used as an enhanced data set to train the text error correction model in the present application.
For a weakly supervised speech recognition text, the speech recognition evaluation unit can automatically mark the error position by comparing the original text with the text after error correction, and the error information memory unit can store the original text, the text error position and the text after error correction and form an error message. The above error information will be used as text error characteristics of the text error correction model for training the text error correction model.
Based on the above error learning module, for a piece of speech recognition text output by the speech recognition system, error information of the speech recognition text can be determined by the error learning module, wherein the error information includes an error location and a text in which an error occurs at the error location. The error position is the error correction position, and the text with errors in the error position is the error correction information.
In one implementation of the step 101, the step of identifying the speech recognition text to obtain text semantics of the speech recognition text includes identifying the speech recognition text by a text semantics acquisition module to obtain text semantics of the speech recognition text, where the text semantics acquisition module performs word vector learning on an original text in a history sample and performs semantic learning based on word vectors.
In the embodiment of the application, the text semantic acquisition module is used for preprocessing the voice recognition text, converting the text into word vectors in a pre-training mode and calculating text semantics.
Specifically, the text semantic acquisition module may be composed of a text preprocessing unit, a word vector training unit and a text semantic calculation unit, including:
And the text preprocessing unit can be used for preprocessing the input voice recognition text. The preprocessing content can include word segmentation processing on the voice recognition text of the middle sequence, word segmentation processing on each word segment to remove word and stop words, and the like, so that the voice recognition text is converted into a word set for training word vectors.
The word vector training unit can learn words based on BERT (Bidirectional Encoder Representations from Transformers) pre-training models and convert the words into word vectors with set dimensions.
The text semantic computation unit can further extract semantic features of the word vectors on the basis of word vector training to obtain deeper text semantics, and the obtained text semantics can be used as input of a text correction model for training the text correction model.
The word vector is a vector form which is used as the input of the next unit by training and converting the text into the comprehensible vector form of a computer. Text semantic computation refers to transforming a sparse word vector matrix into a dense matrix capable of conveying deep text semantics.
For example, the word vector representation can convert words in the text such as "natural language processing", "means", "through", "computer", "coming", "processing", "linguistic problem" into a vector of a fixed dimension, such as 100 dimensions, so that the speech recognition text can be converted into a word vector matrix of 100 x 7 size, and the text semantic computation can represent compressing the word vector matrix of 100 x 7 size into a text semantic matrix of 20 x 5 size.
Specifically, in the embodiment of the application, the voice recognition text can be converted into the word vector matrix by adopting the BERT pre-training model, and the text semantic is extracted through matrix dimension reduction, so that the text semantic matrix is finally obtained.
Fig. 2 is a schematic diagram of a BERT pre-training model according to an embodiment of the present application. In fig. 2, trm is a core structure transform of the BERT pre-training model, which employs an encoder-decoder structure for converting an input matrix into an output matrix of a desired size, the dimensions of which are controlled by model parameters. In the word vector training stage, the input is a word sequence, the word vector with set dimension is output, and each word is represented by a specific vector. In the text semantic calculation stage, the BERT pre-training model converts the word vector matrix into a text semantic matrix with specific dimension, and the deep text semantic is represented by a dense matrix form.
In one implementation of the step 101, the step of identifying the speech recognition text to obtain the domain information of the speech recognition text includes identifying the speech recognition text by a domain information obtaining module to obtain the domain information of the speech recognition text, where the domain information obtaining module performs domain learning on word vectors of original text in a history sample, and sets a domain weight for the domain word vectors of the original text based on the domain.
In the embodiment of the application, the function of the domain information acquisition module is to acquire the domain characteristics of the input text on the basis of a domain dictionary, so as to judge the domain type, and finally, the domain weight is added to the related words, wherein the added domain weight can be used for the subsequent text error correction model learning.
Specifically, the domain information obtaining module may be composed of a domain feature calculating unit, a domain determining unit, and a domain weight assigning unit, and includes:
The domain feature calculation unit can be used for learning the domain features of the text through the deep neural network and acquiring the tendency of the text to the specific domain.
The domain judging unit can be used for judging the domain type of the text according to the domain dictionary and the domain feature calculation result to acquire the domain type of the text.
The domain weight assignment unit is used for increasing the weight of the words in the specific domain according to the domain type of the text, wherein the updated weight of the words can be used for training a text error correction model, so that the text error correction model can make different error correction judgments for the specific domain.
For the domain judging unit, on one hand, text word vectors can be used as the input of domain feature calculation, and the domain features of word vector matrixes of texts can be extracted through a bidirectional long-short-time memory network (BiLSTM) so as to obtain the domain features in the form of set dimension matrixes. On the other hand, for different fields, the field judging unit can obtain the final field type of the text according to the field dictionary and the calculated text field characteristics.
Fig. 3 is a schematic structural diagram of a BiLSTM model according to an embodiment of the present application. In fig. 3, the model input xt is a word vector, the output ot is a domain feature of the text, the domain feature is compared with a domain dictionary feature, and the domain with the greatest similarity is determined to be the domain type to which the text belongs. Whereas the BiLSTM model has bi-directional learning capabilities, which is based on a domain dictionary, training the resulting model in this way enables domain type decisions to be made on speech recognition text.
For example, by inputting a word vector matrix of the speech recognition text "natural language processing" to be used for processing linguistic problems "by a computer into BiLSTM models, the network can learn the matrix from two directions and obtain the domain features. For example, according to the domain dictionary, the domain to which the speech recognition text belongs is obtained as "artificial intelligence".
Aiming at the domain weight assignment unit, according to the determined domain category and domain dictionary, the words in the text sequence are compared, and the weight of the words in the specific domain in the training process of the text correction model is increased, so that the text correction model focuses on the domain words in the original text in the training process, and further, the text generated in the actual text correction process can be ensured to meet the domain characteristics. In the above example, the "artificial intelligence" domain includes several related words in the domain dictionary, such as "deep learning", and so on, then during the training process of the text error correction model, the model focuses on the words in the "artificial intelligence" domain, and weights the words again, so that the model does not ignore the words in the "artificial intelligence" domain during the error location determination and error correction process.
In the general field, word weights are determined by model training. In the training process of the text error correction model, the words in the specific field are weighted according to the importance of the words in the text, and the weight is larger than that in the general field. For example, the weight Wwi' of the word Wi in the training process in a particular domain is shown as follows:
Wherein in the above formulaFor term weight in the general field, Ci is the number of occurrences of term wi in text,For the total number of words in the text, Nd is the total number of words in the input text of the text correction model, and Ndi is the number of times the word wi in the input text. Weight parameterThe domain-specific words have a greater weight than in the general domain, and the weight size is determined by the importance of the words themselves in the domain.
For example, in the foregoing example, on one hand, error information of an original text can be obtained by an error learning module for the original text in a history sample and an error correction text corresponding to the original text, on the other hand, text semantics of the original text can be obtained by a text semantics acquisition module for the original text in the history sample, and on the other hand, domain information of the original text can be obtained by a domain information acquisition module for the original text in the history sample, so that, regarding any original text in the history sample, three aspects of information of error information, text semantics and domain information of the original text can be obtained, then the three aspects of information can be input as a model and input into an initial text error correction model, and meanwhile, the error correction position in the error information can be used as a label value for training a position sub-model in the initial text error correction model, and the error text (i.e., error correction information) at the error correction position in the error information can be used as a label value for training an error sub-model in the initial text error correction model, thereby training the initial text error correction model. If the loss value of a text correction model meets the set requirement through multiple times of training, the text correction model can be used for correcting the voice recognition text output by the voice recognition system, so that the correct voice recognition text is output to a user.
The position sub-model in the text error correction model meeting the training requirements can be used for determining the position where errors occur in the voice recognition text output by the voice recognition system, and the error correction sub-model in the text error correction model meeting the training requirements can be used for correcting the error text at the position where errors occur in the voice recognition text so as to form the correct voice recognition text for output.
Specifically, the application also comprises a text error correction model training module. The text error correction model training module can be used for generating model parameters with loss values meeting a loss threshold value through gradual iterative training, and a model corresponding to the model parameters is the text error correction model.
The text error correction model training module may be composed of a model parameter adjusting unit, an error position labeling unit and an error correction unit, and includes:
The model parameter adjusting unit can be used for acquiring error information, text semantics and domain weight corresponding to each voice recognition text, taking the error information, the text semantics and the domain weight as input information, and adjusting parameters of a text error correction model according to the input information.
The error position labeling unit can be used for training the error position by taking the error position in the error information of the voice recognition text as the label information based on the deep neural network, and the position sub-model formed after training can determine and label the position of the error in the voice recognition text.
And the error correction unit is used for correcting the error text at the error occurrence position in the voice recognition text.
Fig. 4 is a schematic diagram of a training network of a location sub-model according to an embodiment of the present application. The application learns the voice recognition text through the deep neural network, and trains the network aiming at the error position marked in the voice recognition text. Referring to fig. 4, the location sub-model is a network model built based on an encoder-decoder, wherein the input of the model is a text semantic feature matrix and the output is an error location marker.
In some implementations of the application, the error correction sub-model includes a first sub-model based on a generation pattern, a second sub-model based on a decision pattern, and an evaluation sub-model.
In the above example, in the training process of the text error correction model, for a speech recognition text, the position of the error in the speech recognition text can be obtained through the position sub-model in the text error correction model, and then the problem of how to correct the error text at the position of the error can be achieved through the following method.
In the present application, the error correction unit includes an error correction subunit based on a generation pattern, an error correction subunit based on a determination pattern, and an evaluation unit. The error correction subunit based on the generation mode and the error correction subunit based on the judgment mode can be used for correcting errors of the error text at the error position in the speech recognition text respectively to obtain error correction results, and the error correction results can be evaluated by an evaluation unit for the error correction results, wherein the evaluation unit can judge the matching degree of the generated error correction content and the original text context through the semantic similarity among vocabularies and the sentence-level semantic consistency, and select and generate a final error correction result.
The error correction subunit of the generation model is a first sub-model based on the generation mode, the error correction subunit of the judgment mode is a second sub-model based on the judgment mode, and the evaluation unit is an evaluation sub-model.
Specifically, the methods of using the generation mode-based error correction subunit and the decision mode-based error correction subunit will be described below, respectively:
(1) Error correction subunit based on generation pattern
Fig. 5 is a schematic diagram of an error correction subunit based on a generation mode according to an embodiment of the present application. In fig. 5, the error correction subunit based on the generation pattern is based on a sequence-to-sequence generation model (Seq 2 Seq), where xi is the encoder input sequence, contains text semantic features and error location information, and yi is the error correction text output by the decoder, i.e. the error correction text that replaces the error text in the error location.
For example, given that the error location information of the original text "natural language processing essentially by computer to deal with linguistic problems" is (c2,c2), the error correction text "refer to" in the error location is directly generated by model learning, so that the corrected text "natural language processing" of the original text is generated to deal with linguistic problems by computer.
(2) Error correction subunit based on decision mode
The error correction subunit based on the judging mode adopts a traditional N-gram model. Firstly, traversing and acquiring binary phrases and ternary phrases in an input sequence, then, calculating the error rate of the phrases at the error position by using Bi-gram and Tri-gram models, and finally, calculating the confusion degree of all semantic similar phrases according to a text and field dictionary and selecting the word with the biggest confusion degree as the correct word after error correction. The confusion of the word wi is calculated as shown in the following formula:
Where s=w1w2...wi...wl is the sentence in which the word wi is located, and p (w1w2...wi...wl) is the probability of the sentence S.
For example, the error location information of the original text "natural language processing substance to process linguistic problems by a computer" is known as (c2,c2), the N-gram model acquires words of two words and words of three words in the text in units of words, sequentially calculates the degree of confusion of words in the domain dictionary, acquires the word with the highest degree of confusion as "means", and generates corrected text "natural language processing" of the original text means to process linguistic problems by a computer.
Finally, the correction text (i.e., error correction text) generated by the two error correction subunits is simultaneously transferred to the next unit, i.e., evaluation unit.
For a speech recognition text in a history sample, an error correction result can be generated by the error correction subunit based on the generation pattern, and an error correction result can be generated by the error correction subunit based on the decision pattern, with respect to which two error correction results it can be further decided by the evaluation unit in the error correction unit. The function of the evaluation unit will be described next:
When the evaluation unit accepts or rejects one error correction result generated by the two error correction subunits respectively, the evaluation indexes of the evaluation unit comprise the semantic similarity between the generated error correction words and the words of other words in the text sentence and the semantic consistency of the sentence. The semantic similarity among words is the semantic distance of the words, and the application uses the Euclidean distance of word vectors for representation. The semantic consistency of sentences refers to the distance between the sentences in the text, and the Euclidean distance is also adopted to calculate the distance between the sentences in the application. Finally, the two indexes can be added to obtain a final Score of the error correction result, and the Score calculating method is as follows:
In the above formula, wi is an error correction term generated by the model, wj is other terms in the sentence where wi is located, and l is the number of terms in the sentence. si is the number of sentences in the text, and n is the number of sentences in the text.
When the Euclidean distance is adopted for calculation, the smaller the error correction result score is, the more the error correction words conform to the original text semantics, the stronger the text semantics after error correction are consistent, so that the two error correction results are scored through the evaluation unit, and the evaluation unit can select the error correction result with small score as the output of the final text error correction model.
In the above example, the two error correction subunits generate two error correction results, for example, the error correction subunit based on the determination mode outputs "natural language processing to process linguistic problems by a computer", the error correction subunit based on the generation mode outputs "natural language processing to process linguistic problems by a computer", and after the scores of the two error correction results are calculated by using the above evaluation method, it is assumed that the score in which the "natural language processing output by the error correction subunit based on the generation mode refers to process linguistic problems by a computer" is low, so that the error correction result of the error correction subunit based on the generation mode can be determined as the final model output in the embodiment of the present application.
Further, the error correction unit includes a training decision unit and an optimal model acquisition unit in addition to the error correction subunit based on the generation mode, the error correction subunit based on the decision mode, and the evaluation unit, and the following contents are:
And the training decision unit can be used for judging the training effect of the text error correction model and deciding whether to continue training. And stopping training the model when the training loss of the model is smaller than the set loss threshold value, determining to generate optimal model parameters, otherwise, continuing training the model.
The optimal model obtaining unit can be used for obtaining parameters of the optimal model obtained after training, storing and outputting, namely the optimal model obtaining unit can put the obtained optimal model into use and is used for identifying the correctness of the voice recognition text output by the voice recognition system and correcting the voice recognition text identified as the error, so that the correct voice recognition text of the error is generated for the error voice recognition text.
Specifically, whether the text correction model needs to be trained continuously or not may be determined by the training decision unit, and when the loss of training of the text correction model reaches or is smaller than a certain set value, training of the text correction model may be stopped. The Loss of the text error correction model comprises a Loss1 generated by the position sub model and a Loss2 generated by the error correction sub model, and the calculation method is shown in the following formula.
Loss=loss1+loss2
Wherein { x1,x2,…,xN } represents a model error position output, { y1,y2,…,yN } represents a model error correction output, and γ.gtoreq.0. Parameters in loss2The weights of the positive and negative samples can be balanced, so that the model pays attention to the samples with difficulty, namely, the errors with great difficulty in correcting errors in the text.
Based on the training process of the text error correction model, the text error correction model with the loss value meeting the loss threshold value, namely the optimal text error correction model, can be used for detecting the correctness of the speech recognition text output by the speech recognition system.
In some implementations of the present application, the correcting the error text at the error correction location through the error correction sub-model in the text error correction model to obtain a corrected speech recognition text includes correcting the error text at the error correction location through the first sub-model to obtain a first error correction result, correcting the error text at the error correction location through the second sub-model to obtain a second error correction result, and evaluating the first error correction result and the second error correction result through the evaluation sub-model, wherein the error correction result satisfying the evaluation result is used as the corrected speech recognition text.
In some implementations of the application, the evaluating the first error correction result and the second error correction result, taking the error correction result meeting the evaluation result as an error corrected voice recognition text, comprises determining a first score based on word semantic distance between the first error correction result and each word vector in the voice recognition text and sentence semantic distance between the first error correction result and the voice recognition text, determining a second score based on word semantic distance between the second error correction result and each word vector in the voice recognition text and sentence semantic distance between the second error correction result and the voice recognition text, and taking the error correction result with low score in the first score and the second score as the error corrected voice recognition text.
In one implementation of step 101 above, the speech recognition system is any of a variety of speech recognition systems.
The application can perform error learning and memorizing to the voice recognition system, so the application has a mechanism for cooperating with various voice recognition systems. Referring to fig. 6, a schematic diagram of a collaborative mechanism provided by an embodiment of the present application is shown in fig. 6, where ASRn represents a speech recognition system, for example, when the text error correction device of the present application works together with ASR1, the text error correction device learns the corpus of ASR1, obtains error information of text output by the speech recognition system, learns and memorizes and trains a text error correction model, and the trained text error correction model can perform text error correction for the characteristics of ASR 1. When the error correction device needs to be adapted to other voice recognition systems, the above steps are repeated, and the description is omitted.
Based on the same concept, an embodiment of the present application provides a text error correction apparatus, as shown in fig. 7, which is a schematic diagram of the text error correction apparatus provided in the embodiment of the present application, where the apparatus includes an obtaining unit 701 and an error correction unit 702;
an obtaining unit 701, configured to identify a speech recognition text based on the speech recognition text output by the speech recognition system, and obtain error information of the speech recognition text, text semantics of the speech recognition text, and domain information of the speech recognition text, respectively;
The error correction unit 702 is configured to obtain an error correction position of the speech recognition text through a position sub-model in a text error correction model based on the error information, the text semantics and the domain information, correct the error text in the error correction position through the error correction sub-model in the text error correction model, and obtain an error corrected speech recognition text, where the text error correction model is obtained by training a first loss value of the position sub-model and a second loss value of the error correction sub-model.
Further, for the device, the obtaining unit 701 is specifically configured to identify, by an error learning module, the speech recognition text to obtain error information of the speech recognition text, and the error learning module is configured to compare an original text and an error correction text in a history sample, and generate and store each error information.
Further, for the device, the obtaining unit 701 is specifically configured to identify the speech recognition text by using a text semantic obtaining module to obtain text semantics of the speech recognition text, where the text semantic obtaining module is obtained by performing word vector learning on an original text in a history sample and performing semantic learning based on the word vector.
Further, for the device, the obtaining unit 701 is specifically configured to identify the speech recognition text by using a domain information obtaining module to obtain domain information of the speech recognition text, where the domain information obtaining module performs domain learning on a word vector of an original text in a history sample, and sets a domain weight for the domain word vector of the original text based on a domain.
Further, the device further comprises a text error correction model training unit 703, wherein the text error correction model training unit 703 is used for obtaining error information of the original text through the error learning module according to the original text and the error correction text in the history sample, the error information comprises error correction positions and error correction information, obtaining text semantics of the original text through the text semantic acquisition module according to the original text in the history sample, obtaining field information of the original text through the field information acquisition module according to the original text in the history sample, taking the error information of the original text, the text semantics of the original text and the field information of the original text as input values of the text error correction model, taking the error correction positions of the error information as label values of the position sub-model, taking the error correction information of the error information as label values of the error correction sub-model, and training the text error correction model.
Further, for the device, the error correction sub-model comprises a first sub-model based on a generation mode, a second sub-model based on a judgment mode and an evaluation sub-model, wherein the text error correction model training unit 703 is specifically configured to perform error correction on the error text at the error correction position through the first sub-model to obtain a first error correction result, perform error correction on the error text at the error correction position through the second sub-model to obtain a second error correction result, perform evaluation on the first error correction result and the second error correction result through the evaluation sub-model, and take the error correction result meeting the evaluation result as an error corrected voice recognition text.
Further, with respect to the device, a text correction model training unit 703 is specifically configured to determine a first score based on a word semantic distance between the first correction result and each word vector in the speech recognition text and a sentence semantic distance between the first correction result and the speech recognition text, determine a second score based on a word semantic distance between the second correction result and each word vector in the speech recognition text and a sentence semantic distance between the second correction result and the speech recognition text, and use an error correction result with a low score in the first score and the second score as the speech recognition text after error correction.
The embodiment of the application also provides a computing device which can be a desktop computer, a portable computer, a smart phone, a tablet Personal computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) and the like. The computing device may include a central processing unit (Center Processing Unit, CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a display device, such as a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), cathode Ray Tube (CRT), etc.
Memory, which may include Read Only Memory (ROM) and Random Access Memory (RAM), provides program instructions and data stored in the memory to the processor. In an embodiment of the present application, the memory may be used to store program instructions of a text error correction method;
And the processor is used for calling the program instructions stored in the memory and executing a text error correction method according to the obtained program.
Referring to fig. 8, a schematic diagram of a computing device according to an embodiment of the present application is provided, where the computing device includes:
processor 801, memory 802, transceiver 803, bus interface 804, wherein processor 801, memory 802 and transceiver 803 are connected by bus 805;
The processor 801 is configured to read the program in the memory 802, and execute the text error correction method described above;
The processor 801 may be a central processing unit (central processing unit, CPU for short), a network processor (network processor, NP for short), or a combination of CPU and NP. But also a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD for short), a field-programmable gate array (FPGA for short) GATE ARRAY, a generic array logic (GENERIC ARRAY logic, GAL for short), or any combination thereof.
The memory 802, configured to store one or more executable programs, may store data used by the processor 801 in performing operations.
In particular, the program may include program code including computer-operating instructions. The memory 802 may include volatile memory (RAM), such as random-access memory (RAM), nonvolatile memory (non-volatile memory), such as flash memory (flash memory), hard disk (HARD DISK DRIVE HDD) or solid state disk (solid state disk) (SSD), and the memory 802 may include a combination of the above types of memory.
Memory 802 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof:
The operation instructions comprise various operation instructions for realizing various operations.
Operating system-including various system programs for implementing various basic services and handling hardware-based tasks.
Bus 805 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
The bus interface 804 may be a wired communication interface, a wireless bus interface, or a combination thereof, wherein the wired bus interface may be, for example, an ethernet interface. The ethernet interface may be an optical interface, an electrical interface, or a combination thereof. The wireless bus interface may be a WLAN interface.
Embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a text error correction method.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, or as a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.