CN112767921A

Movatterモバイル変換

Info

Publication number: CN112767921A
Application number: CN202110018749.9A
Authority: CN
Inventors: 张宏达; 胡若云; 沈然; 黄俊杰; 丁麒; 盛琦慧; 陈金威; 熊剑峰; 丁莹; 姜伟昊; 丁丹翔; 李一夫; 陈哲乾; 赵洲
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU; State Grid Zhejiang Electric Power Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU; State Grid Zhejiang Electric Power Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-05-07

Abstract

Translated fromChinese

本发明属于语音识别领域，具体涉及一种基于缓存语言模型的语音识别自适应方法和系统，包括：针对一段连续的长语音分割得到多条短语音，并按照时间顺序构成任务队列；通过动态语言模型获得识别文本；根据每一条短语音的识别文本，实时判断是否需要进行概率修正，若是，则根据预设关联词表进行关键词搜索，得到关键词组，计算局部词汇概率分布，构建局部语言模型；将局部语言模型与动态语言模型进行插值合并，得到更新后的动态语言模型。本发明根据预设关联词表进行关键词搜索，得到关键词组，计算局部词汇概率分布，构建局部语言模型，将局部语言模型与动态语言模型进行插值合并，得到更新后的动态语言模型，进而提高识别准确率。

The invention belongs to the field of speech recognition, and in particular relates to a speech recognition adaptive method and system based on a cached language model, comprising: segmenting a continuous long speech to obtain a plurality of short speeches, and forming a task queue according to time sequence; The model obtains the recognition text; according to the recognition text of each short speech, it is judged in real time whether probability correction is required, if so, the keyword search is performed according to the preset associated vocabulary to obtain the keyword group, the local vocabulary probability distribution is calculated, and the local language model is constructed; The local language model and the dynamic language model are interpolated and merged to obtain the updated dynamic language model. The present invention searches for keywords according to a preset associated vocabulary, obtains a keyword group, calculates the probability distribution of local words, constructs a local language model, interpolates and merges the local language model and the dynamic language model, and obtains an updated dynamic language model, thereby improving the recognition efficiency. Accuracy.

Description

Voice recognition self-adaption method and system based on cache language model

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a voice recognition self-adaption method and system based on a cache language model.

Background

After decades of development, speech recognition has a mature technology, and Siri, Cortana and the like have high recognition accuracy under ideal conditions in practical application.

The performance of a speech recognition system depends to a large extent on the similarity between the Language Model (LM) used and the task to be processed. This similarity is particularly important in cases where the statistical properties of the language change over time, such as in application scenarios involving spontaneous and multi-domain speech. Topic Identification (TI) based on information retrieval is a key technology, and a topic under discussion is obtained through semantic analysis of a historical identification result, so that a language model is adjusted, and dynamic self-adaptation is realized.

A problem with topic identification is that for individual low frequency words, there is a possibility that they will cause unnecessary changes to the language model because of the apparent domain characteristics they carry. In the aspect of speech signal processing, a current speech recognition system mainly adopts single-sentence task recognition, namely, the speech recognition system recognizes a single sentence in speech as an independent task according to a Voice Activity Detection (VAD) judgment result no matter how long the speech is input. This has the advantage that better recognition timeliness can be obtained and the system overhead can be reduced to some extent.

For scenes with strong context relation or professional field, such as academic conferences, interview records and the like, the single sentence task recognition system ignores the context relation, repeatedly makes mistakes for words with inaccurate recognition, and cannot recognize low-frequency words by using field information. On the other hand, for a speech recognition system configured with a plurality of domain language models, generally, the domain models need to be manually specified before the recognition is started, or the output results of a plurality of domains need to be selected with confusion, which adds unnecessary steps and leads to insufficient intelligence of the recognition system.

Disclosure of Invention

The invention aims to provide a speech recognition self-adaption method and a speech recognition self-adaption system based on a cache language model.

In order to solve the technical problems, the invention adopts the following technical scheme: a speech recognition adaptive method based on a cache language model comprises the following steps:

s1: dividing a section of continuous long voice to obtain a plurality of short voices, and forming a task queue according to a time sequence;

s2: a first phrase voice in the task queue is taken to obtain an identification text through a dynamic language model, and the phrase voice is deleted from the task queue;

s3: establishing a cache model, judging whether probability correction is needed or not in real time according to the recognition text of each short voice, if not, returning to S2 until the task queue is empty, and completing the recognition task; if yes, performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, calculating local vocabulary probability distribution, and constructing a local language model;

s4: and carrying out interpolation combination on the local language model and the dynamic language model to obtain an updated dynamic language model, returning to S2 until the task queue is empty, and completing the identification task.

Preferably, the method further comprises the following steps:

s5: the recognized text in S2 is manually corrected.

Preferably, the step S3 specifically includes:

s301: establishing a cache model which comprises a plurality of cache regions;

s302: judging whether low-frequency words or strong-field feature words exist or not in real time according to the recognition text of each short voice, if so, performing probability correction, entering S303, if not, performing no probability correction, returning to S2 until the task queue is empty, and completing the recognition task;

s303: performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies, wherein the calculation formula is as follows:

in the formula, ω₁，…，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁,…,ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_iCorresponding hidden vectors, theta represents the Euclidean distance, phi · |, represents the modular length, and oc represents the proportional relationship;

s304: and constructing a local language model based on the 3-gram according to the local vocabulary probability distribution.

Preferably, when the associated word group stored in the cache region of the cache model reaches the threshold of the million-level numerical value, the local vocabulary probability distribution calculation formula in step S303 is replaced with the following formula:

in the formula (I), the compound is shown in the specification,

is and h_tH having the shortest euclidean distance of_iSet of (a), θ (h)_t) Is h_tEuclidean distance to adjacent words.

Preferably, the step S4 specifically includes:

s401: and carrying out interpolation combination on the local language model and the dynamic language model, and correcting the probability of the local words by the following formula:

P(ω_t|ω₁，…，ω_t-1)＝

(1-λ)P_model(ω_t|ω₁,…,ω_t-1)+λP_cache(ω_t|ω₁,…,ω_t-1)；

in which λ is the tuning parameter, P_modelRepresenting the recognition of the antecedent in the existing language model as omega₁,…,ω_t-1Under the condition that the current word is omega_tThe probability of (d);

s402: and obtaining an updated dynamic language model through the corrected word probability, and returning to the step S2 until the task queue is empty, thereby completing the recognition task.

A speech recognition adaptive system based on a cache language model is used for realizing the speech recognition adaptive method, and comprises the following steps:

the voice signal receiving module is used for receiving a continuous voice signal to be recognized;

the voice signal segmentation module is used for segmenting a section of continuous long voice acquired by the voice signal receiving module into a plurality of short voices;

the storage module is used for storing a task queue to be recognized, a recognized text result and a related word corpus, wherein the task queue is formed by a plurality of short voices according to a time sequence;

the ASR decoding module comprises a task reading unit, a dynamic language model and a recognition result output unit; the task reading unit is used for reading a first phrase voice in the task storage module as the input of a dynamic language model, the dynamic language model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module;

the probability correction judging module is used for judging whether the recognition text output by the ASR decoding module needs probability correction in real time, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding module to start the recognition of the next short voice;

the related word searching module is used for searching keywords according to a related word corpus preset in the storage module to obtain related word groups of the historical recognition words;

and the language model correction module is used for updating parameters of the dynamic language model in the ASR decoding module according to the real-time associated phrases and the historical associated phrases obtained by the associated word searching module.

Preferably, the language model modification module includes:

the storage area is used for storing the related phrases output by the related word searching module;

the local probability calculation unit is used for calculating the local vocabulary probability distribution corresponding to the associated phrases at the latest moment in the storage area;

and the language model correction unit is used for constructing a local language model according to the local vocabulary probability distribution, interpolating and combining the local language model and the dynamic language submodel in the ASR decoding module, updating the automatic speech recognition model, sending a signal to the ASR decoding module after updating, and starting the recognition of the next short speech.

Preferably, the method further comprises the following steps:

and the manual correction module is used for manually correcting the recognition text output by the ASR decoding module.

Preferably, the calculating the local vocabulary probability distribution corresponding to the associated phrase at the latest time in the storage area includes:

performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies, wherein the calculation formula is as follows:

in the formula, ω₁，…，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁,…,ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_iThe corresponding hidden vector, θ represents the Euclidean distance, | · |, represents the modulo length, and ∈ represents the proportional relationship.

Preferably, the interpolating and merging the local language model and the dynamic language submodel in the ASR decoding module, and updating the automatic speech recognition model includes:

and carrying out interpolation combination on the local language model and the dynamic language model, and correcting the probability of the local words by the following formula:

P(ω_t|ω₁，…，ω_t-1)＝

(1-λ)P_model(ω_t|ω₁,…,ω_t-1)+λP_cache(ω_t|ω₁，…，ω_t-1)；

in which λ is the tuning parameter, P_modelRepresenting the recognition of the antecedent in the existing language model as omega₁，…，ω_t-1Under the condition that the current word is omega_tThe probability of (d);

and obtaining the updated dynamic language model through the corrected word probability until the task queue is empty, and completing the recognition task.

The technical scheme adopted by the invention has the following beneficial effects:

(1) and searching keywords according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, calculating local vocabulary probability distribution, constructing a local language model, and carrying out interpolation combination on the local language model and the dynamic language model to obtain an updated dynamic language model, thereby improving the identification accuracy.

(2) The method comprises the steps of constructing a relation word list, searching in the relation word list to obtain an associated phrase after historical cache information is obtained, constructing a local language model according to the associated phrase, and performing interpolation and combination with an original dynamic language model for subsequent identification. The addition of the associated phrases improves the field correlation of the dynamic language model and is beneficial to accurately identifying the identification task with stronger field.

(3) For a long continuous recording and transcription recognition task, a traditional voice recognition system cannot interfere recognition results in recognition, and repeated errors of certain low-frequency words are caused. The manual correction module designed by the invention allows a user to manually correct the historical recognition result, improves the probability of the corresponding low-frequency words in the cache language model, avoids repeated mistakes of the low-frequency words and improves the accuracy of subsequent recognition.

The following detailed description of the present invention will be provided in conjunction with the accompanying drawings.

Drawings

The invention is further described with reference to the accompanying drawings and the detailed description below:

FIG. 1 is a flow chart of a speech recognition adaptive method based on a cache language model according to the present invention;

FIG. 2 is a schematic flowchart illustrating step S3 of the speech recognition adaptive method based on the cached language model according to the present invention;

FIG. 3 is a schematic flowchart illustrating step S4 of the speech recognition adaptive method based on the cached language model according to the present invention;

FIG. 4 is a schematic flowchart illustrating step S5 of the speech recognition adaptive method based on the cached language model according to the present invention;

FIG. 5 is a schematic structural diagram of a speech recognition adaptive system based on a cache language model according to the present invention;

FIG. 6 is a schematic structural diagram of an ASR decoding module in the speech recognition adaptive system based on the cache language model according to the present invention;

FIG. 7 is a schematic structural diagram of a language model modification module in a speech recognition adaptive system based on a cache language model according to the present invention;

FIG. 8 is a schematic structural diagram of another speech recognition adaptive system based on a cache language model according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The basic idea of the invention is to receive a continuous voice signal input by a user, divide the continuous voice signal into a plurality of short voices based on a voice activity detection technology VAD, sequentially identify short voices based on a dynamic language model, generate a corresponding identification result for each short voice, obtain an associated word list based on keyword search, process the associated word list by using an extended cache model based on a Recursive Network (RN) to obtain a local language model adapting to local change of historical identification text distribution, and perform interpolation and merging on the local language model and the dynamic language model to obtain an updated dynamic language model. After the processing, the dynamic language model and the voice content have better similarity, and the accuracy of the recognition of the continuous long voice is improved.

Based on the foregoing basic idea, an embodiment of the present invention provides a speech recognition adaptive method based on a cache language model, which is shown in fig. 1 and includes the following steps:

s1: and a plurality of short voices are obtained by segmenting a continuous long voice, and a task queue is formed according to the time sequence.

First, a continuous speech signal input by a user is received. In the existing implementation, a user inputs voice by uploading off-line recording or clicking a recording button through an identification entrance, and then clicks the end of recording and finishes inputting. The system performs subsequent recognition processing on the received voice signal.

The continuous speech signal is segmented into a plurality of short voices based on a voice activity detection VAD. Specifically, each frame of voice of the continuous voice signal is recognized by utilizing a deep learning algorithm according to a preset mute model so as to recognize a mute frame, the frame reaching a preset long mute threshold value is taken as a segmentation point to segment the continuous voice signal into a plurality of effective voice segments, and therefore the segmentation of the voice is realized, and phrase voices are recognized respectively.

And receiving a voice-activity-detection-VAD-segmented phrase task adding queue, if the task queue is not empty, sequentially processing the task voice, and otherwise, ending the process.

S2: and taking a first phrase voice in the task queue, obtaining an identification text through a dynamic language model, and deleting the phrase voice from the task queue.

Specifically, the initial dynamic language model can be a general language model or a preset domain model, the recognition result is input into a cache model of a nonparametric method, and after the cache model processes the existing historical recognition records and associated words, the dynamic language model is modified for recognition.

Specifically, a dynamic language model constructed on a big data text is used as a background language model, basic functional support is provided for a speech recognition system, or a corresponding domain language model is directly used when a speech domain can be predicted.

S3: establishing a cache model, judging whether probability correction is needed or not in real time according to the recognition text of each short voice, if not, returning to S2 until the task queue is empty, and completing the recognition task; if yes, performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, calculating local vocabulary probability distribution, and constructing a local language model.

As shown in fig. 2, step S3 specifically includes the following steps:

s301: and establishing a cache model which comprises a plurality of cache regions.

S302: and judging whether low-frequency words or strong-field characteristic words exist or not in real time according to the recognition text of each short voice, if so, performing probability correction, entering S303, if not, performing no probability correction, returning to S2 until the task queue is empty, and completing the recognition task.

Specifically, the method of the present invention is directed to low frequency words/strong domain feature words in the history text, that is, for the obtained history recognition text, if the observation probability of the history recognition text in the dynamic language model is higher than a threshold, the subsequent probability correction is not performed.

S303: and searching keywords according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies.

And searching in a preset associated word list to obtain an associated word group by taking the word which is judged to need probability correction as a keyword. In order to solve the problem that the number of the cached associated phrases is increased quickly after a large number of recognition tasks, the associated phrases are input into a cache model after pruning.

Specifically, a related word list is constructed based on pronunciation similarity, semantic field similarity and logic correlation, and an inverted index and product quantification are established for the word list in preprocessing, wherein the inverted index is used for recording all words in a corpus, and takes words or field subjects as main keys, and each main key points to a series of connected word groups, so that rapid keyword retrieval and low memory occupation are realized.

In particular, according to the locality principle, a word that has been used recently will have a higher probability of being used again, and a word that has a correlation with the recently used word will have a higher probability of being used. In the present invention, a recursive network designed specifically for sequence modeling is used, and in each iteration of time nodes, the network encodes historical identification information and represents the encoded information as a hidden vector related to time

The subsequent prediction probabilities based on this are:

ο_ωis a coefficient matrix, and oc represents a proportional relationship.

Wherein the hidden vector h_tThe update is made by the following recursive formula:

h_t＝Φ(x_t，h_t-1)

the updating function phi uses different structures in different network structures, the Elman network is used in the invention, and the updating function phi is defined as:

h_t＝σ(Lx_t+Rh_t-1)

where σ is a non-linear function tanh,

the matrix is embedded for the words and,

is a recursive matrix.

On the basis of the historical information representation, the model is cached in (h)_i，ω_i+1) Storing historical information in a mode of observing word pairs by using hidden vectors, and obtaining the word probability until t time in the cache component by using the following kernel density estimator:

in the formula, ω₁，…，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁，…，ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_iThe corresponding hidden vector, θ represents the Euclidean distance, | · |, represents the modulo length, and ∈ represents the proportional relationship.

If the cache content is not cleaned and the stored associated phrases reach the million level, the system does not perform accurate search, and instead, the following approximate k-nearest algorithm is used to estimate the distribution probability of the words, which is also called a variable kernel density estimator:

wherein ω is_tFor the current word, ω₁…ω_t-1In order to identify the sequence for the history,

and h_tH having the shortest euclidean distance of_iK is a kernel function (often chosen as gaussian kernel function), θ (h)_t) Is h_tThe euclidean distance to k adjacent words.

As shown in fig. 3, step S4 specifically includes:

P(ω_t|ω₁,…,ω_t-1)＝

(1-λ)P_model(ω_t|ω₁,…,ω_t-1)+λP_cache(ω_t|ω₁,…,ω_t-1)；

in which λ is the tuning parameter, P_modelRepresenting the recognition of the antecedent in the existing language model as omega₁,…,ω_t-1Under the condition that the current word is omega_tThe probability of (d); and adjusting the lambda according to the experimental result, and finally obtaining a corrected language model containing historical recognition text information for recognizing the residual voice.

Preferably, as shown in fig. 4, the method of the present invention further comprises the steps of:

s5: the recognized text in S2 is manually corrected.

Specifically, according to different use scenes, a user can correct the incorrectly recognized homophones/low-frequency words in the recognition process in a part of scenes (such as recording and transcription), and can better match with the local adjustment of a subsequent language model to improve the accuracy of subsequent recognition.

Fig. 5 is a schematic structural diagram of a speech recognition adaptive system based on a cache language model according to an embodiment of the present invention. The speech recognition adaptive system provided by the embodiment of the invention can execute the speech recognition adaptive method based on the cache language model provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

A speech recognition self-adaptive system based on a cache language model is used for realizing the speech recognition self-adaptive method and comprises a speech signal receiving module, a speech signal segmentation module, a storage module, an ASR decoding module, a probability correction judgment module, a related word searching module and a language model correction module.

And the voice signal receiving module is used for receiving the continuous voice signal to be recognized. Specifically, the user uploads an offline audio file through the client, or clicks a recording button, inputs voice through the recording device, and clicks recording to finish recording. In the recording process, the system can automatically perform segmentation recognition according to the recorded voice signal and return a recognition result in real time.

And the voice signal segmentation module is used for segmenting a section of continuous long voice acquired by the voice signal receiving module into a plurality of short voices. The continuous speech signal is segmented into a plurality of short voices based on a voice activity detection VAD. Specifically, a silence model is established by using a pre-labeled corpus, a silence frame recognition is performed on each frame in the continuous voice signal, the frame reaching a preset long silence threshold value is used as a voice signal segmentation point, the continuous voice signal is segmented into a plurality of effective voice segments, and therefore a voice task group to be recognized is obtained, and voice recognition is performed in sequence.

The storage module is used for storing a task queue to be recognized, a recognized text result and a relevant word corpus, wherein the task queue is formed by a plurality of short voices according to a time sequence.

The ASR decoding module, as shown in fig. 6, includes a task reading unit, a dynamic language model, and a recognition result output unit; the task reading unit is used for reading a first phrase voice in the task storage module as the input of a dynamic language model, the dynamic language model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module.

When the first recognition or historical recognition text associated word quantity is not enough to construct a local language model, the ASR decoding module uses a general language model or a predetermined domain language model to decode and recognize; after the local language model is constructed, the ASR decoding module uses the interpolation corrected language model for decoding and recognition.

And the probability correction judging module is used for judging whether the recognition text output by the ASR decoding module needs probability correction in real time, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding module to start the recognition of the next short voice.

And the related word searching module is used for searching keywords according to a related word corpus preset in the storage module to obtain related word groups of the historical recognition words. The association relation in the preset association word list is pronunciation approximation, semantic field approximation and logic correlation, the word groups with the relation form the association word list, and an inverted index is established on the association word list to carry out rapid keyword retrieval.

And the language model correction module is used for updating parameters of the dynamic language model in the ASR decoding module according to the real-time associated phrases and the historical associated phrases obtained by the associated word searching module. Specifically, an extended cache model based on a Recursive Network (RN) processes related phrases of a history recognition text, local observation probability is obtained through calculation after related word quantity of the history recognition text reaches a threshold value, a local language model is built on the basis of the local observation probability, the local language model and a general language model are subjected to interpolation and merging in a language model correction unit, and finally a corrected language model containing continuous voice signal local information is obtained and used for a subsequent voice recognition process.

In this embodiment, as shown in fig. 7, the language model modification module includes: the storage area is used for storing the related phrases output by the related word searching module; the local probability calculation unit is used for calculating the local vocabulary probability distribution corresponding to the associated phrases at the latest moment in the storage area; and the language model correction unit is used for constructing a local language model according to the local vocabulary probability distribution, interpolating and combining the local language model and the dynamic language submodel in the ASR decoding module, updating the automatic speech recognition model, sending a signal to the ASR decoding module after updating, and starting the recognition of the next short speech.

Preferably, as shown in fig. 8, the system further includes: and the manual correction module is used for manually correcting the recognition text output by the ASR decoding module.

Specifically, the low-frequency words with wrong recognition are corrected, the system is prevented from introducing wrong associated words through manual correction, and the recognition effect of the system on the continuous voice signals is further improved. Because additional interactive operations are introduced, the module is more suitable for text transcription of off-line voice signals, and recognition accuracy is improved at the cost of partial real-time.

Specifically, the calculating of the local vocabulary probability distribution corresponding to the associated phrase at the latest moment in the storage area includes:

in the formula, ω₁,…,ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁,…,ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_iThe corresponding hidden vector, θ represents the Euclidean distance, | · |, represents the modulo length, and ∈ represents the proportional relationship.

Specifically, the interpolation and merging of the local language model and the dynamic language submodel in the ASR decoding module, and the updating of the automatic speech recognition model includes:

P(ω_t|ω₁，…，ω_t-1)＝

(1-λ)P_model(ω_t|ω₁,…,ω_t-1)+λP_cache(ω_t|ω₁，…，ω_t-1)；

The specific technical scheme and the beneficial effects based on the modules are disclosed in the embodiment of the method, and therefore, the detailed description is omitted.

This example was trained and tested using the following data set:

librisipeech, a well-known open source English data set (LibriSpeech: an ASR corpus based on public domain audio books, Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpu, ICASSP 2015), written by Vassil Panayotov, includes about 1000 hours of 16kHz English reading speech corpus. Data were taken from the audiobook of the LibriVox project and carefully segmented and aligned.

New Crawl: a data set of news articles containing news data including 2007-2013. In this embodiment, a cache language model test is performed on the New Crawl 2007-2011 subdata set on the data distribution changing with time.

WikiText: write et al (s.write, c.xiong, j.bradbury, and r.socher.point sensor mix models. in ICLR,2017.) developed english datasets derived from Wikipedia articles, training the underlying recognition model on the original annotation format data.

The book repus: Gutenberg project data set (S.Lahiri.Complex of word chromatography networks: A prediction structural analysis. in Proceedings of The Student Research works hop at The 14th Conference of The European channel of The Association for Computational Linaturics, 2014.) contains text corpora in English books 3036.

Phone-record is a real telephone scene recording, contains 16kHz voice corpora for about 200 hours, and has a good-quality label text.

The implementation details are as follows:

in the basic speech recognition system, 39-dimensional MFCC features are extracted on Librispeech with a frame length of 25 milliseconds and a frame shift of 10 milliseconds, an acoustic model based on kaldi-channel is trained, and recognition is performed using the standard nnet 3-laten-family decoding flow. In modeling setting, the number of final phoneme gaussians in a gmm stage is 250000, leaf nodes are 18000, the final alignment result in the gmm stage is used as initial alignment of training of a chain model, a chain network comprises 3 LSTM layers, and each layer comprises 1024 LSTM units. Constructing a 3-gram language model by using Librisipeech, WikiText and The book Corpus text corpora in The initial language model;

the associated word list establishes a mysql relational database on The texts of New Crawl, WikiText and The book Corpus, sets multidimensional tags including topics and field information and The like, and establishes an inverted index. In the process of searching the relevant words, the threshold value of the low-frequency words is-4.75, and if the threshold value is higher than the threshold value, the low-frequency words are not processed and are not added into the cache model. The local language model interpolation coefficient lambda is set to 0.25;

performance evaluation:

for cache model performance, the recognition accuracy is evaluated using Word Error Rate (WER), which is calculated as follows:

in this implementation, the recognition performance of four schemes is evaluated:

table 1: caching model comparison test results

Wherein, the Base: only using the basic speech recognition system, not modifying the initial language model;

local cache, base line, proposed by Grave et al (e.grave, a.job, and n.user.improving neural network models with a connecting cache. in ICLR,2017.), is a relatively common method of using historical identification information;

cache LM, a Cache model used by the invention adopts a continuous recognition result to calculate the word error rate;

and (5) Final LM, re-identifying each test data set by using the Final language model, wherein the language model is not modified by using a cache model in the identification.

According to table 1, the method of the invention has an improved performance compared to the control scheme in all three test sets. The similarity between the training set and the test set is better, the low-frequency words are fewer, the promotion effect of the cache language model is limited, and the remarkable effect is shown in the other two test sets, particularly the Phone-record with stronger field.

It is worth mentioning that, as an evaluation on the Final dynamic language model modification effect, the Final LM obtains the best accuracy result in all three terms, that is, the modified language model can better describe the test data text relationship than the original language model, which fully proves that the method for modifying the dynamic language model by using the historical recognition information is effective and has a positive influence on the context-associated speech recognition task.

For the manual modification module, the present embodiment performs a test on a single conference recording, the recording duration is one hour, and the results are shown in table 2:

table 2: the manual correction module tests that the data is the error rate of a specific word

In the conference recording, two low-frequency words which do not appear or appear in the training data rarely are selected, the recognition accuracy of a system (base) which does not contain manual correction is poor, and after the manual correction is used (modified), the recognition accuracy of the system is remarkably improved, and the good effect of a manual correction module on a long voice transcription task is fully displayed.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that the invention is not limited thereto, and may be embodied in other forms without departing from the spirit or essential characteristics thereof. Any modification which does not depart from the functional and structural principles of the present invention is intended to be included within the scope of the claims.

Claims

1. A speech recognition adaptive method based on a cache language model is characterized by comprising the following steps:

2. The adaptive speech recognition method based on cached language models according to claim 1, further comprising the steps of:

s5: the recognized text in S2 is manually corrected.

3. The speech recognition adaptive method based on the cached language model as recited in claim 1, wherein the step S3 specifically comprises:

s301: establishing a cache model which comprises a plurality of cache regions;

in the formula, ω₁，...，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁，...，ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_iCorresponding hidden vectors, theta represents an Euclidean distance, | | | · | | | represents a modular length, and oc represents a proportional relation;

4. The adaptive speech recognition method according to claim 3, wherein when the associated phrases stored in the buffer of the buffer model reach the threshold of million-level values, the local vocabulary probability distribution calculation formula in step S303 is replaced with the following formula:

in the formula (I), the compound is shown in the specification,

5. The speech recognition adaptive method based on the cached language model as recited in claim 1, wherein the step S4 specifically comprises:

P(ω_t|ω₁，...，ω_t-1)＝(1-λ)P_model(ω_t|ω₁，...，ω_t-1)+λP_cache(ω_t|ω₁，...，ω_t-1)；

in which λ is the tuning parameter, P_modelRepresenting the recognition of the antecedent in the existing language model as omega₁，...，ω_t-1Under the condition that the current word is omega_tThe probability of (d);

6. A speech recognition adaptive system based on a cache language model, for implementing the speech recognition adaptive method of any one of claims 1-5, comprising:

7. The adaptive speech recognition system according to claim 6, wherein the language model modification module comprises:

8. The cache language model-based speech recognition adaptation system of claim 6, further comprising:

9. The adaptive speech recognition system according to claim 7, wherein the calculating a local lexical probability distribution corresponding to the associated phrase at the latest moment in the storage area comprises:

in the formula, ω₁，...，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁，...，ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK represents a kernel function，h_tIs omega_tCorresponding hidden vector, h_iIs omega_iThe corresponding hidden vector theta represents the Euclidean distance, | | | · | |, represents the modular length, and oc represents a proportional relationship.

10. The adaptive speech recognition system according to claim 7, wherein the local language model is interpolated with the dynamic language submodel in the ASR decoding module, and the updating the automatic speech recognition model comprises: