CN116416967A

Movatterモバイル変換

Info

Publication number: CN116416967A
Application number: CN202111651840.0A
Authority: CN
Inventors: 张美伟; 余娟; 吕洋; 李文沅; 余维华; 王香霖
Original assignee: Chongqing University; Chongqing Medical University
Current assignee: Chongqing University; Chongqing Medical University
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-07-11
Anticipated expiration: 2041-12-30
Also published as: CN116416967B

Abstract

Translated fromChinese

本发明公开一种通过迁移学习提升重庆方言语音识别的方法，步骤为：1)获取语音数据；2)对语音数据进行傅里叶转换，得到语音频谱图；3)利用VGG网络对语音频谱图向量化，得到向量V；4)获取transformer模型的输入X；5)得到参数Q、参数K、参数V；6)将参数Q、参数K、参数V输入到transformer模型的编码器encoder1和编码器encoder2中，分别得到编码器输出Y1和编码器输出Y2；7)将编码器输出Y1和编码器输出Y2输入到transformer模型的解码器中，得到语音识别文本；8)确定拼音BERT模型的输入x；9)将输入x输入到拼音BERT模型中，得到语音识别结果。本发明能更全面的捕捉到语义层面信息，并通过pipeline设计模式，将ASR中的声学模型，语言模型独立开，增强了ASR模型选择的多样性。

The invention discloses a method for improving speech recognition of Chongqing dialect through transfer learning. The steps are: 1) acquiring speech data; 2) performing Fourier transform on the speech data to obtain a speech spectrogram; 3) using a VGG network to analyze the speech spectrogram Vectorization to obtain vector V; 4) Obtain input X of the transformer model; 5) Obtain parameter Q, parameter K, and parameter V; 6) Input parameter Q, parameter K, and parameter V to the encoder encoder1 and encoder of the transformer model In encoder2, the encoder output Y1 and the encoder output Y2 are respectively obtained; 7) Input the encoder output Y1 and the encoder output Y2 into the decoder of the transformer model to obtain the speech recognition text; 8) Determine the input x of the pinyin BERT model ; 9) Input x into the Pinyin BERT model to obtain the speech recognition result. The present invention can more comprehensively capture semantic level information, and independently separate the acoustic model and the language model in the ASR through the pipeline design mode, thereby enhancing the diversity of ASR model selection.

Description

Translated fromChinese

一种通过迁移学习提升重庆方言语音识别的方法A Method of Improving Chongqing Dialect Speech Recognition Through Transfer Learning

技术领域technical field

本发明涉及领域，具体是一种通过迁移学习提升重庆方言语音识别的方法。The invention relates to the field, in particular to a method for improving Chongqing dialect speech recognition through transfer learning.

背景技术Background technique

语音识别技术起步于上世纪五十年代，现如今已取得了不错成绩，同样，自然语言处理技术伴随深度学习技术的发展，也逐渐从统计模型逐渐发展成为了深度语义模型，在一些经典的NLP场景得到了广泛的应用，比如NLG任务，命名体识别等任务。Speech recognition technology started in the 1950s and has achieved good results. Similarly, natural language processing technology has gradually developed from a statistical model to a deep semantic model with the development of deep learning technology. In some classic NLP The scene has been widely used, such as NLG tasks, named body recognition and other tasks.

人工智能产品在各个IT领域应用广泛，ASR技术是人工智能的重要组成部分，实现了计算机能“听懂”人的语音，ASR技术的发展有助于人与更多的人工智能产品进行交流，实现“人机交互”，从而让人们享受到科技发展给生活带来的便利与高效。Artificial intelligence products are widely used in various IT fields. ASR technology is an important part of artificial intelligence, which enables computers to "understand" human speech. The development of ASR technology helps people communicate with more artificial intelligence products. Realize "human-computer interaction", so that people can enjoy the convenience and efficiency brought by the development of science and technology to life.

ASR的实现可分为pipeline或者end2end思路，其中主要区别在于声学模型的识别单元上。模型识别单元大小(词发音模型、字发音模型、半音节模型或音素模型)对语音训练数据量大小、语音识别率，以及灵活性有较大的影响。对中等词汇量以上的语音识别系统来说，识别单元小，则计算量也小，所需的模型存储量也小，要求的训练数据量相对也小，但带来的问题是对应语音段的定位和分割困难，以及更复杂的识别模型规则。通常大的识别单元易于包括协同发音在模型中，这有利于提高系统的识别率，但要求的训练数据相对增加。The implementation of ASR can be divided into pipeline or end2end ideas, and the main difference lies in the recognition unit of the acoustic model. The size of the model recognition unit (word pronunciation model, word pronunciation model, half-syllable model or phoneme model) has a greater impact on the size of speech training data, speech recognition rate, and flexibility. For a speech recognition system with a medium vocabulary or above, if the recognition unit is small, the amount of calculation is small, the amount of model storage required is also small, and the amount of training data required is relatively small, but the problem it brings is the corresponding speech segment. Difficulties in localization and segmentation, and more complex recognition model rules. Usually, large recognition units are easy to include co-pronunciation in the model, which is beneficial to improve the recognition rate of the system, but the required training data is relatively increased.

综上看，基于统计的语言模型受预料大小影响，效果有限，且统计信息在语义层面表达能力有限。现有技术在声学模型中没有融入语言模型，且大部分基于深度学习的声学模型采用CNN或者类RNN结构，计算效率有限。BERT等模型在文本生成任务中由于其双向attention机制，在NLG任务中效果有限。To sum up, the language model based on statistics is affected by the expected size, the effect is limited, and the statistical information has limited expressive ability at the semantic level. The existing technology does not integrate the language model into the acoustic model, and most acoustic models based on deep learning adopt a CNN or RNN-like structure, which has limited computational efficiency. Models such as BERT have limited effects in NLG tasks due to their two-way attention mechanism in text generation tasks.

发明内容Contents of the invention

本发明的目的是提供一种通过迁移学习提升重庆方言语音识别的方法，包括以下步骤：The purpose of the present invention is to provide a method for promoting Chongqing dialect speech recognition through transfer learning, comprising the following steps:

1)获取语音数据。所述语音数据包括方言。1) Acquire voice data. The voice data includes dialects.

2)对语音数据进行傅里叶转换，得到语音频谱图。2) Perform Fourier transform on the voice data to obtain a voice spectrogram.

3)利用VGG网络对语音频谱图向量化，得到向量v。3) Use the VGG network to vectorize the speech spectrogram to obtain a vector v.

向量v如下所示：The vector v looks like this:

v＝VGG(DFT(A)) (1)v=VGG(DFT(A)) (1)

式中，A为语音数据。In the formula, A is voice data.

4)获取transformer模型的输入X。所述transformer模型包括编码器encoder1、编码器encoder2和解码器decoder。4) Get the input X of the transformer model. The transformer model includes encoder1, encoder2 and decoder.

transformer的输入X如下所示：The input X to the transformer looks like this:

X＝PE(DFT(A))+Fbank(v) (2)X＝PE(DFT(A))+Fbank(v) (2)

式中，PE为位置编码函数。In the formula, PE is the position encoding function.

5)对输入X进行转化，得到参数Q、参数K、参数V。5) Convert the input X to obtain the parameter Q, parameter K, and parameter V.

Q＝XW^Q,K＝XW^K,V＝XW^V (3)Q＝XW^Q , K＝XW^K , V＝XW^V (3)

6)将参数Q、参数K、参数V输入到transformer模型的编码器encoder1和编码器encoder2中，分别得到编码器输出Y1和编码器输出Y2。6) Input parameter Q, parameter K, and parameter V into encoder1 and encoder2 of the transformer model to obtain encoder output Y1 and encoder output Y2 respectively.

参数Q、参数K、参数V如下所示：The parameters Q, K and V are as follows:

编码器encoder包括多头注意力层、前向传播层。The encoder encoder includes a multi-head attention layer and a forward propagation layer.

多头注意力层的输出MultiHead(Q,K,V)如下所示：The output MultiHead(Q,K,V) of the multi-head attention layer is as follows:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O (4)MultiHead(Q,K,V)＝Concat(head₁ ,...,head_h )W^O (4)

其中，参数head_i如下所示：Among them, the parameter head_i is as follows:

head_i＝Attention(QW_i^Q,KW_i^K,VW_i^V),i＝1,2,...,h (5)head_i ＝Attention(QW_i^Q ,KW_i^K ,VW_i^V ),i=1,2,...,h (5)

式中，h为attention层的层数；W_i^Q、W_i^K、W_i^V为第i层权重。In the formula, h is the number of attention layers; W_i^Q , W_i^K , and W_i^V are the weights of the i-th layer.

注意力Attention(Q,K,V)如下所示：Attention Attention (Q, K, V) is as follows:

式中，

为归一化参数；In the formula,

is the normalization parameter;

前向传播层的输出FFN(x)如下所示：The output FFN(x) of the forward propagation layer is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (7)FFN(x)=max(0,xW₁ +b₁ )W₂ +b₂ (7)

前向传播层的输入x如下所示：The input x to the forward propagation layer looks like this:

x＝norm(X+MultiHead(Q,K,V)) (8)x=norm(X+MultiHead(Q,K,V)) (8)

encoder编码器的输出Y如下所示：The output Y of the encoder encoder looks like this:

Y＝FFN(x) (9)Y=FFN(x) (9)

7)将编码器输出Y1和编码器输出Y2输入到transformer模型的解码器中，得到语音识别文本。7) Input the encoder output Y1 and the encoder output Y2 into the decoder of the transformer model to obtain the speech recognition text.

8)基于语音识别文本，确定拼音BERT模型的输入x。8) Based on the speech recognition text, determine the input x of the Pinyin BERT model.

拼音BERT模型的输入x如下所示：The input x of the Pinyin BERT model is as follows:

x＝Concat(CE,GE,PYE)W_F+PE (10)x＝Concat(CE,GE,PYE)W_F +PE (10)

式中，CE表示字嵌入。GE表示字形嵌入。PYE表示拼音嵌入。PE表示位置嵌入。W_F表示全连接层。Concat表示向量拼接。where CE stands for word embedding. GE stands for grapheme embedding. PYE stands for Pinyin Embedding. PE stands for positional embedding._WF represents a fully connected layer. Concat means vector concatenation.

字形嵌入GE如下所示：Glyph embedding GE looks like this:

GE＝Concat(flatten(I₁),flatten(I₂),flatten(I₃))W_G (11)GE=Concat(flatten(I₁ ),flatten(I₂ ),flatten(I₃ ))W_G (11)

式中，I₁、I₂、I₃表示字形图像。W_G表示全连接层。flatten表示将二维图像转化为一维向量。In the formula, I₁ , I₂ , and I₃ represent font images. W_G represents a fully connected layer. Flatten means converting a two-dimensional image into a one-dimensional vector.

拼音嵌入PYE如下所示：Pinyin embedding PYE looks like this:

PYE＝max-pooling(CNN(S)) (12)PYE=max-pooling(CNN(S)) (12)

式中，S表示拼音序列。max-pooling表示最大池化。CNN表示卷积计算。In the formula, S represents the pinyin sequence. max-pooling means maximum pooling. CNN stands for convolution computation.

9)将输入x输入到拼音BERT模型中，得到语音识别结果。9) Input the input x into the Pinyin BERT model to obtain the speech recognition result.

语音识别结果p(x₁,x₂,x₃,...,x_n)如下所示：The speech recognition result p(x₁ ,x₂ ,x₃ ,...,x_n ) is as follows:

＝... =...

式中，p(x₂|x₁)表示语音识别文本概率分布。In the formula, p(x₂ |x₁ ) represents the probability distribution of speech recognition text.

本发明的技术效果是毋庸置疑的，本发明将ASR技术中的语言模型由统计模型换位基于大规模预料的预训练模型，能更全面的捕捉到语义层面信息，并通过pipeline设计模式，将ASR中的声学模型，语言模型独立开，增强了ASR模型选择的多样性。The technical effect of the present invention is unquestionable. The present invention transposes the language model in ASR technology from a statistical model to a pre-training model based on large-scale predictions, which can more comprehensively capture semantic level information, and through the pipeline design mode, the The acoustic model in ASR and the language model are independent, which enhances the diversity of ASR model selection.

本发明通过将位置的embedding放在声学模型中，使声学模型具备了一定的语言模型的能力，同时增强了声学模型提取到声学信息并完成解码的有效性。The present invention puts the embedding of the position in the acoustic model, so that the acoustic model has a certain ability of the language model, and at the same time enhances the effectiveness of the acoustic model to extract the acoustic information and complete the decoding.

本发明通过引入拼音，字形等embedding，全方位捕捉语言的信息，这和ASR在中文中存在的一些难点，例如声母相同，韵母相同，发音相同等特点是相匹配的，同时也提高了语言模型在解码过程中的准确性。The present invention captures language information in an all-round way by introducing pinyin, font and other embeddings, which match some difficulties in ASR in Chinese, such as the same initial consonant, the same final vowel, and the same pronunciation, etc., and also improves the language model. Accuracy during decoding.

本发明将UniLM模型应用到ASR场景，借助UniLM算法在文本生成任务中的有效性，提高了ASR解码的准确率。The invention applies the UniLM model to the ASR scene, and improves the accuracy of ASR decoding by virtue of the effectiveness of the UniLM algorithm in text generation tasks.

针对近年来NLP技术采用预训练方法在大量NLP任务上的卓越表现，本发明提出利用transformer充当声学模型得到初步的ASR结果，再根据语言场景的预料，结合其拼音预训练得到的预训练模型(UniLM)充当语言模型，最终得到ASR结果输出。Aiming at the excellent performance of NLP technology using pre-training methods in a large number of NLP tasks in recent years, the present invention proposes to use transformer as an acoustic model to obtain preliminary ASR results, and then combine the pre-training model ( UniLM) acts as a language model, and finally gets the ASR result output.

附图说明Description of drawings

图1为语音识别流程；Fig. 1 is speech recognition process;

图2为语音特征处理流程；Fig. 2 is a voice feature processing flow;

图3为transformer结构图；Figure 3 is a transformer structure diagram;

图4为输入整理；Figure 4 is the input arrangement;

图5为输入信息融合；Fig. 5 is input information fusion;

图6为拼音embedding。Figure 6 shows the pinyin embedding.

具体实施方式Detailed ways

下面结合实施例对本发明作进一步说明，但不应该理解为本发明上述主题范围仅限于下述实施例。在不脱离本发明上述技术思想的情况下，根据本领域普通技术知识和惯用手段，做出各种替换和变更，均应包括在本发明的保护范围内。The present invention will be further described below in conjunction with the examples, but it should not be understood that the scope of the subject of the present invention is limited to the following examples. Without departing from the above-mentioned technical ideas of the present invention, various replacements and changes made according to common technical knowledge and conventional means in this field shall be included in the protection scope of the present invention.

实施例1：Example 1:

参见图1、图2、图3、图4、图5、图6，一种通过迁移学习提升重庆方言语音识别的方法，包括以下步骤：Referring to Fig. 1, Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6, a method for improving Chongqing dialect speech recognition through transfer learning comprises the following steps:

向量v如下所示：The vector v looks like this:

v＝VGG(DFT(A)) (1)v=VGG(DFT(A)) (1)

式中，A为语音数据。In the formula, A is voice data.

X＝PE(DFT(A))+Fbank(v) (2)X＝PE(DFT(A))+Fbank(v) (2)

式中，PE为位置编码函数。Fbank()表示语音特征提取操作。In the formula, PE is the position encoding function. Fbank() represents the speech feature extraction operation.

Q＝XW^Q,K＝XW^K,V＝XW^V (3)Q＝XW^Q , K＝XW^K , V＝XW^V (3)

式中，

为归一化参数；In the formula,

is the normalization parameter;

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (7)FFN(x)=max(0,xW₁ +b₁ )W₂ +b₂ (7)

x＝norm(X+MultiHead(Q,K,V)) (8)x=norm(X+MultiHead(Q,K,V)) (8)

Y＝FFN(x) (9)Y=FFN(x) (9)

x＝Concat(CE,GE,PYE)W_F+PE (10)x＝Concat(CE,GE,PYE)W_F +PE (10)

字形嵌入GE如下所示：Glyph embedding GE looks like this:

式中，I表示字形图像。W_G表示全连接层。flatten表示将二维图像转化为一维向量。In the formula, I represents the font image. W_G represents a fully connected layer. Flatten means converting a two-dimensional image into a one-dimensional vector.

拼音嵌入PYE如下所示：Pinyin embedding PYE looks like this:

PYE＝max-pooling(CNN(S)) (12)PYE=max-pooling(CNN(S)) (12)

＝... =...

实施例2：Example 2:

一种通过迁移学习提升重庆方言语音识别的方法，包括以下步骤：A method for improving Chongqing dialect speech recognition through transfer learning, comprising the following steps:

1)根据音频，采用信号处理技术以及傅里叶变换得到单个音频文件的频谱图，通过VGG网络结构提取整个结构图的向量表达。1) According to the audio, signal processing technology and Fourier transform are used to obtain the spectrum diagram of a single audio file, and the vector expression of the entire structure diagram is extracted through the VGG network structure.

其公式可表示为：Its formula can be expressed as:

V＝VGG(DFT(A))V=VGG(DFT(A))

A:音频文件；DFT：离散傅里叶变换；VGG：VGG网络；V:VGG输出的向量表达A: audio file; DFT: discrete Fourier transform; VGG: VGG network; V: vector expression of VGG output

2)根据频谱图，得到每个频谱单元在原图的位置信息，并将其embedding向量化之后，与Fbank一并输入到transformer里。2) According to the spectrogram, the position information of each spectral unit in the original image is obtained, and after its embedding is vectorized, it is input into the transformer together with Fbank.

encoder计算流程及公式：Encoder calculation process and formula:

transformer输入X由位置编码和Fbank两部分组成，PE为位置编码函数：Transformer input X consists of two parts, position encoding and Fbank, and PE is the position encoding function:

X＝PE(DFT(A))+Fbank(V)X＝PE(DFT(A))+Fbank(V)

将输入X转化为Q,K,V：Convert input X to Q, K, V:

Q＝XW^Q,K＝XW^K,V＝XW^VQ＝XW^Q , K＝XW^K , V＝XW^V

注意力计算公式：Attention calculation formula:

多头注意力层：Multi-head attention layer:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^OMultiHead(Q,K,V)＝Concat(head₁ ,...,head_h )W^O

其中:in:

head_i＝Attention(QW_i^Q,KW_i^K,VW_i^V),i＝1,2,...,hhead_i ＝Attention(QW_i^Q ,KW_i^K ,VW_i^V ),i=1,2,...,h

前向传播层：Forward propagation layer:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂FFN(x)=max(0,xW₁ +b₁ )W₂ +b₂

其中：in:

x＝norm(X+MultiHead(Q,K,V))x=norm(X+MultiHead(Q,K,V))

encoder的输出：The output of the encoder:

Y＝FFN(x)Y=FFN(x)

decoder计算过程与encoder类似，具体参照图3，便不再赘述。The calculation process of the decoder is similar to that of the encoder, please refer to Figure 3 for details, and details will not be repeated here.

3)汉字的最大特性有两个方面：一是字形，二是拼音。汉字是一种典型的意音文字，从其起源来看，它的字形本身就蕴含了一部分语义。比如，“江河湖泊”都有偏旁三点水，这表明它们都与水有关。而从读音来看，汉字的拼音也能在一定程度上反映一个汉字的语义，起到区别词义的作用。比如，“乐”字有两个读音，yuè与lè，前者表示“音乐”，是一个名词；后者表示“高兴”，是一个形容词。而对于一个多音字，单单输入一个“乐”，模型是无法得知它应该是代表“音乐”还是“快乐”，这时候就需要额外的读音信息进行去偏。从汉字本身的这两大特性出发，将汉字的字形与拼音信息融入到中文语料的预训练过程。一个汉字的字形向量由多个不同的字体形成，而拼音向量则由对应的罗马化的拼音字符序列得到。二者与字向量一起进行融合，得到最终的融合向量，作为预训练模型的输入。模型使用全词掩码(Whole Word Masking)和字掩码(Character Masking)两种策略训练，使模型更加综合地建立汉字、字形、读音与上下文之间的联系。3) The biggest characteristic of Chinese character has two aspects: the one, font, the 2nd, pinyin. Chinese characters are a typical phonetic writing. From the point of view of its origin, its glyph itself contains a part of semantics. For example, "rivers and lakes" all have three points of water in their radicals, which indicates that they are all related to water. From the perspective of pronunciation, the pinyin of Chinese characters can also reflect the semantics of a Chinese character to a certain extent, and play a role in distinguishing word meanings. For example, the word "乐" has two pronunciations, yuè and lè. The former means "music" and is a noun; the latter means "happy" and is an adjective. For a polyphonic word, if only one word "乐" is input, the model cannot know whether it should represent "music" or "happiness". At this time, additional pronunciation information is needed for debiasing. Starting from these two characteristics of Chinese characters, the glyph and pinyin information of Chinese characters are integrated into the pre-training process of Chinese corpus. The glyph vector of a Chinese character is formed by multiple different fonts, while the pinyin vector is obtained from the corresponding romanized pinyin character sequence. The two are fused together with the word vector to obtain the final fusion vector, which is used as the input of the pre-training model. The model uses two strategies of whole word masking (Whole Word Masking) and character masking (Character Masking) training, so that the model can more comprehensively establish the connection between Chinese characters, glyphs, pronunciation and context.

X＝Concat(CE,GE,PYE)W_F+PEX＝Concat(CE,GE,PYE)W_F +PE

CE:字嵌入，GE:字形嵌入，PYE:拼音嵌入，PE:位置嵌入，WF:全连接层，X:BERT的输入，Concat:向量拼接。CE: Word Embedding, GE: Glyph Embedding, PYE: Pinyin Embedding, PE: Position Embedding, WF: Fully Connected Layer, X: BERT Input, Concat: Vector Concatenation.

底层的融合层(Fusion Layer)融合了除字嵌入(Char Embedding)之外的字形嵌入(Glyph Embedding)和拼音嵌入(Pinyin Embedding)，得到融合嵌入(FusionEmbedding)，再与位置嵌入相加，就形成模型的输入。字形嵌入使用不同字体的汉字图像得到。每个图像都是24*24的大小，将仿宋、行楷和隶书这三种字体的图像向量化，拼接之后再经过一个全连接W_G，就得到了汉字的字形嵌入。The underlying fusion layer (Fusion Layer) combines the glyph embedding (Glyph Embedding) and the pinyin embedding (Pinyin Embedding) except for the word embedding (Char Embedding), to obtain the fusion embedding (FusionEmbedding), and then add the position embedding to form The input of the model. Glyph embeddings are obtained using Chinese character images in different fonts. Each image is 24*24 in size. The images of the three fonts of Song Dynasty, Xingkai and Lishu are vectorized, and after splicing, they go through a fully connected_WG to obtain the glyph embedding of Chinese characters.

该过程如图5所示：The process is shown in Figure 5:

GE＝Concat(flatten(I₁),flatten(I₂),flatten(I₃))W_GGE=Concat(flatten(I₁ ),flatten(I₂ ),flatten(I₃ ))W_G

I:字形图像，WG:全连接层，GE:字形嵌入，flatten:将二维图像转化为一维向量。I: glyph image, WG: fully connected layer, GE: glyph embedding, flatten: convert a two-dimensional image into a one-dimensional vector.

拼音嵌入首先使用pypinyin将每个汉字的拼音转化为罗马化字的字符序列，其中也包含了音调。比如对汉字“猫”，其拼音字符序列就是“mao1”。对于多音字如“乐”，pypinyin能够非常准确地识别当前上下文中正确的拼音。Pinyin embedding first uses pypinyin to convert the pinyin of each Chinese character into a romanized character sequence, which also includes the tone. For example, for the Chinese character "cat", its pinyin character sequence is "mao1". For polyphonic characters such as "Le", pypinyin can very accurately identify the correct pinyin in the current context.

该过程如图6所示：The process is shown in Figure 6:

PYE＝max-pooling(CNN(S))PYE=max-pooling(CNN(S))

S:拼音序列，max-pooling:最大池化，CNN:卷积计算，PYE:拼音嵌入。S: pinyin sequence, max-pooling: maximum pooling, CNN: convolution calculation, PYE: pinyin embedding.

4)结合预训练模型UniLM生成最终ASR识别结果，相比较于基于语言模型的生成模型，BERT由于其双向解码无法满足语言模型要求，但是通过Mask attention手动控制解码方向，从双向变为单向即可：4) Combine the pre-training model UniLM to generate the final ASR recognition result. Compared with the language model-based generation model, BERT cannot meet the requirements of the language model due to its bidirectional decoding, but manually controls the decoding direction through Mask attention, changing from bidirectional to unidirectional. Can:

＝… =...

x₁,x₂,…,x_n任意一种“出场顺序”都有可能。原则上来说，每一种顺序都对应着一个模型，所以原则上就有n！个语言模型.实现一种顺序的语言模型，就相当于将原来的下三角形式的Mask以某种方式打乱。正因为Attention提供了这样的一个n×n的Attention矩阵，本发明才有足够多的自由度去以不同的方式去Mask这个矩阵，从而实现多样化的效果。从而满足了语言模型的要求。Any "order of appearance" of x₁ , x₂ ,…, x_n is possible. In principle, each order corresponds to a model, so in principle there are n! A language model. Implementing a sequential language model is equivalent to disrupting the original lower-triangular Mask in some way. Just because Attention provides such an n×n Attention matrix, the present invention has enough degrees of freedom to mask the matrix in different ways, so as to achieve diversified effects. Thus meeting the requirements of the language model.

Claims

1. A method for enhancing Chongqing dialect speech recognition by transfer learning, comprising the steps of:

1) Voice data is acquired.

2) And carrying out Fourier transform on the voice data to obtain a voice spectrogram.

3) Vectorizing the voice frequency spectrogram by utilizing a VGG network to obtain a vector v;

4) Acquiring an input X of a transducer model; the transducer model comprises an encoder1, an encoder2 and a decoder;

5) Converting the input X to obtain a parameter Q, a parameter K and a parameter V;

6) Inputting the parameter Q, the parameter K and the parameter V into an encoder1 and an encoder2 of the transducer model to respectively obtain an encoder output Y1 and an encoder output Y2;

7) Inputting the encoder output Y1 and the encoder output Y2 into a decoder of a transducer model to obtain a voice recognition text;

8) Determining input x of a pinyin BERT model based on the speech recognition text;

9) And inputting the input x into the pinyin BERT model to obtain a voice recognition result.

2. The method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the voice data includes dialects.

3. A method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein vector v is as follows:

v＝VGG(DFT(A)) (1)

wherein A is voice data.

4. The method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein the input X of the transducer is as follows:

X＝PE(DFT(A))+Fbank(v) (2)

where PE is a position-coding function.

5. The method for enhancing Chongqing dialect speech recognition by transfer learning as set forth in claim 1, wherein the parameters Q, K, V are as follows:

Q＝XW^Q ,K＝XW^K ,V＝XW^V (3)。

6. the method for enhancing Chongqing dialect speech recognition by transfer learning of claim 1, wherein: the encoder comprises a multi-head attention layer and a forward propagation layer;

the output MultiHead (Q, K, V) of the multi-headed attention layer is shown below:

MultiHead(Q,K,V)＝Concat(head₁ ,...,head_h )W^O (4)

wherein, the parameter head_i The following is shown:

head_i ＝Attention(QW_i^Q ,KW_i^K ,VW_i^V ),i＝1,2,...,h (5)

wherein h is the number of layers of the layer; w (W)_i^Q 、W_i^K 、W_i^V Is the i-th layer weight.

Attention (Q, K, V) is as follows:

in the method, in the process of the invention,

is a normalization parameter;

the output FFN (x) of the forward propagating layer is as follows:

FFN(x)＝max(0,xW₁ +b₁ )W₂ +b₂ (7)

the input x of the forward propagation layer is as follows:

x＝norm(X+MultiHead(Q,K,V)) (8)

the output Y of the encoder is as follows:

Y＝FFN(x) (9)。

7. the method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein the input x of the Pinyin BERT model is as follows:

x＝Concat(CE,GE,PYE)W_F +PE (10)

wherein, CE represents word embedding; GE represents character form embedding; PYE represents Pinyin embedding; PE represents position embedding; w (W)_F Representing a fully connected layer; concat represents vector concatenation.

8. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 7, wherein the glyph embedding GE is as follows:

GE＝Concat(flatten(I₁ ),flatten(I₂ ),flatten(I₃ ))W_G (11)

wherein I is₁ 、I₂ 、I₃ Representing a glyph image; w (W)_G Representing a fully connected layer; the flat represents converting a two-dimensional image into a one-dimensional vector.

9. The method for enhancing speech recognition of Chongqing dialect by transfer learning of claim 7, wherein the pinyin-embedded PYE is as follows:

PYE＝max-pooling(CNN(S)) (12)

wherein S represents a Pinyin sequence; max-pooling represents maximum pooling; CNN represents a convolution calculation.

10. The method for enhancing Chongqing dialect speech recognition by transfer learning as recited in claim 1, wherein the speech recognition result p (x₁ ,x₂ ,x₃ ,...,x_n ) The following is shown:

p(x₁ ,x₂ ,x₃ ,...,x_n )＝p(x₁ )p(x₂ |x₁ )p(x₃ |x₁ ,x₂ )...p(x_n |x₁ ,x₂ ,...,x_n-1 )

＝p(x₃ )p(x₁ |x₃ )p(x₂ |x₃ ,x₁ )...p(x_n |x₃ ,x₁ ,...,x_n-1 )...p(x_n-1 )

＝...

＝p(x₁ |x_n-1 )p(x_n |x_n-1 ,x₁ )...p(x₂ |x_n-1 ,x₁ ,...,x₃ ) (13)

wherein p (x)₂ |x₁ ) Representing a speech recognition text probability distribution.