Disclosure of Invention
The technical purpose is as follows: the invention provides a voice recognition method and a system based on contrast predictive coding; a large amount of insufficient voice data acquired by a background are fully utilized, the voice data are regarded as time sequence data, end-to-end conversion is directly carried out, intermediate voice spectrum features are not required to be extracted, a certain number of fragments with fixed time are extracted randomly by each voice, each fragment is divided into front data and rear data, coding prediction of the rear time sequence data is achieved through a first converter when the front data are input, time sequence prediction of the front data is achieved through a second converter when the rear data are input, after the predicted data are combined, end-to-end paired data comparison is directly carried out on the predicted data and data to be detected of the same type (or different types), and finally end-to-end voice recognition is achieved according to voice category label requirements.
Technical scheme
The first purpose of the present invention is to provide a speech recognition method based on contrast predictive coding, comprising the following steps:
s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)1 ,X2 Y); wherein: x1 Is the first piece of speech timing data of the triplet, X2 For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;
s3, constructing a pairing fragment data set; the method specifically comprises the following steps:
for the first piece of voice time series data X in the paired data set1 Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as Sp Taking out the remaining part, defining as the rear data of the segment, and recording as Ss (ii) a For the second piece of voice time series data X in the paired data set2 Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally for each segment S andsegments S' and copying front data S corresponding to each segment Sp Rear data Ss Fragment S 'and label Y, obtaining N X M pairing fragment data sets consisting of N X M quadruplets (Sp, Ss, S', Y);
s4, constructing an artificial neural network; the method specifically comprises the following steps:
s401, establishing a confrontation generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, processing front data Sp of the segment through a first converter to obtain Sps The rear data S of the segments Processed by the second converter to obtain Ssp ;
S4012, mixing (S)ps ,Ssp ) Combined into a complete segment Sf ;
S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment Sf When the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputf Subsequently, a segment S' to be compared is input;
s4014, calculating distance d from (Z, Z'):
d=‖Z’-Z‖2 ;
s4015, calculating loss according to the distance d and the label Y:
margin is a user-defined real number larger than 0 and is usually set to be 1;
s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolutional neural network;
and S6, performing voice recognition through the voice recognition network.
Preferably, S5 is specifically:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are N × M pairing fragment data sets;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0 Counting the data into one batch, and counting the data into one epoch after all the training data are processed once; m0 Is 128 or 256;
s507, training K epochs; k is a natural number.
Preferably, S6 is specifically: taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice Sw To slice Sw Replacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { dw And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.
It is a second object of the present invention to provide a speech recognition system based on contrast predictive coding, comprising:
the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
the pairing data set construction module is used for constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)1 ,X2 Y); wherein: x1 Is the first piece of speech timing data of the triplet, X2 For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; when two voice time sequence data of each data of the same kind pairing set are voices of the same voice categorySequence data; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;
the pairing fragment data set construction module is used for constructing a pairing fragment data set; the method specifically comprises the following steps:
for the first piece of voice time series data X in the paired data set1 Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as Sp Taking out the remaining part, defining as the rear data of the segment, and recording as Ss (ii) a For the second piece of voice time series data X in the paired data set2 Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, aiming at each segment S and each segment S', copying the front data S corresponding to each segment Sp Rear data Ss Obtaining N x M matching fragment data sets consisting of N x M quadruplets (Sp, Ss, S', Y);
an artificial neural network construction module; the construction process comprises the following steps:
s401, establishing a confrontation generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, processing front data Sp of the segment through a first converter to obtain Sps The rear data S of the segments Processed by the second converter to obtain Ssp ;
S4012, mixing (S)ps ,Ssp ) Combined into a complete segment Sf ;
S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment Sf When the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputf Subsequently inputting a segment S' to be compared;
s4014, calculating distance d from (Z, Z'):
d=‖Z’-Z‖2 ;
s4015, calculating loss according to the distance d and the label Y:
margin is a user-defined real number larger than 0 and is usually set to be 1;
the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolutional neural network;
and the recognition module is used for carrying out voice recognition through a voice recognition network.
Preferably, the training process of the training module is as follows:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are N × M pairing fragment data sets;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0 Counting the data into one batch, and counting the data into one epoch after all the training data are processed once; m0 Is 128 or 256;
s507, training K epochs; k is a natural number.
Preferably, the identification process of the identification module is as follows: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice Sw To slice Sw Replacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { dw And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.
A third object of the present invention is to provide an information data processing terminal for implementing the above-mentioned speech recognition method based on the comparative predictive coding.
It is a fourth object of the present invention to provide a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the above-described method for speech recognition based on contrast predictive coding.
The invention has the advantages and positive effects that:
the invention fully utilizes a large amount of insufficient voice data acquired by a background, takes the voice data as time sequence data, directly converts end to end without extracting the characteristics of the voice sequence data in the middle, randomly extracts a certain number of fragments with fixed time length from each voice, divides each fragment into front part data and back part data, realizes coding prediction of the back part sequence data through a first converter when the front data is input, realizes the sequence prediction of the front data through a second converter when the back part data is input, directly compares the predicted data with the data to be detected of the same type (or different types) end to end in pairs after combining, and finally realizes end to end voice recognition according to the requirements of voice category labels.
Detailed Description
In order to further understand the contents, features and effects of the present invention, the following embodiments are exemplified and described in detail with reference to the accompanying drawings.
Referring to fig. 1 to 4, a speech recognition method based on contrast prediction coding includes:
s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)1 ,X2 Y); wherein: x1 Is the first piece of speech timing data of the triplet, X2 For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories; referring to fig. 1, the construction process of the paired data set specifically includes:
in fig. 1, for the purpose of explaining the problem, three speech categories are taken as an example for detailed description, the original speech time series data sample on the left side includes speech time series data samples of three speech categories, and the speech time series data of each speech category are respectively distinguished by different padding;
firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairs, and defining a label Y of the similar pairs as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice categories respectively to pair to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;
then, according to the total pairing number N, defining the number of the voice categories as a category number k, and defining the homogeneous sampling ratio as alpha, and in order to meet the purpose of fair sampling, namely that the probability of each voice category being extracted is the same, setting the following limiting conditions: s1 +S2 = N/k, calculating the number of pairs S of the same kind to be extracted1 And heterogeneous pairing number S2 :
Finally, the same kind pairing, the different kind pairing and the mark are matchedThe label constitutes a dataset comprising N triples (X)1 ,X2 Y); wherein: x1 Is the first piece of speech timing data, X, of the triplet2 The second piece of voice time sequence data of the triple; y is a label.
The present invention constructs a data set using put-back sample pairings for these speech timing data:
in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the paired voice time sequence data samples are marked with Y as 0.
One voice type is randomly extracted from voice time sequence data samples of different voice types, one voice type is randomly extracted from other voice time sequence data, one pairing is completed, and the voice time sequence data is marked with Y = 1.
Such similar pairing decimation S1 Wheel, heterogeneous pairing extraction S2 Wheel, i.e. obtain S1 +S2 And matching the pairs to form a data set, so that training and testing can be performed. The problem that the number of the voice time sequence data in various types of voice time sequence data is not insufficient exists in the playback sampling.
S3, constructing a pairing fragment data set; the method specifically comprises the following steps:
for the first piece of voice time series data X in the paired data set1 Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as Sp Taking out the remaining part, defining as the rear data of the segment, and recording as Ss (ii) a For the second piece of voice time series data X in the paired data set2 Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, for each segment S and S', the front data S corresponding to each segment S is copiedp Rear data Ss Obtaining N x M matching fragment data sets consisting of N x M quadruplets (Sp, Ss, S', Y);
s4, constructing an artificial neural network; the method specifically comprises the following steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, establishing front data S of the segmentp Corresponding first converter, processed result is Sps Rear data S of the segments Corresponding second converter, processed result is Ssp ;
S4012, the front and back parts of the fragment (S)ps ,Ssp ) Combined into a complete segment Sf ;
S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment Sf When the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputf Subsequently, a segment S' to be compared is input;
s4014, according to the matched segment (S, S'), for the segment S, after dividing the front part and the rear part, outputting S through the first converter and the second convertersp 、Sps Are combined into segments Sf Then, outputting Z through a one-dimensional convolution neural network; for the segment S ' to be compared, directly outputting Z ' after passing through a one-dimensional convolutional neural network, and calculating the distance d by (Z, Z '):
d=‖Z’-Z‖2 ;
s4015, calculating loss according to the distance d and the label Y:
margin is a user-defined real number larger than 0 and is usually set to be 1;
s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolution neural network; the method specifically comprises the following steps:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are N × M pairing fragment data sets;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0 Stripe data (M)0 A natural number defined by a user, which is suggested to be 128 or 256), is counted as a batch, and all training data are processed once and counted as an epoch;
s507, training K epochs; k is a natural number;
SS6, carrying out voice recognition through a voice recognition network;
from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice Sw Replacing the segment S ' to be compared in S4013, one-to-many pairs can be formed as above, the pair is input into the speech recognition network trained in S5, Z and Z ' are obtained by calculating each pair, the distance d is obtained by Z and Z ', and finally the list { d is formedw And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.
A speech recognition system based on contrast predictive coding, comprising:
the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain voice time sequence data of PCM codes; a is a natural number greater than 1;
the pairing data set construction module is used for constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)1 ,X2 Y); wherein: x1 Is the first piece of speech timing data of the triplet, X2 For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; the two pieces of voice time sequence data of each data of the heterogeneous pairing set are different languagesVoice timing data for a tone category; referring to fig. 1, the construction process of the paired data set specifically includes:
in fig. 1, for the purpose of explaining the problem, three voice categories are taken as an example for detailed description, where the original voice time series data sample on the left side includes voice time series data samples of three voice categories, and the voice time series data of each voice category is distinguished by different padding;
firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairing, and defining a label Y of the similar pairing as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice types respectively for pairing to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;
then, according to the total pairing number N, the number of the voice categories is defined as a category number k, and the homogeneous-heterogeneous sampling ratio is defined as α, so as to satisfy the purpose of fair sampling, that is, the probability of each voice category being extracted is the same, and the following limiting conditions are set: s1 +S2 = N/k, calculating the number of pairs S of the same kind to be extracted1 And heterogeneous pair number S2 :
Finally, the same-class pairs, the different-class pairs and the labels y are combined into a data set, and the data set comprises N triples (X)1 ,X2 Y); wherein: x1 Is the first piece of speech timing data, X, of the triplet2 The second piece of voice time sequence data of the triple; y is a label.
The present invention constructs a data set using put-back sample pairings for these speech timing data:
in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the paired voice time sequence data samples are marked with Y as 0.
One voice type is randomly extracted from voice time sequence data samples of different voice types, one voice type is randomly extracted from other voice time sequence data, one pairing is completed, and the voice time sequence data is marked with Y = 1.
Such homogeneous pairing decimation S1 Wheel, heterogeneous pairing extraction S2 Wheel, i.e. obtain S1 +S2 And matching the pairs to form a data set, so that training and testing can be performed. The problem that the number of voice time sequence data in various types of voice time sequence data is not enough exists in the playback sampling.
The pairing fragment data set construction module is used for constructing a pairing fragment data set; the method specifically comprises the following steps:
for the first piece of voice time series data X in the paired data set1 Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as Sp Taking out the remaining part, defining as the rear data of the segment, and recording as Ss (ii) a For the second piece of voice time series data X in the paired data set2 Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, aiming at each segment S and each segment S', copying the front data S corresponding to each segment Sp Rear data Ss Fragment S 'and label Y, obtaining N X M pairing fragment data sets consisting of N X M quadruplets (Sp, Ss, S', Y);
the artificial neural network construction module is used for constructing an artificial neural network; the method specifically comprises the following steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, establishing front data S of the segmentp Corresponding first converter, processed result is Sps Rear data S of the segments Corresponding second converter, processed result is Ssp ;
S4012, the front and back parts of the fragment (S)ps ,Ssp ) Combined into a complete segment Sf ;
S4013, creating a one-dimensional convolution neural network, receiving any one segment as input, and when the input is a complete segment Sf When the input is the segment S' to be compared, the output is recorded as ZAnd the output is recorded as Z'; each time a complete segment S is inputf Subsequently inputting a segment S' to be compared;
s4014, according to the matching segments (S, S'), for S, after dividing the front and rear parts, outputting S through the first converter and the second convertersp 、Sps Are combined into segments Sf Then, outputting Z through a one-dimensional convolution neural network; and for the segment S ' to be compared, directly outputting Z ' after passing through the one-dimensional convolutional neural network, and calculating the distance d according to the (Z, Z '):
d=‖Z’-Z‖2 ;
s4015, calculating loss according to the distance d and the label Y:
margin is a user-defined real number larger than 0 and is usually set to be 1;
the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolution neural network;
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are N × M pairing fragment data sets;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0 Stripe data (M)0 A natural number defined by a user, which is suggested to be 128 or 256), is counted as a batch, and all training data are processed once and counted as an epoch;
s507, training K epochs; k is a natural number;
the voice recognition module is used for carrying out voice recognition through a voice recognition network;
from the reference speech database of each category, each categoryRespectively taking a reference voice to form S', taking a voice to be recognized from a user and slicing to obtain a slice Sw Replacing the segment S ' to be compared in S4013, one-to-many pairs can be formed as above, the pair is input into the speech recognition network trained in S5, Z and Z ' are obtained by calculating each pair, the distance d is obtained by Z and Z ', and finally the list { d is formedw And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.
An information data processing terminal is used for realizing the voice recognition method based on the contrast prediction coding.
A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described method for speech recognition based on contrast prediction coding.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.