CN114783446B

Movatterモバイル変換

Info

Publication number: CN114783446B
Application number: CN202210670592.2A
Authority: CN
Inventors: 戴亦斌
Original assignee: Beijing Information Technology Bote Intelligent Technology Co ltd
Current assignee: Beijing Information Technology Bote Intelligent Technology Co ltd
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-06
Anticipated expiration: 2042-06-15
Also published as: CN114783446A

Abstract

The invention discloses a voice recognition method and a system based on comparative predictive coding, belonging to the technical field of voiceprint recognition and characterized by comprising the following steps: s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; s2, constructing a pairing data set of the voice time sequence data; s3, constructing a pairing fragment data set; s4, constructing an artificial neural network; s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolutional neural network; and S6, performing voice recognition through the voice recognition network. The invention fully utilizes a large amount of insufficient voice data acquired by the background, takes the voice data as time sequence data, directly realizes end-to-end conversion, and does not need to extract the voice time sequence data characteristics in the process.

Description

Voice recognition method and system based on contrast predictive coding

Technical Field

The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voice recognition method and system based on contrast prediction coding.

Background

As is known, speech recognition often needs to collect a large amount of speech data, that is, under various background environments, the number of data pieces needs to be sufficient under various semantic (including various speech and dialect) conditions of speech to be recognized. If a voice uttered by using a particular dialect (or text semantics) in a particular background does not acquire enough data, when the voice recognition model is used under the condition, model failures such as reduction in detection accuracy, inability to recognize, and the like may occur. The traditional technology solves the problems by the following steps: most of the features need to be extracted by various feature extraction methods similar to MFCC feature extraction, and then classification operation is performed on the features, and finally a classification result is obtained. In this case, it is important to determine whether the data in each category is sufficient and representative, and if the data is insufficient and atypical, partial features related to the specific category may be missing or distorted, which affects the final classification result.

Disclosure of Invention

The technical purpose is as follows: the invention provides a voice recognition method and a system based on contrast predictive coding; a large amount of insufficient voice data acquired by a background are fully utilized, the voice data are regarded as time sequence data, end-to-end conversion is directly carried out, intermediate voice spectrum features are not required to be extracted, a certain number of fragments with fixed time are extracted randomly by each voice, each fragment is divided into front data and rear data, coding prediction of the rear time sequence data is achieved through a first converter when the front data are input, time sequence prediction of the front data is achieved through a second converter when the rear data are input, after the predicted data are combined, end-to-end paired data comparison is directly carried out on the predicted data and data to be detected of the same type (or different types), and finally end-to-end voice recognition is achieved according to voice category label requirements.

Technical scheme

The first purpose of the present invention is to provide a speech recognition method based on contrast predictive coding, comprising the following steps:

s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;

s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data of the triplet, X₂ For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;

s3, constructing a pairing fragment data set; the method specifically comprises the following steps:

for the first piece of voice time series data X in the paired data set₁ Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as S_p Taking out the remaining part, defining as the rear data of the segment, and recording as S_s (ii) a For the second piece of voice time series data X in the paired data set₂ Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally for each segment S andsegments S' and copying front data S corresponding to each segment S_p Rear data S_s Fragment S 'and label Y, obtaining N X M pairing fragment data sets consisting of N X M quadruplets (Sp, Ss, S', Y);

s4, constructing an artificial neural network; the method specifically comprises the following steps:

s401, establishing a confrontation generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;

s4011, processing front data Sp of the segment through a first converter to obtain S_ps The rear data S of the segment_s Processed by the second converter to obtain S_sp ；

S4012, mixing (S)_ps ，S_sp ) Combined into a complete segment S_f ；

S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment S_f When the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is input_f Subsequently, a segment S' to be compared is input;

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂ ；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolutional neural network;

and S6, performing voice recognition through the voice recognition network.

Preferably, S5 is specifically:

s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;

s502, training data are N × M pairing fragment data sets;

s503, leading the training data into a voice recognition network as input one by one;

s504, calculating loss by taking L as a loss function;

s505, updating the weight of the voice recognition network by using an ADAM optimization method;

s506, M for each process₀ Counting the data into one batch, and counting the data into one epoch after all the training data are processed once; m₀ Is 128 or 256;

s507, training K epochs; k is a natural number.

Preferably, S6 is specifically: taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice S_w To slice S_w Replacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_w And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

It is a second object of the present invention to provide a speech recognition system based on contrast predictive coding, comprising:

the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;

the pairing data set construction module is used for constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data of the triplet, X₂ For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; when two voice time sequence data of each data of the same kind pairing set are voices of the same voice categorySequence data; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;

the pairing fragment data set construction module is used for constructing a pairing fragment data set; the method specifically comprises the following steps:

for the first piece of voice time series data X in the paired data set₁ Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as S_p Taking out the remaining part, defining as the rear data of the segment, and recording as S_s (ii) a For the second piece of voice time series data X in the paired data set₂ Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, aiming at each segment S and each segment S', copying the front data S corresponding to each segment S_p Rear data S_s Obtaining N x M matching fragment data sets consisting of N x M quadruplets (Sp, Ss, S', Y);

an artificial neural network construction module; the construction process comprises the following steps:

S4012, mixing (S)_ps ，S_sp ) Combined into a complete segment S_f ；

S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment S_f When the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is input_f Subsequently inputting a segment S' to be compared;

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂ ；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolutional neural network;

and the recognition module is used for carrying out voice recognition through a voice recognition network.

Preferably, the training process of the training module is as follows:

s502, training data are N × M pairing fragment data sets;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number.

Preferably, the identification process of the identification module is as follows: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_w To slice S_w Replacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_w And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

A third object of the present invention is to provide an information data processing terminal for implementing the above-mentioned speech recognition method based on the comparative predictive coding.

It is a fourth object of the present invention to provide a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the above-described method for speech recognition based on contrast predictive coding.

The invention has the advantages and positive effects that:

the invention fully utilizes a large amount of insufficient voice data acquired by a background, takes the voice data as time sequence data, directly converts end to end without extracting the characteristics of the voice sequence data in the middle, randomly extracts a certain number of fragments with fixed time length from each voice, divides each fragment into front part data and back part data, realizes coding prediction of the back part sequence data through a first converter when the front data is input, realizes the sequence prediction of the front data through a second converter when the back part data is input, directly compares the predicted data with the data to be detected of the same type (or different types) end to end in pairs after combining, and finally realizes end to end voice recognition according to the requirements of voice category labels.

Drawings

FIG. 1 is a flow chart of the construction of a data set in a preferred embodiment of the present invention;

FIG. 2 is a flow chart of the construction of an artificial neural network in a preferred embodiment of the present invention;

FIG. 3 is a flow chart of a Transformer (Transformer) in a preferred embodiment of the present invention;

fig. 4 is a flow chart of speech recognition in a preferred embodiment of the present invention.

Detailed Description

In order to further understand the contents, features and effects of the present invention, the following embodiments are exemplified and described in detail with reference to the accompanying drawings.

Referring to fig. 1 to 4, a speech recognition method based on contrast prediction coding includes:

s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data of the triplet, X₂ For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories; referring to fig. 1, the construction process of the paired data set specifically includes:

in fig. 1, for the purpose of explaining the problem, three speech categories are taken as an example for detailed description, the original speech time series data sample on the left side includes speech time series data samples of three speech categories, and the speech time series data of each speech category are respectively distinguished by different padding;

firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairs, and defining a label Y of the similar pairs as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice categories respectively to pair to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;

then, according to the total pairing number N, defining the number of the voice categories as a category number k, and defining the homogeneous sampling ratio as alpha, and in order to meet the purpose of fair sampling, namely that the probability of each voice category being extracted is the same, setting the following limiting conditions: s₁ +S₂ = N/k, calculating the number of pairs S of the same kind to be extracted₁ And heterogeneous pairing number S₂ ：

；

Finally, the same kind pairing, the different kind pairing and the mark are matchedThe label constitutes a dataset comprising N triples (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data, X, of the triplet₂ The second piece of voice time sequence data of the triple; y is a label.

The present invention constructs a data set using put-back sample pairings for these speech timing data:

in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the paired voice time sequence data samples are marked with Y as 0.

One voice type is randomly extracted from voice time sequence data samples of different voice types, one voice type is randomly extracted from other voice time sequence data, one pairing is completed, and the voice time sequence data is marked with Y = 1.

Such similar pairing decimation S₁ Wheel, heterogeneous pairing extraction S₂ Wheel, i.e. obtain S₁ +S₂ And matching the pairs to form a data set, so that training and testing can be performed. The problem that the number of the voice time sequence data in various types of voice time sequence data is not insufficient exists in the playback sampling.

for the first piece of voice time series data X in the paired data set₁ Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as S_p Taking out the remaining part, defining as the rear data of the segment, and recording as S_s (ii) a For the second piece of voice time series data X in the paired data set₂ Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, for each segment S and S', the front data S corresponding to each segment S is copied_p Rear data S_s Obtaining N x M matching fragment data sets consisting of N x M quadruplets (Sp, Ss, S', Y);

s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;

s4011, establishing front data S of the segment_p Corresponding first converter, processed result is S_ps Rear data S of the segment_s Corresponding second converter, processed result is S_sp ；

S4012, the front and back parts of the fragment (S)_ps ，S_sp ) Combined into a complete segment S_f ；

s4014, according to the matched segment (S, S'), for the segment S, after dividing the front part and the rear part, outputting S through the first converter and the second converter_sp 、S_ps Are combined into segments S_f Then, outputting Z through a one-dimensional convolution neural network; for the segment S ' to be compared, directly outputting Z ' after passing through a one-dimensional convolutional neural network, and calculating the distance d by (Z, Z '):

d=‖Z’－Z‖₂ ；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolution neural network; the method specifically comprises the following steps:

s502, training data are N × M pairing fragment data sets;

s504, calculating loss by taking L as a loss function;

s506, M for each process₀ Stripe data (M)₀ A natural number defined by a user, which is suggested to be 128 or 256), is counted as a batch, and all training data are processed once and counted as an epoch;

s507, training K epochs; k is a natural number;

SS6, carrying out voice recognition through a voice recognition network;

from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_w Replacing the segment S ' to be compared in S4013, one-to-many pairs can be formed as above, the pair is input into the speech recognition network trained in S5, Z and Z ' are obtained by calculating each pair, the distance d is obtained by Z and Z ', and finally the list { d is formed_w And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

A speech recognition system based on contrast predictive coding, comprising:

the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain voice time sequence data of PCM codes; a is a natural number greater than 1;

the pairing data set construction module is used for constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data of the triplet, X₂ For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; the two pieces of voice time sequence data of each data of the heterogeneous pairing set are different languagesVoice timing data for a tone category; referring to fig. 1, the construction process of the paired data set specifically includes:

in fig. 1, for the purpose of explaining the problem, three voice categories are taken as an example for detailed description, where the original voice time series data sample on the left side includes voice time series data samples of three voice categories, and the voice time series data of each voice category is distinguished by different padding;

firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairing, and defining a label Y of the similar pairing as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice types respectively for pairing to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;

then, according to the total pairing number N, the number of the voice categories is defined as a category number k, and the homogeneous-heterogeneous sampling ratio is defined as α, so as to satisfy the purpose of fair sampling, that is, the probability of each voice category being extracted is the same, and the following limiting conditions are set: s₁ +S₂ = N/k, calculating the number of pairs S of the same kind to be extracted₁ And heterogeneous pair number S₂ ：

；

Finally, the same-class pairs, the different-class pairs and the labels y are combined into a data set, and the data set comprises N triples (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data, X, of the triplet₂ The second piece of voice time sequence data of the triple; y is a label.

Such homogeneous pairing decimation S₁ Wheel, heterogeneous pairing extraction S₂ Wheel, i.e. obtain S₁ +S₂ And matching the pairs to form a data set, so that training and testing can be performed. The problem that the number of voice time sequence data in various types of voice time sequence data is not enough exists in the playback sampling.

for the first piece of voice time series data X in the paired data set₁ Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as S_p Taking out the remaining part, defining as the rear data of the segment, and recording as S_s (ii) a For the second piece of voice time series data X in the paired data set₂ Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, aiming at each segment S and each segment S', copying the front data S corresponding to each segment S_p Rear data S_s Fragment S 'and label Y, obtaining N X M pairing fragment data sets consisting of N X M quadruplets (Sp, Ss, S', Y);

the artificial neural network construction module is used for constructing an artificial neural network; the method specifically comprises the following steps:

S4013, creating a one-dimensional convolution neural network, receiving any one segment as input, and when the input is a complete segment S_f When the input is the segment S' to be compared, the output is recorded as ZAnd the output is recorded as Z'; each time a complete segment S is input_f Subsequently inputting a segment S' to be compared;

s4014, according to the matching segments (S, S'), for S, after dividing the front and rear parts, outputting S through the first converter and the second converter_sp 、S_ps Are combined into segments S_f Then, outputting Z through a one-dimensional convolution neural network; and for the segment S ' to be compared, directly outputting Z ' after passing through the one-dimensional convolutional neural network, and calculating the distance d according to the (Z, Z '):

d=‖Z’－Z‖₂ ；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolution neural network;

s502, training data are N × M pairing fragment data sets;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number;

the voice recognition module is used for carrying out voice recognition through a voice recognition network;

from the reference speech database of each category, each categoryRespectively taking a reference voice to form S', taking a voice to be recognized from a user and slicing to obtain a slice S_w Replacing the segment S ' to be compared in S4013, one-to-many pairs can be formed as above, the pair is input into the speech recognition network trained in S5, Z and Z ' are obtained by calculating each pair, the distance d is obtained by Z and Z ', and finally the list { d is formed_w And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

An information data processing terminal is used for realizing the voice recognition method based on the contrast prediction coding.

A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described method for speech recognition based on contrast prediction coding.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A speech recognition method based on contrast predictive coding, comprising the steps of:

s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data of the triplet, X₂ For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;

for the first piece of voice time series data X in the paired data set₁ Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as S_p Taking out the remaining part, defining as the rear data of the segment, and recording as S_s (ii) a For the second piece of voice time series data X in the paired data set₂ Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally for each segment S and segment S' replication of the corresponding front data S for each segment S_p Rear data S_s Fragment S 'and label Y, obtaining N X M pairing fragment data sets consisting of N X M quadruplets (Sp, Ss, S', Y);

s4011, front data S of segment_p S is obtained by processing with the first converter_ps The rear data S of the segment_s Processed by the second converter to obtain S_sp ；

S4012, mixing (S)_ps ，S_sp ) Combined into a complete segment S_f ；

S4013, creating a one-dimensional convolution neural network, and when the input is a complete segment S_f When the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is input_f Subsequently inputting a segment S' to be compared;

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂ ；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a real number which is larger than 0 and is defined by a user;

and S6, performing voice recognition through the voice recognition network.

2. The speech recognition method based on the contrast-predictive coding according to claim 1, wherein S5 specifically comprises:

s502, training data are N × M pairing fragment data sets;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number.

3. The speech recognition method based on the contrast-predictive coding according to claim 2, wherein S6 specifically comprises: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_w To slice S_w Replacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_w And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

4. A speech recognition system based on contrast predictive coding, comprising:

the pairing data set construction module is used for constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁ ，X₂ Y); wherein: x₁ Is the first piece of speech timing data of the triplet, X₂ For the second piece of voice time sequence data of the triple, the same-class matching time label Y is defined as 0, and the different-class matching is carried outThe time tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;

for a first piece of voice time series data X in the paired data set₁ Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then, taking out front data of the segment S defined as the first half part of each segment S with the fixed length of m, and recording the front data as S_p Taking out the remaining part, defining as the rear data of the segment, and recording as S_s (ii) a For the second piece of voice time series data X in the paired data set₂ Randomly intercepting M segments from the fixed length M, wherein each segment is marked as S'; finally, aiming at each segment S and each segment S', copying the front data S corresponding to each segment S_p Rear data S_s Obtaining N x M matching fragment data sets consisting of N x M quadruplets (Sp, Ss, S', Y);

S4012, mixing (S)_ps ，S_sp ) Combined into a complete segment S_f ；

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂ ；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number which is larger than 0;

5. The system of claim 4, wherein the training module performs the training process by:

s502, training data are N × M pairing fragment data sets;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number.

6. The system of claim 5, wherein the recognition module performs the following steps: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_w To slice S_w Replacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_w And finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

7. An information data processing terminal for implementing the speech recognition method based on the contrast predictive coding according to any one of claims 1 to 3.

8. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method for speech recognition based on contrast prediction coding according to any of claims 1 to 3.