The application be to application No. is: 2019105787247, the applying date are as follows: on 06 28th, 2019, entitledThe divisional application of the original application of " phoneme of speech sound recognition methods and device, storage medium and electronic device ".
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present inventionAttached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is onlyThe embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill peopleThe model that the present invention protects all should belong in member's every other embodiment obtained without making creative workIt encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this wayData be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein orSequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that coverCover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited toStep or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, productOr other step or units that equipment is intrinsic.
According to an aspect of an embodiment of the present invention, a kind of phoneme of speech sound recognition methods is provided.Optionally, above-mentioned voicePhoneme recognition method can be, but not limited to be applied in application environment as shown in Figure 1.As shown in Figure 1, above-mentioned phoneme of speech sound is knownOther method is related to terminal device 102 and passes through the interaction between network 104 and server 106.
Terminal device 102 can acquire or obtain from other equipment the multiple voices being ranked up sequentially in timeFrame, and multiple speech frames are sent to server 106 by network 104.Terminal device 102 can also acquire or from other equipmentTarget speech data is obtained, and target speech data is sent to server 106 by network 104, by server 106 from targetMultiple speech frames are obtained in voice data.
Server 106 can extract and multiple speech frames after getting multiple speech frames from multiple speech framesOne-to-one multiple first phonetic features;Multiple crucial phonetic features are determined from multiple first phonetic features, wherein everyThe probability for the phoneme that a key phonetic feature corresponds in set of phonemes is more than or equal to destination probability threshold value;It determinesPhonetic feature set corresponding with each key phonetic feature, wherein each phonetic feature set includes corresponding crucial voiceThe one or more phonetic features adjacent with corresponding crucial phonetic feature in feature and multiple first phonetic features;It is right respectivelyPhonetic feature in each phonetic feature set carries out Fusion Features, obtains multiple convergence voice features, wherein each voice is specialA corresponding convergence voice feature is closed in collection;Identify sound corresponding with each convergence voice feature respectively in set of phonemesElement.
Optionally, in the present embodiment, above-mentioned terminal device can include but is not limited at least one of: mobile phone is put downPlate computer, desktop computer etc..Above-mentioned network can include but is not limited at least one of: wireless network, cable network, whereinThe wireless network includes: the network of bluetooth, WIFI and other realization wireless communications, which may include: local area network, cityDomain net, wide area network etc..Above-mentioned server can include but is not limited at least one of: for using target nerve network modelThe equipment that target sequence model is handled.Above-mentioned only a kind of example, the present embodiment do not do any restriction to this.
Optionally, in the present embodiment, as an alternative embodiment, as shown in Fig. 2, above-mentioned phoneme of speech sound identifiesThe process of method may comprise steps of:
S202 is extracted one-to-one with multiple speech frames from the multiple speech frames being ranked up sequentially in timeMultiple first phonetic features;
S204 determines multiple crucial phonetic features from multiple first phonetic features, wherein each key phonetic featureProbability corresponding to a phoneme in set of phonemes is more than or equal to destination probability threshold value;
S206 determines phonetic feature set corresponding with each key phonetic feature, wherein each phonetic feature setIncluding one adjacent with corresponding crucial phonetic feature in corresponding crucial phonetic feature and multiple first phonetic features orMultiple phonetic features;
S208 carries out Fusion Features to the phonetic feature in each phonetic feature set respectively, obtains multiple convergence voicesFeature, wherein the corresponding convergence voice feature of each phonetic feature set;
S210 identifies phoneme corresponding with each convergence voice feature respectively in set of phonemes.
Optionally, above-mentioned phoneme of speech sound recognition methods can be by destination server execution, can be, but not limited to applyIn the tasks such as speech recognition, language translation.
For example, multiple speech frames are from the language to be translated for corresponding to first language (for example, Chinese) by taking language translation as an exampleThe speech frame got in sound data.As shown in figure 3, multiple first voices can be extracted from multiple speech frames by module oneFeature is determined multiple crucial phonetic features from multiple first phonetic features by module two, and exports Key Words to module threeSound signature identification determines phonetic feature set corresponding with each key phonetic feature by module three, and respectively to each languagePhonetic feature in sound characteristic set carries out Fusion Features, is identified respectively in set of phonemes and each fusion language by module fourThe corresponding phoneme of sound feature.After identifying each phoneme, is determined in voice data to be translated by the phoneme identified and includeWord (or sentence), by comprising word (or sentence) be translated as the word (or sentence) of second language.
Through this embodiment, on the basis of determining crucial phonetic feature based on the other feature coding of frame level, key is utilizedPhonetic feature determines phonetic feature section (phonetic feature set), to extract more accurately section (unit) grade another characteristic, determinesPhoneme corresponding with each phonetic feature section, there are recognition result accuracys rate for the phoneme of speech sound recognition methods for solving in the related technologyLow technical problem improves the accuracy rate of recognition result.
Above-mentioned voice recognition mode is explained below with reference to Fig. 2.
In step S202, extracted from the multiple speech frames being ranked up sequentially in time and multiple speech frames oneOne corresponding multiple first phonetic features.
Multiple speech frames can be the speech frame got from target speech data.Above-mentioned target speech data can beOne Duan Yuyin of target duration, for example, the voice of one section of 2s.
Before the multiple speech frames obtained in target speech data, the available target voice number of destination serverAccording to.Above-mentioned target speech data, which can be, is sent to destination server by network by terminal, is also possible to by preserving targetThe server of voice data is sent to destination server.Wherein, above-mentioned terminal can be the terminal for recording target speech data,It can be the terminal for preserving target speech data, can also be other terminals that request handles target speech data.
Optionally, in the present embodiment, extracted from multiple speech frames it is multiple correspondingly with multiple speech framesBefore first phonetic feature, target speech data can be divided according to scheduled duration, obtain multiple unit frames;According to meshThe period is marked, multiple speech frames are determined from multiple unit frames, wherein each speech frame includes one or more unit frames.
After getting target speech data, it can be obtained from target speech data using various ways above-mentioned multipleSpeech frame: target speech data is divided into multiple unit frames;Multiple speech frames are sampled out from multiple unit frames, alternatively, willMultiple unit frames are combined, and obtain multiple speech frames.
The mode that target speech data is divided into multiple unit frames may is that according to scheduled duration to target speech dataIt is divided, obtains multiple unit frames.Above-mentioned scheduled duration can satisfy following division condition: can recognize that specific voiceFeature.Above-mentioned scheduled duration can also meet following division condition: the quantity for the phonetic feature for including is less than or equal to 1.OnIt states scheduled duration and can according to need and set, for example, it may be 10ms.By setting scheduled duration, it is ensured that canIt identifies phonetic feature, and leakage identification or wrong identification will not be caused since duration is too long.
For example, the voice data for being 2s for length, can be that scheduled duration draws the voice data according to 10msPoint, obtain 200 unit frames.
Multiple unit frames are being obtained, can determine multiple speech frames from multiple unit frames according to target period,In, each speech frame includes one or more unit frames.
In order to reduce the complexity of calculating, the efficiency of raising phoneme of speech sound identification can be sampled multiple unit frames,Alternatively, being combined to multiple unit frames.For example, one can be extracted according to every N number of unit frame (target period is N number of unit frame)The mode of a or multiple unit frames is sampled, and multiple speech frames are obtained.In another example can be one according to every M unit frameThe mode of group carries out unit frame combination, obtains multiple speech frames.
For example, for 200 unit frames for being divided the voice data of 2s for scheduled duration according to 10ms, it canIn a manner of extracting a unit frame according to every 2 unit frames, 100 speech frames are obtained, can also be taken out according to every 4 unit framesThe mode for taking a unit frame obtains 50 speech frames, can also carry out unit frame in such a way that every 4 unit frames are one groupCombined mode obtains 50 speech frames.
Through this embodiment, by being divided to obtain unit frame to voice data, and to the side that unit frame is sampledFormula obtains speech frame, it is possible to reduce the computation complexity of phoneme of speech sound identification improves the efficiency of phoneme of speech sound identification.
After obtaining multiple speech frames, it is special that destination server can extract multiple first voices from multiple speech framesSign, wherein corresponded between multiple speech frames and multiple first phonetic features.
Identified from speech frame phonetic feature mode can there are many, for existing speech feature extraction mode, onlyThe phonetic feature to be extracted can be used for carrying out phoneme of speech sound identification, be used equally for the phoneme of speech sound identification side in the present embodimentMethod.
In order to improve the validity of the phonetic feature extracted, phonetic feature can be carried out using target nerve network modelExtraction.
Optionally, in the present embodiment, extracted from the multiple speech frames being ranked up sequentially in time with it is multipleOne-to-one multiple first phonetic features of speech frame may include: successively to be input to each speech frame in multiple speech framesTarget nerve network model, wherein target nerve network model is for extracting the first phonetic feature corresponding with each speech frame;Obtain multiple first phonetic features of target nerve network model output.
Above-mentioned target nerve network model can be frame grade encoding device model (that is, the part Encoder), can be all kinds ofThe model of deep-neural-network, can include but is not limited at least one of: multilayer LSTM (Long Short-TermMemory, shot and long term memory network), for example, BiLSTM (two-way LSTM), UniLSTM (derivative LSTM);Multilayer convolutional network;FSMN (Feedforward Sequential Memory Networks, feed-forward type serial memorization network), TDNN (TimeDelay Neural Network, time-delay neural network).
For example, as shown in figure 4, each speech frame in multiple speech frames successively can be input to CNNIn (Convolution Neural Networks, convolutional neural networks), is extracted by CNN and exported corresponding with each speech frameThe first phonetic feature.
Through this embodiment, speech feature extraction is carried out by using neural network model, can according to need and carries out netNetwork model training improves the accuracy and validity of speech feature extraction.
In step S204, multiple crucial phonetic features are determined from multiple first phonetic features, wherein each keyThe probability that phonetic feature corresponds to a phoneme in set of phonemes is more than or equal to destination probability threshold value.
For each of extracting the first phonetic feature, which can be determined according to the first phonetic feature extractedThe corresponding probability with each phoneme in set of phonemes of sound feature.
Above-mentioned phoneme (phone) can be the element for forming each voice, be marked off according to the natural quality of language comeMinimum linguistic unit.It can be analyzed according to the articulation of syllable, a movement constitutes a phoneme.For Chinese, soundElement can be divided into vowel and consonant, e.g., Chinese syllableThere is a phoneme,There are two phoneme,There are three phonemes.IntoWhen row phoneme recognition, the tone (for example, high and level tone, rising tone, upper sound, falling tone) in syllable can be identified, it can also be with nonrecognition soundTone in section.
For each first phonetic feature, the probability of each phoneme corresponded in set of phonemes and can be 1 (normalizingChange processing).In whole the first phonetic features, the first phonetic feature of part due to comprising Limited information, may can not be trueIts fixed probability for corresponding to each phoneme in set of phonemes, these first phonetic features can be ignored;The first phonetic feature of partRepresented information is indefinite, correspond to set of phonemes in each phoneme probability be less than destination probability threshold value (for example,80%), these first phonetic features are not belonging to crucial phonetic feature;Information represented by the first phonetic feature of part is clear,Probability corresponding to a certain phoneme in set of phonemes is more than that (probability for being determined as a certain phoneme is greater than destination probability threshold value80%) these first phonetic features, are determined as crucial phonetic feature.
The determination of crucial phonetic feature can be carried out using various ways.As long as the voice can be determined according to phonetic featureFeature corresponds to the mode of each phoneme probability in set of phonemes, is used equally for the determination of crucial phonetic feature.
Optionally, in the present embodiment, determine that multiple crucial phonetic features can wrap from multiple first phonetic featuresIt includes: determining multiple peak locations from multiple first phonetic features using CTC model, wherein each peak location corresponding oneA key phonetic feature.
CTC model can be as shown in figure 5, CTC model includes an encoder (encoder), by x1…xTIt is sequentially inputted toIn encoder, and using Softmax function (normalization exponential function) to the output (h of encoderenc) handled, it obtainsEach input x (x1…xT) it is each y (y1…yT) probability (P (y1|x)…P(yT|x))。
CTC mainly solves traditional RNN, and (Recurrent Neural Network, Recognition with Recurrent Neural Network are that one kind is used forThe neural network of processing sequence data) in model, the correspondence problem of the sequence of annotated sequence and input.By in label symbol collectionIn plus a blank character blank, be then labeled using RNN, can not determine some effectively export when then export blankSymbol;A significant character is then exported when determining some effective unit enough, therefore, in CTC can obtain label (markLabel) in significant character peak location.
For example, as shown in fig. 6, CNN after identifying multiple first phonetic features, can use CTC criterion, export moreA peak location, each peak location correspond to a crucial phonetic feature, and peak location is the mark of crucial phonetic feature.
Through this embodiment, the positioning that crucial phonetic feature is carried out using CTC model, does not need to mark in training patternThe convenience that model training and model use can be improved in the boundary of each phoneme.
In step S206, phonetic feature set corresponding with each key phonetic feature is determined, wherein each voiceCharacteristic set includes adjacent with corresponding crucial phonetic feature in corresponding crucial phonetic feature and multiple first phonetic featuresOne or more phonetic features.
For each of determining crucial phonetic feature, phonetic feature corresponding with each crucial phonetic feature can be determinedSet.For current key phonetic feature, corresponding phonetic feature set includes: current key phonetic feature and multipleThe one or more phonetic features adjacent with current key phonetic feature in first phonetic feature.
It can be using the determining phonetic feature set corresponding with each key phonetic feature of various ways.For example, can incite somebody to actionOne or more phonetic features in current key phonetic feature, multiple first phonetic features before and after current speech featureIt is determined as phonetic feature set corresponding with current key phonetic feature.In another example can be by current key phonetic feature, multipleOne or more phonetic features in first phonetic feature before current speech feature are determined as and current key phonetic feature pairThe phonetic feature set answered.For another example can be by current speech feature in current key phonetic feature, multiple first phonetic featuresOne or more phonetic features later are determined as phonetic feature set corresponding with current key phonetic feature.
Optionally, in the present embodiment, determine that phonetic feature set corresponding with each key phonetic feature can wrapIt includes: determining that the second phonetic feature corresponding with the current key phonetic feature in multiple crucial phonetic features and third voice are specialSign, wherein the second phonetic feature be multiple first phonetic features in front of current key phonetic feature and with current key languageSound feature non-conterminous first crucial phonetic feature, third phonetic feature is in multiple first phonetic features in current key languageAfter sound feature and with non-conterminous first crucial phonetic feature of current key phonetic feature;It determines and current key languageThe corresponding current speech characteristic set of sound feature, wherein current speech characteristic set is the subclass of target voice characteristic set,Target voice characteristic set includes the second phonetic feature, third phonetic feature and the second phonetic feature and third phonetic featureBetween the first phonetic feature.
For the current speech feature in multiple crucial phonetic features, can determine first before current speech feature andWith current key phonetic feature non-conterminous first crucial phonetic feature (the second phonetic feature) and current speech feature itAfterwards and with current key phonetic feature non-conterminous first crucial phonetic feature (third phonetic feature), then by the second languageThe first phonetic feature between sound feature, third phonetic feature and the second phonetic feature and third phonetic feature is determined as targetPhonetic feature set, then select from target voice characteristic set one or more phonetic features as with current key voiceThe corresponding phonetic feature set of feature.
It should be noted that for first crucial phonetic feature, corresponding second phonetic feature is first firstPhonetic feature, for the last one crucial phonetic feature, corresponding third phonetic feature is the last one first phonetic feature.
For example, for 12 the first phonetic features corresponding with 12 speech frames, wherein crucial phonetic feature are as follows: the 3rd,6,7 and 10 the first phonetic features.For the 1st crucial phonetic feature, corresponding target voice characteristic set are as follows: 1-6A first phonetic feature.For the 2nd crucial phonetic feature, corresponding target voice characteristic set are as follows: 3-10 firstPhonetic feature.For the 3rd crucial phonetic feature, corresponding target voice characteristic set are as follows: 3-10 the first voices are specialSign.For the 4th crucial phonetic feature, corresponding target voice characteristic set are as follows: 7-12 the first phonetic features.
Through this embodiment, by determining that the second phonetic feature corresponding with current key phonetic feature and third voice are specialSign, determines target voice characteristic set corresponding with current key phonetic feature by the second phonetic feature and third phonetic feature,It, can be to avoid not so as to determine phonetic feature set corresponding with current key phonetic feature by target voice characteristic setWith the influence between key phonetic feature, guarantee the accuracy of phoneme recognition.
In step S208, Fusion Features are carried out to the phonetic feature in each phonetic feature set respectively, are obtained multipleConvergence voice feature, wherein the corresponding convergence voice feature of each phonetic feature set.
It, can be by the language in current speech characteristic set for the current speech characteristic set in multiple phonetic feature setSound feature carries out Fusion Features, obtains convergence voice feature corresponding with current speech characteristic set.
Fusion Features can be carried out using various ways, for example, can be using each voice to current speech characteristic setFeature is weighted summation.The weight of each phonetic feature may be the same or different.For example, can be according to current speechThe distance between each phonetic feature and current key phonetic feature of characteristic set assign different power for different phonetic featuresWeight, closer at a distance from current key phonetic feature, weight is bigger.
It should be noted that the distance between two phonetic features can according to speech frame corresponding with each phonetic feature itBetween distance be indicated, the distance between two speech frames can for two voice frame start positions, final position or appointTime difference between same position of anticipating.
Optionally, in the present embodiment, Fusion Features are carried out to the phonetic feature in each phonetic feature set respectively, obtainedIt may include: that the phonetic feature in each phonetic feature set is input to target from attention respectively to multiple convergence voice featuresIn power layer, multiple convergence voice features are obtained, wherein target is used for from attention layer to the voice in each phonetic feature setFeature is weighted summation, obtains convergence voice feature corresponding with each phonetic feature set.
It can be used from attention (Self-Attention) layer and the phonetic feature in each phonetic feature set carried outFusion Features, the feature of extraction unit length scale obtain convergence voice feature.
It is a kind of model using from attention mechanism from attention model.It is different from the attention mechanism of standardBe: in the attention of standard, Query vector is related to output label, is returned in RNN by the label of outputIt obtains;In self-attention, Query vector passes through transformation by encoder itself and generates.
For example, as shown in fig. 7, being exported according to multiple peak locations of CTC output and CNN more for from attention layerA first phonetic feature determines phonetic feature section corresponding with each peak location, and exports the corresponding fusion of each phonetic feature sectionPhonetic feature.For example, set of voice features corresponding with the 1st key phonetic feature is combined into: 1-6 the first phonetic features.The1-6 the first phonetic features are input to from attention layer, are corresponding with the 1st crucial phonetic feature from the output of attention layerConvergence voice feature.
Through this embodiment, section level characteristics are extracted using from attention layer, it is ensured that phonetic feature merges accurateProperty, and then improve the accuracy rate of phoneme of speech sound identification.
In step S210, phoneme corresponding with each convergence voice feature is identified respectively in set of phonemes.
After obtaining multiple convergence voice features, can according to obtained multiple convergence voice features, obtain with eachThe corresponding phoneme of convergence voice feature.
For the present fusion phonetic feature in multiple convergence voice features, can be obtained according to present fusion phonetic featureThe current convergence voice feature corresponds to the probability of each phoneme in set of phonemes, and is corresponded to according to present fusion phonetic featureThe probability of each phoneme in set of phonemes determines phoneme corresponding with each convergence voice feature.
Optionally, in the present embodiment, sound corresponding with each convergence voice feature is identified respectively in set of phonemesElement may include: the decoder that each convergence voice feature is sequentially inputted to target attention model, obtain merging with eachThe corresponding phoneme of phonetic feature, wherein above-mentioned decoder is used for according at least to present fusion phonetic feature currently entered and makesThe previous phoneme handled with previous phonetic feature of the decoder to present fusion phonetic feature, obtain with currentlyThe corresponding current phoneme of convergence voice feature.
Attention is a kind of for promoting the mechanism of the effect of the Encoder+Decoder model based on RNN(Mechanism), commonly referred to as Attention Mechanism.Attention Mechanism can be applied to machine and turn overIt translates, speech recognition, many fields such as image labeling (Image Caption).Attention is imparted to model and is distinguished discriminationAbility, for example, assigning different weights in machine translation, speech recognition application for each word in sentence, making neural networkThe study of model becomes more flexible (soft), while Attention itself can be used as a kind of alignment relation, and interpretation is defeatedEnter/export the alignment relation between sentence, what knowledge interpretation model has acquired on earth.
The structure of attention model can be as shown in Figure 8.Wherein, x1…xTFor the input of encoder, hencFor encoderOutput;For attention layers of a upper outputs, (a upper input for attention model is xu-1), cuFor attention layer(this input of attention model is x to this state outputu), yu-1It is exported for one on attention model,For decodingThis output of device, P (yu|yu-1,…,y0, x) and it is this output of attention model.
Decoder (decoder) network in target attention (Attention) model can be used to determine and each meltClose the corresponding phoneme of phonetic feature.Above-mentioned target attention model can be standard Attention model, be also possible to improvedAttention model, as long as sound corresponding with each convergence voice feature can be obtained according to multiple convergence voice features of inputThe network model of element, is used equally for phoneme corresponding with each convergence voice feature constant current journey really.
For example, as shown in figure 9, the multiple convergence voice features exported from attention layer can be input to attention modelDecoder in, by the decoder according to the present fusion phonetic feature of input and sound corresponding with convergence voice feature beforeElement determines phoneme corresponding with present fusion phonetic feature.
Through this embodiment, the knowledge of phoneme corresponding with each convergence voice feature is carried out using the decoder of attention modelNot, the accuracy rate of phoneme of speech sound identification can be improved.
It, can be according to identifying after identifying phoneme corresponding with each convergence voice feature respectively in set of phonemesMultiple phonemes, obtain phonotactics corresponding with multiple speech frames.
Since the same phoneme likely corresponds to multiple speech frames, it can in the multiple crucial phonetic features identifiedThe case where same phoneme capable of being corresponded to there are at least two crucial phonetic features.
For example, as shown in Figure 10, for " hello ", containing 5 phonemes " n ", " i ", " h ", " a ", " o ", correspond to 12A speech frame, wherein " n " corresponds to the 1-4 speech frame, and " i " corresponds to the 5-7 speech frame, and " h " corresponds to 8-9Speech frame, " a " correspond to the 10-11 speech frame, and " h " corresponds to the 12nd speech frame.For " n ", the Key Words that identifySound feature is the first phonetic feature corresponding with the 3rd, 4 speech frame, for other phonemes, the crucial phonetic feature that identifiesOnly one, then, the group of the phoneme corresponding with each convergence voice feature of final output is combined into " nnihao ".
Optionally, in the present embodiment, sound corresponding with each convergence voice feature is identified respectively in set of phonemesAfter element, phoneme corresponding with each convergence voice feature can be combined according to the language form belonging to set of phonemes,It obtaining target and shows information, wherein target shows that information is one or more syllables corresponding with multiple speech frames, alternatively, withThe corresponding one or more words of multiple speech frames;Target is shown that information is output to display equipment and shows.
While the multiple phonemes identified, each syllable can be determined.It, can according to the rule of different language typeTo merge the phoneme recognition result for corresponding to identical phoneme, one or more syllables are obtained, and according to different language classThe rule of type determines one or more words corresponding with obtained one or more syllables.
One or more syllables corresponding with multiple speech frames are being obtained, alternatively, after one or more words, it can be with meshThe mode of mark display information is exported to be shown to display equipment (for example, terminal device).
Through this embodiment, multiple phonemes that the language form according to belonging to set of phonemes will identify that be determined as one orMultiple syllables, alternatively, one or more words, and by showing that equipment is shown, it can clearly show that out phoneme recognition knotFruit improves user experience.
Above-mentioned phoneme of speech sound recognition methods is illustrated below with reference to optional example.In this example, the first language is extractedSound feature uses deep layer convolutional neural networks model, carries out Fusion Features and uses from attention layer, identifies and mergeThe corresponding phoneme of phonetic feature uses the decoder of standard attention model.
Two kinds of end-to-end modeling methods can be applied in acoustic model modeling: one is CTC;Another kind isAttention.Main in CTC model only includes an encoder (encoder), the i.e. other feature coding module of frame level, has letterClean stable advantage, shortcoming is unrelated there are a condition it is assumed that i.e. current output is only related to input feature vector, with historyIt exports unrelated.Attention model have two main modulars of encoder and decoder (decoder), output not only with it is defeatedEnter feature in relation to also related to history output, it is more perfect than CTC on probabilistic model.Meanwhile Attention can capture it is longerThe feature of range, is not limited by before and after frames.
The combination of two kinds of modeling patterns can be combined two methods by multitask training frame, as shown in figure 11,Encoder module is shared, optimizes the loss function of an interpolation in training, shown in loss function such as formula (1):
LMTL=λ LCTC+(1-λ)LAttention (1)
Wherein, LMTLFor the loss function after combination, LCTCFor the loss function of CTC, LAttentionFor Attention modelLoss function.
However, in such a way that multitask training frame combines two methods, CTC and Attention output unit collectionClosing must be identical, the unit range information that Attention can not be provided using CTC, and CTC and Attention due to oneFrame level not Shu Chu one unit rank export, need specially treated convergence strategy.
Phoneme of speech sound recognition methods in this example is a kind of Acoustic Modeling method, in conjunction with existing CTC, ATTENTION,The end-to-end modeling technique of Self-Attention, on the basis of the peak location that CTC model provides, several lists in effective use front and backThe bounds of member first extract more accurate unit rank length characteristic using Self-attention layers, and then use standardAttention Decoder layer, can further repair mistake on the basis of CTC, reach more preferably recognition accuracy.
As shown in figure 12, modeling corresponding to the phoneme of speech sound recognition methods in this example can be divided into following fourModule: module one, frame grade encoding device model;Module two, pronunciation unit boundary and position discrimination module;Module three, section are (singleMember) level characteristics coder module;Module four, decoder (output unit differentiation) module.
For frame grade encoding device model, all kinds of deep-neural-network models can be used, for example, multilayer LSTM, multilayerConvolutional network, FSMN or TDNN network.For pronunciation unit boundary and position discrimination module, CTC criterion can be used, it can be defeatedPronunciation unit peak location out.For section (unit) level characteristics coder module, Self-attention layers can be used, InThe feature of self-attention network extraction unit length scale is used in the range of the covering each N number of unit in left and right.For hairSound unit differentiates output module, can be using the Decoder network in standard Attention model.
The pronunciation unit set of module two and the output unit set of module four can be different, if pronunciation unit set is using upperHereafter relevant phoneme (context-dependent phoneme), output unit set use syllable (syllable).
As shown in figure 13, it is exported for frame grade encoding device model for encoder output layers, wherein dark circle represents CTCThe spike of effective label under criterion;Self-attention layers are extracted the certain unit in left and right from attention mechanism by unsupervisedMore advanced feature (be in such as figure around each unit range) in bounds;In the section that self-attention layers are extractedOn the basis of (unit) level characteristics, sentencing for final output unit is further carried out using the decoder of standard attentionNot.
By this example, by self-attention layers, section (unit) is extracted using the unit range information that CTC is providedLevel characteristics, by introducing self-attention layers among CTC and attention, so that the output of attention is disobeyedRely the output with original CTC, makes model have the ability to repair the mistake that inserts and delete introduced in CTC model, and final unifiedIt is exported by the Decoder layer of Attention, without the concern for the convergence strategy with CTC, improves the convenience of processing.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series ofCombination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described becauseAccording to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also knowIt knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the inventionIt is necessary.
Other side according to an embodiment of the present invention additionally provides a kind of for implementing above-mentioned phoneme of speech sound recognition methodsPhoneme of speech sound identification device, as shown in figure 14, which includes:
(1) extraction unit 1402, for extracted from the multiple speech frames being ranked up sequentially in time with it is multipleOne-to-one multiple first phonetic features of speech frame;
(2) first determination units 1404, for determining multiple crucial phonetic features from multiple first phonetic features,In, the probability for the phoneme that each key phonetic feature corresponds in set of phonemes is more than or equal to destination probability threshold value;
(3) second determination units 1406, for determining phonetic feature set corresponding with each key phonetic feature,In, each phonetic feature set include in corresponding crucial phonetic feature and multiple first phonetic features with corresponding Key WordsThe adjacent one or more phonetic features of sound feature;
(4) integrated unit 1408 are obtained for carrying out Fusion Features to the phonetic feature in each phonetic feature set respectivelyTo multiple convergence voice features, wherein the corresponding convergence voice feature of each phonetic feature set;
(5) recognition unit 1410, for identifying sound corresponding with each convergence voice feature respectively in set of phonemesElement.
Optionally, above-mentioned phoneme of speech sound identification device can be by destination server execution, can be, but not limited to applyIn the tasks such as speech recognition, language translation.
Optionally, extraction unit 1402 can be used for executing above-mentioned steps S202, and the first determination unit 1404 can be used forAbove-mentioned steps S204 is executed, the second determination unit 1406 can be used for executing above-mentioned steps S206, and integrated unit 1408 can be usedIn executing above-mentioned steps S208, recognition unit 1410 can be used for executing above-mentioned steps S210.
Through this embodiment, on the basis of determining crucial phonetic feature based on the other feature coding of frame level, key is utilizedPhonetic feature determines phonetic feature section (phonetic feature set), to extract more accurately section (unit) grade another characteristic, determinesPhoneme corresponding with each phonetic feature section, there are recognition result accuracys rate for the phoneme of speech sound recognition methods for solving in the related technologyLow technical problem improves the accuracy rate of recognition result.
As a kind of optional embodiment, above-mentioned apparatus further include:
(1) division unit, for being extracted and multiple languages from the multiple speech frames being ranked up sequentially in timeBefore one-to-one multiple first phonetic features of sound frame, target speech data is divided according to scheduled duration, is obtained moreA unit frame;
(2) determination unit, for determining multiple speech frames from multiple unit frames according to target period, wherein eachSpeech frame includes one or more unit frames.
Through this embodiment, by being divided to obtain unit frame to voice data, and to the side that unit frame is sampledFormula obtains speech frame, it is possible to reduce the computation complexity of phoneme of speech sound identification improves the efficiency of phoneme of speech sound identification.
As a kind of optional embodiment, extraction unit 1402 includes:
(1) first input module, for each speech frame in multiple speech frames to be successively input to target nerve networkModel, wherein target nerve network model is for extracting the first phonetic feature corresponding with each speech frame;
(2) module is obtained, for obtaining multiple first phonetic features of target nerve network model output.
Through this embodiment, speech feature extraction is carried out by using neural network model, can according to need and carries out netNetwork model training improves the accuracy and validity of speech feature extraction.
As a kind of optional embodiment, the first determination unit 1404 includes:
First determining module is more for being determined from multiple first phonetic features using connection timing classification CTC modelA peak location, wherein corresponding one crucial phonetic feature of each peak location.
Through this embodiment, the positioning that crucial phonetic feature is carried out using CTC model, does not need to mark in training patternThe convenience that model training and model use can be improved in the boundary of each phoneme.
As a kind of optional embodiment, the second determination unit 1406 includes:
(1) second determining module, it is corresponding with the current key phonetic feature in multiple crucial phonetic features for determiningThe second phonetic feature and third phonetic feature, wherein the second phonetic feature be multiple first phonetic features in current keyBefore phonetic feature and with non-conterminous first crucial phonetic feature of current key phonetic feature, third phonetic feature is moreIn a first phonetic feature after current key phonetic feature and with non-conterminous first key of current key phonetic featurePhonetic feature;
(2) third determining module, for determining current speech characteristic set corresponding with current key phonetic feature,In, current speech characteristic set is the subclass of target voice characteristic set, and target voice characteristic set includes that the second voice is specialThe first phonetic feature between sign, third phonetic feature and the second phonetic feature and third phonetic feature.
Through this embodiment, by determining that the second phonetic feature corresponding with current key phonetic feature and third voice are specialSign, determines target voice characteristic set corresponding with current key phonetic feature by the second phonetic feature and third phonetic feature,It, can be to avoid not so as to determine phonetic feature set corresponding with current key phonetic feature by target voice characteristic setWith the influence between key phonetic feature, guarantee the accuracy of phoneme recognition.
As a kind of optional embodiment, integrated unit 1408 includes:
(1) input module, for the phonetic feature in each phonetic feature set to be input to target from attention respectivelyIn layer, multiple convergence voice features are obtained, wherein target is used for special to the voice in each phonetic feature set from attention layerSign is weighted summation, obtains convergence voice feature corresponding with each phonetic feature set.
Through this embodiment, section level characteristics are extracted using from attention layer, it is ensured that phonetic feature merges accurateProperty, and then improve the accuracy rate of phoneme of speech sound identification.
As a kind of optional embodiment, recognition unit 1410 includes:
(1) second input module, for each convergence voice feature to be sequentially inputted to the decoding of target attention modelDevice obtains phoneme corresponding with each convergence voice feature, wherein decoder is used for according at least to present fusion currently enteredPhonetic feature and the previous sound handled using previous phonetic feature of the decoder to present fusion phonetic featureElement obtains current phoneme corresponding with present fusion phonetic feature.
Through this embodiment, the knowledge of phoneme corresponding with each convergence voice feature is carried out using the decoder of attention modelNot, the accuracy rate of phoneme of speech sound identification can be improved.
As a kind of optional embodiment, above-mentioned apparatus further include:
(1) assembled unit, for identified respectively in set of phonemes phoneme corresponding with each convergence voice feature itAfterwards, the language form according to belonging to set of phonemes is combined phoneme corresponding with each convergence voice feature, obtains targetShow information, wherein target shows that information is one or more syllables corresponding with multiple speech frames, alternatively, with multiple voicesThe corresponding one or more words of frame;
(2) output unit, for target to be shown that information is output to display equipment and shows.
Through this embodiment, multiple phonemes that the language form according to belonging to set of phonemes will identify that be determined as one orMultiple syllables, alternatively, one or more words, and by showing that equipment is shown, it can clearly show that out phoneme recognition knotFruit improves user experience.
The another aspect of embodiment according to the present invention, additionally provides a kind of storage medium, is stored in the storage mediumComputer program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following stepsCalculation machine program:
S1 is extracted more correspondingly with multiple speech frames from the multiple speech frames being ranked up sequentially in timeA first phonetic feature;
S2 determines multiple crucial phonetic features from multiple first phonetic features, wherein each key phonetic feature pairShould the probability of a phoneme in set of phonemes be more than or equal to destination probability threshold value;
S3 determines phonetic feature set corresponding with each key phonetic feature, wherein each phonetic feature set packetInclude adjacent with corresponding crucial phonetic feature in corresponding crucial phonetic feature and multiple first phonetic features one or moreA phonetic feature;
S4 carries out Fusion Features to the phonetic feature in each phonetic feature set respectively, and it is special to obtain multiple convergence voicesSign, wherein the corresponding convergence voice feature of each phonetic feature set;
S5 identifies phoneme corresponding with each convergence voice feature respectively in set of phonemes.
Optionally, in the present embodiment, those of ordinary skill in the art will appreciate that in the various methods of above-described embodimentAll or part of the steps be that the relevant hardware of terminal device can be instructed to complete by program, the program can store inIn one computer readable storage medium, storage medium may include: flash disk, read-only memory (Read-Only Memory,ROM), random access device (Random Access Memory, RAM), disk or CD etc..
Another aspect according to an embodiment of the present invention additionally provides a kind of for implementing above-mentioned phoneme of speech sound recognition methodsElectronic device, as shown in figure 15, which includes: processor 1502, memory 1504, transmitting device 1506 etc..It shouldComputer program is stored in memory, which is arranged to execute the implementation of any of the above-described method by computer programStep in example.
Optionally, in the present embodiment, above-mentioned electronic device can be located in multiple network equipments of computer networkAt least one network equipment.
Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:
S1 is extracted more correspondingly with multiple speech frames from the multiple speech frames being ranked up sequentially in timeA first phonetic feature;
S2 determines multiple crucial phonetic features from multiple first phonetic features, wherein each key phonetic feature pairShould the probability of a phoneme in set of phonemes be more than or equal to destination probability threshold value;
S3 determines phonetic feature set corresponding with each key phonetic feature, wherein each phonetic feature set packetInclude adjacent with corresponding crucial phonetic feature in corresponding crucial phonetic feature and multiple first phonetic features one or moreA phonetic feature;
S4 carries out Fusion Features to the phonetic feature in each phonetic feature set respectively, and it is special to obtain multiple convergence voicesSign, wherein the corresponding convergence voice feature of each phonetic feature set;
S5 identifies phoneme corresponding with each convergence voice feature respectively in set of phonemes.
Optionally, it will appreciated by the skilled person that structure shown in figure 15 is only to illustrate, electronic device can alsoTo be smart phone (such as Android phone, iOS mobile phone), tablet computer, palm PC and mobile internet deviceThe terminal devices such as (Mobile Internet Devices, MID), PAD.Figure 15 it does not make to the structure of above-mentioned electronic deviceAt restriction.For example, electronic device may also include than shown in Figure 15 more perhaps less component (such as network interface) orWith the configuration different from shown in Figure 15.
Wherein, memory 1504 can be used for storing software program and module, such as the phoneme of speech sound in the embodiment of the present inventionRecognition methods and the corresponding program instruction/module of device, the software that processor 1502 is stored in memory 1504 by operationProgram and module identify thereby executing various function application and phoneme of speech sound, that is, realize above-mentioned phoneme of speech sound recognition methods.Memory 1504 may include high speed random access memory, can also include nonvolatile memory, as one or more magnetism is depositedStorage device, flash memory or other non-volatile solid state memories.In some instances, memory 1504 can further comprise phaseThe memory remotely located for processor 1502, these remote memories can pass through network connection to terminal.Above-mentioned networkExample include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Above-mentioned transmitting device 1506 is used to that data to be received or sent via a network.Above-mentioned network specific exampleIt may include cable network and wireless network.In an example, transmitting device 1506 includes a network adapter (NetworkInterface Controller, NIC), can be connected by cable with other network equipments with router so as to interconnectionNet or local area network are communicated.In an example, transmitting device 1506 is radio frequency (Radio Frequency, RF) module,For wirelessly being communicated with internet.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
If the integrated unit in above-described embodiment is realized in the form of SFU software functional unit and as independent productWhen selling or using, it can store in above-mentioned computer-readable storage medium.Based on this understanding, skill of the inventionSubstantially all or part of the part that contributes to existing technology or the technical solution can be with soft in other words for art schemeThe form of part product embodies, which is stored in a storage medium, including some instructions are used so that onePlatform or multiple stage computers equipment (can be personal computer, server or network equipment etc.) execute each embodiment side of the present inventionThe all or part of the steps of method.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodimentThe part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed client, it can be by others sideFormula is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of unit, and only one kind is patrolledVolume function division, there may be another division manner in actual implementation, such as multiple units or components can combine or can be withIt is integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed mutualCoupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of unit or module, canTo be electrically or in the form of others.
Unit may or may not be physically separated as illustrated by the separation member, shown as a unitComponent may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networksOn unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unitIt is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated listMember both can take the form of hardware realization, can also realize in the form of software functional units.
The above is only the preferred embodiment of the present invention, it is noted that those skilled in the art are comeIt says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded asProtection scope of the present invention.