Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. Key technologies to speech technology (Speech Technology) are automatic speech recognition technology (Automatic Speech Recognition, ASR) and speech synthesis technology (TextToSpeech, TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
The speech recognition technology involved in artificial intelligence refers to a technology of converting pronunciation data into corresponding text information or operation instructions by using a voiceprint recognition algorithm, a speech conversion algorithm, etc., where the pronunciation data may be input by a user or downloaded from a network, and languages of the pronunciation data may include, but are not limited to: chinese, english, french, etc.; the pronunciation data may specifically be pronunciation data corresponding to a word (e.g., an english word), a word (e.g., a chinese word), and a plurality of words or phrases. The audio recognition process may specifically include the following three phases: 1. a feature extraction stage is carried out on pronunciation data to be identified; 2. acquiring pronunciation information corresponding to pronunciation data; 3. and determining the text information stage of the pronunciation data according to the pronunciation information. The three stages described above are described in detail below in connection with fig. 1.
Fig. 1 is a schematic structural diagram of an audio recognition system according to an exemplary embodiment of the present application, where the audio recognition system includes a server and at least one terminal. The terminal refers to a terminal facing a user, and specifically can be intelligent equipment such as an intelligent mobile phone, a tablet personal computer, a portable personal computer, an intelligent watch, a bracelet, an intelligent television and the like. The server may be a stand-alone server, or a server cluster composed of several servers, or a cloud computing center. In an exemplary embodiment of the present application, the terminal may be configured to collect pronunciation data; the server can be used as audio recognition equipment, namely the server can comprise a decoder for audio recognition, and the server adopts the built-in decoder to recognize the pronunciation data collected by the terminal, so as to obtain a recognition result. In another exemplary embodiment of the present application, the server may send the decoder to the terminal, and then the terminal may be used to collect the pronunciation data, or may be used as the audio recognition device to directly use the decoder to perform recognition processing on the pronunciation data, so as to obtain the recognition result. In the subsequent embodiments of the present application, the terminal collects pronunciation data, and the server may be used as an audio recognition device to perform audio recognition on the pronunciation data collected by the terminal.
The decoder refers to a tool for performing audio recognition, and can be seen from fig. 2, wherein the decoder is a recognition network established based on an acoustic model, a pronunciation dictionary and a language model, and comprises a plurality of paths, and each path corresponds to various text information and pronunciation information of pronunciation data; the recognition network is used for searching a path with highest decoding score for the pronunciation data to be recognized, outputting text information corresponding to the pronunciation data to be recognized based on the path, and completing audio recognition.
The acoustic model is a model for forming a large number of acoustic decoding paths, the acoustic decoding paths correspond to pronunciation information of pronunciation data, the pronunciation information corresponding to the pronunciation data comprises at least one candidate pronunciation unit set corresponding to the pronunciation data, each candidate pronunciation unit set comprises a plurality of pronunciation units and an acoustic score of each pronunciation unit, and the acoustic score can be equal to the difference between posterior probability and prior probability of the pronunciation unit. One acoustic decoding path corresponds to one candidate pronunciation unit set, and each acoustic decoding path is used for indicating the pronunciation sequence of each pronunciation unit in the corresponding candidate pronunciation unit set. The acoustic score is used for indicating the matching degree between the pronunciation data and the pronunciation unit, namely, the larger the matching degree is, the higher the acoustic score is; the smaller the match, the lower the acoustic score. Meanwhile, the higher the acoustic score of each pronunciation unit in the candidate pronunciation unit set, the higher the matching degree between each pronunciation unit in the candidate pronunciation unit set and pronunciation data, namely, the standard pronunciation of each pronunciation unit in the candidate pronunciation unit set is closer to the pronunciation data, namely, the higher the accuracy of the candidate pronunciation unit set is. The exact pronunciation of the pronunciation unit can be obtained statistically from a large number of pronunciation data. The lower the acoustic score of each pronunciation unit in the candidate pronunciation unit set, the lower the matching degree between each pronunciation unit in the candidate pronunciation unit set and pronunciation data, i.e. the larger the difference between the standard pronunciation and pronunciation data of each pronunciation unit in the candidate pronunciation unit set, i.e. the lower the accuracy of the candidate pronunciation unit set. The pronunciation unit refers to a pronunciation unit of candidate text information corresponding to pronunciation data, and when the language of the pronunciation data is Chinese, the pronunciation unit specifically refers to phonemes, initials, finals and syllables; when the language of the pronunciation data is English, the pronunciation unit can be a phoneme (phone) or a word-piece (word-pieces); each pronunciation unit may be represented by a plurality of pronunciation states. For example, as shown in fig. 3, in the case of an acoustic model in which the language of the pronunciation data is english and the pronunciation unit is in three states, the model includes three acoustic decoding paths, namely, an acoustic decoding path 1, an acoustic decoding path 2, and an acoustic decoding path 3, and circles on the acoustic decoding paths represent one pronunciation state of the pronunciation unit, and arrows are used to indicate pronunciation sequences. Taking the acoustic decoding path 1 as an example, the candidate pronunciation unit set corresponding to the acoustic decoding path 1 includes pronunciation units w, ah and n, and the pronunciation sequence of each pronunciation unit is w, ah and n in sequence. The corresponding pronunciation states of the pronunciation units are w, w and w, ah and ah, n and n respectively; the pronunciation states of the pronunciation units are represented by the corresponding pronunciation units, and of course, the pronunciation states of the pronunciation units can also be represented by other information, such as s1, s2, s3, etc. Where sil in the acoustic decoding path in fig. 3 indicates silence, indicating that the acoustic recognition processing of the pronunciation data has been completed.
Wherein, the pronunciation dictionary comprises a word set which can be processed by a decoder and a pronunciation unit set of each word in the word set, and can be used for mapping the pronunciation unit set to the words. The word set may include english words, chinese characters, etc. For example, as shown in fig. 3, the pronunciation dictionary 11 includes english words and pronunciation units corresponding to the english words, and the pronunciation units of the word "one" are known from the pronunciation dictionary to include w, ah, and n. The pronunciation unit of the english word "two" includes t, uw, etc.
The language model is a model for forming a large number of language decoding paths, the language decoding paths correspond to text information corresponding to pronunciation data, namely one language decoding path corresponds to one candidate text information corresponding to pronunciation data, and the candidate text information is obtained by matching a candidate pronunciation unit set with a pronunciation dictionary. Each candidate text message may be formed of a word, or a plurality of phrases, each candidate text message having a language score indicating a degree of similarity between each pronunciation unit in the candidate pronunciation unit set and the pronunciation units in the pronunciation dictionary. Alternatively, the language score may also be used for the degree of association between the word and the context.
Based on the above description, please refer to the processing flow of audio recognition shown in fig. 4, which may include the following steps S1-S6.
S1, the terminal acquires pronunciation data to be identified and sends the pronunciation data to the server, wherein the pronunciation data can be acquired by the terminal through a voice device or downloaded from a network, and the voice device can be a microphone or the like.
S2, the server acquires an acoustic feature set corresponding to the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features. In order to filter noise in the pronunciation data, firstly, filtering processing can be carried out on the pronunciation data to obtain processed pronunciation data; and carrying out frame processing on the processed pronunciation data to obtain multi-frame pronunciation sub-data. Further, frequency domain transformation is carried out on each frame of pronunciation sub data in the multi-frame pronunciation sub data to obtain frequency domain pronunciation sub data, and feature extraction is carried out on each frame of pronunciation sub data in the frequency domain to obtain an acoustic feature set corresponding to the pronunciation data. Each acoustic feature set comprises a plurality of acoustic features, each acoustic feature in each acoustic feature set is arranged in sequence, each acoustic feature corresponds to one frame of pronunciation sub-data, and the arrangement sequence of each acoustic feature in each acoustic feature set corresponds to the time sequence in which the pronunciation sub-data is acquired. The acoustic features are used herein to characterize the energy, amplitude, zero-crossing rate, linear prediction (Linear Prediction Coefficient, LPC) coefficients, etc. of the voicing data, and may include, in particular, filter bank based (Fbank) features, mel-frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) features, perceptual linear prediction coefficients (Perceptual Linear Predictive, PLP) features, etc.
S3, the server inputs the acoustic feature set into an acoustic model of the decoder to perform acoustic recognition processing, and pronunciation information corresponding to pronunciation data is obtained. The pronunciation information comprises a plurality of candidate pronunciation unit sets, wherein each candidate pronunciation unit set comprises a plurality of pronunciation units and an acoustic score of each pronunciation unit.
S4, the server can inquire candidate text information corresponding to each candidate pronunciation unit set through a pronunciation dictionary of the decoder, and calculate the language score of each candidate pronunciation unit set through a language model.
S5, the server can calculate the acoustic score of each candidate pronunciation unit set according to the acoustic score of each pronunciation unit in each candidate pronunciation unit set, wherein the acoustic score of each candidate pronunciation unit set can be the product of the acoustic scores of the pronunciation units in the corresponding candidate pronunciation unit set. Further, the sum of the acoustic score of each candidate pronunciation unit set and the language score of the corresponding candidate text information can be determined to be the corresponding decoding score of the candidate text information, and the candidate text information with the highest decoding score is selected from the plurality of candidate text information to serve as the recognition result of pronunciation data, so that the audio recognition of the pronunciation data is completed.
S6, the server returns the identification result to the terminal.
The steps S1-S2 are the feature extraction stage of the pronunciation data to be identified; step S3, acquiring pronunciation information corresponding to pronunciation data; S4-S6 are text information phases for determining pronunciation data according to pronunciation information. In practice, it is found that due to the influence of regional accents, pronunciation variants and other factors, the pronunciation of some pronunciation units is inaccurate or insufficient. For example, 1) pronunciation units of a pronunciation variant are prone to pronunciation inaccuracy, where a pronunciation variant may refer to the same pronunciation unit pronouncing differently in different words. Pronunciation variants may include the following 4 cases: (1) soft palatinizing gingival near-tone is accompanied by soft palatinizing or pharynginizing co-pronunciation such as pronunciation unit l in the words all, little. (2) The gum flashing, i.e. the pronunciation sequence of the pronunciation unit is located after the vowels, the pronunciation is not exploded, like the pronunciation unit t of the word better. (3) A click sound such as the pronunciation unit ts of the word its, the pronunciation unit dz of the word goods. (4) The pronunciation does not feed air, such as when the pronunciation order of the pronunciation unit p, t or k in the word is located after the pronunciation order of the pronunciation unit s, such as the pronunciation unit p of the word speak. 2) The problem of regional accents easily causes that some pronunciation units are inadequately pronounced, for example, the pronunciation unit t of the word banketball is easily inadequately pronounced; or the overlapping sound between adjacent words is easy to read, so that some pronunciation units are not fully pronounced, for example, the pronunciation units t in the word next to are easy to read, and the pronunciation units t in the word next are easy to be inadequately pronounced. If the pronunciation units are insufficiently pronounced or are not accurately pronounced in the above 1) or 2), the acoustic scores of the pronunciation units are easy to be lower, so that a decoder cannot accurately identify text information corresponding to the departure sound data, and the expected audio identification effect cannot be achieved.
In order to improve the accuracy of audio recognition, the embodiment of the application provides an audio recognition method, which improves the basic processing flow shown in the above steps S1-S5 as follows: (1) In the above-described pronunciation information stage for acquiring pronunciation data, acoustic compensation processing is performed on the acoustic score of each pronunciation unit in the pronunciation unit set of pronunciation data. (2) In the text information stage of determining pronunciation data according to pronunciation information, text recognition is carried out on the pronunciation unit set after acoustic compensation processing to obtain text information corresponding to the pronunciation data. Through the improvement, the problem that the acoustic score of the pronunciation unit is low due to factors such as regional accents or pronunciation variants can be relieved, and the pronunciation unit can be compensated to obtain a proper acoustic score by carrying out acoustic compensation processing on the pronunciation unit, so that pronunciation data can be correctly decoded, and the accuracy of audio recognition on the pronunciation data is improved.
Based on the above description, the audio recognition method according to the embodiment of the present application may be performed by an audio recognition device, which may be, for example, a server or a terminal as shown in fig. 1, with reference to fig. 5. As shown in fig. 5, the audio recognition method may include the following steps S101 to S104:
S101, acquiring pronunciation data to be identified, and extracting an acoustic feature set of the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features.
The pronunciation data refers to pronunciation data that needs to be converted into text information. In one embodiment, the pronunciation data may be input by a user, and in particular, the terminal may include an audio control that may be used to collect pronunciation data, and if an operation on the audio control is detected, the audio data input by the user may be collected by a voice device of the terminal. The audio control can be a physical key or a virtual key, and the touch operation can be touch operation, cursor operation, key operation or voice operation; the operation on the audio control can be touch click operation, touch press operation or touch sliding operation, and the touch operation can be single-point touch operation or multi-point touch operation; the cursor operation may be an operation of controlling the cursor to click or an operation of controlling the cursor to press; the key operation may be a virtual key operation or a physical key operation, etc. In another embodiment, the pronunciation data may be downloaded from a network, for example, a voice conversation scenario, and the pronunciation data may be downloaded from a conversation window. After the pronunciation data is acquired, when the audio identification device is a server, the terminal can send the pronunciation data to the server, the server can receive the pronunciation data and extract an acoustic feature set of the pronunciation data, and the acoustic feature set comprises a plurality of acoustic features; when the audio recognition device is a terminal, the terminal may directly extract the acoustic feature set of the pronunciation data. The way to extract the acoustic feature set of the pronunciation data can be seen from the above step S1.
S102, performing acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, wherein the target pronunciation unit set comprises a plurality of pronunciation units and acoustic scores of the pronunciation units.
The audio recognition device may input the acoustic feature set of the pronunciation data into the acoustic model for performing acoustic recognition processing, so as to obtain a plurality of candidate pronunciation unit sets corresponding to the pronunciation data, where the target pronunciation unit set may be any pronunciation unit set in the plurality of candidate pronunciation unit sets. The acoustic model herein may include, in particular, but is not limited to: acoustic models based on hidden markov models (Hidden Markov Model, HMM), such as mixed gaussian-hidden markov models (GMM-HMM) and deep neural network-hidden markov models (Deep Neural Networks Hidden Markov Model, DNN-HMM); of course, end-to-End (End to End) acoustic models, such as a connection timing classification (ConnectionistTemporal Classification, CTC) model, a Long-short-term memory (Long-Short Term Memory, LSTM) model, and an Attention (Attention) model, may also be included.
S103, performing acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set.
In order to avoid the problem that the acoustic score of the pronunciation unit is low due to inaccurate pronunciation or insufficient pronunciation of the pronunciation unit, the audio recognition device may perform acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set to improve the acoustic score of the pronunciation unit in the target pronunciation unit set. Specifically, the audio recognition device may determine whether the target sound unit set meets an acoustic compensation condition, and if not, does not perform acoustic compensation processing on the target sound unit set; if so, performing acoustic compensation processing on the target pronunciation unit set. Here, that the target sound unit set does not satisfy the acoustic compensation condition may mean that: each pronunciation unit in the target pronunciation unit set is fully pronounced and accurately pronounced, namely the acoustic score of each pronunciation unit in the target pronunciation unit set is higher; at this time, the standard pronunciation of each pronunciation unit of the target pronunciation unit set is indicated to have higher matching degree with pronunciation data, namely, the accuracy of the target pronunciation unit set is higher, and at this time, the audio recognition device can perform text recognition on the target pronunciation unit set to obtain a recognition result of the pronunciation data. Optionally, that the target pronunciation unit set does not meet the acoustic compensation condition may further mean that: the acoustic score of most of the sound units in the target sound unit set is low, that is, the matching degree of standard sound and sound data of the sound units in the target sound unit set is low, that is, the accuracy of the target sound unit set is low, and the target sound unit set can be discarded. The target pronunciation unit set meeting the acoustic compensation condition may mean that: the acoustic score of the few pronunciation units in the target pronunciation unit set is low, namely the few pronunciation units in the target pronunciation unit set are inadequately or inadequately pronounced.
In one embodiment, the performing acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set may specifically include: and performing acoustic compensation processing on the acoustic score of the inadequately pronounced or inadequately pronounced pronunciation unit in the target pronunciation unit set. In another alternative embodiment, the acoustic score of the target sound unit set may be calculated according to the acoustic score of each sound unit in the target sound unit set, and the acoustic score of the target sound unit set may be subjected to acoustic compensation processing.
S104, carrying out text recognition on the target pronunciation unit set subjected to the acoustic compensation processing to obtain text information corresponding to the pronunciation data.
The audio recognition equipment can determine candidate text information corresponding to the target pronunciation unit set after the acoustic compensation processing according to the pronunciation dictionary; and calculating the language score of the candidate text information through a language model, and calculating the acoustic score of the target pronunciation unit set according to the acoustic score of each pronunciation unit in the target pronunciation unit set after acoustic compensation processing. Further, the sum of the language score of the candidate text information and the acoustic score of the target pronunciation unit set is used as the decoding score of the candidate text information, and if the decoding score of the candidate text information is larger than a preset score threshold value, the candidate text information is used as the recognition result of pronunciation data. When the pronunciation data corresponds to a plurality of candidate pronunciation unit sets, each candidate pronunciation unit set is identified to obtain candidate text information. The decoding score of each candidate text message is calculated through the method, and the candidate text message with the highest decoding score is screened out from the plurality of candidate text messages to be used as the recognition result of the pronunciation data.
In the embodiment of the application, the acoustic score of the pronunciation unit in the target pronunciation unit set can be improved by performing acoustic compensation processing on the acoustic score of the pronunciation unit in the target pronunciation unit set, so that the problem that the acoustic score of the pronunciation unit is lower due to inaccurate pronunciation or insufficient pronunciation of the pronunciation unit can be avoided. In addition, text recognition is carried out on the target pronunciation unit set after acoustic compensation processing to obtain text information corresponding to the pronunciation data, so that accuracy of recognizing the pronunciation data can be improved; meanwhile, the recognition of pronunciation data of other audio words is not affected. In addition, the audio recognition accuracy is improved without optimizing the acoustic model through a large amount of training data, namely, a large amount of training data is not required to be acquired, a large amount of iterative training is not required to be performed on the acoustic model, the difficulty of data acquisition can be reduced, and a large amount of resources can be saved.
In one embodiment, the pronunciation unit includes a plurality of pronunciation states, each pronunciation state corresponding to an acoustic feature; each acoustic feature in the acoustic feature set of the pronunciation data is arranged in sequence; s102 may include steps S11-S13 as follows.
And s11, sequentially identifying each acoustic feature in the acoustic feature set according to the arrangement sequence of each acoustic feature in the acoustic feature set.
s12, calculating the acoustic score of the pronunciation unit every time the pronunciation unit is identified.
And s13, when each acoustic feature in the acoustic feature set is identified, obtaining the target pronunciation unit set.
Wherein the order in which each of the target pronunciation units in the set of target pronunciation units is identified corresponds to the pronunciation order of each of the pronunciation units.
In steps S11-S13, the audio recognition device may sequentially recognize each acoustic feature of the acoustic feature set according to the arrangement order of each acoustic feature in the acoustic feature set, where as is known from S2 above, one acoustic feature corresponds to one pronunciation sub-data, and the arrangement order of each acoustic feature corresponds to the order in which the pronunciation sub-data is collected. That is, the audio recognition device may sequentially input each acoustic feature into the acoustic model according to the arrangement sequence of each acoustic feature in the acoustic feature set to recognize, and calculate the acoustic score of each pronunciation unit if the pronunciation unit is recognized. And after each acoustic feature in the acoustic feature set is identified, obtaining the target pronunciation unit set.
Optionally, before step S103, the method further includes the following step S21.
S21, in the acoustic recognition processing, according to the acoustic score of each pronunciation unit in the target pronunciation unit set, judging whether the target pronunciation unit set meets the acoustic compensation condition, if so, executing step S103.
In step s21, if the target sound unit set does not meet the acoustic compensation condition, it indicates that the accuracy of the target sound unit set is relatively low, and if the target sound unit set is still subjected to acoustic compensation processing at this time, the candidate text information corresponding to the target sound unit set with low accuracy is easily used as the recognition result, and the accuracy of audio recognition is reduced. Accordingly, the audio recognition apparatus may perform the acoustic compensation process on the target sound unit set when the target sound unit set satisfies the acoustic compensation condition. Specifically, the audio recognition device may determine, in real time, whether the target sound unit set meets the acoustic compensation condition according to the acoustic score of each sound unit in the target sound unit set during the acoustic recognition processing. Namely, each time a pronunciation unit is identified, judging whether the target pronunciation unit set meets an acoustic compensation condition according to the acoustic score of the pronunciation unit which is identified currently; if yes, the step S103 is executed if the target pronunciation unit set has a pronunciation unit with insufficient pronunciation or inaccurate pronunciation; the acoustic score of the sound unit in the target sound unit set is compensated only when the target sound unit set meets the acoustic compensation condition, so that the problem that the acoustic score of the sound unit is low due to inaccurate sound generation or insufficient sound generation of the sound unit can be avoided, the target sound unit set which does not meet the acoustic compensation condition is prevented from being subjected to acoustic compensation, and the accuracy and the effectiveness of the acoustic compensation on the target sound unit set are improved.
In this embodiment, step s21 includes the following steps s31 to s34.
And S31, judging whether the current recognized pronunciation unit is a first pronunciation unit or not every time a pronunciation unit is recognized, wherein the first pronunciation unit is a pronunciation unit to be compensated which is obtained through statistics in the historical audio recognition process.
And s32, if so, verifying whether the acoustic score of the current recognized pronunciation unit is smaller than the preset acoustic score threshold value.
s33, if the number of the pronunciation units in the target pronunciation unit set is smaller than the number of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current identified pronunciation unit, and comparing the sizes of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current identified pronunciation unit, with a preset acoustic score threshold.
s34, if the counted number is greater than the first number threshold, and the acoustic score of each pronunciation unit with the pronunciation sequence before the pronunciation sequence of the currently identified pronunciation unit in the target pronunciation unit set is greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
In steps s31 to s34, the audio recognition apparatus may detect in real time whether the target sound unit satisfies the acoustic compensation condition in combination with the history of empirical data. Specifically, each time the audio recognition device recognizes a sound unit, it is determined whether the sound unit currently recognized is the first sound unit. The first sound generating unit is a sound generating unit to be compensated which is obtained through statistics in the historical audio frequency identification process, namely the first sound generating unit refers to a sound generating unit which is easy to generate insufficient sound or generate inaccurate sound, namely the first sound generating unit refers to a frequency corresponding to the fact that the acoustic score of the historical audio frequency identification process is smaller than a preset acoustic score threshold value and is larger than a preset frequency threshold value. For example, in 10 times of history audio recognition, the acoustic score of the sound unit t in 8 times of history audio recognition is smaller than the preset acoustic score threshold, and therefore, the sound unit t is referred to as a first sound unit. If the current recognized pronunciation unit is a first pronunciation unit, indicating that the current recognized pronunciation unit is a pronunciation unit which is easy to be inadequately pronounced or inaccurately pronounced, verifying whether the acoustic score of the current recognized pronunciation unit is smaller than the preset acoustic score threshold value; if the acoustic score of the current recognized pronunciation unit is smaller than the acoustic score of the current recognized pronunciation unit, counting the number of all pronunciation units with pronunciation sequences before the pronunciation sequence of the current recognized pronunciation unit in the target pronunciation unit set, and comparing the sizes of the pronunciation units with preset acoustic score thresholds, wherein the pronunciation sequences of the pronunciation units are before the pronunciation sequence of the current recognized pronunciation unit in the target pronunciation unit set. If the counted number is greater than the first number threshold, and the acoustic score of each pronunciation unit in the target pronunciation unit set, the pronunciation sequence of which is located before the pronunciation sequence of the currently identified pronunciation unit, is greater than or equal to the preset acoustic score threshold, the acoustic score of only the currently identified pronunciation unit in the pronunciation units obtained by identification is lower, namely the currently identified pronunciation unit is insufficiently pronounced or the accuracy of the pronunciation is lower; namely, the accuracy of the target pronunciation unit set is higher, and only a few pronunciation units are inadequately pronounced or are inadequately pronounced; it is determined that the set of target pronunciation units satisfies the acoustic compensation condition. Therefore, the acoustic compensation processing of the target sound unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing of the target sound unit set is improved.
Optionally, step s21 includes the following steps s41 to s43.
s41, verifying whether the acoustic score of the currently identified pronunciation unit is smaller than the preset acoustic score threshold value or not every time one pronunciation unit is identified.
s42, if the number of the pronunciation units is smaller than the number of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current identified pronunciation unit, and comparing the sizes of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current identified pronunciation unit, with a preset acoustic score threshold.
And s43, if the counted number is greater than the second number threshold, and the acoustic score of each pronunciation unit with the pronunciation sequence before the pronunciation sequence of the currently identified pronunciation unit in the target pronunciation unit set is greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
In steps s41 to s43, the audio recognition apparatus may detect whether the target sound unit satisfies the acoustic compensation condition in real time during the acoustic recognition process. Specifically, each time a pronunciation unit is identified, verifying whether the acoustic score of the currently identified pronunciation unit is less than the preset acoustic score threshold; if the acoustic score of the current recognized pronunciation unit is smaller than the acoustic score of the current recognized pronunciation unit, counting the number of all pronunciation units with pronunciation sequences before the pronunciation sequence of the current recognized pronunciation unit in the target pronunciation unit set, and comparing the sizes of the pronunciation units with preset acoustic score thresholds, wherein the pronunciation sequences of the pronunciation units are before the pronunciation sequence of the current recognized pronunciation unit in the target pronunciation unit set. If the counted number is greater than the first number threshold, and the acoustic score of each pronunciation unit in the target pronunciation unit set, the pronunciation sequence of which is located before the pronunciation sequence of the currently identified pronunciation unit, is greater than or equal to the preset acoustic score threshold, the acoustic score of only the currently identified pronunciation unit in the pronunciation units obtained by identification is lower, namely the currently identified pronunciation unit is insufficiently pronounced or the accuracy of the pronunciation is lower; that is, the accuracy of the target pronunciation unit set is relatively high, and only a few pronunciation units are inadequately pronounced or are inadequately pronounced; it is determined that the set of target pronunciation units satisfies the acoustic compensation condition. Therefore, the acoustic compensation processing of the target sound unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing of the target sound unit set is improved.
In this embodiment, step S103 may include the following steps S51 and S52.
And s51, performing acoustic compensation processing on the acoustic scores of the currently recognized pronunciation units by adopting the acoustic scores of all pronunciation units with pronunciation sequences before the pronunciation sequence of the currently recognized pronunciation unit in the target pronunciation unit set, and obtaining the acoustic scores of the compensated currently recognized pronunciation units.
s52, updating the target pronunciation unit set by using the compensated acoustic score of the currently identified pronunciation unit to obtain a target pronunciation unit set after acoustic compensation processing.
In steps s51 and s52, when the acoustic recognition processing detects in real time that the target sound unit set satisfies the acoustic compensation condition, the acoustic compensation processing may be performed on the target sound unit set in real time. Specifically, the audio identifying device may perform acoustic compensation processing on the acoustic score of the currently identified sound unit by using the acoustic scores of all sound units in the target sound unit set, whose sound sequence is before the sound sequence of the currently identified sound unit, to obtain the compensated acoustic score of the currently identified sound unit. In one embodiment, the acoustic score of the currently identified pronunciation unit may be subjected to acoustic compensation processing by using the maximum acoustic score or the average acoustic score of the acoustic scores of all pronunciation units in the target pronunciation unit set, whose pronunciation sequence is before the pronunciation sequence of the currently identified pronunciation unit, so as to obtain the compensated acoustic score of the currently identified pronunciation unit. Optionally, an acoustic score may be randomly selected from the acoustic scores of all the pronunciation units in the target pronunciation unit set, where the pronunciation sequence is located before the pronunciation sequence of the current identified pronunciation unit, and the acoustic score of the current identified pronunciation unit after compensation is obtained by performing acoustic compensation processing on the acoustic score of the current identified pronunciation unit. Further, the target pronunciation unit set is updated by using the compensated acoustic score of the currently identified pronunciation unit, and the target pronunciation unit set after acoustic compensation processing is obtained. Here, only the pronunciation units which are insufficiently or inaccurately pronounced in the target pronunciation unit set are compensated, so that the acoustic score of the target pronunciation unit set can be improved, and the accuracy of compensating the target pronunciation unit set can be improved. In addition, the acoustic compensation processing of all the sound unit sets in the target sound unit set can be avoided, and the normal recognition of sound data of other audio words is not affected.
In this embodiment, step s51 may include the following steps s61 to s64.
s61, calculating a first average value of acoustic scores of all pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the currently identified pronunciation unit.
s62, obtaining the probability that the acoustic score of the current recognized pronunciation unit is smaller than the preset acoustic score threshold value.
s63, determining a compensating acoustic score for the currently identified pronunciation unit according to the first average value and the probability.
s64, determining the sum of the acoustic score of the currently identified pronunciation unit and the compensated acoustic score as the acoustic score of the compensated currently identified pronunciation unit.
In steps s61 to s64, the audio identifying apparatus may calculate an average value of the acoustic scores of all the pronunciation units in the target pronunciation unit set whose pronunciation sequence is before the pronunciation sequence of the currently identified pronunciation unit, and perform acoustic compensation processing on the acoustic score of the currently identified pronunciation unit to obtain the compensated acoustic score of the currently identified pronunciation unit. Specifically, the audio recognition device may calculate the first average value of the acoustic scores of all the pronunciation units in the target pronunciation unit set, where the pronunciation sequence is located before the pronunciation sequence of the currently recognized pronunciation unit, by using a preset average algorithm, where the preset average algorithm may be an arithmetic average algorithm, a statistical average algorithm, or the like. Further, obtaining the probability that the acoustic score of the current recognized pronunciation unit is smaller than the preset acoustic score threshold value, wherein the probability is obtained through statistics in the historical audio recognition processing process; and determining a compensated acoustic score for the currently identified pronunciation unit based on the first average and the probability. Then, the sum of the acoustic score of the currently recognized pronunciation unit and the compensated acoustic score is determined as the acoustic score of the compensated currently recognized pronunciation unit, and the acoustic score of the compensated currently recognized pronunciation unit may be expressed as the following formula (1).
In formula (1), x
n Representing the nth pronunciation unit, i.e. x, in the target pronunciation unit set
n Is the pronunciation unit that is currently identified. P (x)
n ) Acoustic score, P, representing the current identified pronunciation unit
prior (x
n ) A probability that the acoustic score representing the currently identified pronunciation unit is less than the preset acoustic score threshold,
the first average value is represented, and alpha and beta represent weight coefficients, which can be obtained according to statistics in the historical audio identification process. />
Representing the compensated acoustic score, P (x'
n ) Representing the compensated acoustic score of the currently identified pronunciation unit.
Optionally, before step S103, the method further includes the following step S71.
S71, after the acoustic recognition processing is completed, determining whether the target sound unit set meets an acoustic compensation condition according to the acoustic score of each sound unit in the target sound unit set, and if so, executing step S103.
In step s71, if the target sound unit set does not meet the acoustic compensation condition, it indicates that the accuracy of the target sound unit set is relatively low, and if the target sound unit set is still subjected to acoustic compensation processing at this time, it is easy to use candidate text information corresponding to the target sound unit set with low accuracy as a recognition result, and the accuracy of audio recognition is reduced. Accordingly, the audio recognition apparatus may perform the acoustic compensation process on the target sound unit set when the target sound unit set satisfies the acoustic compensation condition. Specifically, the audio recognition device may determine, after the acoustic recognition process is completed, whether the target sound unit set meets an acoustic compensation condition according to the acoustic score of each sound unit in the target sound unit set. After the acoustic features are identified, judging whether the target sound unit set meets acoustic compensation conditions according to the acoustic scores of the sound units in the target sound unit set; if yes, the step S103 is executed, which indicates that there are pronunciation units with insufficient pronunciation or inaccurate pronunciation in the target pronunciation unit set; if not, the target pronunciation unit set is not subjected to acoustic compensation processing. The acoustic score of the sound unit in the target sound unit set is compensated only when the target sound unit set meets the acoustic compensation condition, so that the problem that the acoustic score of the sound unit is low due to inaccurate sound generation or insufficient sound generation of the sound unit can be avoided, the target sound unit set which does not meet the acoustic compensation condition is prevented from being subjected to acoustic compensation, and the accuracy and the effectiveness of the acoustic compensation on the target sound unit set are improved.
In this embodiment, step s71 includes the following steps s81 to s84.
And S81, after the acoustic recognition processing is finished, detecting whether the target pronunciation unit which is the same as a first pronunciation unit exists in the target pronunciation unit set, wherein the first pronunciation unit is a pronunciation unit to be compensated which is obtained through statistics in the historical audio recognition process.
And S82, if so, verifying whether the acoustic score of the target pronunciation unit is smaller than a preset acoustic score threshold value.
And s83, if the number of the sound units is smaller than the preset sound score threshold, counting the number of all sound units with the acoustic scores larger than the preset sound score threshold in the target sound unit set.
s84, if the counted number is greater than the third number threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
In steps s81 to s84, the audio recognition apparatus may detect whether the target sound unit set satisfies the acoustic compensation condition in combination with the history empirical data after the acoustic recognition process is completed. Specifically, the audio recognition device may detect whether the target pronunciation unit identical to the first pronunciation unit exists in the target pronunciation unit set after the acoustic recognition processing is completed, and if the target pronunciation unit exists, it is verified whether the acoustic score of the target pronunciation unit is less than a preset acoustic score threshold value if the target pronunciation unit is a pronunciation unit that is easy to be inadequately pronounced or inadequately pronounced. If the number of the sound units is smaller than the preset sound score threshold, the sound score of the target sound unit is lower, and the number of all sound units with the sound scores larger than the preset sound score threshold in the target sound unit set is counted. If the counted number is larger than a third number threshold, the method indicates that the acoustic score of most pronunciation units in the target pronunciation unit set is higher, and the acoustic score of a few pronunciation units is lower, namely, only the target pronunciation units in the target pronunciation unit set are inadequately pronounced or the accuracy of the pronounced target pronunciation units is lower; that is, the accuracy of the target pronunciation unit set is relatively high, and only a few pronunciation units are inadequately pronounced or are inadequately pronounced; it is determined that the set of target pronunciation units satisfies the acoustic compensation condition. Therefore, the acoustic compensation processing of the target sound unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing of the target sound unit set is improved. It should be noted that the preset acoustic score threshold, the first number threshold, the second number threshold, the third number threshold and the fourth number threshold may be obtained by counting historical audio recognition, where the first number threshold, the second number threshold, the third number threshold and the fourth number threshold may be specifically dynamically adjusted according to the number of the pronunciation words in the target pronunciation unit set.
In this embodiment, step s71 includes the following steps s91 to s93.
And S91, judging whether a target pronunciation unit with the acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set after the acoustic identification processing is completed.
s92, if the target pronunciation unit set exists, counting the number of all pronunciation units with the acoustic score larger than the preset acoustic score threshold value in the target pronunciation unit set.
s93, if the counted number is greater than the fourth number threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
In steps s91 to s93, the audio recognition apparatus may detect whether the target sound unit set satisfies the acoustic compensation condition after the acoustic recognition process is completed. Specifically, after the acoustic identification process is completed, whether a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set is judged, and if the target pronunciation unit with the acoustic score smaller than the preset acoustic score threshold is indicated, the number of all pronunciation units with the acoustic score larger than the preset acoustic score threshold in the target pronunciation unit set is counted. If the counted number is larger than a third number threshold, the method indicates that the acoustic score of most pronunciation units in the target pronunciation unit set is higher, and the acoustic score of a few pronunciation units is lower, namely, only the target pronunciation units in the target pronunciation unit set are inadequately pronounced or the accuracy of the pronounced target pronunciation units is lower; that is, the accuracy of the target pronunciation unit set is relatively high, and only a few pronunciation units are inadequately pronounced or are inadequately pronounced; it is determined that the set of target pronunciation units satisfies the acoustic compensation condition. Therefore, the acoustic compensation processing of the target sound unit set with lower accuracy can be avoided, and the accuracy of the acoustic compensation processing of the target sound unit set is improved.
In this embodiment, step S103 includes the following steps S111 to S112.
And s111, performing acoustic compensation processing on the acoustic score of the target pronunciation unit by adopting the acoustic scores of other pronunciation units except the target pronunciation unit in the target pronunciation unit set, and obtaining the compensated acoustic score of the target pronunciation unit.
And s112, updating the target pronunciation unit set by adopting the compensated acoustic score of the target pronunciation unit to obtain the target pronunciation unit set after acoustic compensation processing.
In steps s111 to s112, after the acoustic recognition processing is completed, when it is detected that the target sound unit set satisfies the acoustic compensation condition, the acoustic compensation processing may be performed on the target sound unit set. Specifically, the acoustic scores of the target pronunciation units can be compensated by using the acoustic scores of other pronunciation units in the target pronunciation unit set, so as to obtain the compensated acoustic score of the target pronunciation unit. In one embodiment, the average acoustic score and the maximum acoustic score of the acoustic scores of the other pronunciation units in the target pronunciation unit set except the target pronunciation unit may be used, and the acoustic score of the target pronunciation unit after being compensated may be obtained by performing acoustic compensation processing on the acoustic score of the target pronunciation unit. In another embodiment, an acoustic score may be randomly selected from the acoustic scores of the other pronunciation units in the target pronunciation unit set, and the acoustic score of the target pronunciation unit may be subjected to acoustic compensation processing, so as to obtain the compensated acoustic score of the target pronunciation unit. Further, the target pronunciation unit set is updated by adopting the compensated acoustic score of the target pronunciation unit, and the target pronunciation unit set after acoustic compensation processing is obtained. Here, only the pronunciation units which are insufficiently or inaccurately pronounced in the target pronunciation unit set are compensated, so that the acoustic score of the pronunciation units in the target pronunciation unit set can be improved, and the accuracy of acoustic compensation of the target pronunciation unit set can be improved.
In this embodiment, step s111 includes the following steps s211 to s214.
s211, calculating a second average value of acoustic scores of other pronunciation units except the target pronunciation unit in the target pronunciation unit set.
s212, obtaining the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold value.
s213, determining a compensated acoustic score for the target pronunciation unit according to the second average and the probability.
s214, determining the sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the compensated acoustic score of the target pronunciation unit.
In steps s211 to s214, the audio recognition device may perform acoustic compensation processing on the acoustic score of the target pronunciation unit by using an average value of the acoustic scores of the pronunciation units other than the target pronunciation unit in the target pronunciation unit set, to obtain the compensated acoustic score of the target pronunciation unit. Specifically, the audio recognition device may calculate the second average value of the acoustic scores of the sound units other than the target sound unit in the target sound unit set by using a preset average algorithm. Further, obtaining the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold value, and determining the compensation acoustic score of the target pronunciation unit according to the second average value and the probability; and determining the sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the compensated acoustic score of the target pronunciation unit. The compensated acoustic score of the target sound unit can be expressed by the following formula (2).
In formula (2), x
i Representing the ith pronunciation element, i.e. x, in the target collection of pronunciation elements
i Is the target pronunciation unit. P (x)
i ) Representing the acoustic score, P, of the target pronunciation unit
prior (x
i ) A probability that the acoustic score representing the target pronunciation unit is less than the preset acoustic score threshold,
representing a second average value, N representing the number of pronunciation units in the target pronunciation unit. />
Representing the acoustic compensation score. P (x'
i ) Representing the compensated acoustic score of the target pronunciation unit.
In one embodiment, step S104 includes the following steps S311-S313.
And s311, performing text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain candidate text information corresponding to the pronunciation data and the language score of the candidate text information.
s312, determining the acoustic score of the target sound unit set according to the acoustic score of each sound unit in the target sound unit set after the acoustic compensation processing.
s313, if the sum of the acoustic score and the language score of the candidate text information is greater than a preset score threshold, determining the candidate text information as the text information corresponding to the pronunciation data.
In steps s311 to s313, the audio recognition apparatus may query candidate text information corresponding to the target pronunciation unit set after the acoustic compensation processing through the pronunciation dictionary, and calculate a language score of the candidate text information according to the language model. And determining the product of the acoustic scores of the sound units in the target sound unit set after the acoustic compensation processing as the acoustic score of the target sound unit set. And if the sum of the acoustic score and the language score of the target pronunciation unit set is greater than a preset score threshold, determining the candidate text information as the text information corresponding to the pronunciation data.
In one embodiment, step S104 is followed by the following steps S411-S413.
s411, detecting whether the text information corresponding to the pronunciation data includes a field matched with the operation instruction.
And s412, if yes, generating a target operation instruction according to the text information corresponding to the pronunciation data.
s413, the target operation instruction is sent to the terminal, and the terminal executes the target operation instruction.
In steps s411 to s413, the audio recognition apparatus may generate an operation instruction from the text information corresponding to the pronunciation data. Specifically, it may be detected whether the text information corresponding to the pronunciation data includes a field matched with the operation instruction, for example, the field may include "open", "close", "start", and the like. If so, the audio identification equipment can generate a target operation instruction according to the text information corresponding to the pronunciation data, and when the audio identification equipment is a server, the server can send the target operation instruction to the terminal, and the terminal executes the target operation instruction; when the audio recognition device is a terminal, the terminal may execute the target operation instruction.
The audio recognition method provided by the application can be applied to scenes such as automatic translation, voice search, voice input, voice dialogue and the like, and the application is described in detail by taking the method as an example of a server for audio recognition equipment. Referring to fig. 6, fig. 6 is an audio recognition method provided in the present application.
As shown in fig. 6, the terminal includes asearch interface 12, where thesearch interface 12 includes anaudio control 13 and atext input box 14, and the search interface may be a browser, a user interface of a social application program, and so on; the text entry box allows a user to enter text information to be searched. When the terminal detects the clicking operation on theaudio control 13, the terminal can acquire the audio data input by the user through the voice device and send the audio data to the server.
As shown in fig. 6, the server may obtain a set of acoustic features corresponding to the pronunciation data. Specifically, filtering processing can be performed on pronunciation data to obtain processed pronunciation data; and carrying out frame processing on the processed pronunciation data to obtain multi-frame pronunciation sub-data. Further, frequency domain transformation is carried out on each frame of pronunciation sub data in the multi-frame pronunciation sub data to obtain frequency domain pronunciation sub data, and feature extraction is carried out on each frame of pronunciation sub data in the frequency domain to obtain an acoustic feature set corresponding to the pronunciation data. The acoustic feature set comprises a plurality of acoustic features, the acoustic features in the acoustic feature set are arranged in sequence, and each acoustic feature corresponds to one frame of pronunciation sub-data.
As shown in fig. 6, the server may perform acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a plurality of candidate pronunciation unit sets corresponding to the pronunciation data, where each candidate pronunciation unit set includes a plurality of pronunciation units and acoustic scores of each pronunciation unit, and here, taking three candidate pronunciation unit sets as an example, a candidate pronunciation unit set 1, a candidate pronunciation unit set 2, and a candidate pronunciation unit set 3 respectively. And after the acoustic recognition processing is completed, detecting whether each candidate unit set meets the acoustic compensation condition according to the acoustic scores of the pronunciation units in each candidate pronunciation unit set. If the acoustic score of each sound unit in the candidate sound unit set 2 is detected to be smaller than the preset acoustic score threshold value, determining that the candidate sound unit set 2 does not meet the acoustic compensation condition. If the acoustic score of each pronunciation unit in the candidate pronunciation unit set 3 is detected to be greater than or equal to the preset acoustic score threshold value, determining that the candidate pronunciation unit set 3 does not meet the acoustic compensation condition. If it is detected that the target pronunciation units with the acoustic scores smaller than the preset acoustic score threshold value exist in the candidate pronunciation unit set 1, and the number of all pronunciation units with the acoustic scores larger than the preset acoustic score threshold value in the candidate pronunciation unit set 1 is larger than the fourth number threshold value, it is determined that the candidate pronunciation unit set 1 does not meet the acoustic compensation condition. For example, the candidate sound unit set 1 includes sound units n, e, k, s, t, and if it is detected that the acoustic score of the sound unit t is smaller than the preset acoustic score threshold, the acoustic scores of the other sound units are all greater than or equal to the preset acoustic score threshold, it is determined that the candidate sound unit set 1 satisfies the acoustic compensation condition. Further, an average value of the acoustic scores of the sound unit n, e, k, s may be calculated, and the acoustic score of the sound unit t may be subjected to acoustic compensation processing based on the average value. And finally, respectively carrying out text recognition on the sound unit set 1 and the candidate sound unit set 3 after acoustic compensation processing to obtain candidate text information 1 and candidate text information 2 corresponding to sound data and a decoding score of each candidate text information. The candidate text information with the highest decoding score is selected from the candidate text information 1 and the candidate text information 2 as the text information of the pronunciation data.
As shown in fig. 6, the server may transmit text information corresponding to the pronunciation data to the terminal, e.g., the text information includes next, and the terminal may display the text information in thetext input box 14 in the search interface. Optionally, the server may further generate a search instruction according to the text information, and send the search instruction to the terminal, where the search instruction is used to instruct the terminal to search for an entry associated with the text information. The terminal may receive the search instruction and execute the search instruction and output a plurality of entries associated with the text information.
The embodiment of the application provides an audio recognition device, which can be arranged in audio recognition equipment, for example, the audio recognition device can be a decoder in the audio recognition equipment or an application program with a decoding function; referring to fig. 7, the apparatus includes:
an obtainingunit 701, configured to obtain pronunciation data to be identified, and extract an acoustic feature set of the pronunciation data, where the acoustic feature set includes a plurality of acoustic features;
the identifyingunit 702 is configured to perform acoustic identification processing on the acoustic feature set of the pronunciation data, so as to obtain a target pronunciation unit set corresponding to the pronunciation data, where the target pronunciation unit set includes a plurality of pronunciation units and acoustic scores of each pronunciation unit;
Acompensation unit 703, configured to perform acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set;
therecognition unit 702 is further configured to perform text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain text information corresponding to the pronunciation data.
Optionally, the acquiringunit 701 is configured to sequentially identify each acoustic feature in the acoustic feature set according to an arrangement sequence of each acoustic feature in the acoustic feature set; calculating an acoustic score for the pronunciation unit each time one of the pronunciation units is identified; when each acoustic feature in the acoustic feature set is identified, obtaining the target pronunciation unit set; wherein the order in which each of the target set of pronunciation units is identified corresponds to the pronunciation order of each of the pronunciation units.
Optionally, the determiningunit 704 is configured to determine, during the acoustic recognition processing, whether the target sound unit set meets an acoustic compensation condition according to an acoustic score of each sound unit in the target sound unit set; or after the acoustic recognition processing is finished, judging whether the target sound unit set meets an acoustic compensation condition according to the acoustic score of each sound unit in the target sound unit set; and if so, executing the step of performing acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set.
Optionally, the determiningunit 704 is specifically configured to determine, when one pronunciation unit is identified, whether the currently identified pronunciation unit is a first pronunciation unit, where the first pronunciation unit is a pronunciation unit to be compensated obtained by statistics in the historical audio identification process; if yes, verifying whether the acoustic score of the current recognized pronunciation unit is smaller than a preset acoustic score threshold value; if the number of the pronunciation units in the target pronunciation unit set is smaller than the number of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, and comparing the sizes of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, with the preset acoustic score threshold; and if the counted number is greater than a first number threshold, and the acoustic score of each pronunciation unit, of which the pronunciation sequence is positioned before the pronunciation sequence of the currently identified pronunciation unit, in the target pronunciation unit set is greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.
Optionally, the determiningunit 704 is specifically configured to, for each time a pronunciation unit is identified, verify whether an acoustic score of the currently identified pronunciation unit is smaller than a preset acoustic score threshold; if the number of the pronunciation units in the target pronunciation unit set is smaller than the number of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, and comparing the sizes of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, with the preset acoustic score threshold; and if the counted number is greater than a second number threshold, and the acoustic score of each pronunciation unit, of which the pronunciation sequence is before the pronunciation sequence of the currently identified pronunciation unit, in the target pronunciation unit set is greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.
Optionally, thecompensation unit 703 is specifically configured to perform acoustic compensation processing on the acoustic score of the currently identified pronunciation unit by using the acoustic scores of all pronunciation units in the target pronunciation unit set that have a pronunciation sequence that is before the pronunciation sequence of the currently identified pronunciation unit, so as to obtain the compensated acoustic score of the currently identified pronunciation unit; and updating the target pronunciation unit set by adopting the compensated acoustic score of the currently identified pronunciation unit to obtain a target pronunciation unit set after acoustic compensation processing.
Optionally, thecompensation unit 703 is specifically configured to calculate a first average value of acoustic scores of all pronunciation units in the target pronunciation unit set, where the pronunciation sequence is located before the pronunciation sequence of the currently identified pronunciation unit; acquiring the probability that the acoustic score of the current recognized pronunciation unit is smaller than the preset acoustic score threshold value; determining a compensated acoustic score for the currently identified pronunciation unit based on the first average and the probability; and determining the sum of the acoustic score of the currently identified pronunciation unit and the compensated acoustic score as the acoustic score of the compensated currently identified pronunciation unit.
Optionally, the audio identifying apparatus further includes: the judgingunit 704 is specifically configured to detect whether a target pronunciation unit identical to a first pronunciation unit exists in the target pronunciation unit set after the acoustic recognition processing is completed, where the first pronunciation unit is a pronunciation unit to be compensated obtained by statistics in a historical audio recognition process; if yes, verifying whether the acoustic score of the target pronunciation unit is smaller than a preset acoustic score threshold value; if the acoustic score is smaller than the preset acoustic score threshold, counting the number of all the pronunciation units with acoustic scores larger than or equal to the preset acoustic score threshold in the target pronunciation unit set; and if the counted number is larger than a third number threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
Optionally, the determiningunit 704 is specifically configured to determine, after the acoustic recognition processing is completed, whether a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set;
if the target sound unit set exists, counting the number of all sound units with the acoustic score being greater than or equal to the preset acoustic score threshold value in the target sound unit set; and if the counted number is larger than a fourth number threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
Optionally, thecompensation unit 703 is specifically configured to perform acoustic compensation processing on the acoustic score of the target pronunciation unit by using the acoustic scores of the pronunciation units other than the target pronunciation unit in the target pronunciation unit set, so as to obtain the compensated acoustic score of the target pronunciation unit; and updating the target pronunciation unit set by adopting the compensated acoustic score of the target pronunciation unit to obtain the target pronunciation unit set after acoustic compensation processing.
Optionally, thecompensation unit 703 is specifically configured to calculate a second average value of acoustic scores of other pronunciation units in the target pronunciation unit set, where the second average value is an acoustic score of the other pronunciation units except the target pronunciation unit; acquiring the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold value; determining a compensated acoustic score for the target pronunciation unit according to the second average and the probability; and determining the sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the acoustic score of the compensated target pronunciation unit.
Optionally, the identifyingunit 702 is specifically configured to perform text identification on the target pronunciation unit set after the acoustic compensation processing to obtain candidate text information corresponding to the pronunciation data and a language score of the candidate text information; determining the acoustic score of the target sound unit set according to the acoustic score of each sound unit in the target sound unit set after the acoustic compensation processing; and if the sum of the acoustic score and the language score of the target pronunciation unit set is greater than a preset score threshold, determining the candidate text information as the text information corresponding to the pronunciation data.
Optionally, the audio identifying apparatus further includes: a generatingunit 705, configured to detect whether a field matched with an operation instruction is included in text information corresponding to the pronunciation data; if yes, generating a target operation instruction according to the text information corresponding to the pronunciation data, sending the target operation instruction to a terminal, and executing the target operation instruction by the terminal.
In the embodiment of the application, the acoustic score of the pronunciation unit in the target pronunciation unit set can be improved by performing acoustic compensation processing on the acoustic score of the pronunciation unit in the target pronunciation unit set, so that the problem that the acoustic score of the pronunciation unit is lower due to inaccurate pronunciation or insufficient pronunciation of the pronunciation unit can be avoided. In addition, text recognition is carried out on the target pronunciation unit set after acoustic compensation processing to obtain text information corresponding to the pronunciation data, so that accuracy of recognizing the pronunciation data can be improved; meanwhile, the recognition of pronunciation data of other audio words is not affected. In addition, the audio recognition accuracy is improved without optimizing the acoustic model through a large amount of training data, namely, a large amount of training data is not required to be acquired, a large amount of iterative training is not required to be performed on the acoustic model, the difficulty of data acquisition can be reduced, and a large amount of resources can be saved.
An embodiment of the present application provides an audio recognition device, please refer to fig. 8. The audio recognition apparatus includes: theprocessor 151, theuser interface 152, thenetwork interface 154, and thestorage device 155 are connected via thebus 153.
Auser interface 152 for enabling human-machine interaction, which may include a display screen or keyboard, etc. Anetwork interface 154 for communication connection with external devices. Astorage device 155 is coupled to theprocessor 151 for storing various software programs and/or sets of instructions. In particular implementations,storage 155 may include high-speed random access memory, and may also include non-volatile memory, such as one or more disk storage devices, flash memory devices, or other non-volatile solid state storage devices. Thestorage 155 may store an operating system (hereinafter referred to as a system), such as ANDROID, IOS, WINDOWS, or an embedded operating system, such as LINUX. Thestorage 155 may also store a network communication program that may be used to communicate with one or more additional devices, one or more application audio recognition devices, and one or more audio recognition devices. Thestorage 155 may also store a user interface program that can vividly display the content image of the application program through a graphical operation interface, and receive control operations of the application program from a user through input controls such as menus, dialog boxes, buttons, and the like. Thestorage 155 may also store acoustic models, language models, pronunciation dictionaries, and the like.
In one embodiment, thestorage 155 may be used to store one or more instructions; theprocessor 151 may invoke the one or more instructions to implement an audio recognition method, specifically, theprocessor 151 invokes the one or more instructions to perform the following steps:
acquiring pronunciation data to be identified, and extracting an acoustic feature set of the pronunciation data, wherein the acoustic feature set comprises a plurality of acoustic features;
performing acoustic recognition processing on the acoustic feature set of the pronunciation data to obtain a target pronunciation unit set corresponding to the pronunciation data, wherein the target pronunciation unit set comprises a plurality of pronunciation units and acoustic scores of each pronunciation unit;
performing acoustic compensation processing on acoustic scores of all pronunciation units in the target pronunciation unit set;
and carrying out text recognition on the target pronunciation unit set subjected to the acoustic compensation processing to obtain text information corresponding to the pronunciation data.
Optionally, the processor invokes an instruction to execute the following steps:
the acoustic recognition processing for the acoustic feature set of the pronunciation data comprises the following steps:
sequentially identifying each acoustic feature in the acoustic feature set according to the arrangement sequence of each acoustic feature in the acoustic feature set;
Calculating an acoustic score for the pronunciation unit each time one of the pronunciation units is identified;
when each acoustic feature in the acoustic feature set is identified, obtaining the target pronunciation unit set;
wherein the order in which each of the target set of pronunciation units is identified corresponds to the pronunciation order of each of the pronunciation units.
Optionally, the processor invokes an instruction to execute the following steps:
in the acoustic recognition processing process, judging whether the target sound unit set meets an acoustic compensation condition according to the acoustic score of each sound unit in the target sound unit set; or after the acoustic recognition processing is finished, judging whether the target sound unit set meets an acoustic compensation condition according to the acoustic score of each sound unit in the target sound unit set;
and if so, executing the step of performing acoustic compensation processing on the acoustic score of each pronunciation unit in the target pronunciation unit set.
Optionally, the processor invokes an instruction to execute the following steps:
judging whether a current recognized pronunciation unit is a first pronunciation unit or not when one pronunciation unit is recognized, wherein the first pronunciation unit is a pronunciation unit to be compensated obtained through statistics in a historical audio recognition process;
If yes, verifying whether the acoustic score of the current recognized pronunciation unit is smaller than a preset acoustic score threshold value;
if the number of the pronunciation units in the target pronunciation unit set is smaller than the number of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, and comparing the sizes of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, with the preset acoustic score threshold;
and if the counted number is greater than a first number threshold, and the acoustic score of each pronunciation unit, of which the pronunciation sequence is positioned before the pronunciation sequence of the currently identified pronunciation unit, in the target pronunciation unit set is greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.
Optionally, the processor invokes an instruction to execute the following steps:
each time a pronunciation unit is identified, verifying whether the acoustic score of the currently identified pronunciation unit is less than a preset acoustic score threshold;
if the number of the pronunciation units in the target pronunciation unit set is smaller than the number of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, and comparing the sizes of the pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the current recognized pronunciation unit, with the preset acoustic score threshold;
And if the counted number is greater than a second number threshold, and the acoustic score of each pronunciation unit, of which the pronunciation sequence is before the pronunciation sequence of the currently identified pronunciation unit, in the target pronunciation unit set is greater than or equal to the preset acoustic score threshold, determining that the target pronunciation unit set meets an acoustic compensation condition.
Optionally, the processor invokes an instruction to execute the following steps:
performing acoustic compensation processing on the acoustic scores of the current recognized pronunciation units by adopting the acoustic scores of all pronunciation units with pronunciation sequences before the pronunciation sequence of the current recognized pronunciation unit in the target pronunciation unit set to obtain the acoustic scores of the compensated current recognized pronunciation units;
and updating the target pronunciation unit set by adopting the compensated acoustic score of the currently identified pronunciation unit to obtain a target pronunciation unit set after acoustic compensation processing.
Optionally, the processor invokes an instruction to execute the following steps:
calculating a first average value of acoustic scores of all pronunciation units in the target pronunciation unit set, the pronunciation sequences of which are located before the pronunciation sequence of the currently identified pronunciation unit;
Acquiring the probability that the acoustic score of the current recognized pronunciation unit is smaller than the preset acoustic score threshold value;
determining a compensated acoustic score for the currently identified pronunciation unit based on the first average and the probability;
and determining the sum of the acoustic score of the currently identified pronunciation unit and the compensated acoustic score as the acoustic score of the compensated currently identified pronunciation unit.
Optionally, the processor invokes an instruction to execute the following steps:
after the acoustic recognition processing is finished, detecting whether a target pronunciation unit which is the same as a first pronunciation unit exists in the target pronunciation unit set, wherein the first pronunciation unit is a pronunciation unit to be compensated which is obtained through statistics in a historical audio recognition process;
if yes, verifying whether the acoustic score of the target pronunciation unit is smaller than a preset acoustic score threshold value;
if the acoustic score is smaller than the preset acoustic score threshold, counting the number of all the pronunciation units with acoustic scores larger than or equal to the preset acoustic score threshold in the target pronunciation unit set;
and if the counted number is larger than a third number threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
Optionally, the processor invokes an instruction to execute the following steps:
after the acoustic identification processing is finished, judging whether a target pronunciation unit with an acoustic score smaller than a preset acoustic score threshold exists in the target pronunciation unit set;
if the target sound unit set exists, counting the number of all sound units with the acoustic score being greater than or equal to the preset acoustic score threshold value in the target sound unit set;
and if the counted number is larger than a fourth number threshold, determining that the target pronunciation unit set meets the acoustic compensation condition.
Optionally, the processor invokes an instruction to execute the following steps:
performing acoustic compensation processing on the acoustic scores of the target pronunciation units by adopting the acoustic scores of other pronunciation units except the target pronunciation unit in the target pronunciation unit set to obtain the compensated acoustic score of the target pronunciation unit;
and updating the target pronunciation unit set by adopting the compensated acoustic score of the target pronunciation unit to obtain the target pronunciation unit set after acoustic compensation processing.
Optionally, the processor invokes an instruction to execute the following steps:
calculating a second average value of acoustic scores of other pronunciation units except the target pronunciation unit in the target pronunciation unit set;
Acquiring the probability that the acoustic score of the target pronunciation unit is smaller than the preset acoustic score threshold value;
determining a compensated acoustic score for the target pronunciation unit according to the second average and the probability;
and determining the sum of the acoustic score of the target pronunciation unit and the compensated acoustic score as the acoustic score of the compensated target pronunciation unit.
Optionally, the processor invokes an instruction to execute the following steps:
performing text recognition on the target pronunciation unit set after the acoustic compensation processing to obtain candidate text information corresponding to the pronunciation data and a language score of the candidate text information;
determining the acoustic score of the target sound unit set according to the acoustic score of each sound unit in the target sound unit set after the acoustic compensation processing;
and if the sum of the acoustic score and the language score of the target pronunciation unit set is greater than a preset score threshold, determining the candidate text information as the text information corresponding to the pronunciation data.
Optionally, the processor invokes an instruction to execute the following steps:
detecting whether text information corresponding to the pronunciation data comprises a field matched with an operation instruction or not;
If yes, generating a target operation instruction according to the text information corresponding to the pronunciation data, sending the target operation instruction to a terminal, and executing the target operation instruction by the terminal.
In the embodiment of the application, the acoustic score of the pronunciation unit in the target pronunciation unit set can be improved by performing acoustic compensation processing on the acoustic score of the pronunciation unit in the target pronunciation unit set, so that the problem that the acoustic score of the pronunciation unit is lower due to inaccurate pronunciation or insufficient pronunciation of the pronunciation unit can be avoided. In addition, text recognition is carried out on the target pronunciation unit set after acoustic compensation processing to obtain text information corresponding to the pronunciation data, so that accuracy of recognizing the pronunciation data can be improved; meanwhile, the recognition of pronunciation data of other audio words is not affected. In addition, the audio recognition accuracy is improved without optimizing the acoustic model through a large amount of training data, namely, a large amount of training data is not required to be acquired, a large amount of iterative training is not required to be performed on the acoustic model, the difficulty of data acquisition can be reduced, and a large amount of resources can be saved.
The embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, and the implementation and beneficial effects of the program for solving the problem may be referred to the implementation and beneficial effects of an audio recognition method described in fig. 2, and the repetition is omitted.
The foregoing disclosure is only illustrative of some of the embodiments of the present application and is not, of course, to be construed as limiting the scope of the appended claims, and therefore, all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.